Thursday 3 November 2016

Random backup failure and multiple snapshot control file entries on ASM

Every ~30th backup of our cluster database fails. The RMAN error on NODE 1 can be seen below:
Starting backup at 01-NOV-16
channel backup_disk1: starting compressed full datafile backup set
channel backup_disk1: specifying datafile(s) in backup set
input datafile file number=00006 name=+DATA/<DB_NAME>/users01.dbf
input datafile file number=00012 name=+DATA/<DB_NAME>/users02.dbf
input datafile file number=00009 name=+DATA/<DB_NAME>/undotbs2_01.dbf
input datafile file number=00003 name=+DATA/<DB_NAME>/undotbs1_01.dbf
input datafile file number=00002 name=+DATA/<DB_NAME>/sysaux01.dbf
input datafile file number=00001 name=+DATA/<DB_NAME>/system01.dbf
input datafile file number=00005 name=+DATA/<DB_NAME>/gpaudit01.dbf
input datafile file number=00008 name=+DATA/<DB_NAME>/perf_data01.dbf
input datafile file number=00011 name=+DATA/<DB_NAME>/tracker01.dbf
input datafile file number=00007 name=+DATA/<DB_NAME>/issues01.dbf
input datafile file number=00010 name=+DATA/<DB_NAME>/users01.dbf
input datafile file number=00004 name=+DATA/<DB_NAME>/apex_data01.dbf
channel backup_disk1: starting piece 1 at 01-NOV-16
channel backup_disk1: finished piece 1 at 01-NOV-16
piece handle=H:\RMAN_BACKUPS\<DB_NAME>_3ORJQ6QM_1_1.RMAN tag=TAG20161101T065933 comment=NONE
channel backup_disk1: backup set complete, elapsed time: 00:47:50
channel backup_disk1: starting compressed full datafile backup set
channel backup_disk1: specifying datafile(s) in backup set
released channel: backup_disk1
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03009: failure of backup command on backup_disk1 channel at 11/01/2016 07:47:25
ORA-00245: control file backup failed; target is likely on a local file system

The alert log on NODE 1 shows this:
Tue Nov 01 07:47:24 2016
Errors in file E:\ORACLE\diag\rdbms\<DB_NAME>\<DB_NAME>1\trace\<DB_NAME>1_ora_22556.trc:
ORA-00245: control file backup failed; target is likely on a local file system Tue Nov 01 07:47:26 2016 Thread 1 cannot allocate new log, sequence 28242 Checkpoint not complete

The alert log error on NODE 2 is:
Tue Nov 01 07:47:24 2016
Control file backup creation failed:
  failure to open backup target file E:\ORACLE\PRODUCT\11.2.0.3\DBHOME_1\DATABASE\SNCF<DB_NAME>1.ORA.
Errors in file E:\ORACLE\diag\rdbms\<DB_NAME>\<DB_NAME>2\trace\<DB_NAME>2_lgwr_1008.trc:
ORA-27041: unable to open file
OSD-04002: unable to open file
O/S-Error: (OS 2) The system cannot find the file specified.

The above alert log error occurs when RMAN attempts to write the snapshot controlfile to a local file system. But, the snapshot control location is not on the local system, it is configured to a shared mount point between both nodes that is always accessible as shown below.

CONFIGURE SNAPSHOT CONTROLFILE NAME TO 'H:\RMAN_Backups\<DB_NAME>_SNAPCF.RMAN';

So what could be the issue? Well, the part I omit is the fact that there were multiple entries for the snapshot file in the control file. The real output looks like this:

E:\dba>rman target /

Recovery Manager: Release 11.2.0.3.0 - Production on Thu Nov 3 10:07:00 2016

Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.

connected to target database: <DB_NAME> (DBID=11111111111)

RMAN> show all;
...............
CONFIGURE SNAPSHOT CONTROLFILE NAME TO 'H:\RMAN_Backups\<DB_NAME>_SNAPCF.RMAN';
CONFIGURE SNAPSHOT CONTROLFILE NAME TO 'H:\RMAN_BACKUPS\<DB_NAME>_SNAPCF.RMAN';
...............

This means at some point, RMAN is defaulting to factory configuration to achieve its snapshot backup. I believe it may be because the mutliple parameters are causing RMAN to fall through to this behaviour. The difference between them is case sensitivity.

I found a match on meta link:
Bug 17879299 - Duplicate snapshot controlfile entries are shown by RMAN if snapshot controlfile is on ASM (Doc ID 17879299.8)

This is a bug which occurs from time to time in a database that has been configured with ASM.
The bug note says there is no workaround. This may be true, but the solution to correct this, and correct the random failure of backups, is to clear and reinstate the parameter in lower case as shown below:

E:\dba>rman target /

Recovery Manager: Release 11.2.0.3.0 - Production on Thu Nov 3 10:07:00 2016

Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.

connected to target database: <DB_NAME> (DBID=1566787861)

RMAN> show all;

using target database control file instead of recovery catalog
RMAN configuration parameters for database with db_unique_name <DB_NAME> are:
....................................
CONFIGURE SNAPSHOT CONTROLFILE NAME TO 'H:\RMAN_Backups\<DB_NAME>_SNAPCF.RMAN';
CONFIGURE SNAPSHOT CONTROLFILE NAME TO 'H:\RMAN_BACKUPS\<DB_NAME>_SNAPCF.RMAN';

RMAN> configure snapshot controlfile name clear;

old RMAN configuration parameters:
CONFIGURE SNAPSHOT CONTROLFILE NAME TO 'H:\RMAN_Backups\<DB_NAME>_SNAPCF.RMAN';
old RMAN configuration parameters:
CONFIGURE SNAPSHOT CONTROLFILE NAME TO 'H:\RMAN_BACKUPS\<DB_NAME>_SNAPCF.RMAN';
RMAN configuration parameters are successfully reset to default value

RMAN> show all;

RMAN configuration parameters for database with db_unique_name <DB_NAME> are:
....................
CONFIGURE SNAPSHOT CONTROLFILE NAME TO 'E:\ORACLE\PRODUCT\11.2.0.3\DBHOME_1\DATABASE\SNCF<DB_NAME>1.ORA'; # default

RMAN> CONFIGURE SNAPSHOT CONTROLFILE NAME TO 'h:\rman_backups\<DB_NAME>_snapcf.rman';

new RMAN configuration parameters:
CONFIGURE SNAPSHOT CONTROLFILE NAME TO 'h:\rman_backups\<DB_NAME>_snapcf.rman';
new RMAN configuration parameters are successfully stored

RMAN> show all;

RMAN configuration parameters for database with db_unique_name <DB_NAME> are:
..........................
CONFIGURE SNAPSHOT CONTROLFILE NAME TO 'h:\rman_backups\<DB_NAME>_snapcf.rman';

RMAN>

Now I will monitor to see if backups continue to fail. I am hoping this bug is the cause of the sporadic failures.

Sunday 9 October 2016

Free State, Clarens 2016

The weekend was spent in a quite town in the Free State. The skater in the pictures is a friend of mine. He skates down hills, sometimes at ~90km per hour. 







Monday 26 September 2016

Creating a Physical Standby Database in 11.2.0.3

Image courtesy of the movie "the shining". Go watch it, it is very good. The twins represent DR sites in a way in that they are mirrored. They also represent database systems in that when it is time to open DR, the administrator is likely to be scared: he or she is either dealing with auditors or a non recoverable disaster - both quite scary. An Oracle standby database is a database that is synchronized at a data block level, so when the DR system is bust open in an emergency, the system should be identical to the production database it was tethered to - at least up to the last log the DR site consumed. There are no Oracle Streams or Microsoft replication type tethering mechanisms, it is kept in sync using archivelog files shed from the primary database. It is also a one way operation, at least in standard edition.

Below are notes to setup a standby database in Windows 2012.

The steps in this guide assume the the server has the exact same drives as the primary production. This means that I am not going to use any *convert* parameters to handle differences in file names. If the drives do differ, then I would use convert to handle the difference in lettering.

1. Install the database service:
In an administration cmd window, I execute the following:

set ORACLE_SID=<STANDBY_SID>
set ORACLE_HOME=E:\Oracle\Product\11.2.0.3\dbhome_1
set PATH=%ORACLE_HOME%\bin;E:\Oracle\product\11.2.0.3\dbhome_1;%PATH%

E:\Oracle\product\11.2.0.3\dbhome_1\BIN\oradim -DELETE -SID <STANDBY_SID> 

E:\Oracle\product\11.2.0.3\dbhome_1\BIN\oradim -new -SID <STANDBY_SID> -startmode manual -pfile 'E:\Oracle\Product\11.2.0.3\dbhome_1\database\init<STANDBY_SID>.ora'

If there is no pfile, make or copy one across using the primary database.
The next step is to create the folders where the control, data and redolog files will be restored.

2. Create the directories:
In windows command, execute the following:
mkdir f:\Oracle\Oradata\<STANDBY_SID>\
mkdir g:\Oracle\Oradata\<STANDBY_SID>\
mkdir h:\Oracle\Oradata\archive\<STANDBY_SID>
mkdir h:\Oracle\Oradata\<STANDBY_SID>

3. Copy the password file and a pfile:
Copy the following key files from the primary server to the standby database database directory:
prod01:E:\Oracle\prduct\11.2.0.3\dbhome_1\database\PWD<STANDBY_SID>.oraprod01:E:\Oracle\prduct\11.2.0.3\dbhome_1\database\init<STANDBY_SID>.ora
To
standby1:E:\Oracle\prduct\11.2.0.3\dbhome_1\database\

4. Restore the spfile
Now I begin the restoration process. If not done already, take a backup of the primary database. You will also need a standby controlfile from the primary site;

create standby controlfile as 'C:\<FOLDER>\<STANDBY_SID>_standby_control.rman'

Copy the backup pieces from the primary database server into a holding location on the standby database server:
C:\RMAN_Backups\<STANDBY_SID>_2016_04_05\

Execute the following in CMD:
set ORACLE_SID=<STANDBY_SID>
set ORACLE_HOME=E:\Oracle\Product\11.2.0.3\dbhome_1

rman target / nocatalog

Once done, I startup the database with force:

startup force nomount;

OUTPUT
RMAN> startup force nomount;

Oracle instance started

Total System Global Area   12827369472 bytes

Fixed Size                     2267184 bytes
Variable Size               5704255440 bytes
Database Buffers            7113539584 bytes
Redo Buffers                   7307264 bytes

Now I restore the database. I restore the spfile from the backup piece containing the spfile file. I can find the piece holding this file by reviewing the RMAN log file generated during the backup.
The clue is the piece containing the following meta data;
  SPFILE Included: Modification time: 20-SEP-16

I then plug the piece name into the following command and restore:
restore spfile to 'E:\Oracle\product\11.2.0.3\dbhome_1\database\spfile<STANDBY_SID>.ora' from 'C:\RMAN_Restore\<STANDBY_SID>_2016_09_25\<STANDBY_SID>_21RGMSG7_1_
.RMAN';

OUTPUT
RMAN> restore spfile to 'E:\Oracle\product\11.2.0.3\dbhome_1\database\spfile<STANDBY_SID>.ora' from 'C:\RMAN_Restore\<STANDBY_SID>_2016_09_25\<STANDBY_SID>_21RGMSG7_1_
.RMAN';

Starting restore at 26-SEP-16
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=1712 device type=DISK

channel ORA_DISK_1: restoring spfile from AUTOBACKUP C:\RMAN_Restore\<STANDBY_SID>_2016_09_25\<STANDBY_SID>_21RGMSG7_1_1.RMAN
channel ORA_DISK_1: SPFILE restore from AUTOBACKUP complete
Finished restore at 26-SEP-16

RMAN>

5. Restore the control files
Now restore the control files. Find the control file backup piece from the log file. It should be the same as the spfile piece used in the previous step. Similarly, the clue is to find the piece with the following metadata identifier:  

Control File Included: Ckp SCN: 19783506609   Ckp time: 25-SEP-16

Load up RMAN
rman target / nocatalog

Execute the following:
restore controlfile from 'C:\RMAN_Restore\<STANDBY_SID>_2016_09_25\<STANDBY_SID>_21RGMSG7_1_1.RMAN';

OUTPUT
RMAN> restore controlfile from 'C:\RMAN_Restore\<STANDBY_SID>_2016_09_25\<STANDBY_SID>_21RGMSG7_1_1.RMAN';

Starting restore at 26-SEP-16
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=1712 device type=DISK

channel ORA_DISK_1: restoring control file
channel ORA_DISK_1: restore complete, elapsed time: 00:00:01
output file name=F:\ORACLE\ORADATA\<STANDBY_SID>\CONTROL01.CTL
output file name=G:\ORACLE\ORADATA\<STANDBY_SID>\CONTROL02.CTL
output file name=H:\ORACLE\ORADATA\<STANDBY_SID>\CONTROL03.CTL
Finished restore at 26-SEP-16

Mount the database once the control files are deployed:

mount database;

OUTPUT
RMAN> mount database;

database mounted
released channel: ORA_DISK_1

6. Catalog the backup pieces
Now to catalog the full backupset:

OUTPUT
RMAN> catalog start with 'C:\RMAN_Restore\<STANDBY_SID>_2016_09_25\';

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of catalog command at 09/26/2016 15:34:17
RMAN-06189: current DBID 1020978749 does not match target mounted database (1025452237)

The problem here is I need to exit rman and connect again with nocatalog after restoring and mounting the control files:

exit
rman target / nocatalog
catalog start with 'C:\RMAN_Restore\<STANDBY_SID>_2016_09_25\';

OUTPUT:

RMAN> catalog start with 'C:\RMAN_Restore\<STANDBY_SID>_2016_09_25\';

searching for all files that match the pattern C:\RMAN_Restore\<STANDBY_SID>_2016_09_25\

List of Files Unknown to the Database

=====================================
File Name: C:\RMAN_RESTORE\<STANDBY_SID>_2016_09_25\<STANDBY_SID>_2016_09_25_backup_list.log
File Name: C:\RMAN_RESTORE\<STANDBY_SID>_2016_09_25\<STANDBY_SID>_20RGMRRN_1_1.RMAN
File Name: C:\RMAN_RESTORE\<STANDBY_SID>_2016_09_25\<STANDBY_SID>_21RGMSG7_1_1.RMAN
File Name: C:\RMAN_RESTORE\<STANDBY_SID>_2016_09_25\<STANDBY_SID>_22RGMSGG_1_1.RMAN
File Name: C:\RMAN_RESTORE\<STANDBY_SID>_2016_09_25\<STANDBY_SID>_23RGMSGN_1_1.RMAN
File Name: C:\RMAN_RESTORE\<STANDBY_SID>_2016_09_25\<STANDBY_SID>_24RGMSGV_1_1.RMAN
File Name: C:\RMAN_RESTORE\<STANDBY_SID>_2016_09_25\<STANDBY_SID>_25RGMSH6_1_1.RMAN
File Name: C:\RMAN_RESTORE\<STANDBY_SID>_2016_09_25\<STANDBY_SID>_26RGMSHE_1_1.RMAN
File Name: C:\RMAN_RESTORE\<STANDBY_SID>_2016_09_25\<STANDBY_SID>_27RGMSHM_1_1.RMAN
File Name: C:\RMAN_RESTORE\<STANDBY_SID>_2016_09_25\<STANDBY_SID>_28RGMSHT_1_1.RMAN
File Name: C:\RMAN_RESTORE\<STANDBY_SID>_2016_09_25\<STANDBY_SID>_29RGMSI5_1_1.RMAN
File Name: C:\RMAN_RESTORE\<STANDBY_SID>_2016_09_25\<STANDBY_SID>_2ARGMSI9_1_1.RMAN
File Name: C:\RMAN_RESTORE\<STANDBY_SID>_2016_09_25\<STANDBY_SID>_STANDBY_CONTROL.RMAN
File Name: C:\RMAN_RESTORE\<STANDBY_SID>_2016_09_25\<STANDBY_SID>_TRACE_CONTROL.RMAN

Do you really want to catalog the above files (enter YES or NO)? YES

cataloging files...
cataloging done

List of Cataloged Files

=======================
File Name: C:\RMAN_RESTORE\<STANDBY_SID>_2016_09_25\<STANDBY_SID>_20RGMRRN_1_1.RMAN
File Name: C:\RMAN_RESTORE\<STANDBY_SID>_2016_09_25\<STANDBY_SID>_21RGMSG7_1_1.RMAN
File Name: C:\RMAN_RESTORE\<STANDBY_SID>_2016_09_25\<STANDBY_SID>_22RGMSGG_1_1.RMAN
File Name: C:\RMAN_RESTORE\<STANDBY_SID>_2016_09_25\<STANDBY_SID>_23RGMSGN_1_1.RMAN
File Name: C:\RMAN_RESTORE\<STANDBY_SID>_2016_09_25\<STANDBY_SID>_24RGMSGV_1_1.RMAN
File Name: C:\RMAN_RESTORE\<STANDBY_SID>_2016_09_25\<STANDBY_SID>_25RGMSH6_1_1.RMAN
File Name: C:\RMAN_RESTORE\<STANDBY_SID>_2016_09_25\<STANDBY_SID>_26RGMSHE_1_1.RMAN
File Name: C:\RMAN_RESTORE\<STANDBY_SID>_2016_09_25\<STANDBY_SID>_27RGMSHM_1_1.RMAN
File Name: C:\RMAN_RESTORE\<STANDBY_SID>_2016_09_25\<STANDBY_SID>_28RGMSHT_1_1.RMAN
File Name: C:\RMAN_RESTORE\<STANDBY_SID>_2016_09_25\<STANDBY_SID>_29RGMSI5_1_1.RMAN
File Name: C:\RMAN_RESTORE\<STANDBY_SID>_2016_09_25\<STANDBY_SID>_2ARGMSI9_1_1.RMAN
File Name: C:\RMAN_RESTORE\<STANDBY_SID>_2016_09_25\<STANDBY_SID>_STANDBY_CONTROL.RMAN

List of Files Which Where Not Cataloged

=======================================
File Name: C:\RMAN_RESTORE\<STANDBY_SID>_2016_09_25\<STANDBY_SID>_2016_09_25_backup_list.log
  RMAN-07517: Reason: The file header is corrupted
File Name: C:\RMAN_RESTORE\<STANDBY_SID>_2016_09_25\<STANDBY_SID>_TRACE_CONTROL.RMAN
  RMAN-07517: Reason: The file header is corrupted

7. Restore the database
Once cataloged, I restore the database files:

restore database;

OUTPUT
RMAN> restore database;

Starting restore at 26-SEP-16
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=1712 device type=DISK

channel ORA_DISK_1: starting datafile backup set restore

channel ORA_DISK_1: specifying datafile(s) to restore from backup set
channel ORA_DISK_1: restoring datafile 00001 to F:\ORACLE\ORADATA\<STANDBY_SID>\SYSTEM01.DBF
channel ORA_DISK_1: restoring datafile 00002 to F:\ORACLE\ORADATA\<STANDBY_SID>\SYSAUX01.DBF
channel ORA_DISK_1: restoring datafile 00003 to F:\ORACLE\ORADATA\<STANDBY_SID>\UNDOTBS1_01.DBF
channel ORA_DISK_1: restoring datafile 00004 to F:\ORACLE\ORADATA\<STANDBY_SID>\APEX_DATA01.DBF
channel ORA_DISK_1: restoring datafile 00005 to F:\ORACLE\ORADATA\<STANDBY_SID>\AUDIT01.DBF
channel ORA_DISK_1: restoring datafile 00006 to F:\ORACLE\ORADATA\<STANDBY_SID>\USERS01.DBF
channel ORA_DISK_1: restoring datafile 00007 to F:\ORACLE\ORADATA\<STANDBY_SID>\ISSUES01.DBF
channel ORA_DISK_1: restoring datafile 00008 to F:\ORACLE\ORADATA\<STANDBY_SID>\PERF_DATA01.DBF
channel ORA_DISK_1: restoring datafile 00009 to F:\ORACLE\ORADATA\<STANDBY_SID>\UNDOTBS2_01.DBF
channel ORA_DISK_1: restoring datafile 00010 to F:\ORACLE\ORADATA\<STANDBY_SID>\USERS01.DBF
channel ORA_DISK_1: restoring datafile 00011 to F:\ORACLE\ORADATA\<STANDBY_SID>\ISSUE02.DBF
channel ORA_DISK_1: restoring datafile 00012 to F:\ORACLE\ORADATA\<STANDBY_SID>\USERS02.DBF
channel ORA_DISK_1: reading from backup piece H:\RMAN_BACKUPS\<STANDBY_SID>_20RGMRRN_1_1.RMAN


RMAN> exit

Recovery Manager complete.

E:\Oracle\product\11.2.0.3\dbhome_1\BIN>

8. Restore the standby control file
Now I restore the standby control file backup taken from the original backup:

rman target / nocatalog
shutdown immediate
startup nomount
restore controlfile from 'C:\RMAN_Restore\<STANDBY_SID>_2016_09_25\<STANDBY_SID>_STANDBY_CONTROL.RMAN';
exit

OUTPUT
E:\Oracle\product\11.2.0.3\dbhome_1\BIN>rman target / nocatalog

Recovery Manager: Release 11.2.0.3.0 - Production on Tue Sep 26 

Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.

connected to target database: <STANDBY_SID> (DBID=1010768491, not open)
using target database control file instead of recovery catalog

RMAN> shutdown immediate

database dismounted
Oracle instance shut down

RMAN> startup nomount

connected to target database (not started)
Oracle instance started

Total System Global Area   12827369472 bytes

Fixed Size                     2293232 bytes
Variable Size               5838471696 bytes
Database Buffers            6979321856 bytes
Redo Buffers                   7282688 bytes

RMAN> restore controlfile from 'C:\RMAN_Restore\<STANDBY_SID>_2016_09_25\<STANDBY_SID>_STANDBY_CONTROL.RMAN';

Starting restore at 26-SEP-16
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=1522 device type=DISK

channel ORA_DISK_1: copied control file copy
output file name=F:\ORACLE\ORADATA\<STANDBY_SID>\CONTROL01.CTL
output file name=G:\ORACLE\ORADATA\<STANDBY_SID>\CONTROL02.CTL
output file name=H:\ORACLE\ORADATA\<STANDBY_SID>\CONTROL03.CTL
Finished restore at 26-SEP16

RMAN> exit
Recovery Manager complete.

E:\Oracle\product\11.2.0.3\dbhome_1\BIN>

9. Mount the database as a physcial standby

Startup and mount the database in standby mode using:
alter database mount standby database;

Note that the key words "standby database" are optional according to the documentation, the db will know what it is to be mounted as per the control file.
SQL> startup nomount;
ORACLE instance started.

Total System Global Area 1.2827E+10 bytes
Fixed Size                  2267184 bytes
Variable Size            7918847952 bytes
Database Buffers         4898947072 bytes
Redo Buffers                7307264 bytes
SQL> alter database mount standby database;

Database altered.

SQL>
10. Checkout the new database
Check the database status
C:\Users\Administrator>sqlplus /nolog

SQL*Plus: Release 11.2.0.3.0 Production on Tue Sep 26 
Copyright (c) 1982, 2013, Oracle.  All rights reserved.

SQL> conn / as sysdba
Connected.
SQL> select open_mode,database_role from v$database;

OPEN_MODE            DATABASE_ROLE
-------------------- ----------------
MOUNTED              PHYSICAL STANDBY

SQL>

11. Apply database logs
Now all I do is load the log files into the archive log dest or sync them using robocopy and periodically run the following through a scheduler:

recover automatic standby database;
cancel
exit

If done correctly, my alert log should have a surge of log files consumed through media recovery, with many lines reading as follows:
Media Recovery Log G:\ORACLE\ORADATA\ARCHIVE\<STANDBY_SID>\<STANDBY_SID>_0922977743_0001_0000001053.ARC
Media Recovery Log G:\ORACLE\ORADATA\ARCHIVE\<STANDBY_SID>\<STANDBY_SID>_0922977743_0001_0000001054.ARC
Media Recovery Log G:\ORACLE\ORADATA\ARCHIVE\<STANDBY_SID>\<STANDBY_SID>_0922977743_0001_0000001055.ARC

Eventually, the alert log should spit out the following error:
Errors with log G:\ORACLE\ORADATA\ARCHIVE\<STANDBY_SID>\<STANDBY_SID>_0922977743_0001_0000001055.ARC
ORA-00308: cannot open archived log 'G:\ORACLE\ORADATA\ARCHIVE\<STANDBY_SID>\<STANDBY_SID>_0922977743_0001_0000001055.ARC'

That is to be expected - the database has eaten all the logs, and it would like more, but it cannot have them until the primary database sheds more logs. So don't be surprised to see Media Recovery Log messages punctuated with error logs - each error depicts the next log the standby database is expecting.

12. Monitoring
Once the system is self reliant and able to copy and apply logs, I will need to monitor the difference in time between the primary and standby. I do this by measuring the difference in hours between the current date and the timestamp of the last applied log. I can use the below script with an email scheduler to help keep an eye on the pair. If the time begins falling behind and the hour difference becomes > 1 (considering the pair in close proximity within the site), I will know there is a problem and I will need to investigate.

conn / as sysdba

define p_log_dir='&1'

set echo off
set verify off
set linesize 1000
set pages 100

spool &1.\standby_hours_behind.log

PROMPT <STANDBY_SID> DATABASE
PROMPT ======================

select trunc(24*(sysdate-max(first_time)),2) as hours_behind,
       max(sequence#) as last_applied_sequence_number,
       max(sequence#)+1 as next_sequence_number
  from v$log_history;

spool off


OUTPUT:
SQL> @H:\scripts\sql\standby_hours_behind.sql h:\scripts
Connected.
<STANDBY_SID> DATABASE
======================

HOURS_BEHIND LAST_APPLIED_SEQUENCE_NUMBER NEXT_SEQUENCE_NUMBER
------------ ---------------------------- --------------------
         .36                         1136                 1137

The system is just over 20 minutes apart - considering it rotates logs every 10 minutes, this is acceptable for the site.

Friday 16 September 2016

RAC Attack 12c - Setting up enterprise grade clusterware in your home

I was able to build a 2 node 12C cluster from the ground up using a home PC. The 2 node RAC cluster is comprised of 2 Linux 6 VM machines each consuming 1.5G of memory. The cluster runs alright and shares 4 ASM virtual box drives. The job took 2 evenings.

To do this, I used a guide created by the RAC attack group. The guys at RAC attack have done a great job in detailing each and every step  in the setup. The guide I used is here


Oracle RAC - home rolling an ORA.CVU patch

I have managed to stop the cluster from wasting away memory chunks owing to BUG [Note 1523366.1]. Oracle's solution is to opatch. However, the client cannot do that at this point owing to the country wide dependency on the system. In addition to this, to opatch the system is risky considering the fragile nature of RAC in Windows. I am not 100% confident that an opatch will be successful and I would sooner create a new cluster  than to apply a patch to and Oracle RAC system on Windows.

For the original discovery of the bug - see here
For the fragility of the cluster - see here


A recap: the command that is left open every 6 hours during the CVU health check is:

C:\Windows\system32\cmd.exe /K E:\OracleGrid\11.2.0.3\bin\cluvfy comp health -_format

A screenshot can be found here when the orphaned session grows in numbers:

For the memory leak fix, I did the following.

I backup a copy of the bat file: E:\OracleGrid\11.2.0.3\bin\cluvfy.bat 

I then open the original cluvfy,bat in notepad and carefully make the following changes

CHANGE 1
FROM:
if not (%CRSHOME%)==() (
  @set "CV_HOME=%CRSHOME%"
)

set CMDPATH=%~dp0

TO:
if not (%CRSHOME%)==() (
  @set "CV_HOME=%CRSHOME%"
)

set EXIT_OPTION=

if "%CVU_RESOURCE_OPTIONS%"=="" set EXIT_OPTION=/B


set CMDPATH=%~dp0

CHANGE 2
FROM:
exit /B %errorlevel%
goto done

:ERROR
exit /B 1

TO:
exit %EXIT_OPTION% %errorlevel%
goto done

:ERROR
exit %EXIT_OPTION% 1

Once the change is made, save and copy the file across to node 2. I can also move the CVU to avoid creating any unwanted issues using while carrying out the above changes

srvctl relocate cvu -n <node name>

However, considering the batch file is executed every 6 hours, the chances are slim.

The idea to carry this out came after looking at the source code of the 11.2.0.4 home - I noted a bug fix by a certain Oracle developer @ Oracle. I diff'd the old and new CVU, evaluated the developers intention, and took only what I needed from his fix to manually patch the CVU myself. I first recreated the problem in the 11.2.0.4 test environment, monitored it for a day, patched it with my fix, and monitored for a few days. Once done satisfied, I took the 'patch' live. This fix will allow the in house DBA and his manager to have a cluster that no longer crashes once a month - forever. The nightmare is over and this solution will suffice until we create a new cluster for 12c. Done fixed it.

Thursday 15 September 2016

How to corrupt a datablock and recover it


For this post, I am using the Linux DEV VM.

Oracle stores data in cells within datafiles referred to as datablocks. By default, those blocks are squared off into 8196 byte partitions. The blocksize usually comes into play on database creation and should be the same as the OS blocksize for improved efficiency. More info on blocks here. On a long enough time line, provided the drives survive long enough, corruption will occur at some point or another within the database. They can be recovered provided corruption is detected shortly after it occurs before it persists the days or weeks of backups. Corruption is usually reported first in the RMAN backup log therefore it is important to view the backup log daily to spot corruption.

To see the block size of an instance I run:
SQL> show parameter db_block_size

NAME                        TYPE        VALUE
--------------------------- ----------- -------
db_block_size               integer     8192

Enough about that, moving onto corrupting a datablock.

Create a new tablespace:

CREATE TABLESPACE blockbreaker 
DATAFILE '/home/oracle/app/oracle/oradata/cdb1/orcl/blockbreaker01.dbf' 
SIZE 10M 
LOGGING EXTENT MANAGEMENT LOCAL 
SEGMENT SPACE MANAGEMENT AUTO;

ALTER DATABASE DATAFILE '/home/oracle/app/oracle/oradata/cdb1/orcl/blockbreaker01.dbf' AUTOEXTEND OFF;

CREATE TABLE BLOCKMAKER 
(
VALUE VARCHAR(100)

TABLESPACE BLOCKBREAKER;

Now populate the table with data:

begin
  for i in 1..10000 loop
    insert into blockmaker 
values (to_char(sysdate,'DDMMYYYY HH24:MI:SS')||'XXXXXXXXX_'||to_char(i));

commit;
  end loop;
end;
/

alter system switch logfile;
alter system switch logfile;
alter system switch logfile;
alter system switch logfile;
alter system switch logfile;

Once created, take a full system backup - in this example I am performing a warm backup considering my database is in archivelog mode.
run {
    crosscheck backup;
    delete noprompt obsolete;
    backup as compressed backupset full database tag WARM_FULL_BACKUP format '/home/oracle/app/oracle/backup/%d_%T_%s_%p_WARM_FULL.bk';
    sql 'alter system archive log current';
    backup tag ARCHIVELOG_BACKUP format '/home/oracle/app/oracle/backup/%d_%T_%s_%p_ARCHIVE.bk' archivelog all delete all input ;
    backup tag CONTROL_BACKUP current controlfile format '/home/oracle/app/oracle/backup/%d_%T_%s_%p_CONTROL.bk';
    delete noprompt obsolete;
    }


Now the fun part. I am using a Linux VM for the following, I have not done something like  this in Windows before but I am sure that there are commands that perform similar functions in a Windows environment.
I am going to
-corrupt a block in my new table
-verify it is corrupted with DBV
-Dump the block out
-Check the corruption view to show the corruption was registered
-Recover the bad blocks


In sqlplus I run the following:

set trimspool on
set heading off
set lines 1000
set pages 1000

column file_name new_val file_name;
column file_id new_val file_id;
column block_id new_val block_id;

select f.name as file_name,
       to_char(f.block_id) as block_id,
       to_char(f.file#) as file_id   
  from (
       select * 
    from v$datafile vd, 
     dba_extents de 
   where lower(vd.name) like '%blockbreaker%' 
 and vd.file# = de.file_id 
     order by dbms_random.random 
   ) f 
  where rownum < 2
/
  
select 'dd of=&file_name bs=8192 conv=notrunc seek=&block_id << EOF' as string from dual
union all
  select 'BLOCKBREAKER IS HERE' as string from dual
union all
  select 'EOF' as string from dual
union all
  select ' ' as string from dual
union all
  select 'dbv FILE='''||name||''' blocksize=8192' as string from (select * from v$datafile where lower(name) like '%blockbreaker%' order by file#)  where rownum < 2
union all
  select 'echo ''alter system dump datafile &file_id block &block_id;'' | sqlplus sys/oracle@cdb1 as sysdba' as string from dual
union all
  select ' ' as string from dual
union all
  select 'cd '||value from v$diag_info where name='Diag Trace'
union all
  select 'ls -trl *ora*.trc' from dual;
/

exit;

The output sould look like this:
dd of=/home/oracle/app/oracle/oradata/cdb1/orcl/blockbreaker01.dbf bs=8192 conv=notrunc seek=168 << EOF
BLOCKBREAKER IS HERE
EOF

dbv FILE='/home/oracle/app/oracle/oradata/cdb1/orcl/blockbreaker01.dbf' blocksize=8192
echo 'alter system dump datafile 31 block 168;' | sqlplus sys/oracle@cdb1 as sysdba

cd /home/oracle/app/oracle/diag/rdbms/cdb1/cdb1/trace
ls -trl *ora*.trc

I copy the output and run in the shell:

[oracle@localhost backup]$ dd of=/home/oracle/app/oracle/oradata/cdb1/orcl/blockbreaker01.dbf bs=8192 conv=notrunc seek=168 << EOF
> BLOCKBREAKER IS HERE
> EOF
0+1 records in
0+1 records out
21 bytes (21 B) copied, 0.000171222 s, 123 kB/s
[oracle@localhost backup]$
[oracle@localhost backup]$ dbv FILE='/home/oracle/app/oracle/oradata/cdb1/orcl/blockbreaker01.dbf' blocksize=8192

DBVERIFY: Release 12.1.0.2.0 - Production on Thu Sep 15 09:24:06 2016

Copyright (c) 1982, 2014, Oracle and/or its affiliates.  All rights reserved.

DBVERIFY - Verification starting : FILE = /home/oracle/app/oracle/oradata/cdb1/orcl/blockbreaker01.dbf
Page 168 is marked corrupt
Corrupt block relative dba: 0x07c000a8 (file 31, block 168)
Bad header found during dbv:
Data in bad block:
 type: 66 format: 4 rdba: 0x4552424b
 last change scn: 0x4920.52454b41 seq: 0x53 flg: 0x20
 spare1: 0x4f spare2: 0x43 spare3: 0x4552
 consistency value in tail: 0x6ac40603
 check value in block header: 0x4548
 block checksum disabled



DBVERIFY - Verification complete

Total Pages Examined         : 1280
Total Pages Processed (Data) : 57
Total Pages Failing   (Data) : 0
Total Pages Processed (Index): 0
Total Pages Failing   (Index): 0
Total Pages Processed (Other): 133
Total Pages Processed (Seg)  : 0
Total Pages Failing   (Seg)  : 0
Total Pages Empty            : 1089
Total Pages Marked Corrupt   : 1
Total Pages Influx           : 0
Total Pages Encrypted        : 0
Highest block SCN            : 7566501 (0.7566501)
[oracle@localhost backup]$ echo 'alter system dump datafile 31 block 168;' | sqlplus sys/oracle@cdb1 as sysdba

SQL*Plus: Release 12.1.0.2.0 Production on Thu Sep 15 09:24:06 2016

Copyright (c) 1982, 2014, Oracle.  All rights reserved.


Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

SQL>
System altered.

SQL> Disconnected from Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options
[oracle@localhost backup]$
[oracle@localhost backup]$ cd /home/oracle/app/oracle/diag/rdbms/cdb1/cdb1/trace
[oracle@localhost trace]$ ls -trl *ora*.trc

-rw-r-----. 1 oracle oracle   2205 Sep 15 09:24 cdb1_ora_31184.trc

[oracle@localhost trace]$ 

The commands should place the shell in the trace location. The latest TRC file should contain the contents of the newly corrupted block.

I cat the latest trace file out below:

[oracle@localhost trace]$ cat cdb1_ora_31184.trc

Trace file /home/oracle/app/oracle/diag/rdbms/cdb1/cdb1/trace/cdb1_ora_31184.trc
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options
ORACLE_HOME = /home/oracle/app/oracle/product/12.1.0/dbhome_1
System name:    Linux
Node name:      localhost.localdomain
Release:        3.8.13-68.1.3.el7uek.x86_64
Version:        #2 SMP Wed Apr 22 11:51:54 PDT 2015
Machine:        x86_64
Instance name: cdb1
Redo thread mounted by this instance: 1
Oracle process number: 67
Unix process pid: 31184, image: oracle@localhost.localdomain


*** 2016-09-15 09:24:06.877
*** SESSION ID:(44.23378) 2016-09-15 09:24:06.877
*** CLIENT ID:() 2016-09-15 09:24:06.877
*** SERVICE NAME:(cdb1) 2016-09-15 09:24:06.877
*** MODULE NAME:(sqlplus@localhost.localdomain (TNS V1-V3)) 2016-09-15 09:24:06.877
*** CLIENT DRIVER:(SQL*PLUS) 2016-09-15 09:24:06.877
*** ACTION NAME:() 2016-09-15 09:24:06.877
*** CONTAINER ID:(1) 2016-09-15 09:24:06.877

Start dump data blocks tsn: 6 file#:31 minblk 168 maxblk 168
Block dump from cache:
Dump of buffer cache at level 4 for pdb=1 tsn=6 rdba=130023592
BH (0x78bdb158) file#: 31 rdba: 0x07c000a8 (31/168) class: 1 ba: 0x78884000
  set: 3 pool: 3 bsz: 8192 bsi: 0 sflg: 0 pwc: 0,0
  dbwrid: 0 obj: 93452 objn: 93452 tsn: [1/6] afn: 31 hint: f
  hash: [0x8290d588,0x8290d588] lru: [0x78bdb380,0x78bdb100]
  ckptq: [NULL] fileq: [NULL]
  objq: [0x78bdb128,0x78bdd1a8] objaq: [0x78bdd1b8,0x78bdb138]
  st: XCURRENT md: NULL fpin: 'ktspbwh2: ktspfmdb' fscn: 0x0.736877 tch: 1
  flags: block_written_once
  LRBA: [0x0.0.0] LSCN: [0x0.0] HSCN: [0xffff.ffffffff] HSUB: [1]
Block dump from disk:
Encrypted block <6, 130023592> content will not be dumped. Dumping header only.
buffer tsn: 6 rdba: 0x4552424b (277/1196619)
scn: 0x4920.52454b41 seq: 0x53 flg: 0x20 tail: 0x6ac40603
frmt: 0x04 chkval: 0x4548 type: 0x42=unknown
Hex dump of corrupt header 4 = CORRUPT
Dump of memory from 0x00007FD7120A1E00 to 0x00007FD7120A1E14
7FD7120A1E00 434F4C42 4552424B 52454B41 20534920  [BLOCKBREAKER IS ]
7FD7120A1E10 45524548                             [HERE]
End dump data blocks tsn: 6 file#: 31 minblk 168 maxblk 168
[oracle@localhost trace]$

The block should be registered in the dynamic corruption view table once discovered by DBV - if it has not been registered, I can discover it myself:

--Flush buffer cache to force the db to retrieve the blocks from disk SQL> alter system flush buffer_cache;

System altered.

SQL> select * from v$database_block_corruption;

no rows selected

SQL> select * from blockmaker;

VALUE
--------------------------------------------------------------------------------
............................
15092016 09:16:08XXXXXXXXX_6098
15092016 09:16:08XXXXXXXXX_6099
15092016 09:16:08XXXXXXXXX_6100
15092016 09:16:08XXXXXXXXX_6101
ERROR:
ORA-01578: ORACLE data block corrupted (file # 31, block # 168)
ORA-01110: data file 31:
'/home/oracle/app/oracle/oradata/cdb1/orcl/blockbreaker01.dbf'

6885 rows selected.

SQL> select * from v$database_block_corruption;

     FILE#     BLOCK#     BLOCKS CORRUPTION_CHANGE# CORRUPTIO     CON_ID
---------- ---------- ---------- ------------------ --------- ----------
        31        168          1                  0 CORRUPT            0

SQL>

Now I will recover the corrupted block. This involves RMAN unpacking the last backup of the corrupted datafile and applying the logs in the backup as well as the logs generated up until this point.

RMAN> blockrecover corruption list;

Starting recover at 15-SEP-16
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=43 device type=DISK

channel ORA_DISK_1: restoring block(s)
channel ORA_DISK_1: specifying block(s) to restore from backup set
restoring blocks of datafile 00031
channel ORA_DISK_1: reading from backup piece /home/oracle/app/oracle/backup/CDB1_20160915_9_1_WARM_FULL.bk
channel ORA_DISK_1: piece handle=/home/oracle/app/oracle/backup/CDB1_20160915_9_1_WARM_FULL.bk tag=WARM_FULL_BACKUP
channel ORA_DISK_1: restored block(s) from backup piece 1
channel ORA_DISK_1: block restore complete, elapsed time: 00:00:01

starting media recovery
media recovery complete, elapsed time: 00:00:03

Finished recover at 15-SEP-16

RMAN>

the corruption flag has been cleared:

SQL> select * from v$database_block_corruption;

no rows selected

And I am able to now select from my table:

..................
VALUE
--------------------------------------------------------------------------------
15092016 09:16:08XXXXXXXXX_10000

10000 rows selected.


SQL> l
  1* select * from blockmaker
SQL>