Thursday 3 November 2016

Random backup failure and multiple snapshot control file entries on ASM

Every ~30th backup of our cluster database fails. The RMAN error on NODE 1 can be seen below:
Starting backup at 01-NOV-16
channel backup_disk1: starting compressed full datafile backup set
channel backup_disk1: specifying datafile(s) in backup set
input datafile file number=00006 name=+DATA/<DB_NAME>/users01.dbf
input datafile file number=00012 name=+DATA/<DB_NAME>/users02.dbf
input datafile file number=00009 name=+DATA/<DB_NAME>/undotbs2_01.dbf
input datafile file number=00003 name=+DATA/<DB_NAME>/undotbs1_01.dbf
input datafile file number=00002 name=+DATA/<DB_NAME>/sysaux01.dbf
input datafile file number=00001 name=+DATA/<DB_NAME>/system01.dbf
input datafile file number=00005 name=+DATA/<DB_NAME>/gpaudit01.dbf
input datafile file number=00008 name=+DATA/<DB_NAME>/perf_data01.dbf
input datafile file number=00011 name=+DATA/<DB_NAME>/tracker01.dbf
input datafile file number=00007 name=+DATA/<DB_NAME>/issues01.dbf
input datafile file number=00010 name=+DATA/<DB_NAME>/users01.dbf
input datafile file number=00004 name=+DATA/<DB_NAME>/apex_data01.dbf
channel backup_disk1: starting piece 1 at 01-NOV-16
channel backup_disk1: finished piece 1 at 01-NOV-16
piece handle=H:\RMAN_BACKUPS\<DB_NAME>_3ORJQ6QM_1_1.RMAN tag=TAG20161101T065933 comment=NONE
channel backup_disk1: backup set complete, elapsed time: 00:47:50
channel backup_disk1: starting compressed full datafile backup set
channel backup_disk1: specifying datafile(s) in backup set
released channel: backup_disk1
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03009: failure of backup command on backup_disk1 channel at 11/01/2016 07:47:25
ORA-00245: control file backup failed; target is likely on a local file system

The alert log on NODE 1 shows this:
Tue Nov 01 07:47:24 2016
Errors in file E:\ORACLE\diag\rdbms\<DB_NAME>\<DB_NAME>1\trace\<DB_NAME>1_ora_22556.trc:
ORA-00245: control file backup failed; target is likely on a local file system Tue Nov 01 07:47:26 2016 Thread 1 cannot allocate new log, sequence 28242 Checkpoint not complete

The alert log error on NODE 2 is:
Tue Nov 01 07:47:24 2016
Control file backup creation failed:
  failure to open backup target file E:\ORACLE\PRODUCT\11.2.0.3\DBHOME_1\DATABASE\SNCF<DB_NAME>1.ORA.
Errors in file E:\ORACLE\diag\rdbms\<DB_NAME>\<DB_NAME>2\trace\<DB_NAME>2_lgwr_1008.trc:
ORA-27041: unable to open file
OSD-04002: unable to open file
O/S-Error: (OS 2) The system cannot find the file specified.

The above alert log error occurs when RMAN attempts to write the snapshot controlfile to a local file system. But, the snapshot control location is not on the local system, it is configured to a shared mount point between both nodes that is always accessible as shown below.

CONFIGURE SNAPSHOT CONTROLFILE NAME TO 'H:\RMAN_Backups\<DB_NAME>_SNAPCF.RMAN';

So what could be the issue? Well, the part I omit is the fact that there were multiple entries for the snapshot file in the control file. The real output looks like this:

E:\dba>rman target /

Recovery Manager: Release 11.2.0.3.0 - Production on Thu Nov 3 10:07:00 2016

Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.

connected to target database: <DB_NAME> (DBID=11111111111)

RMAN> show all;
...............
CONFIGURE SNAPSHOT CONTROLFILE NAME TO 'H:\RMAN_Backups\<DB_NAME>_SNAPCF.RMAN';
CONFIGURE SNAPSHOT CONTROLFILE NAME TO 'H:\RMAN_BACKUPS\<DB_NAME>_SNAPCF.RMAN';
...............

This means at some point, RMAN is defaulting to factory configuration to achieve its snapshot backup. I believe it may be because the mutliple parameters are causing RMAN to fall through to this behaviour. The difference between them is case sensitivity.

I found a match on meta link:
Bug 17879299 - Duplicate snapshot controlfile entries are shown by RMAN if snapshot controlfile is on ASM (Doc ID 17879299.8)

This is a bug which occurs from time to time in a database that has been configured with ASM.
The bug note says there is no workaround. This may be true, but the solution to correct this, and correct the random failure of backups, is to clear and reinstate the parameter in lower case as shown below:

E:\dba>rman target /

Recovery Manager: Release 11.2.0.3.0 - Production on Thu Nov 3 10:07:00 2016

Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.

connected to target database: <DB_NAME> (DBID=1566787861)

RMAN> show all;

using target database control file instead of recovery catalog
RMAN configuration parameters for database with db_unique_name <DB_NAME> are:
....................................
CONFIGURE SNAPSHOT CONTROLFILE NAME TO 'H:\RMAN_Backups\<DB_NAME>_SNAPCF.RMAN';
CONFIGURE SNAPSHOT CONTROLFILE NAME TO 'H:\RMAN_BACKUPS\<DB_NAME>_SNAPCF.RMAN';

RMAN> configure snapshot controlfile name clear;

old RMAN configuration parameters:
CONFIGURE SNAPSHOT CONTROLFILE NAME TO 'H:\RMAN_Backups\<DB_NAME>_SNAPCF.RMAN';
old RMAN configuration parameters:
CONFIGURE SNAPSHOT CONTROLFILE NAME TO 'H:\RMAN_BACKUPS\<DB_NAME>_SNAPCF.RMAN';
RMAN configuration parameters are successfully reset to default value

RMAN> show all;

RMAN configuration parameters for database with db_unique_name <DB_NAME> are:
....................
CONFIGURE SNAPSHOT CONTROLFILE NAME TO 'E:\ORACLE\PRODUCT\11.2.0.3\DBHOME_1\DATABASE\SNCF<DB_NAME>1.ORA'; # default

RMAN> CONFIGURE SNAPSHOT CONTROLFILE NAME TO 'h:\rman_backups\<DB_NAME>_snapcf.rman';

new RMAN configuration parameters:
CONFIGURE SNAPSHOT CONTROLFILE NAME TO 'h:\rman_backups\<DB_NAME>_snapcf.rman';
new RMAN configuration parameters are successfully stored

RMAN> show all;

RMAN configuration parameters for database with db_unique_name <DB_NAME> are:
..........................
CONFIGURE SNAPSHOT CONTROLFILE NAME TO 'h:\rman_backups\<DB_NAME>_snapcf.rman';

RMAN>

Now I will monitor to see if backups continue to fail. I am hoping this bug is the cause of the sporadic failures.

4 comments:

  1. i struck with same error. i followed your page it helped me to solve . Thanking you so much.

    ReplyDelete
    Replies
    1. Apologies for the late response. I am glad the post helped you sort out the problem. Well done Ramki

      Delete
  2. hi , i am keep getting this error when backup is used with "SECTION SIZE"
    BACKUP DATABASE FILESPERSET = 1 SECTION SIZE = 10000M;

    ReplyDelete
    Replies
    1. Hi Ramki

      If I understand your comment correctly, it seems like that directive/configuration you've mentioned is triggering the bug?

      I don't know how to bypass this, I would think you'd need to find a work around for section size. Possibly, your section size and files per set maybe be conflicting? Try a different setting.

      Delete