Friday 22 April 2016

Database Fever

The source of this post comes from my experience during a disaster at one of the sites of the company I am contracting to. 3 days ago, I left work early because I could feel a fever coming on.
I was getting sick. While shivering in bed, I received a message that the SAN supporting an entire site had a drive failure and all the production environments in a particular site crashed.

A brief overview of the site:
-The SAN Storage was configured for RAID 5 supporting a Oracle database production VM server and a SQL development VM server.
-The Oracle VM server was hosted to a number of integration mechanisms including 2 different FTP solutions supporting a number of big companies in the region.
-The information system hosted by the Oracle Database was key in transacting industrial logistics on a large scale. This means that system failure could potentially halt operations of national interest.

Shortly after the news, the site engineer attempted to rebuild the failed disk as the configuration was suited for this kind of transaction. However, the SAN declared itself in a degraded state owing to the lack of  drives and no spares were available. What this meant was that for months, no one could resupply the disks after consecutive failures due to procurement policies. They had depleted the spare disks without replenishing the supply. What this also meant was that I would have to get up and out of bed and be ready for a new implementation of the site. Fortunately, the department manager negotiated a DR meeting for the next morning considering no transactions were planned for the next day and the site would not be used. I wasted little time and logged into the DR server for the lost system, storing and moving aside logs, backups, configurations and reading any documentation I could find on this particular system - there was little to none.

The following morning we held a video conference with all the managers and engineers involved. The plan was to not activate the DR site considering we had time before the next transaction took place. The administrators flattened the SAN and rebuilt storage minus one drive. They then planted a VM for me to work on. They also restored bits and pieces of drive backups from backup executive which fortunately had the Oracle application folder (to salvage config parts from and speed recovery) and a few other integration configuration files. No registry and no operating system could be restored. I began copying the most recent backup (sync to our DR server every night) and logs around midday and my colleague installed the database software while I planned for the specific moment when we would have everything we needed. At around 13:00 we had software, a backup and all the logs generated between the backup and the point of failure. Throughout the next few hours, my awareness of my condition was overridden by a small amount of adrenaline in my system.

The steps to install the instance service and restore and recover the database in Windows are below.

Fixing the database:
In Windows I need to install the database service using ORADIM which requires the pfile.
But to get the pfile, I need either a previous copy of that pfile or the spfile - neither of which was actively backed up except as part of the RMAN backup routine.
But I cannot restore the spfile using RMAN until I have the service (or at least to my knowledge at this point)
A catch 22.

I had little time to figure out the correct order of carrying this out so I did the following: I grabbed the spfile from backup executive restoration, opened it in a binary editor, scraped the contents out, cleaned up dynamic parameters (*.parameter_name) and remove the ascii artifacts in the contents and created a new pfile in the databases directory called 'init<instancename>.ora' using notepad.
I also verified the paths of the control_files parameter, log_archive_dest parameters and any other paths stated in the spfile to make sure they were real.

Create the service:
In command (as administrator) and using the hodgepodge pfile I created, I create the service as follows:

set ORACLE_SID=<instancename>
set ORACLE_HOME=e:\Oracle\product\11.2.0.3\dbhome_1

oradim -new -SID <instancename> -startmode manual -pfile 'E:\Oracle\Product\11.2.0.3\db_1\database\init<instancename>.ora'

REM echo.
REM echo use this if you've made a mistake in your service creation and want to start again
REM oradim -DELETE -SID MYDBSID 

I started the service in services.msc once I created it. I then use RMAN to restore the spfile from the backup. You will have hopefully kept a log of which backup piece holds your spfile and control file (usually the same piece). In my case, I had the piece name after reviewing the RMAN backup log we kept for our system backup.

Restore the spfile:

rman target / nocatalog
shutdown immediate;
startup nomount;
restore spfile from 'H:\RMAN_Backups\<instancename>_2016_04_18\<instancename>_RIR3CALE_1_1.RMAN';
shutdown immediate;
exit

The above script restores the spfile to my %ORACLE_HOME%/database directory.
Once I have this file, I then recreate the pfile from the spfile.

sqlplus /nolog
conn / as sysdba
startup nomount;
create pfile='E:\Oracle\Product\11.2.0.3\db_1\database\init<instancename>.ora.fromspfile' from spfile;
shutdown immediate;
exit;

I review the pfile for any differences between what I scraped together and what was the last functioning spfile configuration. If there is a difference, I would delete the service and recreate it with a newly created pfile from spfile using the restored spfile. As there was no difference, I proceed to carry out the next important step.

Safe guard the system from running jobs the moment I open it:

sqlplus /nolog
conn / as sysdba
startup nomount;
alter system set job_queue_processes=0 scope=spfile;
shutdown immediate;

I do this to prevent the system from running real time jobs wildly when I've not setup the supporting non-oracle systems.

Restore the control files:
Next step is to restore the control files:

rman target / nocatalog
startup nomount
restore controlfile from 'H:\RMAN_Backups\<instancename>_2016_04_18\<instancename>_RIR3CALE_1_1.RMAN';
mount database;
exit

This succeeds as I've check that the destination directories of the control files are all valid.
The next step is to catalog the remaining backup pieces and archive logs so Oracle can restore and recover the database.

Catalog the remaining backup pieces:

rman target / nocatalog
catalog start with 'H:\RMAN_Backups\<instancename>_2016_04_18\';
exit;

This command will dig through the specified directory and sub directories hunting for RMAN backup
pieces to use in any restorative commands.

Restore and recover the database:
Finally, I restore and recover the database:

rman target / nocatalog
run {
restore database;
recover database;
}
exit;

The above step should open the backup pieces cataloged, restore the data files to their allocated drives and directories and lastly apply all the archive logs it can. The restore function will be successful but the recover function will eventually "fail" as the recovery process runs out of thread:
unable to find archive log.

archive log thread=1 sequence=###
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of recover command at 20/04/2016 14:27:09
RMAN-06054: media recovery requesting unknown log: thread 1 seq ### lowscn ###

This is normal when performing an incomplete recovery. I believe that the only way  around this is if I have a copy of the redologs, which are never part of an RMAN backup set.
If I had those, I wouldn't be restoring or recovering. So, once this message is received, I proceed to open the database using resetlogs.

Open the database:
The moment of truth.

sqlplus /nolog
conn / as sysdba
alter database open resetlogs;
exit;

The database opened for me and the quiet nervousness in the department subsided.
What resetlogs does is create a fork in the road for the database identity by scrubbing the scn information and incarnation number from all files and replacing it with a new set of values so all files are in sync, in other words, I am re-instantiating this database and also rendering all prior backups, archive logs, control and data files and DR solutions irrelevant. They become irrelevant because Oracle will not allow me to apply files from one incarnation to another as a safety mechanism against database corruption. It is like having a citizens id number (or social security number) changed: once the number is changed, I would think all previous societal facilities like banking and identity become ineffective as their details no longer match the respective systems - one would have to setup new accounts. This also means I need to create a new backup as soon as possible and recreate the DR site as soon as possible.

Note: I can still make use of the old backup files but again - I would need to open the database with resetlogs. RESETLOGS will only render old files from the previous incarnation irrelevant to my NEWLY opened database incarnation.

Re-enable the jobs:
I proceed to fix the integration solutions and switch the database jobs back on by running

sqlplus /nolog
conn / as sysdba
alter system set job_queue_processes=15 scope=both;

Also, I created the listener after I restored the database. I used the NETCA wizard to do this then I overwrote the contents with the restored listener.ora. I also reinstated the previous
tnsnames.ora and sqlnet.ora files from the backup executive backups. Lastly, I copied the old password file back and prior to starting the process of restoration, I copied the old diag directory into the new home. I then began working on the EDI solutions and worked to get working again.

Once this was all done, my joints were aching and I became aware of my condition - my head was burning and I was in the middle of a fever. I went home and slept. I went to work the following day to cleanup and attend any other new issues that  I may have missed. Things are running smoothly again. Today, I took the day off to recover, post this post and prepare for my exams.

No comments:

Post a Comment