Saturday, 23 April 2016

Virtualised Environments and Shared Storage

In my experience, the uncontrollable database disasters outside of the DBA's direct control have been to do with SAN failures. Mixed with virtual machines and less than adequate forward thinking can result in a system setup that fails on a long enough timeline.

SANs are feasible storage solutions that offer convenient storage with decent disk redundancy, however, what I've found is that some administrators can setup one production server on one VM and then setup either the DR or "offsite" backup server on another VM without realising they've setup both the production and DR VMs to function on the same SAN. In the event that the SAN fails, both VMs are lost and the whole disaster recovery is a failure and the organisation is then dependent on either a week old tape backup or, by chance, a deliberately copied database backup - either from a recent database refresh or for testing. The same can be seen with some Oracle RAC installations - both nodes are virtualised yet they've been implemented on the very same disk array outside of the ASM storage. So if the SAN dies - both nodes are going to fail thus defeating one of the main incentives of using RAC.

I write specifically about SANs but it is possible the same arrangement exists on any disk array used to house a multitude of virtual machines. If it's believed that SANs and disk arrays are safe from failure, one may have not accounted for potential administration error where server engineers can potentially destroy partitions or neglect periodic disk checks leading to a slow collapse of a disk system.

What this means is that VMs and SAN structures tend to be obscured behind layers of virtualised machines and disk partitions and I find that it is in the organisationsbest interest to question and document the underlying foundation of the disk structure. If the engineers have setup the system in this fashion, I make alternative arrangements for the backups to be synchronised to disk that is not part of the production disk.


Friday, 22 April 2016

Database Fever

The source of this post comes from my experience during a disaster at one of the sites of the company I am contracting to. 3 days ago, I left work early because I could feel a fever coming on.
I was getting sick. While shivering in bed, I received a message that the SAN supporting an entire site had a drive failure and all the production environments in a particular site crashed.

A brief overview of the site:
-The SAN Storage was configured for RAID 5 supporting a Oracle database production VM server and a SQL development VM server.
-The Oracle VM server was hosted to a number of integration mechanisms including 2 different FTP solutions supporting a number of big companies in the region.
-The information system hosted by the Oracle Database was key in transacting industrial logistics on a large scale. This means that system failure could potentially halt operations of national interest.

Shortly after the news, the site engineer attempted to rebuild the failed disk as the configuration was suited for this kind of transaction. However, the SAN declared itself in a degraded state owing to the lack of  drives and no spares were available. What this meant was that for months, no one could resupply the disks after consecutive failures due to procurement policies. They had depleted the spare disks without replenishing the supply. What this also meant was that I would have to get up and out of bed and be ready for a new implementation of the site. Fortunately, the department manager negotiated a DR meeting for the next morning considering no transactions were planned for the next day and the site would not be used. I wasted little time and logged into the DR server for the lost system, storing and moving aside logs, backups, configurations and reading any documentation I could find on this particular system - there was little to none.

The following morning we held a video conference with all the managers and engineers involved. The plan was to not activate the DR site considering we had time before the next transaction took place. The administrators flattened the SAN and rebuilt storage minus one drive. They then planted a VM for me to work on. They also restored bits and pieces of drive backups from backup executive which fortunately had the Oracle application folder (to salvage config parts from and speed recovery) and a few other integration configuration files. No registry and no operating system could be restored. I began copying the most recent backup (sync to our DR server every night) and logs around midday and my colleague installed the database software while I planned for the specific moment when we would have everything we needed. At around 13:00 we had software, a backup and all the logs generated between the backup and the point of failure. Throughout the next few hours, my awareness of my condition was overridden by a small amount of adrenaline in my system.

The steps to install the instance service and restore and recover the database in Windows are below.

Fixing the database:
In Windows I need to install the database service using ORADIM which requires the pfile.
But to get the pfile, I need either a previous copy of that pfile or the spfile - neither of which was actively backed up except as part of the RMAN backup routine.
But I cannot restore the spfile using RMAN until I have the service (or at least to my knowledge at this point)
A catch 22.

I had little time to figure out the correct order of carrying this out so I did the following: I grabbed the spfile from backup executive restoration, opened it in a binary editor, scraped the contents out, cleaned up dynamic parameters (*.parameter_name) and remove the ascii artifacts in the contents and created a new pfile in the databases directory called 'init<instancename>.ora' using notepad.
I also verified the paths of the control_files parameter, log_archive_dest parameters and any other paths stated in the spfile to make sure they were real.

Create the service:
In command (as administrator) and using the hodgepodge pfile I created, I create the service as follows:

set ORACLE_SID=<instancename>
set ORACLE_HOME=e:\Oracle\product\11.2.0.3\dbhome_1

oradim -new -SID <instancename> -startmode manual -pfile 'E:\Oracle\Product\11.2.0.3\db_1\database\init<instancename>.ora'

REM echo.
REM echo use this if you've made a mistake in your service creation and want to start again
REM oradim -DELETE -SID MYDBSID 

I started the service in services.msc once I created it. I then use RMAN to restore the spfile from the backup. You will have hopefully kept a log of which backup piece holds your spfile and control file (usually the same piece). In my case, I had the piece name after reviewing the RMAN backup log we kept for our system backup.

Restore the spfile:

rman target / nocatalog
shutdown immediate;
startup nomount;
restore spfile from 'H:\RMAN_Backups\<instancename>_2016_04_18\<instancename>_RIR3CALE_1_1.RMAN';
shutdown immediate;
exit

The above script restores the spfile to my %ORACLE_HOME%/database directory.
Once I have this file, I then recreate the pfile from the spfile.

sqlplus /nolog
conn / as sysdba
startup nomount;
create pfile='E:\Oracle\Product\11.2.0.3\db_1\database\init<instancename>.ora.fromspfile' from spfile;
shutdown immediate;
exit;

I review the pfile for any differences between what I scraped together and what was the last functioning spfile configuration. If there is a difference, I would delete the service and recreate it with a newly created pfile from spfile using the restored spfile. As there was no difference, I proceed to carry out the next important step.

Safe guard the system from running jobs the moment I open it:

sqlplus /nolog
conn / as sysdba
startup nomount;
alter system set job_queue_processes=0 scope=spfile;
shutdown immediate;

I do this to prevent the system from running real time jobs wildly when I've not setup the supporting non-oracle systems.

Restore the control files:
Next step is to restore the control files:

rman target / nocatalog
startup nomount
restore controlfile from 'H:\RMAN_Backups\<instancename>_2016_04_18\<instancename>_RIR3CALE_1_1.RMAN';
mount database;
exit

This succeeds as I've check that the destination directories of the control files are all valid.
The next step is to catalog the remaining backup pieces and archive logs so Oracle can restore and recover the database.

Catalog the remaining backup pieces:

rman target / nocatalog
catalog start with 'H:\RMAN_Backups\<instancename>_2016_04_18\';
exit;

This command will dig through the specified directory and sub directories hunting for RMAN backup
pieces to use in any restorative commands.

Restore and recover the database:
Finally, I restore and recover the database:

rman target / nocatalog
run {
restore database;
recover database;
}
exit;

The above step should open the backup pieces cataloged, restore the data files to their allocated drives and directories and lastly apply all the archive logs it can. The restore function will be successful but the recover function will eventually "fail" as the recovery process runs out of thread:
unable to find archive log.

archive log thread=1 sequence=###
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of recover command at 20/04/2016 14:27:09
RMAN-06054: media recovery requesting unknown log: thread 1 seq ### lowscn ###

This is normal when performing an incomplete recovery. I believe that the only way  around this is if I have a copy of the redologs, which are never part of an RMAN backup set.
If I had those, I wouldn't be restoring or recovering. So, once this message is received, I proceed to open the database using resetlogs.

Open the database:
The moment of truth.

sqlplus /nolog
conn / as sysdba
alter database open resetlogs;
exit;

The database opened for me and the quiet nervousness in the department subsided.
What resetlogs does is create a fork in the road for the database identity by scrubbing the scn information and incarnation number from all files and replacing it with a new set of values so all files are in sync, in other words, I am re-instantiating this database and also rendering all prior backups, archive logs, control and data files and DR solutions irrelevant. They become irrelevant because Oracle will not allow me to apply files from one incarnation to another as a safety mechanism against database corruption. It is like having a citizens id number (or social security number) changed: once the number is changed, I would think all previous societal facilities like banking and identity become ineffective as their details no longer match the respective systems - one would have to setup new accounts. This also means I need to create a new backup as soon as possible and recreate the DR site as soon as possible.

Note: I can still make use of the old backup files but again - I would need to open the database with resetlogs. RESETLOGS will only render old files from the previous incarnation irrelevant to my NEWLY opened database incarnation.

Re-enable the jobs:
I proceed to fix the integration solutions and switch the database jobs back on by running

sqlplus /nolog
conn / as sysdba
alter system set job_queue_processes=15 scope=both;

Also, I created the listener after I restored the database. I used the NETCA wizard to do this then I overwrote the contents with the restored listener.ora. I also reinstated the previous
tnsnames.ora and sqlnet.ora files from the backup executive backups. Lastly, I copied the old password file back and prior to starting the process of restoration, I copied the old diag directory into the new home. I then began working on the EDI solutions and worked to get working again.

Once this was all done, my joints were aching and I became aware of my condition - my head was burning and I was in the middle of a fever. I went home and slept. I went to work the following day to cleanup and attend any other new issues that  I may have missed. Things are running smoothly again. Today, I took the day off to recover, post this post and prepare for my exams.

Monday, 18 April 2016

Ponta Do Ouro 2016


My brother and I
Mozambique is a colourful place. It is a developing country so you can expect ramshackled buildings and dirt roads along the outskirts. In the short time I was in Ponta Do Ouro I did not see a single asphalt road or a computer system. The house I stayed in had intermittent hot water or sometimes no water at all. I found out later that other places were more geared up and also had better facilities. However, with all that is lacking, there are a number of areas which Ponta Do Oura has in abundance: Freedom and safety. Regulation is a lot more 'relaxed' - I could walk around with a drink in my hand and some might say I would be looked on suspiciously if I didn't have a drink in my hand. No one hassled me because of all my camera gear and I was not checked by any agencies. Ponta Do Oura is also very safe. The only crime in Ponta Do Oura is petty so I always kept one eye on anything I left alone. The other good thing is the border post by Kosi Bay was a breeze to get in and out of. Mozambican border posts can have problems and hot ticket items like DSLR camera gear and laptops can be the cause of contention. The southern border post is welcoming and friendly. I was fortunate that the weather was cool - cooler than Durban. Usually, Mozambique is very hot so I dressed in vests, shorts and flip flops as seen below.



Ponta Do Oura
Usually, the Mozambicans who actively engaged me were hawkers but the individuals I actively engaged with were friendly and talkative. Many Mozambicans have been impacted by the civil war to some degree and each person has his or her story of the period.

Other highlights of my trip included riding a quad bike 10km north of Ponta Do Oura to Ponta Malongane and stopping along the way to drink R&Rs at each bar and then driving back.

The other unexpected highlight was swimming with dolphins. There is something special about being taken a kilometer out to sea and being dumped into the very warm, very clear and very deep Indian ocean to bob around next to a pod of dolphins. My family and friends enjoyed the experience as much as me but I was a bit more thrilled by the fact that we were swimming very far out at sea with 8 foot swell lurching around us like blue hills.

For a comprehensive history on Mozambique - see the
WIKI on Mozambique.

NOTE: I did not go to Maptuo. After reading other accounts by travelers, it sounds as if I would not have had a great a time with my camera in Maputo as I did in Ponta Do Oura so read up on where you are going should you decide to go to Mozambique.Tensions between the 2 groups have also recently flared up: it is advised not head too far north. The South is safe and unaffected.

TIPS:
-If you are taking a DSLR or notebook - print or write all the lens and component serial numbers down on paper so that you can account for the items in the event you are asked to clear them at customs while leaving or entering the country. I did not have to do this even when I insisted, but I have read stories where if you cannot provide the serials, you may have to buy your items back when returning to your country of origin.

-Don't drink the water - buy bottled water,  the bigger ones if you're staying for long

-Keep mosquito repellent on you as malaria can kill you, which brings me to

-Check the area you're visiting for mosquitoes and malaria - take anti malaria pills a few weeks before you travel. Ponto Do Oura is a low risk area, I used a mosquito net and repellent - I took no pills.

-There are no cell phone provider for South African networks - you will either need to buy a Mozambican sim card or put your phone into roaming mode if you plan to be a DBA on call in a foreign country.

-Drink 2 or 3 R&R's if you want a good time, drink more if you want to black out and forget the previous evening. These fuchsia drinks are sweet, but they can put you on the ground quite quickly.

-If you are going to Mozambique on your own without securing passage, make sure you have a vehicle that can handle sand dunes.

Pictures below


Mozambique - Ponta Do Oura - The Bay
5 minutes North of Kosi Bay Border Post
The Main Road
The Main Corner - early morning

Jack's Barefoot Bar

The Melting Pot Cafe and Chalet
Fernandos
Residential Road
A SAN disk server that has slowly degraded
Homes

Monday, 4 April 2016

Sub-Sahara Africa

I am in the middle of installing a single instance database, a Middleware server and an EDI server (along with their DR counter parts) for an organisation in North Africa. I aim to finish by Wednesday before I embark on a weekend tour of Mozambique. And by tour I mean riding a quad bike across the forgotten dunes of Ponta Do Ouro, armed with my camera and drinking rum out of a jar. The project I am working has been 12 hours a day but a pleasant break away from Clusterware and a good lead up to the coming trip.

Below are images I've snapped of South Africa over the last few years


Durban South Africa


Umhlanga


1200MW of power keeping all those databases running
The transport is dated and inefficient but people get by

The Road through Transkei


Burgundy - North Coast KZN


Eastern Cape cattle
North Coast KZN - Lagoon

Heartland