BeepityBop: August 2016

I managed to get my hands on an Oracle 12C developer image along with Oracle Virtual Box which can be downloaded from here...

It's come a long way since the last time I played with Oracle Virtual Box and I am impressed at how simple the setup is currently.

"You know sonny, back in my day, we had to create the VM image from scratch, install the OS, the software, setup the network, prepare the virtual drives and partition space ourselves! Easily a day or two of work!".

Not any more - all I had to do was grab the Oracle 12C .OVA file, load it into Oracle Virtual Box and within 20 minutes I had a polished and running copy of Linux, a copy of Oracle 12C EE and both a CDB and PDB. Lovely.

The best part was that I was able to setup Port Forward via settings panel which meant I can use my SSH client tools from my host. I forward port 22 from host to VM as shown below:

I can now connect to the VM using MobaXterm and work as a client.

It's also been a very long time since I've worked in Linux, I have missed it. I'll be using this new setup to explore the new features of 12C. So long in fact that I forgot all about TWO_TASK and what happens when it is not unset and not unset in .profile :

[oracle@localhost ~]$ sqlplus / as sysdba

SQL*Plus: Release 12.1.0.2.0 Production on Thu Aug 18 13:17:10 2016

ERROR:

ORA-01017: invalid username/password; logon denied

[oracle@localhost ~]$ unset TWO_TASK

[oracle@localhost ~]$ sqlplus / as sysdba

SQL*Plus: Release 12.1.0.2.0 Production on Thu Aug 18 13:17:23 2016

Connected to:

Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production

With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

SQL>

If I don't unset TWO_TASK, then it will take precedence over ORACLE_SID when trying to connect using '/ as sysdba'.

Upon reviewing the weekend scheduler sent emails for the system, I discovered the following output for our DR standby email:

DRDB DATABASE
=============
SP2-0640: Not connected

The correct output should show the number of hours our DR is trailing the production system. Clearly something was wrong.
I investigate the cluster log E:\OracleGrid\11.2.0.3\log\<NODE_1>\alert<NODE_1>.log:
2016-08-07 00:15:10.005:
[ohasd(2040)]CRS-2112:The OLR service started on node <NODE_1>.
2016-08-07 00:15:10.425:
[ohasd(2040)]CRS-1301:Oracle High Availability Service started on node <NODE_1>.

It began producing these messages again and again from Sunday morning until Monday morning. Something had occurred at midnight which the cluster-ware did not like.

I looked at the services.msc panel and found the OracleOHService to be down. I manually started it up and watched the log. It produced the aforementioned message again and the server went but to a shutdown state. OHAS was not starting and showed no errors in either log thus far. I then went into the OHASD log folder to see what I could see.

I found a series of 100M dump files output by the failed process as well as the stack dumps.
The dump read:
Symbol file E:\OracleGrid\11.2.0.3\bin\orannzsbb11.SYM does not match binary.

This did not look good I thought, I've seen this issue in the past and it had to do with a failed opatch. I had not opatched anything in DR before.
I looked through E:\OracleGrid\11.2.0.3\log\<NODE_1>\ohasd\ohasdOUT.log and found:

08/08/16 08:42:03 ssmain_run_crs: CRS starting with restart argument
08/08/16 08:42:03 ssmain_stateCallback: marking daemon service as active
08/08/16 08:42:03 ssmain_monitor_thread setting the service status to Running
08/08/16 08:42:03 **** caught exception (0xe06d7363) in thread 2756 ****
Oracle Database 11g OHAS Release 11.2.0.3.0 Production Copyright 1996, 2009, Oracle. All rights reserved.
Main: Parameters
WaitForAutoStart: 1
ServiceStartupDelay: 0
MaxAutoStartDelay: 600000
OhasdQueryServiceInfo: failed to get service handle for NisDrv, err(5)
OhasdReadService: failed getting information for NisDrv, skipping
OhasdQueryServiceInfo: failed to get service handle for NisSrv, err(5)
OhasdReadService: failed getting information for NisSrv, skipping
OhasdReadService: skipping service Oracle ACFS
OhasdReadService: skipping service Oracle ADVM
OhasdReadService: skipping service Oracle OKS
OhasdReadService: skipping service OracleASMService+ASM
OhasdReadService: skipping service OracleOraCrs11g_home1TNSListener
OhasdReadService: skipping service OracleOraDb11g_home1ClrAgent
OhasdReadService: skipping service OracleService<SID>
Main: service list contains 0
Main: completed waiting for 0 startup delay
Main: completed all work, exiting

The drives we not starting. This really cannot be good I thought. I read the ohasd.log file and found:
E:\OracleGrid\11.2.0.3\log\<NODE_1>\ohasd\ohasd.log
2016-08-08 08:42:03.150: [ default][6944] OHASD Daemon Starting. Command string :restart
2016-08-08 08:42:03.150: [ default][6944] Initializing OLR

.........................
2016-08-08 08:42:03.275: [ CRSPE][2756] {0:0:2} Sent request to write event sequence number 5500000 to repository
2016-08-08 08:42:03.290: [ CRSPE][2756] {0:0:2} Wrote new event sequence to repository
2016-08-08 08:42:03.306: [ CRSPE][2756] {0:0:2} Reading (7) resources
2016-08-08 08:42:03.306: [ CRSPE][2756] {0:0:2} Reading (1) server pools
2016-08-08 08:42:03.306: [ CRSPE][2756] {0:0:2} Reading (13) types
2016-08-08 08:42:03.853: [ CRSPE][2756] {0:0:2} Finished reading configuration. Parsing...
2016-08-08 08:42:03.853: [ CRSPE][2756] {0:0:2} Parsing resource types...
2016-08-08 08:42:03.868: [ CRSSEC][2756] {0:0:2} Exception: OwnerEntry construction failed to retrieve user id by name with ACL string: owner:<DOMAIN>\<OLD_USER>:rwx and error: 1
2016-08-08 08:42:03.868: [ CRSSEC][2756] {0:0:2} Exception: ACL entry creation failed for: owner:<DOMAIN>\<OLD_USER>:rwx

For a very strange reason, the clusterware wanted to validate the owner of each resource, however, the user it was set to was the previous DBA. This is strange considering we only ever use a global administration account across all machines for all Oracle administration purposes. I struggled to think of what file or process may want to use the old credentials. I then executed:

ocrdump -local -stdout

This revealed the problem in detail

[CRS]
UNDEF :
SECURITY : {USER_PERMISSION : PROCR_ALL_ACCESS, GROUP_PERMISSION : PROCR_READ, OTHER_PERMISSION : PROCR_READ, USER_NAME : <DOMAIN>\<OLD_USER>, GROUP_NAME : }

The CRS component along with all the components in the OCR, were registered to the old account. So, if the account was gone, it would explain why the CRS was not coming up.
I searched for the user in active directory via a PROPERTIES>SECURITY>ADVANCED>OWNER>EDIT to see if the user existed. It did not. I could find my user, but not this one.

I then requested that the network administrator contact the domain administrator to reinstate the user. It turned out that the user was deleted over a year ago but the DR server had not been restarted since. Reinstating the account it resolved the issue and cluster started up correctly.

Note 1491367.1 has details on how to handle the issue. However, removing components was not applicable in this situation owing to the fact that the vast majority of components were all owned by this single user. I will look for a way to change the owner of components in time.

BeepityBop

Thursday, 18 August 2016

Oracle 12C Developer VM and TWO_TASK

Friday, 12 August 2016

Another Crash

Monday, 8 August 2016

DR CRS Fails to Start - ACL entry creation failed for: owner