Upon reviewing the weekend scheduler sent emails for the system, I discovered the following output for our DR standby email:
DRDB DATABASE
=============
SP2-0640: Not connected
The correct output should show the number of hours our DR is trailing the production system. Clearly something was wrong.
I investigate the cluster log
E:\OracleGrid\11.2.0.3\log\<NODE_1>\alert<NODE_1>.log:
2016-08-07 00:15:10.005:
[ohasd(2040)]CRS-2112:The OLR service started on node <NODE_1>.
2016-08-07 00:15:10.425:
[ohasd(2040)]CRS-1301:Oracle High Availability Service started on node <NODE_1>.
It began producing these messages again and again from Sunday morning until Monday morning. Something had occurred at midnight which the cluster-ware did not like.
I looked at the
services.msc panel and found the
OracleOHService to be down. I manually started it up and watched the log. It produced the aforementioned message again and the server went but to a shutdown state. OHAS was not starting and showed no errors in either log thus far. I then went into the OHASD log folder to see what I could see.
I found a series of 100M dump files output by the failed process as well as the stack dumps.
The dump read:
Symbol file E:\OracleGrid\11.2.0.3\bin\orannzsbb11.SYM does not match binary.
This did not look good I thought, I've seen this issue in the past and it had to do with a failed opatch. I had not opatched anything in DR before.
I looked through
E:\OracleGrid\11.2.0.3\log\<NODE_1>\ohasd\ohasdOUT.log and found:
08/08/16 08:42:03 ssmain_run_crs: CRS starting with restart argument
08/08/16 08:42:03 ssmain_stateCallback: marking daemon service as active
08/08/16 08:42:03 ssmain_monitor_thread setting the service status to Running
08/08/16 08:42:03 **** caught exception (0xe06d7363) in thread 2756 ****
Oracle Database 11g OHAS Release 11.2.0.3.0 Production Copyright 1996, 2009, Oracle. All rights reserved.
Main: Parameters
WaitForAutoStart: 1
ServiceStartupDelay: 0
MaxAutoStartDelay: 600000
OhasdQueryServiceInfo: failed to get service handle for NisDrv, err(5)
OhasdReadService: failed getting information for NisDrv, skipping
OhasdQueryServiceInfo: failed to get service handle for NisSrv, err(5)
OhasdReadService: failed getting information for NisSrv, skipping
OhasdReadService: skipping service Oracle ACFS
OhasdReadService: skipping service Oracle ADVM
OhasdReadService: skipping service Oracle OKS
OhasdReadService: skipping service OracleASMService+ASM
OhasdReadService: skipping service OracleOraCrs11g_home1TNSListener
OhasdReadService: skipping service OracleOraDb11g_home1ClrAgent
OhasdReadService: skipping service OracleService<SID>
Main: service list contains 0
Main: completed waiting for 0 startup delay
Main: completed all work, exiting
The drives we not starting. This really cannot be good I thought. I read the ohasd.log file and found:
E:\OracleGrid\11.2.0.3\log\<NODE_1>\ohasd\ohasd.log
2016-08-08 08:42:03.150: [ default][6944] OHASD Daemon Starting. Command string :restart
2016-08-08 08:42:03.150: [ default][6944] Initializing OLR
.........................
2016-08-08 08:42:03.275: [ CRSPE][2756] {0:0:2} Sent request to write event sequence number 5500000 to repository
2016-08-08 08:42:03.290: [ CRSPE][2756] {0:0:2} Wrote new event sequence to repository
2016-08-08 08:42:03.306: [ CRSPE][2756] {0:0:2} Reading (7) resources
2016-08-08 08:42:03.306: [ CRSPE][2756] {0:0:2} Reading (1) server pools
2016-08-08 08:42:03.306: [ CRSPE][2756] {0:0:2} Reading (13) types
2016-08-08 08:42:03.853: [ CRSPE][2756] {0:0:2} Finished reading configuration. Parsing...
2016-08-08 08:42:03.853: [ CRSPE][2756] {0:0:2} Parsing resource types...
2016-08-08 08:42:03.868: [ CRSSEC][2756] {0:0:2} Exception: OwnerEntry construction failed to retrieve user id by name with ACL string: owner:<DOMAIN>\<OLD_USER>:rwx and error: 1
2016-08-08 08:42:03.868: [ CRSSEC][2756] {0:0:2} Exception: ACL entry creation failed for: owner:<DOMAIN>\<OLD_USER>:rwx
For a very strange reason, the clusterware wanted to validate the owner of each resource, however, the user it was set to was the previous DBA. This is strange considering we only ever use a global administration account across all machines for all Oracle administration purposes. I struggled to think of what file or process may want to use the old credentials. I then executed:
ocrdump -local -stdout
This revealed the problem in detail
[CRS]
UNDEF :
SECURITY : {USER_PERMISSION : PROCR_ALL_ACCESS, GROUP_PERMISSION : PROCR_READ, OTHER_PERMISSION : PROCR_READ, USER_NAME : <DOMAIN>\<OLD_USER>, GROUP_NAME : }
The CRS component along with all the components in the OCR, were registered to the old account. So, if the account was gone, it would explain why the CRS was not coming up.
I searched for the user in active directory via a PROPERTIES>SECURITY>ADVANCED>OWNER>EDIT to see if the user existed. It did not. I could find my user, but not this one.
I then requested that the network administrator contact the domain administrator to reinstate the user. It turned out that the user was deleted over a year ago but the DR server had not been restarted since. Reinstating the account it resolved the issue and cluster started up correctly.
Note
1491367.1 has details on how to handle the issue. However, removing components was not applicable in this situation owing to the fact that the vast majority of components were all owned by this single user. I will look for a way to change the owner of components in time.