Monday 8 August 2016

DR CRS Fails to Start - ACL entry creation failed for: owner

Upon reviewing the weekend scheduler sent emails for the system, I discovered the following output for our DR standby email:

DRDB DATABASE
=============
SP2-0640: Not connected

The correct output should show the number of hours our DR is trailing the production system. Clearly something was wrong.
I investigate the cluster log E:\OracleGrid\11.2.0.3\log\<NODE_1>\alert<NODE_1>.log:
2016-08-07 00:15:10.005: 
[ohasd(2040)]CRS-2112:The OLR service started on node <NODE_1>.
2016-08-07 00:15:10.425: 
[ohasd(2040)]CRS-1301:Oracle High Availability Service started on node <NODE_1>.

It began producing these messages again and again from Sunday morning until Monday morning. Something had occurred at midnight which the cluster-ware did not like.

I looked at the services.msc panel and found the OracleOHService to be down. I manually started it up and watched the log. It produced the aforementioned message again and the server went but to a shutdown state. OHAS was not starting and showed no errors in either log thus far. I then went into the OHASD log folder to see what I could see.

I found a series of 100M dump files output by the failed process as well as the stack dumps.
The dump read:
Symbol file E:\OracleGrid\11.2.0.3\bin\orannzsbb11.SYM does not match binary.

This did not look good I thought, I've seen this issue in the past and it had to do with a failed opatch. I had not opatched anything in DR before.
I looked through E:\OracleGrid\11.2.0.3\log\<NODE_1>\ohasd\ohasdOUT.log and found:

08/08/16 08:42:03  ssmain_run_crs:  CRS starting with restart argument
08/08/16 08:42:03  ssmain_stateCallback:  marking daemon service as active
08/08/16 08:42:03  ssmain_monitor_thread setting the service status to Running
08/08/16 08:42:03  **** caught exception (0xe06d7363) in thread 2756 ****
Oracle Database 11g OHAS Release 11.2.0.3.0 Production Copyright 1996, 2009, Oracle.  All rights reserved.
Main:  Parameters
  WaitForAutoStart:     1
  ServiceStartupDelay:  0
  MaxAutoStartDelay:    600000
OhasdQueryServiceInfo:  failed to get service handle for NisDrv, err(5)
OhasdReadService:  failed getting information for NisDrv, skipping
OhasdQueryServiceInfo:  failed to get service handle for NisSrv, err(5)
OhasdReadService:  failed getting information for NisSrv, skipping
OhasdReadService:  skipping service Oracle ACFS
OhasdReadService:  skipping service Oracle ADVM
OhasdReadService:  skipping service Oracle OKS
OhasdReadService:  skipping service OracleASMService+ASM
OhasdReadService:  skipping service OracleOraCrs11g_home1TNSListener
OhasdReadService:  skipping service OracleOraDb11g_home1ClrAgent
OhasdReadService:  skipping service OracleService<SID>
Main:  service list contains 0
Main:  completed waiting for 0 startup delay
Main:  completed all work, exiting

The drives we not starting. This really cannot be good I thought. I read the ohasd.log file and found:
E:\OracleGrid\11.2.0.3\log\<NODE_1>\ohasd\ohasd.log
2016-08-08 08:42:03.150: [ default][6944] OHASD Daemon Starting. Command string :restart
2016-08-08 08:42:03.150: [ default][6944] Initializing OLR

.........................
2016-08-08 08:42:03.275: [   CRSPE][2756] {0:0:2} Sent request to write event sequence number 5500000 to repository
2016-08-08 08:42:03.290: [   CRSPE][2756] {0:0:2} Wrote new event sequence to repository
2016-08-08 08:42:03.306: [   CRSPE][2756] {0:0:2} Reading (7) resources
2016-08-08 08:42:03.306: [   CRSPE][2756] {0:0:2} Reading (1) server pools
2016-08-08 08:42:03.306: [   CRSPE][2756] {0:0:2} Reading (13) types
2016-08-08 08:42:03.853: [   CRSPE][2756] {0:0:2} Finished reading configuration. Parsing...
2016-08-08 08:42:03.853: [   CRSPE][2756] {0:0:2} Parsing resource types...
2016-08-08 08:42:03.868: [  CRSSEC][2756] {0:0:2} Exception: OwnerEntry construction failed to retrieve user id by name with ACL string: owner:<DOMAIN>\<OLD_USER>:rwx and error: 1
2016-08-08 08:42:03.868: [  CRSSEC][2756] {0:0:2} Exception: ACL entry creation failed for: owner:<DOMAIN>\<OLD_USER>:rwx

For a very strange reason, the clusterware wanted to validate the owner of each resource, however, the user it was set to was the previous DBA. This is strange considering we only ever use a global administration account across all machines for all Oracle administration purposes. I struggled to think of what file or process may want to use the old credentials. I then executed:

ocrdump -local -stdout

This revealed the problem in detail

[CRS]
UNDEF : 
SECURITY : {USER_PERMISSION : PROCR_ALL_ACCESS, GROUP_PERMISSION : PROCR_READ, OTHER_PERMISSION : PROCR_READ, USER_NAME : <DOMAIN>\<OLD_USER>, GROUP_NAME : }

The CRS component along with all the components in the OCR, were registered to the old account. So, if the account was gone, it would explain why the CRS was not coming up.
I searched for the user in active directory via a PROPERTIES>SECURITY>ADVANCED>OWNER>EDIT to see if the user existed. It did not. I could find my user, but not this one.

I then requested that the network administrator contact the domain administrator to reinstate the user. It turned out that the user was deleted over a year ago but the DR server had not been restarted since. Reinstating the account it resolved the issue and cluster started up correctly.

Note 1491367.1 has details on how to handle the issue. However, removing components was not applicable in this situation owing to the fact that the vast majority of components were all owned by this single user. I will look for a way to change the owner of components in time.

4 comments:

  1. Hi Ray

    Thanks for this post. It saved my production system! I just experienced exactly the same issue and I was clueless as to how to solve it so thank you!

    Did you ever find a way of changing the ownership of the components? if so how did you do it?

    Tristan

    ReplyDelete
  2. Hey Tristan, thats great to hear. I am glad the post was able to assist.

    No, I did not find a way to change the owning user - I read a document or MOS post saying that the user is not changeable It may be possible to hack the OCR but I have not gone down this road. Over this last weekend, I reinstalled the DR site, in the above post, from the ground up - but with the universal admin user that is used across the domain. Again, I am glad that this post was helpful.

    ReplyDelete
  3. Ray, thank you very much for your post, if it wasnt for you, I'd probably be still searching for a solution until now. I'm trying to figure out how to change this username inside oracle. As far as I could find, Oracle suggestion is to reinstall GI using a proper windows domain account (557273.1).

    Marlon W. Silva

    ReplyDelete
    Replies
    1. No problem Marlon, glad to help. I tried looking into the OCR but it was not 'adjustable' as are some of the other windows clusterware components. Reinstating the old user can buy time until the components or site can be reinstalled.

      Delete