A few days ago, the motherboard on our ESX server blew. This server was home to 2 different operational environment for the organisation. The domain I am tasked to maintain was gone in a flash. The server engineers remounted the SAN on a backup ESX machine with the same physical properties as the recently failed system. Before beginning recovery I began copying logs and backups from our DR to a safe location: over the last 5 months there seem to have been more hardware failures than the 3 years I was last a DBA in 2010. I know that if I had a backup of the system safely somewhere OUTSIDE of a VM environment, I would be able to restore the system regardless of how many servers failed.
Being the exact same physical server and starting up the exact same VMs, which were running no less than 30 minutes ago, I thought the cluster would come up in the exact same way.
This was not the case: the servers started, but the CRS failed to start up. The errors began popping up one after another in the cluster log.
E:\OracleGrid\11.2.0.3\log\<node1>\alert<node1>.log
[ohasd(1548)]CRS-2302:Cannot get GPnP profile. Error CLSGPNP_NO_DAEMON (GPNPD daemon is not running).
..........
[cssd(5064)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00011:) in E:\OracleGrid\11.2.0.3\log\<node1>\cssd\ocssd.log
2016-05-30 10:44:05.766:
[ohasd(2424)]CRS-2765:Resource 'ora.cssdmonitor' has failed on server '<node1>'.
Oracle Cluster Synchronization Services has failed because it seems to me that the GPNPD daemon is not running. The reasons for which are below.
E:\OracleGrid\11.2.0.3\log\<node1>\cssd\ocssd.log
2016-05-30 10:44:05.329: [ CSSD][5064]clssnmOpenGIPCEndp: opening cluster listener on gipc://<node1>:nm_<CLUSTER_NAME>-cluster
2016-05-30 10:44:05.329: [GIPCGMOD][5064] gipcmodGipcPassInitializeNetwork: Initializing passthrough GIPC
2016-05-30 10:44:05.385: [ GPNP][5064] clsgpnp_Init: [at clsgpnp0.c:585] 'E:\OracleGrid\11.2.0.3' in effect as GPnP home base.
2016-05-30 10:44:05.385: [ GPNP][5064] clsgpnp_Init: [at clsgpnp0.c:619] GPnP pid=5060, GPNP comp tracelevel=1, depcomp tracelevel=0, tlsrc:ORA_DAEMON_LOGGING_LEVELS, apitl:0, complog:1, tstenv:0, devenv:0, envopt:0, flags=2003
2016-05-30 10:44:05.396: [ GPNP][5064] clsgpnpkwf_initwfloc: [at clsgpnpkwf.c:399] Using FS Wallet Location : E:\OracleGrid\11.2.0.3\gpnp\<node1>\wallets\peer\
[ CLWAL][5064]clsw_Initialize: OLR initlevel [70000]
2016-05-30 10:44:05.422: [ GPNP][5064] clsgpnp_profileCallUrlInt: [at clsgpnp.c:2104] get-profile call to url "ipc://GPNPD_<node1>" disco "" [f=3 claimed- host: cname: seq: auth:]
2016-05-30 10:44:05.431: [ GPNP][5064] clsgpnp_profileCallUrlInt: [at clsgpnp.c:2234] Result: (0) CLSGPNP_OK. Successful get-profile CALL to remote "ipc://GPNPD_<node1>" disco ""
2016-05-30 10:44:05.432: [ CLSINET][5064] Returning NETDATA: 0 interfaces
2016-05-30 10:44:05.432: [GIPCXCPT][5064] gipcmodGipcPassInitializeNetwork: failed to find any interfaces in clsinet, ret gipcretFail (1)
2016-05-30 10:44:05.432: [GIPCGMOD][5064] gipcmodGipcPassInitializeNetwork: EXCEPTION[ ret gipcretFail (1) ] failed to determine host from clsinet
2016-05-30 10:44:05.640: [GIPCXCPT][5064] gipcSetAttributeStringF [gipcInternalAddress : gipcInternal.c : 354]: EXCEPTION[ ret gipcretFail (1) ] failure for obj 00000000063EAED0 [000000000000025a] { gipcAddress : name '', objFlags 0x0, addrFlags 0x0 }, name 'name', val 000000000012D130, len 39, flags 0x4000
2016-05-30 10:44:05.640: [GIPCXCPT][5064] gipcEndpointF [clsssclsnrsetup : clsssc.c : 2763]: EXCEPTION[ ret gipcretFail (1) ] failed endp create ctx 0000000005060F20 [000000000000006b] { gipcContext : traceLevel 2, fieldLevel 0x0, numDead 0, numPending 0, numZombie 0, numObj 5, numWait 0, hgid 000000000000006c, flags 0x2, objFlags 0x0 }, name 'gipc://<node1>:nm_<CLUSTER_NAME>-cluster', flags 0x0
2016-05-30 10:44:05.640: [ CSSD][5064]clsssclsnrsetup: gipcEndpoint failed, rc 1
2016-05-30 10:44:05.640: [ CSSD][5064]clssnmOpenGIPCEndp: failed to listen on gipc addr gipc://<node1>:nm_<CLUSTER_NAME>-cluster- ret 1
2016-05-30 10:44:05.640: [ CSSD][5064]clssnmCompleteInitVFDiscovery: failed to open gipc endp
2016-05-30 10:44:05.640: [ CSSD][5064](:CSSSC00011:)clssscExit: A fatal error occurred during initialization
To simplify fixing the issue, I shut down the second VM node and focused exclusively on the first node. I restarted the system and let the cluster try to self start. This failed.
This is when I began digging deeper into the logs in conjunction with MOS. I ran to MOS to begin searching for a solution as the above errors did not resonate with me. I was fortunate that the first
item that popped up in MOS was titled "GI Fails to Start as no Private Network Interface is Available (Doc ID 1481176.1)" - without opening the note, I immediately checked
the private network interface. I was happy to see that the private network interfaces for the node interconnect were left out of the migration - meaning that the cluster nodes
were unable to interface. I requested the administrators to correct the interface. Once they were restored, I managed to get the first node up, followed by the second node.
SUMMARY:
Check the private interface, make sure the interface is viewable in network properties on both nodes in the cluster - if not, have the administrator check that they have been made available to the VMs in the cluster.
No comments:
Post a Comment