Monday 4 July 2016

ORA.CVU Memory Leak - Windows Bug


The production cluster at the company I contract too had been running smoothly for ~30 days. Last week, during my morning check, I discovered the the listener and second and third scan resources on our cluster were in a state I had not seen:

INTERMEDIATE - CHECK TIMED OUT


crsctl stat res -t returned the following:

CLUSTER HEALTH REPORT  
---------------------  
.                      
crsctl stat res

-t     
------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS       
------------------------------------------------
Local Resources
------------------------------------------------
................
ora.LISTENER.lsnr
               ONLINE  INTERMEDIATE <node1>    CHECK TIMED OUT     
               ONLINE  ONLINE       <node2>                              
................
------------------------------------------------
Cluster Resources
------------------------------------------------
ora.LISTENER_SCAN1.lsnr
      1        ONLINE  ONLINE       <node2>                              
ora.LISTENER_SCAN2.lsnr
      1        ONLINE  INTERMEDIATE <node1>   CHECK TIMED OUT     
ora.LISTENER_SCAN3.lsnr
      1        ONLINE  INTERMEDIATE <node1>   CHECK TIMED OUT     
ora.cvu
      1        ONLINE  ONLINE       <node1>                              
......................

According to the log, the issue began at 01:47 that morning and had been consistently in this state.

The central log did not reveal the underlying cause of the issue
E:\OracleGrid\11.2.0.3\log\<node1>\alert<node1>.log

[E:\OracleGrid\11.2.0.3\bin\oraagent.exe(4348)]CRS-5818:Aborted command 'check' for resource 'ora.LISTENER.lsnr'. Details at (:CRSAGF00113:) {1:60709:2} in E:\OracleGrid\11.2.0.3\log\<node1>\agent\crsd\oraagent\oraagent.log.

[E:\OracleGrid\11.2.0.3\bin\oraagent.exe(5148)]CRS-5014:Agent "E:\OracleGrid\11.2.0.3\bin\oraagent.exe" timed out starting process "E:\OracleGrid\11.2.0.3\bin\lsnrctl.exe" for action "check": details at "(:CLSN00009:)" in "E:\OracleGrid\11.2.0.3\log\<node1>\agent\crsd\oraagent\oraagent.log"

[E:\OracleGrid\11.2.0.3\bin\oraagent.exe(5128)]CRS-5818:Aborted command 'check' for resource 'ora.LISTENER_SCAN3.lsnr'. Details at (:CRSAGF00113:) {1:60709:48526} in E:\OracleGrid\11.2.0.3\log\<node1>\agent\crsd\oraagent\oraagent.log.

Looking at the E:\OracleGrid\11.2.0.3\log\<node1>\agent\crsd\oraagent\oraagent.log log file, I find the following:
2016-07-01 11:08:36.517: [   AGENT][5124] {1:60709:2} UserErrorException: Locale is ..
2016-07-01 11:08:36.517: [ora.LISTENER.lsnr][5124] {1:60709:2} [check] clsnUtils::error Exception type=2 string=
CRS-5017: The resource action "ora.LISTENER.lsnr check" encountered the following error:
(:CLSN00009:)Utils:execCmd aborted. For details refer to "(:CLSN00109:)" in "E:\OracleGrid\11.2.0.3\log\<node1>\agent\crsd\oraagent\oraagent.log".
..
2016-07-01 11:08:38.421: [ora.LISTENER_SCAN2.lsnr][4348] {1:60709:2} [check] got lock..
2016-07-01 11:08:38.421: [ora.LISTENER_SCAN2.lsnr][4348] {1:60709:2} [check] tryActionLock }.
2016-07-01 11:08:38.421: [ora.LISTENER_SCAN2.lsnr][4348] {1:60709:2} [check] abort  }.
2016-07-01 11:08:38.421: [ora.LISTENER_SCAN2.lsnr][4348] {1:60709:2} [check] (:CLSN00110:) clsn_agent::abort }..
2016-07-01 11:08:38.421: [    AGFW][4348] {1:60709:2} Command: check for resource: ora.LISTENER_SCAN2.lsnr 1 1 completed with status: TIMEDOUT..
...

This is message appears many times and is the the same response for SCAN2 and SCAN3 checks.

However, the users were not complaining, the sessions were distributed evenly across all cluster nodes and operations could be considered "smooth". According to v$sessions and the listener logs, new connections were being made across both listeners and scans. I began checking MOS for notes on this issue. I discovered that there were a few different issues related to this problem so I began cross referencing the log files and the items found on MOS. I found a note making reference to a memory leak - so I grepped through all listener logs and I discovered a match.

Early that morning, the memory limit had been reached on the heap and the first sign of trouble was revealed in the listener alert file:
E:\OracleGrid\11.2.0.3\log\diag\tnslsnr\<node1>\listener\alert.xml
<msg time='2016-07-01T01:47:16.033+02:00' org_id='oracle' comp_id='tnslsnr'
 type='UNKNOWN' level='16' host_id='<node1>'
 host_addr='::1'>
 <txt>TNS-12531: TNS:cannot allocate memory
 </txt>
</msg>

Just as the note had depicted, there was another sign to look for - memory exceptions in the cvutrace.log file
E:\OracleGrid\11.2.0.3\log\<node1>\cvu\cvutrc\cvutrace.log file.0

[6880@<node1>] [main] [ 2016-07-01 05:27:02.783 CAT ] [VerificationUtil.getNodelist:685]  ==== Nodelist after: nodelist[0]:<node1>
Exception in thread "main" java.lang.NullPointerException
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:989)
at java.lang.Double.valueOf(Double.java:447)
at java.lang.Double.<init>(Double.java:539)
at oracle.ops.verification.framework.engine.component.FreeSpaceComponent.initializeReqdValues(FreeSpaceComponent.java:240)
at oracle.ops.verification.framework.engine.component.FreeSpaceComponent.init(FreeSpaceComponent.java:98)
at oracle.ops.verification.framework.engine.component.FreeSpaceComponent.<init>(FreeSpaceComponent.java:85)
at oracle.ops.verification.framework.engine.component.ComponentFactory.createComponent(ComponentFactory.java:164)
at oracle.ops.verification.framework.engine.component.HealthComponent.verifyComponents(HealthComponent.java:211)
at oracle.ops.verification.framework.engine.component.HealthComponent.verify(HealthComponent.java:145)
at oracle.ops.verification.framework.engine.ClusterVerifier.verifyComponent(ClusterVerifier.java:117)
at oracle.ops.verification.client.CluvfyDriver.main(CluvfyDriver.java:276)

The note I am referring to is note: Windows: Ora.LISTENER.lsnr Status goes to INTERMEDIATE CHECK TIMED OUT (Doc ID 1523366.1)

The bug in a nutshell: ora.cvu runs cluvfy comp health -_format but fails to close its session thereby consuming heap memory until the listener and scan resources on the node, where CVU is hosted ,begins failing.

I opened process explorer and found 125 instances of the open cluvfy commands:

C:\Windows\system32\cmd.exe  /K E:\OracleGrid\11.2.0.3\bin\cluvfy  comp health -_format 



I killed all off all but one process in hopes that it would free memory and allow the cluster to correct itself. This did not work. I then tried to restart ora.cvu using

srvctl stop cvu
srvctl start cvu

The statues did not change. I tried to stop one of the scans and start it up again

srvctl stop SCAN_LISTENER -i 3
srvctl start SCAN_LISTENER -i 3

The process became stuck during the shutdown process as it timed out.
I cancelled the shutdown and tried to start it up - this too timed out.

I coordinated with my managers, explained the scenario and it was decided that the first node would be restarted. Not great, but at least the shutdown was controlled by us and not left to the whims of the clusterware. The node was restarted and the issue went away - at least for now until the bug consumes it all again over time.

The CVU is a health check conducted by the cluster. By default, the process runs every 6 hours meaning it can spawn and leave 4 sessions every 24 hours. It runs on only one cluster node and the bug impacts only the hosting node.

The issue is said to be fixed in 11.2.0.4. The only hint of a workaround I have found by Oracle is to disable CVU - this is something that I am considering as CVU was only introduced recently in 11.2.0.2. This admission can be found in the MOS note: Clusterware resource ora.cvu FAQ (Doc ID 1524235.1) - Known issues with ora.cvu

I am testing this in our test cluster and will research further to find out what the implications are. Until then, I will continue killing left over sessions to see if this curtails the failure. I will certainly know whether this helps or not in 30 days whether or not killing the left over cmd sessions prevents the issue.

One last possibility to buy time is to extend the CVU check from its 6 hour check to 12 to 24 hours - this would stave of a crash from anywhere between 2 to 4 months. This can be done using the following command:
E:\>srvctl config cvu
CVU is configured to run once every 360 minutes

E:\>srvctl modify cvu -t 720


UPDATE:
I've been periodically killing left over clufvy sessions and the cluster is holding thus far, it has been over 30 days since the issue occurred. The CVU service also jumped nodes, so I monitored the wrong node for a few days until the penny dropped.

FINAL UPDATE: I have been able to write my own fix for the problem. It can be found here

No comments:

Post a Comment