Showing posts with label Listener. Show all posts
Showing posts with label Listener. Show all posts

Friday, 16 September 2016

Oracle RAC - home rolling an ORA.CVU patch

I have managed to stop the cluster from wasting away memory chunks owing to BUG [Note 1523366.1]. Oracle's solution is to opatch. However, the client cannot do that at this point owing to the country wide dependency on the system. In addition to this, to opatch the system is risky considering the fragile nature of RAC in Windows. I am not 100% confident that an opatch will be successful and I would sooner create a new cluster  than to apply a patch to and Oracle RAC system on Windows.

For the original discovery of the bug - see here
For the fragility of the cluster - see here


A recap: the command that is left open every 6 hours during the CVU health check is:

C:\Windows\system32\cmd.exe /K E:\OracleGrid\11.2.0.3\bin\cluvfy comp health -_format

A screenshot can be found here when the orphaned session grows in numbers:

For the memory leak fix, I did the following.

I backup a copy of the bat file: E:\OracleGrid\11.2.0.3\bin\cluvfy.bat 

I then open the original cluvfy,bat in notepad and carefully make the following changes

CHANGE 1
FROM:
if not (%CRSHOME%)==() (
  @set "CV_HOME=%CRSHOME%"
)

set CMDPATH=%~dp0

TO:
if not (%CRSHOME%)==() (
  @set "CV_HOME=%CRSHOME%"
)

set EXIT_OPTION=

if "%CVU_RESOURCE_OPTIONS%"=="" set EXIT_OPTION=/B


set CMDPATH=%~dp0

CHANGE 2
FROM:
exit /B %errorlevel%
goto done

:ERROR
exit /B 1

TO:
exit %EXIT_OPTION% %errorlevel%
goto done

:ERROR
exit %EXIT_OPTION% 1

Once the change is made, save and copy the file across to node 2. I can also move the CVU to avoid creating any unwanted issues using while carrying out the above changes

srvctl relocate cvu -n <node name>

However, considering the batch file is executed every 6 hours, the chances are slim.

The idea to carry this out came after looking at the source code of the 11.2.0.4 home - I noted a bug fix by a certain Oracle developer @ Oracle. I diff'd the old and new CVU, evaluated the developers intention, and took only what I needed from his fix to manually patch the CVU myself. I first recreated the problem in the 11.2.0.4 test environment, monitored it for a day, patched it with my fix, and monitored for a few days. Once done satisfied, I took the 'patch' live. This fix will allow the in house DBA and his manager to have a cluster that no longer crashes once a month - forever. The nightmare is over and this solution will suffice until we create a new cluster for 12c. Done fixed it.

Monday, 4 July 2016

ORA.CVU Memory Leak - Windows Bug


The production cluster at the company I contract too had been running smoothly for ~30 days. Last week, during my morning check, I discovered the the listener and second and third scan resources on our cluster were in a state I had not seen:

INTERMEDIATE - CHECK TIMED OUT


crsctl stat res -t returned the following:

CLUSTER HEALTH REPORT  
---------------------  
.                      
crsctl stat res

-t     
------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS       
------------------------------------------------
Local Resources
------------------------------------------------
................
ora.LISTENER.lsnr
               ONLINE  INTERMEDIATE <node1>    CHECK TIMED OUT     
               ONLINE  ONLINE       <node2>                              
................
------------------------------------------------
Cluster Resources
------------------------------------------------
ora.LISTENER_SCAN1.lsnr
      1        ONLINE  ONLINE       <node2>                              
ora.LISTENER_SCAN2.lsnr
      1        ONLINE  INTERMEDIATE <node1>   CHECK TIMED OUT     
ora.LISTENER_SCAN3.lsnr
      1        ONLINE  INTERMEDIATE <node1>   CHECK TIMED OUT     
ora.cvu
      1        ONLINE  ONLINE       <node1>                              
......................

According to the log, the issue began at 01:47 that morning and had been consistently in this state.

The central log did not reveal the underlying cause of the issue
E:\OracleGrid\11.2.0.3\log\<node1>\alert<node1>.log

[E:\OracleGrid\11.2.0.3\bin\oraagent.exe(4348)]CRS-5818:Aborted command 'check' for resource 'ora.LISTENER.lsnr'. Details at (:CRSAGF00113:) {1:60709:2} in E:\OracleGrid\11.2.0.3\log\<node1>\agent\crsd\oraagent\oraagent.log.

[E:\OracleGrid\11.2.0.3\bin\oraagent.exe(5148)]CRS-5014:Agent "E:\OracleGrid\11.2.0.3\bin\oraagent.exe" timed out starting process "E:\OracleGrid\11.2.0.3\bin\lsnrctl.exe" for action "check": details at "(:CLSN00009:)" in "E:\OracleGrid\11.2.0.3\log\<node1>\agent\crsd\oraagent\oraagent.log"

[E:\OracleGrid\11.2.0.3\bin\oraagent.exe(5128)]CRS-5818:Aborted command 'check' for resource 'ora.LISTENER_SCAN3.lsnr'. Details at (:CRSAGF00113:) {1:60709:48526} in E:\OracleGrid\11.2.0.3\log\<node1>\agent\crsd\oraagent\oraagent.log.

Looking at the E:\OracleGrid\11.2.0.3\log\<node1>\agent\crsd\oraagent\oraagent.log log file, I find the following:
2016-07-01 11:08:36.517: [   AGENT][5124] {1:60709:2} UserErrorException: Locale is ..
2016-07-01 11:08:36.517: [ora.LISTENER.lsnr][5124] {1:60709:2} [check] clsnUtils::error Exception type=2 string=
CRS-5017: The resource action "ora.LISTENER.lsnr check" encountered the following error:
(:CLSN00009:)Utils:execCmd aborted. For details refer to "(:CLSN00109:)" in "E:\OracleGrid\11.2.0.3\log\<node1>\agent\crsd\oraagent\oraagent.log".
..
2016-07-01 11:08:38.421: [ora.LISTENER_SCAN2.lsnr][4348] {1:60709:2} [check] got lock..
2016-07-01 11:08:38.421: [ora.LISTENER_SCAN2.lsnr][4348] {1:60709:2} [check] tryActionLock }.
2016-07-01 11:08:38.421: [ora.LISTENER_SCAN2.lsnr][4348] {1:60709:2} [check] abort  }.
2016-07-01 11:08:38.421: [ora.LISTENER_SCAN2.lsnr][4348] {1:60709:2} [check] (:CLSN00110:) clsn_agent::abort }..
2016-07-01 11:08:38.421: [    AGFW][4348] {1:60709:2} Command: check for resource: ora.LISTENER_SCAN2.lsnr 1 1 completed with status: TIMEDOUT..
...

This is message appears many times and is the the same response for SCAN2 and SCAN3 checks.

However, the users were not complaining, the sessions were distributed evenly across all cluster nodes and operations could be considered "smooth". According to v$sessions and the listener logs, new connections were being made across both listeners and scans. I began checking MOS for notes on this issue. I discovered that there were a few different issues related to this problem so I began cross referencing the log files and the items found on MOS. I found a note making reference to a memory leak - so I grepped through all listener logs and I discovered a match.

Early that morning, the memory limit had been reached on the heap and the first sign of trouble was revealed in the listener alert file:
E:\OracleGrid\11.2.0.3\log\diag\tnslsnr\<node1>\listener\alert.xml
<msg time='2016-07-01T01:47:16.033+02:00' org_id='oracle' comp_id='tnslsnr'
 type='UNKNOWN' level='16' host_id='<node1>'
 host_addr='::1'>
 <txt>TNS-12531: TNS:cannot allocate memory
 </txt>
</msg>

Just as the note had depicted, there was another sign to look for - memory exceptions in the cvutrace.log file
E:\OracleGrid\11.2.0.3\log\<node1>\cvu\cvutrc\cvutrace.log file.0

[6880@<node1>] [main] [ 2016-07-01 05:27:02.783 CAT ] [VerificationUtil.getNodelist:685]  ==== Nodelist after: nodelist[0]:<node1>
Exception in thread "main" java.lang.NullPointerException
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:989)
at java.lang.Double.valueOf(Double.java:447)
at java.lang.Double.<init>(Double.java:539)
at oracle.ops.verification.framework.engine.component.FreeSpaceComponent.initializeReqdValues(FreeSpaceComponent.java:240)
at oracle.ops.verification.framework.engine.component.FreeSpaceComponent.init(FreeSpaceComponent.java:98)
at oracle.ops.verification.framework.engine.component.FreeSpaceComponent.<init>(FreeSpaceComponent.java:85)
at oracle.ops.verification.framework.engine.component.ComponentFactory.createComponent(ComponentFactory.java:164)
at oracle.ops.verification.framework.engine.component.HealthComponent.verifyComponents(HealthComponent.java:211)
at oracle.ops.verification.framework.engine.component.HealthComponent.verify(HealthComponent.java:145)
at oracle.ops.verification.framework.engine.ClusterVerifier.verifyComponent(ClusterVerifier.java:117)
at oracle.ops.verification.client.CluvfyDriver.main(CluvfyDriver.java:276)

The note I am referring to is note: Windows: Ora.LISTENER.lsnr Status goes to INTERMEDIATE CHECK TIMED OUT (Doc ID 1523366.1)

The bug in a nutshell: ora.cvu runs cluvfy comp health -_format but fails to close its session thereby consuming heap memory until the listener and scan resources on the node, where CVU is hosted ,begins failing.

I opened process explorer and found 125 instances of the open cluvfy commands:

C:\Windows\system32\cmd.exe  /K E:\OracleGrid\11.2.0.3\bin\cluvfy  comp health -_format 



I killed all off all but one process in hopes that it would free memory and allow the cluster to correct itself. This did not work. I then tried to restart ora.cvu using

srvctl stop cvu
srvctl start cvu

The statues did not change. I tried to stop one of the scans and start it up again

srvctl stop SCAN_LISTENER -i 3
srvctl start SCAN_LISTENER -i 3

The process became stuck during the shutdown process as it timed out.
I cancelled the shutdown and tried to start it up - this too timed out.

I coordinated with my managers, explained the scenario and it was decided that the first node would be restarted. Not great, but at least the shutdown was controlled by us and not left to the whims of the clusterware. The node was restarted and the issue went away - at least for now until the bug consumes it all again over time.

The CVU is a health check conducted by the cluster. By default, the process runs every 6 hours meaning it can spawn and leave 4 sessions every 24 hours. It runs on only one cluster node and the bug impacts only the hosting node.

The issue is said to be fixed in 11.2.0.4. The only hint of a workaround I have found by Oracle is to disable CVU - this is something that I am considering as CVU was only introduced recently in 11.2.0.2. This admission can be found in the MOS note: Clusterware resource ora.cvu FAQ (Doc ID 1524235.1) - Known issues with ora.cvu

I am testing this in our test cluster and will research further to find out what the implications are. Until then, I will continue killing left over sessions to see if this curtails the failure. I will certainly know whether this helps or not in 30 days whether or not killing the left over cmd sessions prevents the issue.

One last possibility to buy time is to extend the CVU check from its 6 hour check to 12 to 24 hours - this would stave of a crash from anywhere between 2 to 4 months. This can be done using the following command:
E:\>srvctl config cvu
CVU is configured to run once every 360 minutes

E:\>srvctl modify cvu -t 720


UPDATE:
I've been periodically killing left over clufvy sessions and the cluster is holding thus far, it has been over 30 days since the issue occurred. The CVU service also jumped nodes, so I monitored the wrong node for a few days until the penny dropped.

FINAL UPDATE: I have been able to write my own fix for the problem. It can be found here

Friday, 17 June 2016

Setting up a standalone APEX Listener in Glassfish

I setup Glass fish on a development box a few weeks ago. My understanding of how Glassfish works and where it fits into the enterprise architecture picture is limited but I believe I have an idea of what it is: Glassfish is a free application server. It looks and feels, at least to me, like a lightweight Oracle 11g Middleware. The reason why Glassfish was needed is that the development team wanted to setup an APEX listener and that listener needed to be accessible to the outside world and hosted outside of the database. The database that is at the centre of this setup has APEX 4.2 installed in it.

My notes for setting up APEX are below.

1. Download Glassfish and the APEX Standalone Listener:

Glassfish server can be found here:
https://glassfish.java.net/download.html

Listener can be found here:
http://www.oracle.com/technetwork/developer-tools/rest-data-services/downloads/apex-listener-download-v205-2135981.html

*NOTE: The download locations can change from time to time, so for this post I used Apex Listener 2.0.5
I later learned that Oracle rebranded the listener to Oracle Restful Services. This note is focused on the old Apex Listener.

2. Install
Unzip Glassfish in a suitable location. I use E:\oracle\glassfish4. Glassfish works out the box so there is no installation required.

3. Start the default domain

E:\oracle\glassfish4\bin>asadmin start-domain
Waiting for domain1 to start ..................
Successfully started the domain : domain1
domain  Location: E:\oracle\glassfish4\glassfish\domains\domain1
Log File: E:\oracle\glassfish4\glassfish\domains\domain1\logs\server.log
Admin Port: 4848
Command start-domain executed successfully.

4. Set the admin password 
For now, I use "adminadmin"

E:\oracle\glassfish4\bin>asadmin --host localhost --port 4848 change-admin-password
Enter admin user name [default: admin]>admin
Enter the admin password> <TYPE ADMINADMIN>
Enter the new admin password> <TYPE ADMINADMIN>
Enter the new admin password again> <TYPE ADMINADMIN>
Command change-admin-password executed successfully.

5. Enable secure admin
E:\oracle\glassfish4\bin>asadmin --host localhost --port 4848 enable-secure-admin
Enter admin user name>  admin
Enter admin password for user "admin"> <TYPE ADMINADMIN>
You must restart all running servers for the change in secure admin to take effect.
Command enable-secure-admin executed successfully.

6. Setup the APEX.WAR file for glassfish.
For Glassfish to present an Apex Listener, all it needs is the APEX.WAR file. I navigate to the location of the WAR file located inside the unzipped listener folder.
I then configure the file and save it. I configure the listener to use directory E:\oracle\glassfish4\glassfish\domains\domain1\apex_config to contain its configuration information.
So, make the directory E:\oracle\glassfish4\glassfish\domains\domain1\apex_config before configuring the WAR file.

G:\apex_listener_2_0_5>java -jar apex.war configdir E:\oracle\glassfish4\glassfish\domains\domain1\apex_config
Apr 21, 2016 3:30:10 PM oracle.dbtools.common.config.cmds.ConfigDir execute
INFO: Set config.dir to E:\oracle\glassfish4\glassfish\domains\domain1\apex_config in: G:\apex_listener_2_0_5\apex.war

7. Configure the listener to point to one database
G:\apex_listener_2_0_5>java -jar apex.war setup
Apr 21, 2016 3:30:43 PM oracle.dbtools.common.config.file.ConfigurationFolder logConfigFolder
INFO: Using configuration folder: E:\oracle\glassfish4\glassfish\domains\domain1\apex_config\apex
Enter the name of the database server [localhost]:<GLASSFISH SERVER>
Enter the database listen port [1521]:1522
Enter 1 to specify the database service name, or 2 to specify the database SID [1]:2
Enter the database SID [xe]:<DATABASE SID>
Enter the database user name [APEX_PUBLIC_USER]:APEX_PUBLIC_USER
Enter the database password for APEX_PUBLIC_USER:
Confirm password:
Enter 1 to enter passwords for the RESTful Services database users (APEX_LISTENER,APEX_REST_PUBLIC_USER), 2 to use the same password as used for APEX_PUBLIC_USE
R or, 3 to skip this step [1]:1
Enter the database password for APEX_LISTENER:
Confirm password:
Enter the database password for APEX_REST_PUBLIC_USER:
Confirm password:
Apr 21, 2016 3:32:09 PM oracle.dbtools.common.config.file.ConfigurationFiles update
INFO: Updated configurations: defaults, apex, apex_al, apex_rt

8. Make the apex images directory available to the APEX.WAR file
These images are the ones found within the APEX database installation package.
On our development server, our /i alias is set to e:\apex422_images. So, I setup the war to point to the same location.

G:\apex_listener_2_0_5>java -jar apex.war static e:\apex422_images
WAR Generation complete
WAR location     : G:\apex_listener_2_0_5\i.war
Context path     : /i
Static resources : e:\apex422_images
Ensure the static resources are available at path: e:\apex422_images
on the server where the WAR is deployed

9. Upload the JAR file into glass fish.
Navigate to https://<GLASSFISH HOSTING SERVER>:4848





Enter the credentials selected earlier.
Once logged in, click the "Applications" link on the left. Once it opens in the right panel, click the "Deploy" button. You will presented with the screen below.




















Click the "Choose File" button and locate the WAR. Once selected, click "OK"

This will load the JAR file into the Glassfish framework.

9. Stop and start the domain
E:\oracle\glassfish4\bin>asadmin stop-domain
Waiting for the domain to stop .
Command stop-domain executed successfully.

E:\oracle\glassfish4\bin>asadmin start-domain
Waiting for domain1 to start ...........................................
Successfully started the domain : domain1
domain  Location: E:\oracle\glassfish4\glassfish\domains\domain1
Log File: E:\oracle\glassfish4\glassfish\domains\domain1\logs\server.log
Admin Port: 4848
Command start-domain executed successfully.

10. Test glass fish and the new glasshfish administration page.

It seems that Glassfish piggy backs off port 8080 - the other webserver port when 80 is taken.
The glassfish administration console presents itself on port 4848.
To logon to the Glassfish administration console, use the credentials supplied in the APEX setup.

Test Apex

http://<GLASSFISH HOSTING SERVER>:8080/apex/apex_admin