Friday 16 September 2016

Oracle RAC - home rolling an ORA.CVU patch

I have managed to stop the cluster from wasting away memory chunks owing to BUG [Note 1523366.1]. Oracle's solution is to opatch. However, the client cannot do that at this point owing to the country wide dependency on the system. In addition to this, to opatch the system is risky considering the fragile nature of RAC in Windows. I am not 100% confident that an opatch will be successful and I would sooner create a new cluster  than to apply a patch to and Oracle RAC system on Windows.

For the original discovery of the bug - see here
For the fragility of the cluster - see here


A recap: the command that is left open every 6 hours during the CVU health check is:

C:\Windows\system32\cmd.exe /K E:\OracleGrid\11.2.0.3\bin\cluvfy comp health -_format

A screenshot can be found here when the orphaned session grows in numbers:

For the memory leak fix, I did the following.

I backup a copy of the bat file: E:\OracleGrid\11.2.0.3\bin\cluvfy.bat 

I then open the original cluvfy,bat in notepad and carefully make the following changes

CHANGE 1
FROM:
if not (%CRSHOME%)==() (
  @set "CV_HOME=%CRSHOME%"
)

set CMDPATH=%~dp0

TO:
if not (%CRSHOME%)==() (
  @set "CV_HOME=%CRSHOME%"
)

set EXIT_OPTION=

if "%CVU_RESOURCE_OPTIONS%"=="" set EXIT_OPTION=/B


set CMDPATH=%~dp0

CHANGE 2
FROM:
exit /B %errorlevel%
goto done

:ERROR
exit /B 1

TO:
exit %EXIT_OPTION% %errorlevel%
goto done

:ERROR
exit %EXIT_OPTION% 1

Once the change is made, save and copy the file across to node 2. I can also move the CVU to avoid creating any unwanted issues using while carrying out the above changes

srvctl relocate cvu -n <node name>

However, considering the batch file is executed every 6 hours, the chances are slim.

The idea to carry this out came after looking at the source code of the 11.2.0.4 home - I noted a bug fix by a certain Oracle developer @ Oracle. I diff'd the old and new CVU, evaluated the developers intention, and took only what I needed from his fix to manually patch the CVU myself. I first recreated the problem in the 11.2.0.4 test environment, monitored it for a day, patched it with my fix, and monitored for a few days. Once done satisfied, I took the 'patch' live. This fix will allow the in house DBA and his manager to have a cluster that no longer crashes once a month - forever. The nightmare is over and this solution will suffice until we create a new cluster for 12c. Done fixed it.

No comments:

Post a Comment