Monday, 25 July 2016

Authentication - Oracle Service Terminated Unexpectedly

This morning, a public facing portal database was not starting up after a restart. I could not login the database as sysdba to startup owing to

ORA-12560: TNS:protocol adapter error.

Looking at the services, the OracleService<SID> was down.
When starting the OracleService<SID>, I received an unexpected "Oracle service terminated unexpectedly" prompt.
When investigating the alert log, I see the database begins starting up but ends with

ORACLE_BASE from environment = E:\Oracle

The next step in the startup would be

alter database mount exclusive

- but it does not get this far.

I check the listener file at E:\Oracle\diag\tnslsnr\<HOST>\listener_<SID>\trace\listener_<SID>.log
and found
TNS-12518: TNS:listener could not hand off client connection
 TNS-12560: TNS:protocol adapter error
  TNS-00530: Protocol adapter error
   64-bit Windows Error: 203: Unknown error

I also checked the E:\Oracle\diag\clients\user_<USER>\host_2051522117_80\trace\sqlnet.log file for the SYSTEM user
 
***********************************************************************

Fatal NI connect error 12637, connecting to:
 (DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=BEQ)(PROGRAM=oracle)(ARGV0=oracle<SID>)(ARGS='(DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))')))(CONNECT_DATA=(SID=<SID>)(CID=(PROGRAM=E:\Oracle\product\11.2.0\dbhome_1\bin\oradim.exe)(HOST=<SERVER>)(USER=SYSTEM))))

  VERSION INFORMATION:
TNS for 64-bit Windows: Version 11.2.0.3.0 - Production
Oracle Bequeath NT Protocol Adapter for 64-bit Windows: Version 11.2.0.3.0 - Production
  Time: 25-JUL-2016 09:18:42
  Tracing not turned on.
  Tns error struct:
    ns main err code: 12637
    
TNS-12637: Packet receive failed
    ns secondary err code: 12532
    nt main err code: 0
    nt secondary err code: 0
    nt OS err code: 0

The issue is related to connectivity or authentication.
I checked ora_dba privileges and found that the group had the correct users within.
I checked the user starting the OracleService<SID> service and it was the Local System users.

I suspected domain authentication issues and so I changed the SQLNET.ORA parameter

SQLNET.AUTHENTICATION_SERVICES = (NTS)

to

SQLNET.AUTHENTICATION_SERVICES = (NONE)

This worked and the database was able to startup using the service.
However, I could no longer execute backups or use sysdba but at least the database was up for now.

After much digging and investigating, it turned out the issue was that one service was not started upon the recent restart of the server machine. That service is the workstation service. This service allows a machine on the Microsoft domain to access files on remote machines. I guesstimate that the remote call aims to authenticate the user starting the service but fails to receive a valid response if this service is down.

Upon starting the service up, the database was taken out of NONE authentication and put back into NTS. The backups are running successfully and I am able to connect again as SYSDBA.



Monday, 18 July 2016

Database in the Jungle

Lagoon of Cotonou by Mark Fischer 
I am being sent to West Africa to configure and possibly reinstall database and middleware servers. It will be an interesting experience and I will likely post images and anecdotes from the week I spend in the coastal city of Cotonou in Benin.

The trip came about for a number of reasons, but the definitive one, from my understanding, is that the engineers initially selected for the project do not want to go for sudden reasons. So, my colleague and I will be shipped off to drive the project and setup the foundations for a new information system. It doesn't help that we don't speak french but we'll deal with what ever comes our way.

In preparation, I've had to receive vaccinations for various bugs including yellow fever. I was even given a little yellow book to say that I am yellow fever proof. I've also been given 2 weeks supply of malaria tablets. Until we fly, I will be preparing my scripts, software installations, patches and certificates so that I am prepared and self reliant during my week long stay. We're not sure what to expect of internet accessibility but considering it is a large organisation, we have faith that basic internet connectivity will be available in the event we need access to the outside world.

SMTP Preferences in the database

At the end of the month, the server administrators are changing the IP address of our smtp server.
I was given a list of all servers registered in the server log over the last few weeks.
On these servers are EDI apps, task scheduler jobs, databases and application servers and it was my task to drill through each and everyone, finding all the instances of where the smtp server was utilized directly through IP. Most of the stuff was simple, but I did struggle initially to find the smtp preferences for APEX installations, scheduler job notifications and plain old job notifications.

There locations can be found using the following queries and blocks of pl/sql.

To find the SMTP relay for an APEX schema:
SELECT
APEX_040200.WWV_FLOW_PLATFORM.GET_PREFERENCE('SMTP_HOST_ADDRESS'),
APEX_040200.WWV_FLOW_PLATFORM.GET_PREFERENCE('SMTP_HOST_PORT'),
APEX_040200.WWV_FLOW_PLATFORM.GET_PREFERENCE('SMTP_USERNAME'),
APEX_040200.WWV_FLOW_PLATFORM.GET_PREFERENCE('SMTP_PASSWORD'),
APEX_040200.WWV_FLOW_PLATFORM.GET_PREFERENCE('SMTP_TLS_MODE')
FROM dual

To change the SMTP relay for APEX:
EXEC APEX_UTIL.SET_PREFERENCE(p_preference=>'SMTP_HOST_ADDRESS', p_value =>'smtprelay.<smtpserver>.net');

To see the settings of the SMTP relay for the scheduler jobs:
SET SERVEROUTPUT ON

DECLARE
  V_VALUE VARCHAR(100);
BEGIN
  DBMS_SCHEDULER.GET_SCHEDULER_ATTRIBUTE('email_server',V_VALUE);
  DBMS_OUTPUT.PUT_LINE(V_VALUE);
END;
/

To see the settings for the old dba jobs system:
SELECT attribute_name, 
       value 
  FROM dba_scheduler_global_attribute 
 WHERE attribute_name like '%EMAIL%';

ATTRIBUTE_NAME                 VALUE
------------------------------ -------------------------------------
EMAIL_SERVER                   smtprelay.<smtpserver>.net:25
EMAIL_SERVER_ENCRYPTION        NONE
EMAIL_SERVER_CREDENTIAL
EMAIL_SENDER

Saturday, 16 July 2016

KZN Umdloti Seascape

These were taken in a small KZN town north of Durban city. It is winter here and photogenic sky and cloud opportunities are further and fewer between than they would be in summer. The sun had to rise about ~10 degrees off the horizon before surfacing a top the cold front out at sea. This is roughly 60 minutes from the peeking over the horizon to being seen and by that time the red light is mostly gone.





Thursday, 14 July 2016

Yay me

I've been studying part-time for most of my twenties. I have finally finished my undergraduate science degree in Informatics with a distinction through UNISA. Informatics is a broad subject that covers business information systems, what they're comprised of, how to program them, and who interacts with them. Throw in accounting, economics, auditing, management and statistics and you have a bachelor of science in informatics - at least in the South African interpretation of the subject. I will graduate in October.

However, this does not mean I began working with information systems in the last few years - I began quite some time ago. I got going early by programming in Borland 3.0, studying relational database design and programming RDBMS systems in my teenage years using Oracle. I completed a 1 year diploma in comp sci and began working as a systems administrator, web and database developer shortly after my 18th birthday. I have worked in IT for the last 14 years give or take and formally became a dba at 27.

I'll now focus on rebuilding my well-being and repay my sleep debt. Next year I'll start my honours degree, and aim to do masters in the end.

Wednesday, 13 July 2016

V$SQL_PLAN Formatted Output

A coworker was tasked to make adjustments to a query. He increased the scope of the query by adding 2 new tables and a few new parameters.
The new query was incredibly slow and he needed assistance in finding out why.

In the process of investigating the issue, I reworked a script to create a query plan display similar to DBMS_XPLAN.DISPLAY.
This is handy if one is working backwards in pulling the thread from the struggling session to the query plan.

Usually, if one has the query text on hand, the explain plan can be drawn by using the following form:

set pages 1000
set lines 1000

EXPLAIN PLAN FOR
  <YOUR QUERY HERE>;
  
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY);

EXAMPLE: EXPLAIN PLAN FOR
   select sysdate from dual;
   
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY);


OUTPUT:
SQL>
EXPLAIN PLAN FOR
   select sysdate from dual;

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY);SQL>   2
Explained.

SQL> SQL>

PLAN_TABLE_OUTPUT
-----------------------------------------------------------------
Plan hash value: 1388734953

-----------------------------------------------------------------
| Id  | Operation        | Name | Rows  | Cost (%CPU)| Time     |
-----------------------------------------------------------------
|   0 | SELECT STATEMENT |      |     1 |     2   (0)| 00:00:01 |
|   1 |  FAST DUAL       |      |     1 |     2   (0)| 00:00:01 |
-----------------------------------------------------------------

8 rows selected.

SQL>

But, if I am working the other way around where all I have is an unhappy user with a poor performing session, the following query will allow me to see the query plan responsible for their anguish:

SET LINES 1000
SET PAGES 1000
COLUMN I FORMAT A3


  SELECT /*+ NO_MERGE */ 
         ROWNUM-1||
DECODE(access_predicates,NULL,DECODE(filter_predicates,NULL,'','*'),'*') "I",
         SUBSTR(LPAD(' ',(DEPTH-1))||
OPERATION,1,40)||
DECODE(OPTIONS,NULL,'',' (' 
|| OPTIONS 
|| ')') "Operation",
         SUBSTR(OBJECT_NAME,1,30) "Object Name",
         cardinality "# Rows",
         bytes,
         cost,
         time
    FROM (
         SELECT * 
           FROM gv$sql_plan 
          WHERE sql_id = (
                         SELECT sql_id 
                           FROM gv$session 
                          WHERE sid = <SID_HERE>
                         )
         ) plan
ORDER BY id
/


The above query was found and taken from an ASK TOM response and adjusted here and there for my own purposes (cluster, sid, formatting, removal of filter)


I will need to have identified the users session on production in order to find their SID. Once found, I plug the SID into the <SID_HERE> tag above. I was initially looking for a way to format the output of v$sql_plan and found a number of decent solutions.

The developer had included 2 new tables into their query and had forgotten to join them. The result was a cartesian on a large dataset. Below is a comparison of the 2 outputs of the explain plan of their poor performing query - one is from DBMS_XPLAN.DISPLAY, the other is from the above query.
Once connecting the table, the result set was reduced to a few records as intended.

DBMS_XPLAN.DISPLAY

----------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name              |Rows  |Bytes |Cost (%CPU)|Time    |
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |                   |20367 | 1809K|32160   (1)|00:06:26|
|   1 |  NESTED LOOPS                       |                   |    1 |   46 |    4   (0)|00:00:01|
|   2 |   NESTED LOOPS                      |                   |    1 |   34 |    3   (0)|00:00:01|
|   3 |    TABLE ACCESS BY INDEX ROWID      | TABLE_AAAAAAA     |    1 |   18 |    2   (0)|00:00:01|
|*  4 |     INDEX UNIQUE SCAN               | XXX_XXX_PK        |    1 |      |    1   (0)|00:00:01|
|   5 |    TABLE ACCESS BY INDEX ROWID      | TABLE_BBB         |    1 |   16 |    1   (0)|00:00:01|
|*  6 |     INDEX UNIQUE SCAN               | TABLE_BBB_PK      |    1 |      |    0   (0)|00:00:01|
|   7 |   TABLE ACCESS BY INDEX ROWID       | TABLE_CCC         |    1 |   12 |    1   (0)|00:00:01|
|*  8 |    INDEX UNIQUE SCAN                | TABLE_CCC_PK      |    1 |      |    0   (0)|00:00:01|
|   9 |  SORT ORDER BY                      |                   |20367 | 1809K|32160   (1)|00:06:26|
|  10 |   CONCATENATION                     |                   |      |      |           |        |
|* 11 |    FILTER                           |                   |      |      |           |        |
|* 12 |     FILTER                          |                   |      |      |           |        |
|  13 |      MERGE JOIN CARTESIAN           |                   |  403K|   35M|17206   (1)|00:03:27|
|  14 |       MERGE JOIN CARTESIAN          |                   |    1 |   91 | 3536   (1)|00:00:43|
|  15 |        NESTED LOOPS                 |                   |      |      |           |        |
|  16 |         NESTED LOOPS                |                   |    1 |   87 | 2260   (1)|00:00:28|
|* 17 |          TABLE ACCESS BY INDEX ROWID| TABLE             |    1 |   57 | 2257   (1)|00:00:28|
|* 18 |           INDEX RANGE SCAN          | TABLE_XX_IX       |    1 |      | 2256   (1)|00:00:28|
|* 19 |          INDEX RANGE SCAN           | XX_IX_1           |    1 |      |    2   (0)|00:00:01|
|* 20 |         TABLE ACCESS BY INDEX ROWID | TABLE_CCCCCCCCCCCC|    1 |   30 |    3   (0)|00:00:01|
|  21 |        BUFFER SORT                  |                   |  196 |  784 | 3533   (1)|00:00:43|
|* 22 |         INDEX FULL SCAN             | XX_XX_IX          |  196 |  784 | 1276   (2)|00:00:16|
|  23 |       BUFFER SORT                   |                   | 4545K|      |15931   (1)|00:03:12|
|  24 |        TABLE ACCESS FULL            | TABLE_DDDDD       | 4545K|      |13670   (1)|00:02:45|
|  25 |     TABLE ACCESS BY INDEX ROWID     | TABLE_EEEEEEEEE   |    1 |   14 |    4   (0)|00:00:01|
|* 26 |      INDEX SKIP SCAN                | XX_PK             |    1 |      |    3   (0)|00:00:01|
|* 27 |    FILTER                           |                   |      |      |           |        |
|* 28 |     FILTER                          |                   |      |      |           |        |
|  29 |      MERGE JOIN CARTESIAN           |                   |    1 |   91 |14953   (1)|00:03:00|
|  30 |       MERGE JOIN CARTESIAN          |                   |    1 |   91 | 1283   (2)|00:00:16|
|  31 |        NESTED LOOPS                 |                   |      |      |           |        |
|  32 |         NESTED LOOPS                |                   |    1 |   87 |    7   (0)|00:00:01|
|* 33 |          TABLE ACCESS BY INDEX ROWID| TABLE             |    1 |   57 |    4   (0)|00:00:01|
|* 34 |           INDEX RANGE SCAN          | TABLE_XX_IX       |    1 |      |    3   (0)|00:00:01|
|* 35 |          INDEX RANGE SCAN           | XX_IX_1           |    1 |      |    2   (0)|00:00:01|
|* 36 |         TABLE ACCESS BY INDEX ROWID | TABLE_CCCCCCCCCCCC|    1 |   30 |    3   (0)|00:00:01|
|  37 |        BUFFER SORT                  |                   |  196 |  784 | 1280   (2)|00:00:16|
|* 38 |         INDEX FULL SCAN             | XX_XX_IX          |  196 |  784 | 1276   (2)|00:00:16|
|  39 |       BUFFER SORT                   |                   | 4545K|      |13677   (1)|00:02:45|
|  40 |        TABLE ACCESS FULL            | TABLE_DDDDD       | 4545K|      |13670   (1)|00:02:45|
|  41 |     TABLE ACCESS BY INDEX ROWID     | TABLE_EEEEEEEEE   |    1 |   14 |    4   (0)|00:00:01|
|* 42 |      INDEX SKIP SCAN                | XX_PK             |    1 |      |    3   (0)|00:00:01|
----------------------------------------------------------------------------------------------------

QUERY : V$SQL_PLAN
I   Operation                                Object Name            # Rows    BYTES     COST     TIME
--- ---------------------------------------- ------------------  --------- -------- -------- --------
0   SELECT STATEMENT                                                                   32160
1   NESTED LOOPS                                                         1       46        4        1
2    NESTED LOOPS                                                        1       34        3        1
3     TABLE ACCESS (BY INDEX ROWID)          SUP_LOCATIONS               1       18        2        1
4*     INDEX (UNIQUE SCAN)                   SUP_AZL_PK                  1                 1        1
5     TABLE ACCESS (BY INDEX ROWID)          SUP_ZONES                   1       16        1        1
6*     INDEX (UNIQUE SCAN)                   SUP_ZONES_PK                1                 0
7    TABLE ACCESS (BY INDEX ROWID)           SUP_AREAS                   1       12        1        1
8*    INDEX (UNIQUE SCAN)                    SUP_AREAS_PK                1                 0
9   SORT (ORDER BY)                                                  20367  1853397    32160      386
10   CONCATENATION                                                                    
11*   FILTER                                                                          
12*    FILTER                                                                         
13      MERGE JOIN (CARTESIAN)                                      403484 36717044    17206      207
14       MERGE JOIN (CARTESIAN)                                          1       91     3536       43
15        NESTED LOOPS                                                                
16         NESTED LOOPS                                                  1       87     2260       28
17*         TABLE ACCESS (BY INDEX ROWID)    ITEMS                       1       57     2257       28
18*          INDEX (RANGE SCAN)              ITEMS_CT_N1                 1              2256       28
19*         INDEX (RANGE SCAN)               CT_IX_1                     1                 2        1
20*        TABLE ACCESS (BY INDEX ROWID)     CARGO_TRANSACTIONS          1       30        3        1
21        BUFFER (SORT)                                                196      784     3533       43
22*        INDEX (FULL SCAN)                 AH_VN_IX                  196      784     1276       16
23       BUFFER (SORT)                                             4545904             15931      192
24        TABLE ACCESS (FULL)                ADMIN_ITEMS           4545904             13670      165
25     TABLE ACCESS (BY INDEX ROWID)         SUP_COMMODITIES             1       14        4        1
26*     INDEX (SKIP SCAN)                    SC_PK                       1                 3        1
27*   FILTER                                                                          
28*    FILTER                                                                         
29      MERGE JOIN (CARTESIAN)                                           1       91    14953      180
30       MERGE JOIN (CARTESIAN)                                          1       91     1283       16
31        NESTED LOOPS                                                                
32         NESTED LOOPS                                                  1       87        7        1
33*         TABLE ACCESS (BY INDEX ROWID)    ITEMS                       1       57        4        1
34*          INDEX (RANGE SCAN)              ITEMS_CT_N1                 1                 3        1
35*         INDEX (RANGE SCAN)               CT_IX_1                     1                 2        1
36*        TABLE ACCESS (BY INDEX ROWID)     CARGO_TRANSACTIONS          1       30        3        1
37        BUFFER (SORT)                                                196      784     1280       16
38*        INDEX (FULL SCAN)                 AH_VN_IX                  196      784     1276       16
39       BUFFER (SORT)                                             4545904             13677      165
40        TABLE ACCESS (FULL)                ADMIN_ITEMS           4545904             13670      165
41     TABLE ACCESS (BY INDEX ROWID)         SUP_COMMODITIES             1       14        4        1
42*     INDEX (SKIP SCAN)                    SC_PK                       1                 3        1

43 rows selected.


Monday, 4 July 2016

UNISA MNG3701 and MNG3702 - The degree killer

If you are contemplating doing a bachelor of science (BSc) with a mix of business modules to complement your computer science degree at UNISA (University of South Africa), be cautious of the modules MNG3701 and MNG3702. These 2 honors styled business modules are the most difficult modules I and many other students have had to prepare for. They are difficult for the wrong reasons owing to the implementation and the kind of examination administered, the short comings of which are a topic all on their own. It is still a new module and the lecturers are ironing out the issues and I am sure the issues will be resolved in time.

However, in its current form, it is very very demanding. If you are aiming to finish a degree with a mix of business and computer science - considered sticking to pure computer science or engineering in an effort guard yourself from this "degree killer" as it has been termed. It is not unusual for students to fail this module, sometimes two or three times despite working very hard in their own capacity to pass. For business students, I imagine this kind of module is a business students thing and I wish the student all the best of luck considering the pure business curriculum would have prepared the student to write and think in the required way for passing the modules.

For more information - look at the UNISA Stdy Notes Wiki forums for an idea of what to expect. Pay attention to the feedback.

My strategy for passing these 2 modules was to book 5 days off for each module and rewrite questions again and again in an effort "brainwash" myself that I love business management. It worked and managed to keep my average and earn a distinction for my degree.

ORA.CVU Memory Leak - Windows Bug


The production cluster at the company I contract too had been running smoothly for ~30 days. Last week, during my morning check, I discovered the the listener and second and third scan resources on our cluster were in a state I had not seen:

INTERMEDIATE - CHECK TIMED OUT


crsctl stat res -t returned the following:

CLUSTER HEALTH REPORT  
---------------------  
.                      
crsctl stat res

-t     
------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS       
------------------------------------------------
Local Resources
------------------------------------------------
................
ora.LISTENER.lsnr
               ONLINE  INTERMEDIATE <node1>    CHECK TIMED OUT     
               ONLINE  ONLINE       <node2>                              
................
------------------------------------------------
Cluster Resources
------------------------------------------------
ora.LISTENER_SCAN1.lsnr
      1        ONLINE  ONLINE       <node2>                              
ora.LISTENER_SCAN2.lsnr
      1        ONLINE  INTERMEDIATE <node1>   CHECK TIMED OUT     
ora.LISTENER_SCAN3.lsnr
      1        ONLINE  INTERMEDIATE <node1>   CHECK TIMED OUT     
ora.cvu
      1        ONLINE  ONLINE       <node1>                              
......................

According to the log, the issue began at 01:47 that morning and had been consistently in this state.

The central log did not reveal the underlying cause of the issue
E:\OracleGrid\11.2.0.3\log\<node1>\alert<node1>.log

[E:\OracleGrid\11.2.0.3\bin\oraagent.exe(4348)]CRS-5818:Aborted command 'check' for resource 'ora.LISTENER.lsnr'. Details at (:CRSAGF00113:) {1:60709:2} in E:\OracleGrid\11.2.0.3\log\<node1>\agent\crsd\oraagent\oraagent.log.

[E:\OracleGrid\11.2.0.3\bin\oraagent.exe(5148)]CRS-5014:Agent "E:\OracleGrid\11.2.0.3\bin\oraagent.exe" timed out starting process "E:\OracleGrid\11.2.0.3\bin\lsnrctl.exe" for action "check": details at "(:CLSN00009:)" in "E:\OracleGrid\11.2.0.3\log\<node1>\agent\crsd\oraagent\oraagent.log"

[E:\OracleGrid\11.2.0.3\bin\oraagent.exe(5128)]CRS-5818:Aborted command 'check' for resource 'ora.LISTENER_SCAN3.lsnr'. Details at (:CRSAGF00113:) {1:60709:48526} in E:\OracleGrid\11.2.0.3\log\<node1>\agent\crsd\oraagent\oraagent.log.

Looking at the E:\OracleGrid\11.2.0.3\log\<node1>\agent\crsd\oraagent\oraagent.log log file, I find the following:
2016-07-01 11:08:36.517: [   AGENT][5124] {1:60709:2} UserErrorException: Locale is ..
2016-07-01 11:08:36.517: [ora.LISTENER.lsnr][5124] {1:60709:2} [check] clsnUtils::error Exception type=2 string=
CRS-5017: The resource action "ora.LISTENER.lsnr check" encountered the following error:
(:CLSN00009:)Utils:execCmd aborted. For details refer to "(:CLSN00109:)" in "E:\OracleGrid\11.2.0.3\log\<node1>\agent\crsd\oraagent\oraagent.log".
..
2016-07-01 11:08:38.421: [ora.LISTENER_SCAN2.lsnr][4348] {1:60709:2} [check] got lock..
2016-07-01 11:08:38.421: [ora.LISTENER_SCAN2.lsnr][4348] {1:60709:2} [check] tryActionLock }.
2016-07-01 11:08:38.421: [ora.LISTENER_SCAN2.lsnr][4348] {1:60709:2} [check] abort  }.
2016-07-01 11:08:38.421: [ora.LISTENER_SCAN2.lsnr][4348] {1:60709:2} [check] (:CLSN00110:) clsn_agent::abort }..
2016-07-01 11:08:38.421: [    AGFW][4348] {1:60709:2} Command: check for resource: ora.LISTENER_SCAN2.lsnr 1 1 completed with status: TIMEDOUT..
...

This is message appears many times and is the the same response for SCAN2 and SCAN3 checks.

However, the users were not complaining, the sessions were distributed evenly across all cluster nodes and operations could be considered "smooth". According to v$sessions and the listener logs, new connections were being made across both listeners and scans. I began checking MOS for notes on this issue. I discovered that there were a few different issues related to this problem so I began cross referencing the log files and the items found on MOS. I found a note making reference to a memory leak - so I grepped through all listener logs and I discovered a match.

Early that morning, the memory limit had been reached on the heap and the first sign of trouble was revealed in the listener alert file:
E:\OracleGrid\11.2.0.3\log\diag\tnslsnr\<node1>\listener\alert.xml
<msg time='2016-07-01T01:47:16.033+02:00' org_id='oracle' comp_id='tnslsnr'
 type='UNKNOWN' level='16' host_id='<node1>'
 host_addr='::1'>
 <txt>TNS-12531: TNS:cannot allocate memory
 </txt>
</msg>

Just as the note had depicted, there was another sign to look for - memory exceptions in the cvutrace.log file
E:\OracleGrid\11.2.0.3\log\<node1>\cvu\cvutrc\cvutrace.log file.0

[6880@<node1>] [main] [ 2016-07-01 05:27:02.783 CAT ] [VerificationUtil.getNodelist:685]  ==== Nodelist after: nodelist[0]:<node1>
Exception in thread "main" java.lang.NullPointerException
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:989)
at java.lang.Double.valueOf(Double.java:447)
at java.lang.Double.<init>(Double.java:539)
at oracle.ops.verification.framework.engine.component.FreeSpaceComponent.initializeReqdValues(FreeSpaceComponent.java:240)
at oracle.ops.verification.framework.engine.component.FreeSpaceComponent.init(FreeSpaceComponent.java:98)
at oracle.ops.verification.framework.engine.component.FreeSpaceComponent.<init>(FreeSpaceComponent.java:85)
at oracle.ops.verification.framework.engine.component.ComponentFactory.createComponent(ComponentFactory.java:164)
at oracle.ops.verification.framework.engine.component.HealthComponent.verifyComponents(HealthComponent.java:211)
at oracle.ops.verification.framework.engine.component.HealthComponent.verify(HealthComponent.java:145)
at oracle.ops.verification.framework.engine.ClusterVerifier.verifyComponent(ClusterVerifier.java:117)
at oracle.ops.verification.client.CluvfyDriver.main(CluvfyDriver.java:276)

The note I am referring to is note: Windows: Ora.LISTENER.lsnr Status goes to INTERMEDIATE CHECK TIMED OUT (Doc ID 1523366.1)

The bug in a nutshell: ora.cvu runs cluvfy comp health -_format but fails to close its session thereby consuming heap memory until the listener and scan resources on the node, where CVU is hosted ,begins failing.

I opened process explorer and found 125 instances of the open cluvfy commands:

C:\Windows\system32\cmd.exe  /K E:\OracleGrid\11.2.0.3\bin\cluvfy  comp health -_format 



I killed all off all but one process in hopes that it would free memory and allow the cluster to correct itself. This did not work. I then tried to restart ora.cvu using

srvctl stop cvu
srvctl start cvu

The statues did not change. I tried to stop one of the scans and start it up again

srvctl stop SCAN_LISTENER -i 3
srvctl start SCAN_LISTENER -i 3

The process became stuck during the shutdown process as it timed out.
I cancelled the shutdown and tried to start it up - this too timed out.

I coordinated with my managers, explained the scenario and it was decided that the first node would be restarted. Not great, but at least the shutdown was controlled by us and not left to the whims of the clusterware. The node was restarted and the issue went away - at least for now until the bug consumes it all again over time.

The CVU is a health check conducted by the cluster. By default, the process runs every 6 hours meaning it can spawn and leave 4 sessions every 24 hours. It runs on only one cluster node and the bug impacts only the hosting node.

The issue is said to be fixed in 11.2.0.4. The only hint of a workaround I have found by Oracle is to disable CVU - this is something that I am considering as CVU was only introduced recently in 11.2.0.2. This admission can be found in the MOS note: Clusterware resource ora.cvu FAQ (Doc ID 1524235.1) - Known issues with ora.cvu

I am testing this in our test cluster and will research further to find out what the implications are. Until then, I will continue killing left over sessions to see if this curtails the failure. I will certainly know whether this helps or not in 30 days whether or not killing the left over cmd sessions prevents the issue.

One last possibility to buy time is to extend the CVU check from its 6 hour check to 12 to 24 hours - this would stave of a crash from anywhere between 2 to 4 months. This can be done using the following command:
E:\>srvctl config cvu
CVU is configured to run once every 360 minutes

E:\>srvctl modify cvu -t 720


UPDATE:
I've been periodically killing left over clufvy sessions and the cluster is holding thus far, it has been over 30 days since the issue occurred. The CVU service also jumped nodes, so I monitored the wrong node for a few days until the penny dropped.

FINAL UPDATE: I have been able to write my own fix for the problem. It can be found here