These days, a customer’s Oracle Clusterware (2 nodes) crashed one ASM instance at every startup.
- It was not possible to start it manually, too.
- The CSSD was running.
- For obvious reasons, CRSD did not start.
- The other ASM instance in the cluster recognized CLUSTER RECONFIGURATION for a short period of time.
The ASM Alert Log file looked like:
Sun Nov 13 13:44:08 2011 MMNL started with pid=21, OS id=7783 lmon registered with NM - instance number 2 (internal mem no 1) Sun Nov 13 13:46:05 2011 System state dump requested by (instance=2, osid=7684 (PMON)), summary=[abnormal instance termination]. System State dumped to trace file /u01/app/oracle/diag/asm/+asm/+ASM2/trace/+ASM2_diag_7706.trc Sun Nov 13 13:46:05 2011 PMON (ospid: 7684): terminating the instance due to error 481 Dumping diagnostic data in directory=[cdmp_20111113134605], requested by (instance=2, osid=7684 (PMON)), summary=[abnormal instance termination]. Instance terminated by PMON, pid = 7684
Strange problem. Looking up device permissions, read write tests, rebooting the cluster in a downtime window – nothing.
To make a long story short: The NTP daemon did not get his time synchronisation, but was running. Thus, CTSS was in observer mode, and server time started drifting apart. Fixing NTP, fixed the cluster.