Oracle Grid Infrastructure: How to recover from a messed up ASM/CSSD diskstring

Oracle Grid Infrastructure 11.2 with voting files and OCR in an ASM diskgroup can be a little tricky if you mess up the voting file voodoo. You know the basic situation?

With Oracle Grid Infrastructure aka Oracle Clusterware, we are storing your cluster quorum and config repository (OCR) into a disk group. But CSSD needs the voting files before ASM is online: At its startup, CSSD scans the device headers of every device in the disk string (configured by you at ASM initial setup time). If it can find at least two valid voting files, the party takes place. Otherwise, your CSSD will cycle with appropriate error messages in $GRID_HOME/log/hostname/log/cssd/ocssd.log for each loop.

This is where I did find myself today: I changed the ASM diskstring to an insane value, and whoops – at next reboot, my Node1 cycled its CSSD forever in a few minute’s interval, and Node2 was caught in a rock solid reboot loop. Looking up the CSSD logfile, I saw that the CSSD had trouble identifying its voting files. (In fact, there have been multiple devices pointing to the same physical device. Interestingly, thus, CSSD dropped both of them. But this is not the issue of this post.)

Now, tell me, how do you change back the ASM disk_string parameter without having ASM running, and with no CSSD available, which is necessary to start it? And how do we tell the CSSD, that’s running fairly in advance of ASM, to scan the right devices?

Some facts first:

CSSD scans for all devices specified in the gpnp profile xml file, tag “DiscoveryString”
ASM scans for all devices specified in its asm_diskstring parameter
Both of them are somehow synchronized when you change the ASM init parameter (So far, I did not find out how)
ASM depends on CSSD for starting (as stated above)
gpnptool was made for editing the gpnp profile and is great stuff

But now, how to recover from a non-starting CSSD, complaining about wrong/missing/too many disks with voting files available?

Overview:

Stop all CRS stuff on the node you are working on
Repair your gpnp (Grid Plug and Play) profile, change the DiscoveryString
Start up CSSD and ASM (by starting the cluster stack)
Repair your ASM_DISKSTRING
Start all of the clusterware stack

So, how to do all of those?

Step 1:

# crsctl stop crs

Step 2:

a) goto your profile directory
~> cd $GRID_HOME/gpnp/hostname/profiles/peer

b) copy the profile.xml to a file we want to work with. backup the old file somewehere else, too.
~> cp profile.xml profile.xml.new.us

c) remove the oracle signature from the file
~> gpnptool unsign -p=./profile.xml.new.us

d) change the DiscoveryString itself
~> gpnptool edit -asm_dis=’/dev/disk/by-id/ASM*’ \
-p=profile.xml.new.us -o=profile.xml.new.us -ovr

e) check the output file, it’s all plain text 😉
~> cat profile.xml.new.us

f) sign the profile xml file with the wallet (Notice: the path is only the directory to the wallet, NOT the wallet file itself)
~> gpnptool sign -p=./profile.xml.new.us \
-w=file:$GRID_HOME/gpnp/hostname/wallets/peer/ \
-o=./profile.xml.new

g) move the original profile.xml out of the way
~> mv ./profile.xml ./profile.xml.old.1

h) put the new one in place
~> cp ./profile.xml.new ./profile.xml

Step 3:

# crsctl start crs

Step 4:

The cluster stack should be running now, but the CRSD complains that its OCR stuff in in a disk group that’s not online. Obviously, the broken ASM_DISKSTRING sucessfully obfuscates the right devices for this disk group. So we have to repair that as well.

a) connect to the now running ASM instance
~> sqlplus / as sysasm

b) change the diskstring. I’m doing scope=memory, because the spfile is on +OCW, too… 🙂
SQL> alter system set asm_diskstring=’/dev/disk/by-id/ASM*’ scope=memory;

c) Bring online the diskgroup holding the OCW. Wise (wo)men are proud owners of a dedicated diskgroup for the cluster stuff.
SQL> alter diskgroup OCW mount;

d) Make the diskstring parameter persistent. DONT FORGET TO!
SQL> alter system set asm_diskstring=’/dev/disk/by-id/ASM*’ scope=both;

Step 5:

If the cluster stack does not continue starting voluntarily, restart it:
# crsctl stop crs
# crsctl start crs

Finish run:

Finally, repeat all this stuff on all nodes messed up before. Catch them out of their reboot loops, and bring them home into the cluster. Good luck!

Yours
Martin

PS: Thanks a lot to the oracle-l mailing list (especially Freek D’Hooge) and TheBonsai. Guys, once again you saved my day!

Archives

Topics

About

Martin Klier