Archive for the 'Cluster and RAC' Category

Oracle on Windows: ASM instance terminated by LMON / ORA-27300 IPC_TCPConnectCheck failed with status -1

Recently I had an issue with a two-node Grid Infrastructure on Windows 2012R2. After an infrastructure-caused cluster restart (irresponsible SAN hardware patching :) ), everything was running on Node 2, and Node 1 could not join the cluster any more.

No easy solution: On CSSD level, there was no issue (network and disk heartbeat worked, according to ocssd.log). It turned out, that the ASM instance on Node 1 started, but its LMON could not communicate with the already-running ASM on Node 2: Instance terminated by LMON. No really speaking ORA error messages in its alert log.

But on the working Node 2, the ASM alert log shows
ORA-27300: OS system dependent operation:IPC_TCPConnectCheck failed with status: -1

Guessing from the module name, I started thinking about the network – and yes, somebody activated the Windows Firewall on Node 1. Strange that the errors did not show up on the node causing the error, but I was glad to have found the culprit.

How to check Windows firewall:
netsh advfirewall show currentprofile

Syntax that will always help with annoying firewalls, but has to be clarified by security:
netsh advfirewall set currentprofile state off

Lessons learned:

  1. People tend to introducing new problems during fixing others (in this case, messing with the Windows Firewall config during looking for a SAN problem), so DBAs, adapt your thinking to that.
  2. Obvious, but easy to forget: When diagnosing RAC / Clusterware issues, look into logs on all nodes (or build a central ADR)

Us usual, take care and think about the (other) box
Martin



Oracle on SLES12 SP2 – Avoiding Cgroup Task Limit

Once upon a time, there was an Oracle RAC cluster 12.2.0.1 on SUSE Linux Enterprise Server (SLES) 12 SP2, that did not do well. Its database and ASM instances used to fail with:

ORA-27300: OS system dependent operation:fork failed with status: 11
ORA-27301: OS failure message: Resource temporarily unavailable
ORA-27302: failure occurred at: skgpspawn3

Of course, in such a case you will check ulimits -u / limits.conf (nproc) and sysctl.conf (kernel.pid_max). But what if this does not help?

Read more…



Oracle RAC 12.2 on SLES12 – How to disable Hardware Lock Elision Support

The Problem

In the last week, I had to set up an Oracle RAC (and thus, the Clusterware) version 12.2.0.1 on SuSE Linux Enterprise Server 12 SP2 (SLES12 Sp2). The gridSetup.sh failed latest in root.sh of the first node with a kinda non-intuitive error:

CRS-5804: Communication error with agent process
CRS-4000: Command Start failed, or completed with errors.
2017/07/10 10:18:49 CLSRSC-119: Start of the exclusive mode cluster failed
Died at /u01/app/12.2.0/grid_1/crs/install/crsinstall.pm line 2053.
The command ‘/u01/app/12.2.0/grid_1/perl/bin/perl -I/u01/app/12.2.0/grid_1/perl/lib -I/u01/app/12.2.0/grid_1/crs/install /u01/app/12.2.0/grid_1/crs/install/rootcrs.pl ‘ execution failed

Thank you – for nothing.

The rootcrs.log in /tmp/GridAction<date> directory was a bit more enlightening, but not much:

CRS-5804: Communication error with agent process
CRS-4000: Command Start failed, or completed with errors.
The exlusive mode cluster start failed, see Clusterware alert log for more information
Executing cmd: /u01/app/12.2.0/grid_1/bin/clsecho -p has -f clsrsc -m 119
Command output:
> CLSRSC-119: Start of the exclusive mode cluster failed 
>End Command output
CLSRSC-119: Start of the exclusive mode cluster failed
###### Begin DIE Stack Trace ######
Package File Line Calling 
--------------- -------------------- ---- ----------
1: main rootcrs.pl 287 crsutils::dietrap
2: crsinstall crsinstall.pm 2053 main::__ANON__
3: crsinstall crsinstall.pm 1963 crsinstall::perform_initial_config
4: crsinstall crsinstall.pm 653 crsinstall::perform_init_config
5: crsinstall crsinstall.pm 813 crsinstall::init_config
6: crsinstall crsinstall.pm 380 crsinstall::CRSInstall
7: main rootcrs.pl 446 crsinstall::new
####### End DIE Stack Trace #######

Eh… yes.

The error in the cluster alert log finally was better, and made me curious:

[ORAROOTAGENT(20948)]
CRS-8503: Oracle Clusterware process ORAROOTAGENT with operating system process ID 20948 experienced fatal signal or exception code 11.
Errors in file /u01/app/oracle/diag/crs/myserver08/crs/trace/ohasd_orarootagent_root.trc (incident=1):
CRS-8503 [__lll_unlock_elision()+48] [Signal/Exception: 11] [Instruction Addr: 0x7f5df148a4a0] [Memory Addr: (nil)] [] [] [] [] [] [] [] []

Ah. Of course. :)

Read more…



Oracle RAC 12.2 High Load on CPU from gdb when Node Missing

Recently I had to battle a new issue with the quite-new Oracle Database RAC Version 12.2.0.1 on Linux x86_64. The idea of RAC is, to compensate the loss of a node or service by restarting services on other nodes. But in my case, when one node in a two-node-Cluster was down (or the crs stack stopped with crsctl stop crs), there was very high CPU load on the surviving node. In a pattern of three minutes 100% on four threads, every five minutes. That’s a bit disappointing – if we loose a node, we want all the CPU power of the survivor for our services, not for debugging, as interesting the debug results may be afterwards – but the service must stay up!

Oracle RAC 12.2 High Load on CPU from GDB when Node Missing

So where does that load come from? A quick research with top allows some insight: There are several GDB processes running.

Oracle RAC 12.2. high CPU load from GDB (top)

Oracle RAC 12.2. high CPU load from GDB (top)

Digging deeper with pstree, shows their origin.

Oracle RAC 12.2. high CPU load from GDB (pstree)

Oracle RAC 12.2. high CPU load from GDB (pstree)

Seems like osysmond starts diagsnap.pl due to an error condition. And likewise, diagsnap is running the debugger. Okay, once or twice would be ok, but up to six gdb’s at the same time as in our case, and every five minutes – thanks, but no thanks.

Diagsnap is used by the Oracle 12.2 Autonomous Health Framework to create diagnostics information in case of cluster issues. Mostly that means data for the management repository database.

I was able to reproduce the issue on all Linux RACs with 12.2 I have running, also verified with at least two customer systems.

Workaround

The Cluster Health Monitor can be configured not to collect this kind of information – you will reduce the amount of data Oracle Support can use to help you diagnosing cluster outages. But on the pro side, you will have the CPU power of the surviving node for your services, and that was the plan.

Activate all nodes in the cluster, and on one node, run

~$ oclumon manage -disable diagsnap

Here we go:
oclumon disabling diagsnap

oclumon disabling diagsnap

After that, I was not able to reproduce the issue.

Solution

Oracle told me, that a real fix will be part of the first 12.2 PSU, so I estimate to see it working after July 17th, 2017.

Readme

You may not be familiar with Cluster Health Monitor and the oclumon CLI. This is a good point for starting to read about the topic:
https://docs.oracle.com/database/121/CWADD/troubleshoot.htm#CWADD92242

 

 

Hope this is helpful – do it like we did and test thoroughly before commissioning your new RAC! :)
Martin Klier



Oracle 11g unable to extend datafile but ASM disk group shows free space

Sometimes Oracle Database 11gR2 complains about not being able to extend a (lob) segment with ORA-1691, but ASM monitoring based on USABLE_FILE_MB did not fire. Adding a new data file fails with ASM ORA-15041. I was under the impression this behaviour of ASM deserved some explanation.

Situation

Database Alert Log complains with ORA-1691:

ORA-1691: unable to extend lobsegment MYUSER.SYS_LOB0013128030C00003$$ 
  by 128 in tablespace USERS
ORA-1691: unable to extend lobsegment MYUSER.SYS_LOB0013128030C00003$$ 
  by 8192 in tablespace USERS

Ok, a quick look comparing dba_segments and the tablespace size – it’s full. So let’s extend it with one more datafile:

SQL> alter tablespace USERS add datafile size 2G autoextend on next 1G maxsize 32G;
 alter tablespace IWACS add datafile size 2G autoextend on next 1G maxsize 32G
 *
 FEHLER in Zeile 1:
 ORA-01119: Fehler bei der Erstellung der Datenbankdatei '+ORADATA'
 ORA-17502: ksfdcre:4 konnte Datei +ORADATA nicht erstellen
 ORA-15041: diskgroup "ORADATA" space exhausted

Uh-oh, ORA-15041? But as I well know, the customer is monitoring FREE_MB and USABLE_FILE_MB in v$asm_diskgroup…? Maybe something nasty is going on, let’s check if ASM has to work on balancing:

SQL> select * from v$asm_operation;
no rows selected

Read more…



Looking forward to speaking at COLLABORATE16 IOUG Forum

Yes, I did it again, submitted and got two papers accepted – speaking at COLLABORATE16 in Las Vegas (#C16LV) is always a highlight of the year!

Collaborate16_Horizontal_Logo

This time, it will be:

And, what I’m very thrilled to do again – helping with the Sunday RAC Attack Workshop as a ninja. We own the night! :)

Please see details of the talks here, and follow my #C16LV posts on twitter!

Big Thanks to IOUG, and see you in Vegas!
Martin Klier



Featured by Oracle Magazine

The Oracle Magazine featured me in its January / February 2016 issue. I really feel honoured and would like to say thank you for the opportunity!

Martin Klier Oracle Magazine Jan Feb 2016

For details, please see the Performing Databases Blog post about the publication.



DOAG noon2noon RAC & Dataguard – Quick Report

There’s always something to learn – for example about Oracle RAC (Real Application Clusters) and Dataguard. And the old-school frontal teaching concept is boring, and more important, ineffective after a few hours of passive listening. So the German Oracle Users Group DOAG organized a “noon2noon” event: “Oracle RAC and Dataguard” this week in Würzburg (January 2 1st-22nd 2016). It was the second noon2noon, after introducing the concept last year with Oracle vs. MySQL. I enjoyed it much last time, so I easily agreed to volunteer as a RAC Attack “Ninja” and tech guy on site for the workshops in general this year.

The response was close to overwhelming – we calculated 25 participants, but ended up with 39, and a PACKED room. Somebody called it a chicken cage, but the atmosphere was great. Thanks to the air condition. :)

DOAG noon2noon RACattack

Everybody is highly motivated, despite the packed room at noon2noon :)

But what’s that noon2noon thing?

Read more…



How to disable Oracle ACFS drivers / registry resource

Sometimes an installed ACFS can cause trouble, especially if we don’t or can’t use it (e.g. when not using an UEK Linux kernel, like with SuSE Linux Enterprise Server SLES). There’s lots of documentation how to create and maintain ACFS file systems, but how to get rid of ACFS at all wasn’t so easy to guess. I had to find out how to disable Oracle ACFS drivers, because the grid infrastructure did not stop successfully when using “crsctl stop crs” or “/etc/init.d/ohasd stop”:
CRS-2799: Failed to shut down resource ‘ora.drivers.acfs’ – and thus, it was not possible to upgrade the grid infrastructure 11.2.0.3 to version 12.1.0.2 (rootupgrade.sh fails on first node, also when trying to stop the CRS for the same reason).

How to disable Oracle ACFS drivers - runInstaller during upgrade

Read more…



What is a “RAC Battle”?

RAC Battle [ræk ˈbæt̬l̩]

What is a “RAC Battle”? It is a format of presenting technology – two experts, battling against each other on pros and cons of Oracle Real Application Cluster. Björn Rost (Oracle ACE Director) and Martin Klier (Oracle ACE)

Björn Rost -  What is a "RAC Battle"?  Martin Klier - What is a "RAC Battle"?

will fight

Wednesday November 18th, 2015
11 am
Nürnberg CCN (DOAG Konferenz 2015)
Room St. Petersburg

Who will be pro? Who will be con? We don’t know, we will decide by lot in front of the audience.

Be there, to see a technology event at its best in rounds, with no strings attached. Are you afraid of violence? Stay calm, Johannes Ahrends (Oracle ACE) will be the referee to avoid bloodshed.

Johannes Ahrends - What is a "RAC Battle"?

Here’s the official RAC Battle link from DOAG.




You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.