Archive for the 'Cluster and RAC' Category

Oracle RAC 12.2 on SLES12 – How to disable Hardware Lock Elision Support

The Problem

In the last week, I had to set up an Oracle RAC (and thus, the Clusterware) version on SuSE Linux Enterprise Server 12 SP2 (SLES12 Sp2). The failed latest in of the first node with a kinda non-intuitive error:

CRS-5804: Communication error with agent process
CRS-4000: Command Start failed, or completed with errors.
2017/07/10 10:18:49 CLSRSC-119: Start of the exclusive mode cluster failed
Died at /u01/app/12.2.0/grid_1/crs/install/ line 2053.
The command ‘/u01/app/12.2.0/grid_1/perl/bin/perl -I/u01/app/12.2.0/grid_1/perl/lib -I/u01/app/12.2.0/grid_1/crs/install /u01/app/12.2.0/grid_1/crs/install/ ‘ execution failed

Thank you – for nothing.

The rootcrs.log in /tmp/GridAction<date> directory was a bit more enlightening, but not much:

CRS-5804: Communication error with agent process
CRS-4000: Command Start failed, or completed with errors.
The exlusive mode cluster start failed, see Clusterware alert log for more information
Executing cmd: /u01/app/12.2.0/grid_1/bin/clsecho -p has -f clsrsc -m 119
Command output:
> CLSRSC-119: Start of the exclusive mode cluster failed 
>End Command output
CLSRSC-119: Start of the exclusive mode cluster failed
###### Begin DIE Stack Trace ######
Package File Line Calling 
--------------- -------------------- ---- ----------
1: main 287 crsutils::dietrap
2: crsinstall 2053 main::__ANON__
3: crsinstall 1963 crsinstall::perform_initial_config
4: crsinstall 653 crsinstall::perform_init_config
5: crsinstall 813 crsinstall::init_config
6: crsinstall 380 crsinstall::CRSInstall
7: main 446 crsinstall::new
####### End DIE Stack Trace #######

Eh… yes.

The error in the cluster alert log finally was better, and made me curious:

CRS-8503: Oracle Clusterware process ORAROOTAGENT with operating system process ID 20948 experienced fatal signal or exception code 11.
Errors in file /u01/app/oracle/diag/crs/myserver08/crs/trace/ohasd_orarootagent_root.trc (incident=1):
CRS-8503 [__lll_unlock_elision()+48] [Signal/Exception: 11] [Instruction Addr: 0x7f5df148a4a0] [Memory Addr: (nil)] [] [] [] [] [] [] [] []

Ah. Of course. :)

Read more…

Oracle RAC 12.2 High Load on CPU from gdb when Node Missing

Recently I had to battle a new issue with the quite-new Oracle Database RAC Version on Linux x86_64. The idea of RAC is, to compensate the loss of a node or service by restarting services on other nodes. But in my case, when one node in a two-node-Cluster was down (or the crs stack stopped with crsctl stop crs), there was very high CPU load on the surviving node. In a pattern of three minutes 100% on four threads, every five minutes. That’s a bit disappointing – if we loose a node, we want all the CPU power of the survivor for our services, not for debugging, as interesting the debug results may be afterwards – but the service must stay up!

Oracle RAC 12.2 High Load on CPU from GDB when Node Missing

So where does that load come from? A quick research with top allows some insight: There are several GDB processes running.

Oracle RAC 12.2. high CPU load from GDB (top)

Oracle RAC 12.2. high CPU load from GDB (top)

Digging deeper with pstree, shows their origin.

Oracle RAC 12.2. high CPU load from GDB (pstree)

Oracle RAC 12.2. high CPU load from GDB (pstree)

Seems like osysmond starts due to an error condition. And likewise, diagsnap is running the debugger. Okay, once or twice would be ok, but up to six gdb’s at the same time as in our case, and every five minutes – thanks, but no thanks.

Diagsnap is used by the Oracle 12.2 Autonomous Health Framework to create diagnostics information in case of cluster issues. Mostly that means data for the management repository database.

I was able to reproduce the issue on all Linux RACs with 12.2 I have running, also verified with at least two customer systems.


The Cluster Health Monitor can be configured not to collect this kind of information – you will reduce the amount of data Oracle Support can use to help you diagnosing cluster outages. But on the pro side, you will have the CPU power of the surviving node for your services, and that was the plan.

Activate all nodes in the cluster, and on one node, run

~$ oclumon manage -disable diagsnap

Here we go:
oclumon disabling diagsnap

oclumon disabling diagsnap

After that, I was not able to reproduce the issue.


Oracle told me, that a real fix will be part of the first 12.2 PSU, so I estimate to see it working after July 17th, 2017.


You may not be familiar with Cluster Health Monitor and the oclumon CLI. This is a good point for starting to read about the topic:



Hope this is helpful – do it like we did and test thoroughly before commissioning your new RAC! :)
Martin Klier

Oracle 11g unable to extend datafile but ASM disk group shows free space

Sometimes Oracle Database 11gR2 complains about not being able to extend a (lob) segment with ORA-1691, but ASM monitoring based on USABLE_FILE_MB did not fire. Adding a new data file fails with ASM ORA-15041. I was under the impression this behaviour of ASM deserved some explanation.


Database Alert Log complains with ORA-1691:

ORA-1691: unable to extend lobsegment MYUSER.SYS_LOB0013128030C00003$$ 
  by 128 in tablespace USERS
ORA-1691: unable to extend lobsegment MYUSER.SYS_LOB0013128030C00003$$ 
  by 8192 in tablespace USERS

Ok, a quick look comparing dba_segments and the tablespace size – it’s full. So let’s extend it with one more datafile:

SQL> alter tablespace USERS add datafile size 2G autoextend on next 1G maxsize 32G;
 alter tablespace IWACS add datafile size 2G autoextend on next 1G maxsize 32G
 FEHLER in Zeile 1:
 ORA-01119: Fehler bei der Erstellung der Datenbankdatei '+ORADATA'
 ORA-17502: ksfdcre:4 konnte Datei +ORADATA nicht erstellen
 ORA-15041: diskgroup "ORADATA" space exhausted

Uh-oh, ORA-15041? But as I well know, the customer is monitoring FREE_MB and USABLE_FILE_MB in v$asm_diskgroup…? Maybe something nasty is going on, let’s check if ASM has to work on balancing:

SQL> select * from v$asm_operation;
no rows selected

Read more…

Looking forward to speaking at COLLABORATE16 IOUG Forum

Yes, I did it again, submitted and got two papers accepted – speaking at COLLABORATE16 in Las Vegas (#C16LV) is always a highlight of the year!


This time, it will be:

And, what I’m very thrilled to do again – helping with the Sunday RAC Attack Workshop as a ninja. We own the night! :)

Please see details of the talks here, and follow my #C16LV posts on twitter!

Big Thanks to IOUG, and see you in Vegas!
Martin Klier

Featured by Oracle Magazine

The Oracle Magazine featured me in its January / February 2016 issue. I really feel honoured and would like to say thank you for the opportunity!

Martin Klier Oracle Magazine Jan Feb 2016

For details, please see the Performing Databases Blog post about the publication.

DOAG noon2noon RAC & Dataguard – Quick Report

There’s always something to learn – for example about Oracle RAC (Real Application Clusters) and Dataguard. And the old-school frontal teaching concept is boring, and more important, ineffective after a few hours of passive listening. So the German Oracle Users Group DOAG organized a “noon2noon” event: “Oracle RAC and Dataguard” this week in Würzburg (January 2 1st-22nd 2016). It was the second noon2noon, after introducing the concept last year with Oracle vs. MySQL. I enjoyed it much last time, so I easily agreed to volunteer as a RAC Attack “Ninja” and tech guy on site for the workshops in general this year.

The response was close to overwhelming – we calculated 25 participants, but ended up with 39, and a PACKED room. Somebody called it a chicken cage, but the atmosphere was great. Thanks to the air condition. :)

DOAG noon2noon RACattack

Everybody is highly motivated, despite the packed room at noon2noon :)

But what’s that noon2noon thing?

Read more…

How to disable Oracle ACFS drivers / registry resource

Sometimes an installed ACFS can cause trouble, especially if we don’t or can’t use it (e.g. when not using an UEK Linux kernel, like with SuSE Linux Enterprise Server SLES). There’s lots of documentation how to create and maintain ACFS file systems, but how to get rid of ACFS at all wasn’t so easy to guess. I had to find out how to disable Oracle ACFS drivers, because the grid infrastructure did not stop successfully when using “crsctl stop crs” or “/etc/init.d/ohasd stop”:
CRS-2799: Failed to shut down resource ‘ora.drivers.acfs’ – and thus, it was not possible to upgrade the grid infrastructure to version ( fails on first node, also when trying to stop the CRS for the same reason).

How to disable Oracle ACFS drivers - runInstaller during upgrade

Read more…

What is a “RAC Battle”?

RAC Battle [ræk ˈbæt̬l̩]

What is a “RAC Battle”? It is a format of presenting technology – two experts, battling against each other on pros and cons of Oracle Real Application Cluster. Björn Rost (Oracle ACE Director) and Martin Klier (Oracle ACE)

Björn Rost -  What is a "RAC Battle"?  Martin Klier - What is a "RAC Battle"?

will fight

Wednesday November 18th, 2015
11 am
Nürnberg CCN (DOAG Konferenz 2015)
Room St. Petersburg

Who will be pro? Who will be con? We don’t know, we will decide by lot in front of the audience.

Be there, to see a technology event at its best in rounds, with no strings attached. Are you afraid of violence? Stay calm, Johannes Ahrends (Oracle ACE) will be the referee to avoid bloodshed.

Johannes Ahrends - What is a "RAC Battle"?

Here’s the official RAC Battle link from DOAG.

Oracle dbms_workload_repository awr_global_report_text fails with ORA-06553 PLS-307

My last post was about generating AWR reports from SQL – not a very complicated case. “But” – little word, big meaning. And suddenly we get a strange error message.

In my example, I was especially interested in a Real Application Cluster’s “global” report. The subprogram call for that purpose is dbms_workload_repository.awr_global_report_text(), accepting the DBID, begin snapshot, end snapshot, an option field, and, l_inst_num for the instance number. Regarding the latter, Oracle documentation says: “List of instance numbers to be included in report. If set to NULL, all instances for which begin and end snapshots are available …”

But when I did that, I got an error that felt strange to me:

ORA-06553: PLS-307: too many declarations of ‘AWR_GLOBAL_REPORT_TEXT’ match this call

Read more…

How to create an Oracle AWR report with SQL and PL/SQL

AWR is a great tool (but needs diagnostics pack). It can create very useful reports for performace analysis over a given period. Most people will pull the AWR Report  from the OS level, with ‘@?/rdbms/admin/awrrpt.sql” (or one of the other scripts there). But sometime it’s not possible to access a database server’s OS level, or gain permission to do so. So we also can create the whole bunch of possible AWR and ASH reports from SQL level, too. We can have it in text or HTML, it’s just a matter of personal taste. I always preferred the text version, but there are many facts pro HTML. Anyway, here is my cheat sheet how to do it, and the difference in RAC.

What we have to know

Our database ID (DBID):

select dbid 
  from v$database;

The period we are interested in, spoken in snapshot IDs:

select /*+ FIRST_ROWS */ * 
  order by snap_id desc, instance_number desc;

How to create an Oracle AWR report with SQL and PL/SQL

FROM TABLE (dbms_workload_repository.awr_report_text(

Read more…

You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.