Exadata Quarantines — Not Your Typical Oracle Quarantine

Exadata extends the concept of SQL Quarantines at the cell storage level: the behavior of Exadata quarantined SQL_IDs and how they are treated is different from how they are handled at the Oracle Database level.

Quarantined SQL Statements in the AWR

I was reviewing an AWR RAC report on an Exadata database, and when I looked in the Exadata Alerts Detail section, I noticed that a misbehaving SQL statement was in quarantine:

Cell Quarantined SQL Statement

Having the SQL_ID, the plan was to take a look at the database and see the details of that SQL_ID and other quarantine information.

Data Dictionary Views

The first place I looked was in the data dictionary view DBA_SQL_QUARANTINE, but that SQL_ID was not in there! I knew that SQL Quarantine functionality was only available on engineered systems like Exadata, so I was a bit perplexed. As it happens with Oracle Database, similar but distinct functionality can have the same name. SQL Quarantine works in conjunction with Oracle Resource Manager and the DBMS_SQLQ package to identify SQL statements that run too long or use too many resources.

The DBMS_SQLQ package can create a quarantine for a specific plan hash value or for any appearance of a SQL_ID, preventing the runaway query from running at all in the future. The Exadata cell-level quarantine is different.

Exadata Cell-Level Views

In many Exadata environments, the Oracle DBA will not have any access to the metrics or status of the cell servers via the CellCLI command-line tool. To give the Oracle DBA more visibility into what is happening at the storage level, many cell metrics are exposed in V$ views and ultimately in the AWR historical views that start with DBA_HIST. In this case, I was able to look in DBA_HIST_CELL_OPEN_ALERTS or V$CELL_OPEN_ALERTS to see what storage alerts were active during the AWR snapshot range.

DBA_HIST_CELL_OPEN_ALERTS

That’s the alert I saw in the AWR report. The SQL_ID was quarantined at the cell level, but in a different way: the SQL statement will still run, but it won’t take advantage of Exadata offloading features — all the rows returned by the query will have to be processed at the database layer.

Impacts, Root Causes, and Remediations

Losing the offloading capabilities for one or more SQL statements can mean the difference between a query finishing in 5 minutes versus 5 hours. While that definitely will have an impact on a user’s response time, more importantly, I noticed something else about the cell alert:

SQL PLAN Quarantine reason : Crash Quarantine Plan

That SQL statement on the cell server fell into the critical severity level because it crashed one or more processes on one cell server, or sometimes it can be the entire OS of the cell server in extreme circumstances. That impact definitely goes beyond one user query. What would cause an exception like this? It can be as simple as a bug in REGEXP_LIKE in the thin SQL engine at the cell layer. In this case, the quarantine was caused by an incompatibility between an older Exadata software version at the storage layer and a newer model cell server. Updating the Exadata software version will likely solve this issue and the quarantined SQL statement alert can be cleared.

Information is Power

Each version of Oracle Database and Oracle Exadata exposes more and more of the metrics from the cell layer to the database layer, which is useful when you don’t have access to the CellCLI or ExaCLI utilities.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s