Oracle Enterprise Manager Cloud Control 13c: Debugging Metrics Collection

Has this ever happened to you? You’re adding a new Exadata environment to OEM Cloud Control (in this case, version 13.3) and one of the storage servers keeps failing to return metrics, even when dropping and re-adding the storage server as an OEM target?

Exadata Machine status — cell server down?

Fortunately, the cell server wasn’t actually down, but OEM couldn’t retrieve metrics or any other status. The cell server was discovered just fine with the root account, and setup completed fine. However, the error returned via OEM was somewhat cryptic:

"Metric evaluation error start - 
oracle.sysman.emSDK.agent.fetchlet.exception.FetchletException:
 syntax error at line 1, column 0, byte 0 at
 /u01/app/agent/agent_13.3.0.0.0/perl/lib/site_perl/5.14.4/x86_64-linux-thread-multi/XML/Parser.pm line 187" 

A thorough search of Oracle Support didn’t show an exact match until I only searched for “Metric evaluation error” and “line 187”. To debug, use the same command(s) that OEM would use to get the status, which in this case would be the cellcli command running as the cellmonitor user, and return the results of the command in XML format:

$ ssh -q -o ConnectTimeout=60 -o BatchMode=yes -o
StrictHostKeyChecking=no -o PreferredAuthentications=publickey
-i /home/oracle/.ssh/id_dsa -l cellmonitor dbm0celadm14 cellcli
-xml -e ' list cell attributes msStatus '

cellcli error message

AHA! Even though the correct XML response was returned, because of the directory permission error, OEM will never show a status of “up” anywhere in OEM. Comparing the permissions on that directory on dbm0celadm14 to any of the other 13 cell servers, I saw this right away:

The directory in question on dbm0celadm14 had permissions for the celltrace group instead of the cellusers group on dbm0celadm13 and every other cell server. How it got that way, not sure, but it was an easy fix:

[root@dbm0celadm14]# chgrp cellusers /var/log/oracle/deploy

Once that was fixed, the CELL-01528 error message was no longer returned, and all subsequent OEM collection information on the cell servers showed one big happy family again:

OEM Exadata component status

For reference, the Oracle Support note with this solution is at:
Oracle Support Document 2017298.1 (EM12C: Storage Cell Metric Collection Error “… syntax error … x86_64-linux-thread-multi/XML/Parser.pm line 187”) can be found at: https://support.oracle.com/epmos/faces/DocumentDisplay?id=2017298.1
It references this bug, which is supposed to be fixed by now, but apparently has not been in the cell image software version I was using:
Oracle Support Bug 20274834 (CELL-1528: UNABLE TO CREATE THE LOG FILE IN DIRECTORY /OPT/ORACLE/CELL/CELLSRV/) can be found at: https://support.oracle.com/epmos/faces/BugDisplay?id=20274834

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s