Has this ever happened to you? You’re adding a new Exadata environment to OEM Cloud Control (in this case, version 13.3) and one of the storage servers keeps failing to return metrics, even when dropping and re-adding the storage server as an OEM target?
Fortunately, the cell server wasn’t actually down, but OEM couldn’t retrieve metrics or any other status. The cell server was discovered just fine with the root account, and setup completed fine. However, the error returned via OEM was somewhat cryptic:
"Metric evaluation error start - oracle.sysman.emSDK.agent.fetchlet.exception.FetchletException: syntax error at line 1, column 0, byte 0 at /u01/app/agent/agent_188.8.131.52.0/perl/lib/site_perl/5.14.4/x86_64-linux-thread-multi/XML/Parser.pm line 187"
A thorough search of Oracle Support didn’t show an exact match until I only searched for “Metric evaluation error” and “line 187”. To debug, use the same command(s) that OEM would use to get the status, which in this case would be the cellcli command running as the cellmonitor user, and return the results of the command in XML format:
$ ssh -q -o ConnectTimeout=60 -o BatchMode=yes -o StrictHostKeyChecking=no -o PreferredAuthentications=publickey -i /home/oracle/.ssh/id_dsa -l cellmonitor dbm0celadm14 cellcli -xml -e ' list cell attributes msStatus '
AHA! Even though the correct XML response was returned, because of the directory permission error, OEM will never show a status of “up” anywhere in OEM. Comparing the permissions on that directory on dbm0celadm14 to any of the other 13 cell servers, I saw this right away:
The directory in question on dbm0celadm14 had permissions for the celltrace group instead of the cellusers group on dbm0celadm13 and every other cell server. How it got that way, not sure, but it was an easy fix:
[root@dbm0celadm14]# chgrp cellusers /var/log/oracle/deploy
Once that was fixed, the CELL-01528 error message was no longer returned, and all subsequent OEM collection information on the cell servers showed one big happy family again:
For reference, the Oracle Support note with this solution is at:
Oracle Support Document 2017298.1 (EM12C: Storage Cell Metric Collection Error “… syntax error … x86_64-linux-thread-multi/XML/Parser.pm line 187”) can be found at: https://support.oracle.com/epmos/faces/DocumentDisplay?id=2017298.1
It references this bug, which is supposed to be fixed by now, but apparently has not been in the cell image software version I was using:
Oracle Support Bug 20274834 (CELL-1528: UNABLE TO CREATE THE LOG FILE IN DIRECTORY /OPT/ORACLE/CELL/CELLSRV/) can be found at: https://support.oracle.com/epmos/faces/BugDisplay?id=20274834