This is only security related in that the underlying issue is permission, and for a time I thought I had created the problem by changing the permissions to secure the environment.
I upgraded my RAC environments to 11g late last year, and I have found very little about 11g that I do not love. There is one small thing which is extremely unlovable, but once you know what is wrong, it is easy to fix.
Our process for moving to 11g was intended to be slow and methodical. We upgraded the clusterware, and then waited a week before installing the 11g DBMS in preparation for the database upgrade. The first sign of trouble came when I was preparing to shut things down for the database upgrade. I did a srvctl stop nodeapps and got a VIP error on the remote node. Alas, I was a Diva with a mission (as in DEADLINE), so I continued with my cold backup, and 11g database build. I still didn’t make the connection when the 11g database would not start on the remote node. Eventually my outage window, and my patience expired and I reverted to the cold backup of the 10g database. Imagine my surprise when it wouldn’t run on the remote node, either.
I started browsing the crsd logs when the following messages caught my eye
… [ CREVT][518] CAAMonitorHandler 0:Could not execute /opt/oracle/product/clusterware/bin/rcgwrap(check) for ora.hostname.vip
category: 1234, operation: scls_canexec, loc: , OS error: 0, other: no exe permission, file /opt/oracle/product/clusterware/bin/racgwrap
… [ CRSAPP][518] CheckResource error for ora.hostname.vip error code = -1
This doesn’t look good. I checked the permissions for racgwrap and they were set at 750. I check the other node, and they were 751. The other node works and the message says “other no exe permission”. Let’s try setting the permission to 751.
This improves the situation but does not restore full functionality. I ultimately end up doing an ls -l of the clusterware $ORACLE_HOME/bin directory on each node and find that there are 115 files with 751 permission on the functional node, and only 8 on the problem node. Still thinking that I’ve been overzealous in my effort to secure binaries, I restore the permissions and make plans to upgrade the database another evening.
When a peer had the very same problem, and I knew he hasn’t been making permission changes, I started to wonder if we have a defective root.sh. I had another RAC pair to upgrade. The clusterware was done, but the DBMS was not. Throughout the DBMS install process I ran
ls -lc | grep -c — -rwxr-x–x
on the remote node. During the “Remote operations” part of the install, the count dropped from 115 to 8. I then ran
find . -ctime -1 -exec chmod 751 {} \;
This left me with 116 binaries, so I ran the command
ls -lc | grep -v — -rwx-rx–x
on each node and discovered that onsctl should be set to 711.
I was feeling pretty smug at this point. This never works for me. Without fail I end up discovering I’ve got dog poop on my shoe. In this case, the dog poop was racgwrap, which is a symbolic link that points to $ORACLE_HOME/racg/admin/racgwrap. Once this was set back to 751, my humility was restored and so was the functionality of the remote node.
I have since been told that this problem is documented somewhere in Metalink, but I have not been able to locate it.
While removing world execute from binaries is considered a good security practice, you really don’t want to apply it to your clusterware binaries!