Basic shell commands to keep an eye on the health of an HP Red Hat Enterprise Linux server
Charge by the hour I may, but it does neither me nor the client any good to be rebuilding a live server while the users have nothing better to do than look for someone else to spend their money with. There are loads of tools and frameworks out there to keep a watchful eye on server health, but for monitoring one or two live servers for a small business they’re a bit of a sledgehammer to crack a nut. And who monitors the monitoring system?
These days most such servers are hosted by data-centres who will be monitoring their health anyway. And increasingly they’re virtualised too, so hardware monitoring isn’t something you need to worry about (or can do much about). But for physical servers, relying entirely on the hosting provider’s diligence puts the stability and availability of your live systems outside of your direct control. Anyone who has been round the block a few times in this business knows that this is going to go wrong sooner or later. After all, everything does!
So I always like to set up a simple script to “keep a silent watch and ward” to slightly mis-quote Gilbert and Sullivan’s “The Yeomen of the Guard”. Porting a client’s Red Hat Enterprise Linux server from Dell to HP hardware recently had me digging around for a new set of commands to use.
Keeping it simple
On the old Dell server I had a simple hourly cron job set up which used the rather nifty omreport command to check for any critical events in the Dell alert log. This command generates a series of semi-colon separated lines like the following:
# omreport system alertlog -fmt ssv Ok;2271;Thu Nov 29 02:22:00 2012;Storage Service; The Patrol Read corrected a media error.: Physical Disk 0:0:1 Controller 0, Connector 0 Ok;2242;Wed Nov 28 21:20:17 2012;Storage Service; The Patrol Read has started.: Controller 0 (PERC 6/i Integrated) Ok;2358;Fri Nov 23 09:37:50 2012;Storage Service; The battery charge cycle is complete.: Battery 0 Controller 0
Grepping for lines starting Critical and emailing the matches provides a quick and simple heads-up if anything is going awry:
# omreport system alertlog -fmt ssv | grep "^Critical" Critical;2272;Wed Nov 21 23:41:34 2012;Storage Service; Patrol Read found an uncorrectable media error.: Physical Disk 0:0:1 Controller 0, Connector 0 Critical;1054;Tue Apr 10 03:08:26 2012;Instrumentation Service; Temperature sensor detected a failure value
Further grepping for only those messages generated in the last couple of hours or the current day is enough to keep these alerts focussed without jumping through too many scripting hoops, though creating a timestamp file and only emailing alerts with a more recent timestamp would be an easy way of avoiding duplicates.
The HP way
A bit of light Googling found an HP equivalent hplog command, though the output is slightly less grep-friendly:
# hplog -v ID Severity Initial Time Update Time ------------------------------------------------- 0000 Information 21:45 09/26/2012 21:45 09/26/2012 LOG: Maintenance note: IML cleared through hpasmcli 0001 Critical 06:22 09/27/2012 06:22 09/27/2012 LOG: System Power Supply: Input Power Loss.
As I’ve less history with this new server I decided to capture all Caution and Critical messages and use awk’s custom record separators to handle the multi-line output:
# hplog -v | awk 'BEGIN{RS="\n\n"; ORS="\n\n"} ($0 ~ "Caution" || $0 ~ "Critical") { print $0 }' 0001 Critical 06:22 09/27/2012 06:22 09/27/2012 LOG: System Power Supply: Input Power Loss or Unplugged Power Cord. 0002 Caution 00:03 07/15/2012 00:03 07/15/2012 LOG: POST Error: 1785-Slot X Drive Array Not Configured
Grepping for a date doesn’t work here because of the multi-line output, but passing in a partial date for awk to search for does the same thing:
hplog -v | awk -v checkDate="`date +"%m/%d/%Y"`" 'BEGIN{RS="\n\n"; ORS="\n\n"} $0 ~ checkDate && ($0 ~ "Caution" || $0 ~ "Critical") {print $0}'
Grepping for a specific day’s date, or a specific hour, always leaves a hole between the cron running and the end of the period you’re grepping for, so it’s prudent to search for the previous hour or previous day too.
One nice feature of hplog is that it lets you write your own messages to the alert log:
hplog -s <severity> -l "<custom message>"
So I can test my scripts on a previously untroubled system:
# hplog -s Caution -l "Test caution message" 0012 Caution 11:01 12/01/2012 11:01 12/01/2012 0001 LOG: Maintenance note: Test caution message
Belt and braces
I haven’t historical logs on this system to prove that disk health problems will be recorded in this log file. While I presume they will be I decided to hunt down a command which would specifically check the health of the disks. The somewhat obscurely named hpacucli seems to give me what I need:
# hpacucli ctrl all show config Smart Array P420i in Slot 0 (Embedded) (sn: XXXXXXXXXXXXXXXX) array A (SAS, Unused Space: 0 MB) logicaldrive 1 (279.4 GB, RAID 1, OK) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 300 GB, OK) physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 300 GB, OK)
A simple grep of the output for physicaldrive or logicaldrive lines which don’t contain OK to include in my email seems to do the trick:
hpacucli ctrl all show config | egrep "(logicaldrive|physicaldrive)" | grep -v "OK"
Powerful monitoring tools are great but they take time to manage and configure and the more complex they are then the more points of failure they themselves introduce. A few lines of shell script running on a regular cron is a comforting cross-check to have in place, and sometimes really as much as you need.