The Node Health Check (NHC) system, formerly distributed with the Warewulf cluster management suite, is a convenient way to deal with nodes which fail in some way, since it is able to act directly as a load sensor. (The canonical way to deal with problems is to cause an alarm load level on the relevant queue instance to prevent jobs being scheduled to it.)
Note that you will need a version 1.4.1 or later (or
a very old one) to work correctly with SGE. Also the
check_ps_unauth_users
check doesn’t support SGE yet, though
check_ps_userproc_lineage
does.
There is a potential problem in that the SGE loop leaks memory in some
versions of bash, such as version 4.1.2-33 from RHEL6. The version
packaged (at the time of writing) under
https://loveshack.fedorapeople.org/copr/warewulf-nhc-1.4.2-1.el6_liv1.src.rpm
has a change to bail out if it seems to be leaking, at which point
execd will restart it; see also the patch on
github. (That also has a change to avoid spamming
syslog on each run for continuing issues, also with a patch on
github.) If you have a leaky bash
you might want to build bash 4.3, install it as, say, bash43
and
modify the nhc
script to run under it.
Load sensor configuration
-
Assuming
admin_user
isn’t root — and think two or three times if it is — you may need to override root-only permissions on the configuration directory, specifically if installing from an RPM built with the distributed spec file, which assumes that NHC runs as root and the configuration might be sensitive:chmod -R +r /etc/nhc
-
As of version 1.4.1, the resource manager should be auto-detected as SGE, assuming
SGE_ROOT
is correctly defined in the environment, as usual. With an earlier version, it will be necessary to add* || export NHC_RM=sge
to
/etc/nhc/nhc.conf
explicitly unless the SGE binaries are on the execd’sPATH
; -
You probably want to define (
qconf -mc
) a complex to select on (un)healthy nodes:# qconf -sc | grep healthy healthy healthy BOOL == YES NO 0 0
-
Add a load threshold to all the relevant queues, e.g.
# qconf -aattr queue load_thresholds healthy=0 all.q # qconf -sq all.q | grep load load_thresholds np_load_avg=1.25,healthy=0
i.e. generate an alarm condition on the queue instance when NHC detects a problem and reports the
healthy
complex as false.
Now you can find alarming instances with
qselect -qs a
and check the reported problem on a node with qconf
, looking for
the diagnosis
load parameter:
# SGE_SINGLE_LINE=1 qconf -se comp530 | tr ',' '\n' | grep diag
diagnosis=NHC: Ethernet device ib1 not detected.
Note
|
It could take up to two times load_report_time for the
alarm to propagate, since load sensors are call asynchronously. |
The distributed Nagios plugin could alert you to the alarm.
Configuring tests
Test NHC configurations simply like
$ echo | nhc -c /etc/nhc/nhc-test.conf
begin
comp530:healthy:true
comp530:diagnosis:HEALTHY
end
$
Since the load sensor typically won’t run as root, it may be necessary
to use a mechanism like sudo
if tests need privileges to work, or if
you want to take an action like killing processes.
If you use the check_ps_userproc_lineage
NHC check, make sure the
uid of admin_user
is covered by the MAX_SYS_UID
parameter, or
modify the test. As of version 1.4.1, RM_DAEMON_MATCH
is configured
by default to check processes not spawned by sge_execd
or
sge_shepherd
. (The process tree starts at the shepherd if the execd
is restarted, and the execd can, for instance, spawn mail
as the job
owner.)