The Node Health Check (NHC) system, formerly distributed with the Warewulf cluster management suite, is a convenient way to deal with nodes which fail in some way, since it is able to act directly as a load sensor. (The canonical way to deal with problems is to cause an alarm load level on the relevant queue instance to prevent jobs being scheduled to it.)

Note that you will need a version 1.4.1 or later (or a very old one) to work correctly with SGE. Also the check_ps_unauth_users check doesn’t support SGE yet, though check_ps_userproc_lineage does.

There is a potential problem in that the SGE loop leaks memory in some versions of bash, such as version 4.1.2-33 from RHEL6. The version packaged (at the time of writing) under https://loveshack.fedorapeople.org/copr/warewulf-nhc-1.4.2-1.el6_liv1.src.rpm has a change to bail out if it seems to be leaking, at which point execd will restart it; see also the patch on github. (That also has a change to avoid spamming syslog on each run for continuing issues, also with a patch on github.) If you have a leaky bash you might want to build bash 4.3, install it as, say, bash43 and modify the nhc script to run under it.

Load sensor configuration

  • Assuming admin_user isn’t root — and think two or three times if it is — you may need to override root-only permissions on the configuration directory, specifically if installing from an RPM built with the distributed spec file, which assumes that NHC runs as root and the configuration might be sensitive:

    chmod -R +r /etc/nhc
  • As of version 1.4.1, the resource manager should be auto-detected as SGE, assuming SGE_ROOT is correctly defined in the environment, as usual. With an earlier version, it will be necessary to add

     * || export NHC_RM=sge

    to /etc/nhc/nhc.conf explicitly unless the SGE binaries are on the execd’s PATH;

  • You probably want to define (qconf -mc) a complex to select on (un)healthy nodes:

    # qconf -sc | grep healthy
      healthy             healthy      BOOL       ==      YES         NO         0        0
  • Add a load threshold to all the relevant queues, e.g.

    # qconf -aattr queue load_thresholds healthy=0 all.q
    # qconf -sq all.q | grep load
    load_thresholds       np_load_avg=1.25,healthy=0

    i.e. generate an alarm condition on the queue instance when NHC detects a problem and reports the healthy complex as false.

Now you can find alarming instances with

qselect -qs a

and check the reported problem on a node with qconf, looking for the diagnosis load parameter:

# SGE_SINGLE_LINE=1 qconf -se comp530 | tr ',' '\n' | grep diag
diagnosis=NHC: Ethernet device ib1 not detected.
Note
It could take up to two times load_report_time for the alarm to propagate, since load sensors are call asynchronously.

The distributed Nagios plugin could alert you to the alarm.

Configuring tests

Test NHC configurations simply like

$ echo | nhc -c /etc/nhc/nhc-test.conf
begin
comp530:healthy:true
comp530:diagnosis:HEALTHY
end
$

Since the load sensor typically won’t run as root, it may be necessary to use a mechanism like sudo if tests need privileges to work, or if you want to take an action like killing processes.

If you use the check_ps_userproc_lineage NHC check, make sure the uid of admin_user is covered by the MAX_SYS_UID parameter, or modify the test. As of version 1.4.1, RM_DAEMON_MATCH is configured by default to check processes not spawned by sge_execd or sge_shepherd. (The process tree starts at the shepherd if the execd is restarted, and the execd can, for instance, spawn mail as the job owner.)