Topic:

Setup a terminate_method in a queue configuration, to kill orphaned processes on master and slave nodes or use Linux facilities to do so.

Author:

Reuti, reuti__at__staff.uni-marburg.de; Philipps-University of Marburg, Germany

Version:

1.0 -- 2005-11-22 Initial Release
1.1 -- 2012-07-09 Add proc_police and expand somewhat (Dave Love)
1.2 -- 2012-07-24 Mention cpusets

Contents:

Note:

The original method in this HOWTO should only be applied when no other means was able to remove orphaned processes on slave nodes for a parallel job. The "proc-police" method below is un-intrusive and worth running generally to avoid any problems with rogue processes, assuming all parallel jobs are meant to be tightly-integrated. The cpuset technique should allow loosely-integrated jobs to work and be cleaned up.


Symptom of this behavior

When using e.g. MPICH2 with the mpd startup method, it is reported that under certain circumstances, some processes will not be removed by a qdel issued for a job or when the job ends abnormally. These processes compete for memory and CPU with subsequent jobs, typically seen as the load level shown by qhost being significantly higher than the number of slots scheduled on the host (assuming no over-subscription).

Note that the MPICH MPD startup is outdated, but the problem is a general one, and rogue processes are seen under some circumstances even with properly tightly-integrated MPIs.


Explanation

Each task on a slave node is a child of the started mpd, which is unique for a job and user. Whether there are processes left which weren't removed by SGE, can be checked with the command:
$ ps f -eo pid,uid,gid,user,pgrp,command

This might happen as a result of a race condition, where the mpd already left the nodes, and as a result the sge_shepherd for this jobs thinks that all was shut down in a proper way. Instead some processes might be left over and are now bound to the init process as parent. Despite the fact that these processes have the additional group ID attached in the correct way, SGE never tries to remove them even when the SGE configuration has a setting of:

$ qconf -sconf
...
execd_params ENABLE_ADDGRP_KILL=TRUE

Solutions

Original method

This Howto targets only Linux-like platforms. For other operating systems, a corresponding script must be implemented.
To get a list of eligible processes, the removal of the mpd needs to be delayed on all nodes, so that the defined routine below can read the information about the additional group ID, and then send them all a kill signal:
#!/bin/sh
#
# Define a routine to get all kids for this job
#
function getkids()
{
group=`cat $SGE_JOB_SPOOL_DIR/addgrpid`
for process in /proc/[0-9]*; do awk '/^Groups/ { for (i=2;i<=NF;i++) if
($i==group) { print process }}' process=${process##*/} group=$group $process/status; done
}

#
# Main call
#

sleep 5
getkids | xargs kill -9
exit 0

When such a routine is stored e.g. in /usr/sge/cluster, it can be defined for a particular queue with:

$ qconf -sq all.q
...
terminate_method /usr/sge/cluster/killkids.sh

and will be invoked for all jobs in this queue. As the default terminate_method is overridden, this routine is now solely responsible for the removal of all processes of this job on its own.

Alternative for Linux 2.6.15+ (proc_police)

A better solution, available in Linux with the "proc connector" available (from 2.6.15) and configured, is Brian Bockelman's proc_police; see also its Trac site and the connector tutorial referenced in the code.

Unfortunately, its packet filtering has broken in recent Red Hat kernels (observed with 2.6.18-308, at least), which makes the daemon typically fail almost immediately. A source RPM of version 0.0.4 is available patched with a workaround. After installing it and enabling the init script with chkconfig(1), edit /etc/sysconfig/proc_police to set

PROCPOLICE_ROOTNAME=sge_execd

and start the service. Actually the proc_police service should be started/stopped around the execd one, so you probably want to invoke it from the execd init script in case execd is restarted. With other distributions, try building the vanilla tarball with the appropriate init method and configuration for the distribution, and only apply the patch if necessary.

Similar mechanisms may be available in other operating systems, such as kqueue/kevent in *BSD, to allow similar tools to proc_police to be written, but none are currently known.

Using cpusets

The best solution is to use kernel facilities intended for the purpose. SGE version 8.1.2 and above can use Linux cpusets. This provides containment for processes started under the shepherd, and execd will kill any which are detached from the process tree at the end of the job. It works on Red Hat 5 and later (and possibly earlier Linux versions, starting sometime after 2.9).

To enable cpuset usage, the cpuset filesystem must be mounted and populated suitably. The script util/resources/scripts/setup-cgroups-etc does that. (Recent GNU/Linux distributions may mount the cgroups version of cpusets by default, under a different mount point, but the code currently assumes /dev/cpuset.) At present cpuset usage must also be configured by setting USE_CGROUPS=true in the execd_params of sge_conf(5).

Warnings will be printed in the execd messages when rogue processes are found and killed. Their command lines are printed as info messages for debugging problematic jobs — i.e. you get them if the log level is "log_info". For example, running

   qrsh daemonize /bin/sleep 30
logs the messages
   ...|W|rogue process(es) found for task 259.1
   ...|I|rogue: /bin/sleep 30