Grid Engine Configuration Recipes

This is a somewhat-random collection of commonly-required configuration recipes. It is written mainly from the point of view of high performance computing clusters, and some of the configuration suggestions may not be relevant in other circumstances or for old versions of gridengine. See also the other howto documents. Suggestions for additions/corrections are welcome (to d.love @ liverpool.ac.uk).

Script Execution

Unix behaviour

Set shell_start_mode to unix_behavior in your queue configurations to get the normally-expected behaviour of starting job scripts with the shell specified in the initial #! line.

Modules environment

A side-effect of unix_behaviour is usually not getting the normal login environment, specifically not with the module command available for those sites that use environment modules. At least for use with the bash shell, add the following to the site sge_request file to avoid users having to source modules.sh etc. in job scripts, assuming sessions from which jobs are submitted have modules available:

-v module -v MODULESHOME -v MODULEPATH

This may not work with other shells.

Parallel Environments

Heterogeneous/Isolated Node Groups

Suppose you have various sets of compute nodes over which you want to run parallel jobs, but each job must be confined to a specific set. Possible reasons are that you have

significantly heterogeneous nodes (different CPU speeds, architectures, core numbers, etc.),
groups of nodes which have restricted access, e.g. dedicated to a user group, controlled by an ACL,
or islands of connectivity on your MPI fabric(s) which are either actually isolated or have slow communication boundaries over switch layers.

Then you’ll want to define multiple parallel environments and host groups. There will typically be one PE and one host group (with possibly an ACL) for each node type or island. The PEs will all be the same, unless you want a different fixed allocation_rule for each, but with different names. The names need to be chosen so that you can conveniently use wildcard specifications for them. Normally the names will all have the same name base, e.g. mpi-…. As an example, for different architectures, with different numbers of cores which get exclusive use for any tightly integrated MPI:

$ qconf -sp mpi-8
pe_name            mpi-8
slots              99999
user_lists         NONE
xuser_lists        NONE
start_proc_args    NONE
stop_proc_args     NONE
allocation_rule    8
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary FALSE
qsort_args         NONE
$ qconf -sp mpi-12
pe_name            mpi-12
slots              99999
user_lists         NONE
xuser_lists        NONE
start_proc_args    NONE
stop_proc_args     NONE
allocation_rule    12
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary FALSE
qsort_args         NONE

with corresponding host groups @quadcore and @hexcore for each type of dual-processor box. Those PEs are assigned one-to-one to host groups, ensuring that jobs can’t run across the groups (since parallel jobs are always granted a unique PE by the scheduler, whereas they can be split across queues).

$ qconf -sq parallel
...
seq_no    2,[@quadcore=3],[@hexcore-eth=4],...
...
pe_list   NONE,[@quadcore=make mpi-8 smp],[@hexcore=make mpi-12 smp],...
...
slots     0,[@quadcore=8],[@hexcore=12],...
...

Now the PE naming comes in useful, since you can submit to a wildcarded PE, -pe mpi-*, if you’re not fussy about the PE you actually get. See Wildcarded PEs for the next step.

Suppose you want to retain the possibility of running across all the PEs (assuming they’re not isolated). Then you can define an extra PE, say allmpi, which isn’t matched by the wildcard.

Note	SGE 8.1.1+ allows general PE wildcards (actually patterns), as documented, fixing the bug which meant that only `*` was available in older versions. The correct treatment might be useful with such configurations, e.g. selecting `mpi-[1-4]`.

JSVs

See jsv(1) and jsv_script_interface(3) for documentation on job submission verifiers, as well as the examples in $SGE_ROOT/util/resources/jsv/.

See also the Resource Reservation section.

Wildcarded PEs

When you would use a wildcarded PE as above, for convenience and abstraction, you can use a JSV to write the wildcard pattern. This JSV fragment from jsv_on_verify in Bourne shell re-writes -pe mpi to -pe mpi-*:

if [ $(jsv_is_param pe_name) = true ]; then
pe=$(jsv_get_param pe_name)
...
case $pe in
    mpi)
    jsv_set_param pe_name "$pe-*"
    pe="$pe-*"
    modified=1
    ...

Suppose you want to retain the possibility of running across all the PEs (assuming the groups aren’t isolated). Then you can define an extra PE, say allmpi, which isn’t re-written by the JSV.

Checking for Windows-style line endings in job scripts

Users sometimes transfer job scripts from MS Windows systems in binary mode, so that they end up with CRLF line endings, which typically fail, often with a rather obscure failure to execute if shell_start_mode is set to unix_behavior. The following fragment from jsv_on_verify in a shell script JSV prevents submitting such job scripts

jsv_on_verify () {
...
  cmd=$(jsv_get_param CMDNAME)
  case $(jsv_get_param b) in y|yes) binary=y;; esac
  [ "$cmd" != NONE -a "$cmd" != STDIN -a "$binary" != y ] &&
      [ -f "$cmd" ] &&
      grep -q $'\r\\$' "$cmd" &&
      # Can't use multi-line messages, unfortunately.
      jsv_reject "\
  Script has Windows-style line endings; transfer in text mode or use dos2unix"
...

Scheduling Policies

See sched_conf(5) for detailed information on the scheduling configuration.

Host Group Ordering

To change scheduling so that hosts in different host groups are preferentially used in some defined order, set queue_sort_method to seqno:

$ qconf -ssconf
...
queue_sort_method seqno
...

and define the ordering in the relevant queue(s) as required with seq_no):

$ qconf -sq ...
...
seq_no 10,[@group1=4],[@group2=3],...
...

It is possible to use seqno, for instance, to schedule serial jobs preferentially to one `end' of the hosts and parallel jobs to the other `end'.

Fill Up Hosts

To schedule preferentially to hosts which are already running a job, as opposed to the default of roughly round robin according to the load level, change the load_formula:

$ qconf -ssconf
...
queue_sort_method load
load_formula slots
...

assuming the slots consumable is defined on each node.

Reasons for compacting jobs onto a few nodes as possible include avoiding fragmentation (so that parallel jobs which require complete nodes have a better chance of being fitted in), or being able to power down complete unused nodes.

Note	Scheduling is done according to load values reported at scheduling time, without lookahead, so that it only takes effect over time.

Since the load formula is used to determine scheduling when hosts are equal according to queue_sort_method, you can schedule to the preferred host groups by seqno as above, and still compact jobs onto the nodes using slots in the load formula, as above, i.e. with this configuration:

$ qconf -ssconf
...
queue_sort_method seqno
load_formula slots
...

Avoiding Starvation (Resource Reservation/Backfilling)

To avoid “starvation” of larger, higher-priority jobs by smaller, lower-priority ones (i.e. the smaller ones always run in front of the larger ones) enable resource reservation by setting max_reservation to a reasonable value (maybe around 100), and arrange that relevant jobs are submitted with -R y, e.g. using a JSV. Here is a JSV fragment suitable for client side use, to add reservation to jobs over a certain size, assuming that PE slots is the only relevant resource:

    if [ $(jsv_is_param pe_name) = true ]; then
        pe=$(jsv_get_param pe_name)
        pemin=$(jsv_get_param pe_min)
...
        # Check for an appropriate pe_min with no existing reservation.
        if [ $(jsv_is_param R) = false ]; then
            if [ $pemin -ge $pe_min_reserve ]; then
                jsv_set_param R y
                modified=1
            fi
        fi

Note	For “backfilling” (shorter jobs can fill the gaps before reservations) to work properly with jobs which do not specify an `h_rt` value at submission, the scheduler `default_duration` must be set to a value other then the default `infinity`, e.g. to the longest runtime you allow.

To monitor reservations, set MONITOR=1 in sched_conf(5) params and use qsched(1) after running process_scheduler_log; qsched -a summarizes all the current reservations.

Fair Share

It is often required to provide a fair share of resources in some sense, whereby heavier users get reduced priority. There are two SGE policies for this. The share tree policy assigns priorities based on historical usage with a specified lifetime, and the functional policy only takes current usage into account, i.e. is similar to the share tree with a very short decay time. (It isn’t actually possible to use the share tree with a lifetime less than one hour.) Use one of the other, but not both to avoid confusion.

With both methods, ensure that the default scheduler parameters are changed so that weight_ticket is a lot larger than weight_urgency and weight_priority or set the latter two to zero if you don’t need them. Otherwise it is possible to defeat the fair share by submitting with a high priority (-p) or with resources with a high urgency attached. See sge_priority(5) for details.

You may also want to set ACCT_RESERVED_USAGE in execd_params to use effectively `wall clock' time in the accounting that determines the shares.

Functional

For simple use of the functional policy, add

weight_tickets_functional 10000

to the default scheduling parameters (qconf -msconf) and define a non-zero fshare for each user (qconf -muser). If you use enforce_user auto in the configuration,

auto_user_fshare 1000

could be used to set up automatically-created users (new ones only).

Warning

enforce_user auto implies not using CSP security, but is OK with MUNGE security. It typically is not wise to run without one of them.

Share Tree

See share_tree(5).

To make a simple tree, use qconf -Astree with a file with contents similar to:

id=0
name=Root
type=0
shares=1
childnodes=1
id=1
name=default
type=0
shares=1000
childnodes=NONE

and give the share tree policy a high weight, like (qconf -msconf):

weight_tickets_share 10000

If you have auto-creation of users (see the warning above), you probably want to ensure that they are preserved with:

auto_user_delete_time 0

The share tree usage decays with a half-life of 7 days by default; modify halftime (specified in hours) to change it.

Resource Management

Slot Limits

You normally want to prevent over-subscription of cores on execution hosts by limiting the slots allocated on a host to its core (or actually processor) count — where ``processors'' might mean hardware threads. There are multiple ways of doing so, according to taste, administrative convenience, and efficiency.

If you only have a single queue, you can get away with specifying the slot counts in the queue definition (qconf -mconf), e.g. by host group

   slots 0,[@hexcore=12],[@quadcore=8]...

but with multiple queues on the same hosts, you may need to avoid over-subscription due to contributions from each queue.

An easy way for an inhomogeneous cluster is with the following RQS (with qconf -arqs), although it may lead to slow scheduling in a large cluster:

{
   Name         host-slots
   description  restrict slots to core count
   enabled      true
   limit        hosts {*} to slots=$num_proc
}

This would probably be the best solution if num_proc, the processor count, is variable by turning hardware threads on and off.

Alternatively, with a host group for each hardware type, you can use a set of limits like

   limit        hosts {@hexcore} to slots=12
   limit        hosts {@quadcore} to slots=8

which will avoid the possible scheduling inefficiency of the $num_proc dynamic limit.

Finally, and possibly the most foolproof way in normal situations is to set the complex on each host, e.g.

$ for n in 8 16; do qconf -mattr exechost complex_values slots=$n \
   `qconf -sobjl exechost load_values "*num_proc=$n*"`; done

Memory Limits

Normally it is advisable to prevent jobs swapping. To do so, make the h_vmem complex consumable, and give it a default value that is (probably slightly less than) the lowest memory/core that you have on execution hosts, e.g.:

$ qconf -sc | grep h_vmem
h_vmem              h_vmem       MEMORY      <=      YES         YES        2000m    0

(See complex(5) and the definition of memory_specifier.)

Also set h_vmem to an appropriate value on each execution host, leaving some head-room for system processes, e.g. (with bash-style expansion):

$ qconf -mattr exechost complex_values h_vmem=31.3G node{1..32}

Then single-process jobs can’t over-subscribe memory on the hosts—at least the jobs on their own—and multi-process ones can’t over-subscribe long term (see below).

Jobs which need more than the default (2000m per slot above) need to request it at job submission with -l{nbsp}h_vmem=..., and may end up under-subscribing hosts' slots to get enough memory in total.

Each process is limited by the system to the requested memory (see setrlimit(2)), and attempts to allocate more will fail. If it is a stack allocation, the program will typically die; if it is an attempt to malloc(3) too much, well-written programs should report an allocation failure. Also, the qmaster tracks the total memory accounted to the job, and will kill it if allocated memory exceeds the total requested.

These mechanisms are not ideal in the case of MPI-style jobs, in particular. The rlimit applied is the h_vmem request multiplied by the slot count for the job on the host, but it is for each process in the job—the limit does not apply to the process tree as a whole. This means that MPI processes, for instance, can over-subscribe in the PDC_INTERVAL before the execd notices, and out-of-memory system errors may still occur if a job can fill swap fast enough. Future use of memory control groups will help address this on Linux. However, if you happen to use Open MPI 1.7 or later, you could use a JSV to set OMPI_MCA_opal_set_max_sys_limits consistent with h_vmem.

Note

Killing by qmaster due to the memory limit may occur spuriously, at least under Linux, if the execd over-accounts memory usage. Older SGE versions, and possibly newer ones on old Linux versions, use the value of VmSize that Linux reports (see proc(5)); that includes cached data, and takes no account of sharing. The current SGE uses a more accurate value if possible (see execd_params USE_SMAPS). Also, if a job maps large files into memory (see mmap(2)), that may cause it to fail due to the rlimit, since that counts the memory mapped data, at least under Linux. A future version of SGE is expected to provide control over using the rlimit.

Note	Suspended jobs contribute to the `h_vmem` consumed on the host, which may need to be taken into account if you allow jobs to preempt others by suspension.

Note

Setting h_vmem can cause trouble with programs using pthreads(7), typically appearing as a segmentation violation. This is apparently because the pthreads runtime (at least on GNU/Linux) defines a per-thread stack from the h_vmem limit. The solution is to specify a reasonable value for h_stack in addition; typically a few 10s to 100 or so of MB is a good value, but may depend on the program.

Note	There is also an issue with recent OpenJDK Java. It allegedly tries to allocate 1/4 of physical memory for the heap initially by default, which will fail with typical h_vmem on recent systems. The (only?) solution is to use java -XmxN explicitly, with N derived from h_vmem.

Licence Tokens

For managing Flexlm licence tokens, see Olesen’s method. This could be adapted to similar systems, assuming they can be interrogated suitably. There’s also the licence juggler for multiple locations.

Killing Detached Processes

If any of a job’s processes detach themselves from the process tree under the shepherd, they are not killed directly when the job terminates. Use ENABLE_ADDGRP_KILL to turn on finding and killing them at job termination. It will probably be on by default in a future version.

Core Binding

Binding processes to cores (or `CPU affinity') is normally important for performance on `modern' systems (in the mainstream at least since the SGI Origin). Assuming cores are not over-subscribed, a good default (since SGE 8.0.0c) is to set a default in sge_request(5) of

-binding linear:slots

The allocated binding is accessible via SGE_BINDING in the job’s environment, which can be assigned directly to GOMP_CPU_AFFINITY for the benefit of the GNU OpenMP implementation, for instance.

If you happen to use Open MPI, a good default in openmpi-mca-params.conf (at least for Open MPI 1.6), matching the SGE -binding is:

orte_process_binding = core

Binding is the default in Open MPI 1.8, but to sockets, not cores, in most cases. That can be changed with, say,

hwloc_base_binding_policy = core
rmaps_base_mapping_policy = core

which also prefers binding adjacent-numbered ranks to adjacent cores.

Open MPI will respect the core binding from SGE, i.e. it won’t try to bind to cores outside SGE_BINDING.

Registered Memory Limit for openib etc.

To use Infiniband via openib, processes need to be able to `register' a large amount of memory, and the limit for jobs is typically too small by default (see ulimit -l). It is typically necessary to add H_MEMORYLOCKED=infinity to execd_params (see sge_conf(5)). This may be necessary with other transports too.

Administration

Maintenance Periods

Rejecting Jobs

In case you want to drain the system, adding $SGE_ROOT/util/resources/jsv/jsv_reject_all.sh as a server JSV will reject all jobs at submission with a suitable message.

Down Time

If you want jobs to stay queued, there are two approaches to avoid starting ones that might run into a maintenance period, assuming you enforce a runtime limit and the maintenance won’t start any sooner than that period: a calendar and an advance reservation.

Calendar

You can define a calendar for the shutdown period and attach it to all your queues, e.g.

# qconf -scal shutdown
calendar_name    shutdown
year             6.9.2013-9.9.2013=off
week             NONE
# qconf -mattr queue calendar shutdown serial parallel
root@head modified "serial" in cluster queue list
root@head modified "parallel" in cluster queue list

Note	To get the scheduler to look ahead to the calendar, you need to enable resource reservation (issue #493) but that reservation may interact poorly with calendars (issue #722), but it’s not clear whether this is still a problem.

Advance reservation

Define a fake PE with allocation_rule 1 and access only by the admin ACL, say, and attach it to all your hosts, possibly via a new queue if you already have a complex pe_list setup:

$ qconf -sp all
slots              99999
user_lists         admin
...
allocation_rule    1
...
$ qconf -sq shutdown
qname                 shutdown
hostlist              @allhosts
...
pe_list               all
...

Now you can make an advance reservation (assuming max_advance_reservations allows it, and you’re in arusers as well as admin):

$ qrsub -l exclusive -pe all $(qselect -pe all|wc -l) -a 201309061200 -d 201309091200

Rebooting execution hosts

To reboot execution hosts, you need to ensure they’re empty and avoid races with job submission. Thus, submit a job which requires exclusive access to the host and then does a reboot. Since you want to avoid root being able to run jobs for security reasons, use sudo(1) with appropriate settings to allow password-less executions of the commands by the appropriate users. You want to comment out Defaults requiretty from /etc/sudoers, add !requiretty to the relevant policy line, or use -pty y on the job submission. It is cleanest to shut down the execd before the reboot.

The job submission parameters will depend on what is allowed to run on the hosts in question, but assuming you can run SMP jobs on all hosts (some might not be allowed serial jobs), a suitable job might be

qsub -pe smp 1 -R y -N boot-$1 -l h=$node,exclusive -p 1024 -l h_rt=60 -j y <<EOF
/usr/bin/sudo /sbin/service sgeexecd.ulgbc5 softstop
/usr/bin/sudo /sbin/reboot
EOF

where $node is the host in question, and we try to ensure the job runs early by using a resource reservation and a high priority.

For a mass reboot it may be better to submit an array job of a size equal to the number of nodes to boot. It will have to request a complex set via a load sensor with a value reflecting a state before a reboot, such as the old kernel version for reboots into a new kernel.

Broken/For-testing Hosts

Administrative Control

A useful tactic for dealing with hosts which are broken, or possibly required for testing and not available to normal users is to make a host group for them, say @testing (qconf -ahgrp testing), and restrict access to it only to admin users with an RQS rule like

   limit users {!@admin} hosts {@testing} to slots=0

It can also be useful to have a host-level string-valued complex (say comment or broken) with information on the breakage, say with a URL pointing to your monitoring/ticketing system. A utility script can look after adding to the host group, setting the complex and, for instance, assigning downtime (in Nagios' terms) for the host in your monitoring system.

Alternatively the RQS could control access on the basis of the broken complex rather than using host group separately.

A monitoring system like Nagios (which has hooks for such actions and is allowed admin access to the SGE system) can set the status as above when it detects a problem.

Using a restricted host group or complex is more flexible than disabling the relevant queues on the host, as sometimes recommended; that stops you running test jobs on them and can cause confusion if queues are disabled for other reasons.

Using Alarm States

As an alternative to explicitly restricting access as above, one can put a host into an alarm state to stop it getting jobs. This can be done by defining an appropriate complex and a load formula involving it, along with a suitable load sensor. The sensor executes periodic tests, e.g. using existing frameworks, and sets the load value high via the complex if it detects an error. However, since it takes time for the load to be reported, jobs might still get scheduled for a while after the problem occurs.

Running tests could also be done in the prolog potentially to set the queue into an error state before trying to run the job. However, that is queue-specific, and the prolog only runs on parallel jobs' master node.

Licence GFDL (text), GPL (code).