SGE Array Jobs

Rationale

Consider using array jobs whenever you might otherwise submit a large number of jobs — likely auto-generated — to do the same processing on a lot of individual input datasets or values. Examples include image processing or rendering with one job for each frame in a sequence, and parameter sweeps, with each job running the same calculation with a different parameter set from a pre-defined collection.

Array jobs effectively constitute a parallel loop over the input data which can be manipulated as a unit by qsub, qdel, qhold, qalter, etc. They have a single entry in qstat output when waiting (but an entry for each running member of the array). This is easier on you than generating and manipulating a large number of individual jobs, and easier on the system, which doesn’t have to manage a very large number of jobs.

Each instance of the job in the effective loop is called a task, which has an index exported to the task’s environment for use in the job script. The tasks may be parallel. Each has initially the same SGE parameters, though qalter may be used to change some for waiting tasks.

See qsub(1) for details of array job options -t, -tc, -hold_jid, and -hold_jid_ad in addition to the guidance below.

Basic Use

The job script for an array job is usually essentially a template in which substitutions can be made on the basis of the task index environment variable $SGE_TASK_ID.

Consider processing the contents of a collection of 1000 directories with regular names of the form casen, with n running from 1–1000. Submitting the job as

qsub -t 1-1000 ... array.sh

asks for array.sh to be run as 1000 separate tasks and it might contain something like:

#!/bin/sh
#$ -cwd -j y -o case$TASK_ID/output
cd case$SGE_TASK_ID
exec program <input

This is equivalent to 1000 jobs, each executing commands from an element of the sequence

cd case1; program <input
 ...
cd case1000; program <input

so processing data in case1, … case1000 and putting the job output in output in that directory. See qsub(1) for details of substitutions in the values specified with -e and -o; note the lack of an SGE_ prefix in those contexts.

Tasks are started in order of the array index, but you can’t make more assumptions than that about when they execute; a task could be rescheduled to start effectively after a later one by array index.

Note that the sge_conf(5) parameters max_aj_tasks and max_aj_instances control for each job the maximum total number of tasks and the maximum number of concurrent tasks.

Refinements and Clichés

Indexing Arithmetic

The task index is always 1-based, but you can always do arithmetic with it if your data are essentially 0-based. In a POSIX-conformant shell like bash(1), the expression $(($SGE_TASK_ID-1)) converts the index to a 0-based one.

The array stride need not be 1. This is useful, for instance, to unroll the loop when the tasks would be short compared with overheads of running them:

$ qsub -t 1-1000:5 ... <<EOF
for i in $(seq $SGE_TASK_STEPSIZE); do
  index=$(($SGE_TASK_ID+$i))
  program < input$index
done
EOF

Variables SGE_TASK_FIRST, SGE_TASK_LAST, and SGE_TASK_STEPSIZE provide the task range and stride to the script.

Non-shell script Jobs

It isn’t necessary to use a shell script — the index is accessible in the environment of a binary job, though it would probably need to be an SGE-specific program, and a job script could be in a non-shell language such as Python, accessing the environment in the appropriate way. Here is a trivial example of constructing a file name in Python in a two-task array job, assuming the cluster configuration sets shell_start_mode to unix_behavior:

#!/usr/bin/python2
#$ -t 1-2 -l h_rt=9 -cwd -j y
import os
ip = "input"+os.getenv("SGE_TASK_ID")+".dat"
print "We might read from file", ip

and similarly for R, using littler, which only need be on PATH if we use the /bin/env cliché:

#!/bin/env r
#$ -t 1-2 -l h_rt=9 -cwd -j y
ip <- sprintf ("input%s.dat", Sys.getenv("SGE_TASK_ID"))
cat ("We might read from file", ip, "\n")

The -C argument of qsub might help if # isn’t a comment in the script language.

Selecting from Lists in Files

You needn’t be restricted to simple template-like jobs. You can put complex arguments or commands in a file and pull out a different line each time, e.g. with the right number of commands to be executed listed one/line in file. awk is convenient for this, e.g. selecting a complete command line:

eval $(awk NR==$SGE_TASK_ID lines)

or an input file from a non-regular collection:

program < $(awk NR==$SGE_TASK_ID lines)

An alternative suggestion involves supplying a list of scripts as job arguments.

Dependent Arrays

qsub -hold_jid_ad can be used to hold an array job’s tasks dependent on the corresponding tasks in others. E.g. with

qsub -t 1:50 step1
qsub -t 1:50 -hold_jid_ad step1 step2
qsub -t 1:50:2 -hold_jid_ad step2 step3

array task n of step2 will only run after task n of step1 has finished, and task n of step3 will only run after tasks n and n+1 of step2 has finished. Similarly, with

qsub -t 1:50:2 step1
qsub -t 1:50 -hold_jid_ad step1 step2

task n and n+1 of step2 depend on task n of step1, where n is odd. NB ‘finished’ doesn’t mean ‘finished successfully’, so subsequent tasks may need to check the results of previous ones.

In contrast, with

qsub -t 1:50 step1
qsub -t 1:50 -hold_jid step1 step2

tasks of step2 will only run after all those of step1 have finished.

Task Concurrency

The qsub -tc option can be used to restrict the number of tasks of the job that run concurrently, usually to be socially conscious in not dominating the cluster. You might use -tc 1 to ensure only one task can run at once, e.g. for a series of tests that update some state in the directory that you don’t care about but when it will cause problems if more than one task does so. This might also be useful if each task was dependent on results from the previous index, though care would have to be taken in case a task failed.