Grid Engine Troubleshooting

Some of the information below is more applicable to administrators than users, and some may not be available to users, depending on the visibility of spool files.

Problems with a pending jobs not being dispatched

Sometimes a pending job seems runnable but does not get dispatched. Grid Engine can be asked for the reason:

qstat -j jobid

If it is enabled, qstat -j jobid provides reasons why job jobid has not been dispatched in the last scheduling run, although these are typically voluminous and often hard to understand. The monitoring required for that to work is disabled by default as an efficiency measure and can be turned on with schedd_job_info in sched_conf(5). Here is sample output:

% qstat -j 242059
scheduling info: queue "fangorn.q" dropped because it is temporarily not available
                 queue "lolek.q" dropped because it is temporarily not available
                 queue "balrog.q" dropped because it is temporarily not available
                 queue "saruman.q" dropped because it is full
                 cannot run in queue "bilbur.q" because it is not contained in its hard queue list (-q)
                 cannot run in queue "dwain.q" because it is not contained in its hard queue list (-q)
                 has no permission for host "ori"

qstat -j (with no job specified) provides global scheduling information and a summary listing jobs by reasons they can't run, e.g.

Jobs can not run because the resource requirements can not be satisfied
	23147,	23046,	23047,	23048,	23049,	23050,	23051,	23052,
        ...

Jobs can not run because queue instance is not in queue list of PE
	23147,	23145,	22678,	22986,	22470,	22471,	22936,	22937,
        ...

Jobs can not run because available slots combined under PE are not in range of job
	23147,	23145,	22678,	22986,	22470,	22471,	22936,	22937,
        ...

Jobs dropped because of exceeding limit in rule
	23145,	22678,	22986,	22470,	22471,	22936,	22937,	22938,
        ...

Job dropped because of job dependencies
	23144

where the rest of the job lists are elided.

qalter -w p jobid

qalter -w p provides similar information to qalter -j by doing a dummy scheduler run, but this does not take load values into account. It can be used when scheduling output from qstat -j isn't available.
qsched, qrstat
Neither qstat nor qalter provide information on resource reservation which may prevent jobs running. That may be obtained using qsched(1) after configuring monitoring information and processing of it per qsched's man page. Similarly qrstat(1) provides information on advance reservations which may block jobs; also qhost -q indicates advance reservations (but not resource reservations) on hosts.

Information from those commands is generated normally by the scheduler and takes the current utilization of the cluster into account. Sometimes this is not exactly what you are interested in: E.g. if all queue slots are already occupied by jobs of other users, no detailed message is generated for the job you are interested in.

qalter -w v jobid

This command lists the reasons why a job is not dispatchable in principle. For this purpose a dry scheduling run is performed with all consumable resources (including slots) considered to be fully available for this job and all load values are ignored.

Job or Queue in error state `E`

Job or queue errors are indicated by an E in the qstat output. A job enters the error state when Grid Engine tried to execute a job in a queue, but it failed for a reason that is considered specific to the job. A queue enters the error state when Grid Engine tried to execute a job in a queue, but it failed for a reason that is considered specific to the queue. (Those classifications are sometimes imperfect.)

Grid Engine offers a set of possiblities for users and administrators to get diagnosis information in case of job execution errors. Since both the queue and the job error state result from failed job execution the diagnosis possibilities are applicable to both types of error states:

query job error reason

For jobs in an error state, a one-line error reason is available through
```
qstat -j jobid | grep error
```
This is the recommended first source of diagnostic information for the end user.

query queue error reason

For queues in an error state, a one-line error reason is available through
```
qstat -explain E
```
This is the recommended first source of diagnostic information for administrators in case of queue errors.

user abort mail

If jobs are submitted with the submit option -m a an abort mail message is sent to the adress specified with the -M user[@host] option. The abort mail contains diagnostic information about job errors and is the recommended source of information for users.

qacct accounting

If no abort mail is available the user can run
```
qacct -j jobid
```
to get information about the job error from Grid Engine job accounting file. Note that the accounting file may be rotated, so information on old jobs may require an older version of the accounting file if the job concerned was run near the time when the file was rotated (typically at the start of a month). This will include a message if the job has been killed because it exceeded its allowed time or memory limits. For distributed jobs run in a PE with accounting_summary false, it may be convenient to add the -m option available in recent SGE versions to list only the master queue, e.g.
```
$ qacct -m -j 21659 | egrep '^failed|^maxvmem|^ru_wallclock|^category'
failed       37  : qmaster enforced h_rt, h_cpu, or h_vmem limit
ru_wallclock 36001s
maxvmem      5.247GB
category     -U localusers -u fred -l h_rt=36000,h_vmem=10000M
```
shows that the job ran out of time and should have requested a larger h_rt. Note that a job's exit_status is 0, even if it failed intrinsically, if it was a script which didn't take care to exit with an appropriate code, e.g. by execing a binary, using set -e (Bourne shell), or exiting with a saved value of $? if it needs to print something after the failed command.

See qacct(1) and accounting(5) for more information.

administrator abort mail

An administrator can order email about job execution problems by specifying an appropriate email adress (see under administrator_mail in sge_conf(5)). Administrator mail contains more detailed diagnosis information than user abort mail (in particular, the shepherd trace information) and is recommended in case of frequent job execution errors.

messages files

If no administrator mail is available the Qmaster's messages file should be investigated. Logging related to a particular job can be found by searching for the appropriate jobid. In a default installation the Qmaster messages file is
```
$SGE_ROOT/default/spool/qmaster/messages
```
It may be helpful to alter the logging level (see loglevel in sge_conf(5)), especially as the level at which some messages are classified is at least arguable.
Additionally, useful information may be found in the messages files of the Execd(s) on nodes running the job, particularly the master host in the case of distributed parallel jobs. Use qacct -j jobid to find the (master) host on which the job was started and search in
```
$SGE_ROOT/default/spool/host/messages
```
for the jobid.
Recent versions of SGE allow messages to be directed to syslog instead of the messages files, and always send startup messages (before the spool area has been established) to syslog. Other versions put startup error messages into files in /tmp. It is worth checking syslog on the exec host for messages that may be relevant to job failure.

Administrators may find it useful to use Nagios monitors for jobs and queues in an error state.

Other issues

Obscure `exec` messages from the shepherd

If jobs fail to start correctly with a message (perhaps truncated) in the shepherd output about failing to exec the script, or some other odd error in a script not started in Unix mode, check that it doesn't have CRLF (MS Windows-style) line endings. It can be useful to check that at submission time with a JSV.

Host name-related issues

If SGE appears to be unhappy with network names, the hostnameutils(1) utilities can be used to check SGE's idea of names including the effects of its host aliasing.

Unknown accounts on execution hosts

It is common for jobs to fail because execution hosts' passwd database information isn't correct — specifically not synchronized with the head node's. Check that with getent(1) or id(1) on the relevant nodes. (SGE simply uses the system services for such information, which getent checks.) It may be necessary to flush cached information after making changes, e.g. with ncsd -i passwd.

Something runs outside SGE, but not inside

It's most likely due to environment variables or resource limits. As usual, check the execution host syslog and its SGE messages file for clues. If there are no useful diagnostics, submit a job (conveniently done interactively with qrsh) to print information about relevant parts of the process environment (such as environment variables and resource limits) and compare the output against a situation which works. procenv provides a comprehensive listing for Linux-based systems and potentially others; there are RPMs available in a copr repo and possibly in Fedora/EPEL eventually. procenv actually subsumes information from other tools, such as id (above).

A common issue is programs using pthreads reacting badly to the default infinity (unlimited) setting of h_stack, which can be fixed by setting h_stack to a few 10s of MB.

Communication errors

If clients report communication errors, such as timeouts, without more useful diagnostics, first consult the qmaster messages file (probably after loglevel to log_info in sge_conf(5)). If there's no useful information there, check the networking in the as for other client-server applications. Ensure that the configured qmaster port is open (or the execd one if exec host communication is failing). Specifically, check, and maybe double-check, for any firewalling or possible effects of tcp-wrappers. A tool like tcpdump/wireshark running at either end may be useful. Otherwise, using strace(1) (or equivalent on other OSs), or generating a debugging trace from the client and/or server, might help. The basic SGE tool for checking communication is qping -info (but old versions always reported errors).