Grid Engine Troubleshooting

Some of the information below is more applicable to administrators than users, and some may not be available to users, depending on the visibility of spool files.

Problems with a pending jobs not being dispatched

Sometimes a pending job seems runnable but does not get dispatched. Grid Engine can be asked for the reason:

Information from those commands is generated normally by the scheduler and takes the current utilization of the cluster into account. Sometimes this is not exactly what you are interested in: E.g. if all queue slots are already occupied by jobs of other users, no detailed message is generated for the job you are interested in.

Job or Queue in error state E

Job or queue errors are indicated by an E in the qstat output. A job enters the error state when Grid Engine tried to execute a job in a queue, but it failed for a reason that is considered specific to the job. A queue enters the error state when Grid Engine tried to execute a job in a queue, but it failed for a reason that is considered specific to the queue. (Those classifications are sometimes imperfect.)

Grid Engine offers a set of possiblities for users and administrators to get diagnosis information in case of job execution errors. Since both the queue and the job error state result from failed job execution the diagnosis possibilities are applicable to both types of error states:

Administrators may find it useful to use Nagios monitors for jobs and queues in an error state.

Other issues

Obscure exec messages from the shepherd

If jobs fail to start correctly with a message (perhaps truncated) in the shepherd output about failing to exec the script, or some other odd error in a script not started in Unix mode, check that it doesn't have CRLF (MS Windows-style) line endings. It can be useful to check that at submission time with a JSV.

Host name-related issues

If SGE appears to be unhappy with network names, the hostnameutils(1) utilities can be used to check SGE's idea of names including the effects of its host aliasing.

Unknown accounts on execution hosts

It is common for jobs to fail because execution hosts' passwd database information isn't correct — specifically not synchronized with the head node's. Check that with getent(1) or id(1) on the relevant nodes. (SGE simply uses the system services for such information, which getent checks.) It may be necessary to flush cached information after making changes, e.g. with ncsd -i passwd.

Something runs outside SGE, but not inside

It's most likely due to environment variables or resource limits. As usual, check the execution host syslog and its SGE messages file for clues. If there are no useful diagnostics, submit a job (conveniently done interactively with qrsh) to print information about relevant parts of the process environment (such as environment variables and resource limits) and compare the output against a situation which works. procenv provides a comprehensive listing for Linux-based systems and potentially others; there are RPMs available in a copr repo and possibly in Fedora/EPEL eventually. procenv actually subsumes information from other tools, such as id (above).

A common issue is programs using pthreads reacting badly to the default infinity (unlimited) setting of h_stack, which can be fixed by setting h_stack to a few 10s of MB.

Communication errors

If clients report communication errors, such as timeouts, without more useful diagnostics, first consult the qmaster messages file (probably after loglevel to log_info in sge_conf(5)). If there's no useful information there, check the networking in the as for other client-server applications. Ensure that the configured qmaster port is open (or the execd one if exec host communication is failing). Specifically, check, and maybe double-check, for any firewalling or possible effects of tcp-wrappers. A tool like tcpdump/wireshark running at either end may be useful. Otherwise, using strace(1) (or equivalent on other OSs), or generating a debugging trace from the client and/or server, might help. The basic SGE tool for checking communication is qping -info (but old versions always reported errors).