Common problems using Grid Engine

Last updated: Feb 26, 2014

The present HOWTO goes over some commonly seen problems experienced when using Grid Engine, and appropriate solutions. The information is presented in a tabular chart, using the following scheme:


Category

Symptom

Cause

Resolution

For problems which are not explicitly mentioned here, search for a symptom in the appropriate category which matches your problem as closely as possible, and see if the resolution fixes your particular case.

Categories:




Batch Submit

My output file for my job says

"Warning: no access to tty... Thus no job control in this shell."

One or more of your login files contains an stty command. These commands are only useful if there is a terminal present.

In Grid Engine batch jobs, there is no terminal associated with these jobs. You need to either remove all stty commands from your login files, or bracket them with an if statement which checks for a terminal before processing. An example of this is below:

/bin/csh:
stty -g            # checks terminal status
if ($status == 0)  # succeeds if a terminal is present
<place all stty commands in here>
endif

(In recent SGE versions, you can allocate a pseudo-terminal for batch jobs with qsub -pty y.)

The job standard error log file says:

`tty`: Ambiguous

but there is no reference to tty in user's shell which is called in job script

shell_start_mode is by default posix_compliant; therefore, all job scripts run with the shell specified in the queue definition, not the one specified on the first line of the job script.

Use the -S flag to qsub, or change shell_start_mode to unix_behavior

A job script runs from the command line, but it fails when run via qsub.

Process limits could be being set for your job. To test this, write a test script which does limit and limit -h (C shell) or ulimt (Bourne shell). Execute both interactively at the shell prompt and through qsub to compare the result. See also the troubleshooting guide.

make sure to remove any commands in configuration files which sets limits in your shell.

Data and executables may not be accessible where needed

The jobs script itself must be accessible from the submit host. All data and other executables needed by the script must be accessible on the execute host. Usually shared via NFS.

Unlimited stack size set by default by SGE may cause some apps to crash on some OS's.

In the job script, use ulimit to set stack size limits before calling the executable that crashes.

Or modify the queue to set smaller stack size:

qconf -mattr queue h_stack 8389486 <queue_name> (hard limit in bytes)
qconf -mattr queue s_stack 8389486 <queue_name> (soft limit in bytes)

qsub of a job results in the error "can't set additional group id for job" (seen in administrator or user mail, or shepherd trace file) and puts queue into error state

Possible reasons

  1. The error message above can occur if the user already has the maximum number of group ids set. SGE tries to set one more group id and fails.

  2. If you are not running Grid Engine as root, then the setgroups() command will fail trying to set the unique group ID which is used to track all the spawned processes of a job.

Corresponding solutions

  1. Please check to see how many group ids are assigned to the user using id -a. If it's more than the system's NGROUPS_MAX (found in /proc/sys/kernel/ngroups_max under Linux), you need to reduce the number of the user's secondary groups or increase the limit in the kernel.

  2. Be sure to run the Grid Engine daemons as root.

Monitoring

Exec hosts report a load of “-”; queue is in “alarm” and/or “unknown” state

There are a few things that could cause your exec hosts to fail to report a load:


  1. The execd is not running on the host.

  2. A default domain is incorrectly specified

  3. The qmaster host sees the exec host as a different name as the exec host sees itself.

Depending on the cause, here are the appropriate solutions


  1. Start up the execd as root.

  2. Run qconf -mconf as the Grid Engine administrator and change the default_domain to none.

  3. Set IGNORE_FQDN=TRUE in qmaster_params in the cluster configuration.

  4. See man page host_aliases(5)

Miscellaneous Error Messages

A warning is printed to <cell>/spool/<host>/messages every 30 seconds. The messages look like this:

Tue Jan 23 21:20:46 2001|execd|meta|W|local configuration meta not defined - using global configuration 

But there IS a file for each host in <cell>/common/local_conf/, each with FDQN.

The hostname resolving at your machine "meta" returns the short name, while at your master machine "meta" with FQDN is returned.

Make sure that all of your /etc/hosts files and your NIS/LDAP names are consistent. In this example, there could be a line like

168.0.0.1 meta meta.your.domain

in /etc/hosts of the host "meta" while it should be instead

168.0.0.1 meta.your.domain meta

Occasionally I see "CHECKSUM ERROR", "WRITE ERROR" or "READ ERROR" messages in the "messages" files of the daemons. Do I need to worry about these?


As long as these messages do not appear in a one second interval (they typically may appear between 1-30 times per day), there is no need to do anything on this issue.

Jobs will finish on a particular queue, showing in qmaster/messages:

Wed Mar 28 10:57:15 2001|qmaster|masterhost|I|job 490.1 finished on host exechost 

But then the following errors are seen on the exec host:

exechost/messages:

Wed Mar 28 10:57:15 2001|execd|exechost|E|can't find directory "active_jobs/490.1" for reaping job 490.1

exechost/messages:

Wed Mar 28 10:57:15 2001|execd|exechost|E|can't remove directory "active_jobs/490.1": opendir(active_jobs/490.1) failed: Input/output error 

The $SGE_ROOT directory, which is automounted, is being unmounted, causing the sge_execd to lose its cwd.

Use a local spool directory for you execd host. Set the parameter execd_spool_dir using qmon or qconf.

The actual hostname <myhostname> of the machine is an alias to localhost in etc/hosts. Looks like this:

127.0.0.1   localhost  myhostname

remove <myhostname> as an alias to localhost and put <myhostname> after the real IP-address in /etc/hosts

Multiple queues cascade into error state, rendering the grid unusable.

errors in a user's .cshrc/.profile result in setting all queues in error state

  1. Fix errors in users' .cshrc/.profile

  2. Use the -f option in the first line of the jobscript (i.e. Use #!/bin/sh -f) to bypass the user's .cshrc or .profile

Java complains "Could not reserve enough space for object heap"

Recent OpenJDK JVMs allegedly default to trying to allocate 1/4 of the machine's physical memory at startup, which will usually fail with an h_vmem resource limit.

Start java with the -XmxN option, where N is an appropriate memory limit.

Qrsh/Interactive Jobs

Submitting interactive jobs with qrsh, I get the error:

% qrsh -l mem_free=1G error: error: no suitable queues 

Yet queues are available for batch jobs using qsub, and can be queried using qhost -l mem_free=1G and qstat -f -l mem_free=1G.

The message "error: no suitable queues" results from the "-w e" submit option which is active by default for interactive jobs like qrsh (look for "-w e" in qrsh(1)). This option causes the submit command to fail if the qmaster does not know for sure that the job will be dispatchable according to the current cluster configuration. The intention of this mechanism is to decline job requests in advance in case they can't be granted.

In this case 'mem_free' is configured to be a consumable resource, but you have not specified the amount of memory available at each the host. The memory load values are deliberately not considered for this check, because they vary, so they can't be seen as part of the cluster configuration. To overcome this you can either

  • omit this check generally by overriding qrsh's default setting "-w e" explicitly by submitting it with "-w n" (can also be put into $SGE_ROOT/<cell>/common/sge_request)

  • if you intend managing 'mem_free' as a consumable resource specify the 'mem_free' capacity for your hosts in 'complex_values' of host_conf(5) by using 'qconf -me <hostname>'

  • if you don't intend managing 'mem_free' as consumable resource make it a non-consumable resource again in the 'consumable' column of complex(5) by using 'qconf -mc host'

qrsh wont dispatch to the same node it is on. From a qsh shell:

host2 [49]% qrsh -inherit host2 hostname
error: executing task of job 1 failed: 

host2 [50]% qrsh -inherit host4 hostname
host4

gid_range not sufficient. It should be defined as a range, not a single number. SGE assigns each job on a host a distinct gid.

Adjust gid_range using 'qconf -mconf' or qmon. The suggested range is:

gid_range                 20000-20100

qrsh -inherit -V does not work when used inside a parallel job:

cannot get connection to "qlogin_starter"

This problem occurs with nested qrsh calls, and is due to the -V switch. The first qrsh -inherit call will set the environment variable TASK_ID (the id of the tightly integrated task within the parallel job). The second call will then use this environment variable for registration of its task, which will fail as it tries to start a task with the same id as the first task.

You can either

  • unset TASK_ID before calling qrsh -inherit

  • not use the -V switch, but use -v and export only the environment variables really needed.

qrsh does not seem to work at all:

host2$ qrsh -verbose hostname
local configuration host2 not defined - using global configuration 
waiting for interactive job to be scheduled ...
Your interactive job 88 has been successfully scheduled.
Establishing /share/gridware/utilbin/solaris64/rsh session to host exehost ...
rcmd: socket: Permission denied
/share/gridware/utilbin/solaris64/rsh exited with exit code 1
reading exit code from shepherd ... 
error: error waiting on socket for client to connect: Interrupted system call
error: error reading returncode of remote command
cleaning up after abnormal exit of /share/gridware/utilbin/solaris64/rsh
host2$ 

Permissions for qrsh are not set properly

Check the permissions of the following files. They are located in $SGE_ROOT/utilbin/.

Note that rlogin and rsh need to be setuid and owned by root.

-r-s--x--x 1 root root 28856 Sep 18 06:00 rlogin*
-r-s--x--x 1 root root 19808 Sep 18 06:00 rsh*
-rwxr-xr-x 1 sgeadmin adm 128160 Sep 18 06:00 rshd*

NOTE: the $SGE_ROOT directory also needs to be NFS-mounted with the "setuid" option. If it is mounted with "nosuid" from your submit client, then qrsh (and associated commands) will not work.

Qmake

When trying to start a distributed make qmake exits with the following error message:

qrsh_starter: executing child process qmake failed: No such file or directory

Grid Engine will start an instance of qmake on the execution host. If the Grid Engine environment (esp. the PATH) is not setup in the users shell resource file (.profile/.cshrc) this qmake call will fail.

Use the -v option to export the PATH to the qmake job. A typical qmake call is

qmake -v PATH -cwd -pe make 2-10 --

When doing qmake, the error seen is:

waiting for interactive job to be scheduled ...timeout (4 s) expired while waiting on socket fd 5
Your "qrsh" request could not be scheduled, try again later.

The ARCH variable could be set incorrectly in the shell which called qmake



Set ARCH correctly to a supported value matching a host available in your cluster, or else specify the correct value at submit time, e.g.,

qmake -v ARCH=solaris64 ...

Parallel/Checkpointing

Parts of Sun HPC ClusterTools parallel jobs (job script itself, child processes, etc) fail to stop when terminated by user or by qmaster.

The user may not have supplied the necessary means (scripts) for SGE to control the distributed jobs.

Follow the complete HOW-TO instructions on Integration between Grid Engine and HPC Cluster Tools.

Shadow Facility

Shadow host fails to own mastership of SGE cluster

Lock file exists.

Remove $SGE_ROOT/<cell>/spool/qmaster/lock file if master host has crashed or can no longer function as qmaster.
NOTE: to force the shadow host to take over from another master, use the “migrate” option, i.e, “rcsge -migrate”.

Root R/W access to $SGE_ROOT directory and its sub-directories should be from both master and shadow.

Adjust permissions for root r/w access to the $SGE_ROOT directory and its sub-directories from shadow host.

NOTE: please see the Shadow Master HOWTO