Last updated: Feb 26, 2014
The present HOWTO goes over some commonly seen problems experienced when using Grid Engine, and appropriate solutions. The information is presented in a tabular chart, using the following scheme:
Category |
|
---|---|
Symptom |
|
Cause |
Resolution |
For problems which are not explicitly mentioned here, search for a symptom in the appropriate category which matches your problem as closely as possible, and see if the resolution fixes your particular case.
My output file for my job says "Warning: no access to tty... Thus no job control in this shell." |
|
One or more of your login files contains an stty command. These commands are only useful if there is a terminal present. |
In Grid Engine batch jobs, there is no terminal associated with these jobs. You need to either remove all stty commands from your login files, or bracket them with an if statement which checks for a terminal before processing. An example of this is below: /bin/csh: stty -g # checks terminal status if ($status == 0) # succeeds if a terminal is present <place all stty commands in here> endif (In recent SGE versions, you can allocate a pseudo-terminal for batch jobs with qsub -pty y.) |
The job standard error log file says: `tty`: Ambiguous but there is no reference to tty in user's shell which is called in job script |
|
shell_start_mode is by default posix_compliant; therefore, all job scripts run with the shell specified in the queue definition, not the one specified on the first line of the job script. |
Use the -S flag to qsub, or change shell_start_mode to unix_behavior |
A job script runs from the command line, but it fails when run via qsub. |
|
Process limits could be being set for your job. To test this, write a test script which does limit and limit -h (C shell) or ulimt (Bourne shell). Execute both interactively at the shell prompt and through qsub to compare the result. See also the troubleshooting guide. |
make sure to remove any commands in configuration files which sets limits in your shell. |
Data and executables may not be accessible where needed |
The jobs script itself must be accessible from the submit host. All data and other executables needed by the script must be accessible on the execute host. Usually shared via NFS. |
Unlimited stack size set by default by SGE may cause some apps to crash on some OS's. |
In the job script, use ulimit to set stack size limits before calling the executable that crashes. Or modify the queue to set smaller stack size: qconf -mattr queue h_stack 8389486 <queue_name> (hard limit in bytes) qconf -mattr queue s_stack 8389486 <queue_name> (soft limit in bytes) |
qsub of a job results in the error "can't set additional group id for job" (seen in administrator or user mail, or shepherd trace file) and puts queue into error state |
|
Possible reasons
|
Corresponding solutions
|
Exec hosts report a load of “-”; queue is in “alarm” and/or “unknown” state |
|
There are a few things that could cause your exec hosts to fail to report a load:
|
Depending on the cause, here are the appropriate solutions
|
A warning is printed to <cell>/spool/<host>/messages every 30 seconds. The messages look like this: Tue Jan 23 21:20:46 2001|execd|meta|W|local configuration meta not defined - using global configuration But there IS a file for each host in <cell>/common/local_conf/, each with FDQN. |
|
The hostname resolving at your machine "meta" returns the short name, while at your master machine "meta" with FQDN is returned. |
Make sure that all of your /etc/hosts files and your NIS/LDAP names are consistent. In this example, there could be a line like 168.0.0.1 meta meta.your.domain in /etc/hosts of the host "meta" while it should be instead 168.0.0.1 meta.your.domain meta |
Occasionally I see "CHECKSUM ERROR", "WRITE ERROR" or "READ ERROR" messages in the "messages" files of the daemons. Do I need to worry about these? |
|
|
As long as these messages do not appear in a one second interval (they typically may appear between 1-30 times per day), there is no need to do anything on this issue. |
Jobs will finish on a particular queue, showing in qmaster/messages: Wed Mar 28 10:57:15 2001|qmaster|masterhost|I|job 490.1 finished on host exechost But then the following errors are seen on the exec host: exechost/messages: Wed Mar 28 10:57:15 2001|execd|exechost|E|can't find directory "active_jobs/490.1" for reaping job 490.1 exechost/messages: Wed Mar 28 10:57:15 2001|execd|exechost|E|can't remove directory "active_jobs/490.1": opendir(active_jobs/490.1) failed: Input/output error |
|
The $SGE_ROOT directory, which is automounted, is being unmounted, causing the sge_execd to lose its cwd. |
Use a local spool directory for you execd host. Set the parameter execd_spool_dir using qmon or qconf. |
The actual hostname <myhostname> of the machine is an alias to localhost in etc/hosts. Looks like this: 127.0.0.1 localhost myhostname |
remove <myhostname> as an alias to localhost and put <myhostname> after the real IP-address in /etc/hosts |
Multiple queues cascade into error state, rendering the grid unusable. |
|
errors in a user's .cshrc/.profile result in setting all queues in error state |
|
Java complains "Could not reserve enough space for object heap" |
|
Recent OpenJDK JVMs allegedly default to trying to allocate 1/4 of the machine's physical memory at startup, which will usually fail with an h_vmem resource limit. |
Start java with the -XmxN option, where N is an appropriate memory limit. |
Submitting interactive jobs with qrsh, I get the error: % qrsh -l mem_free=1G error: error: no suitable queues Yet queues are available for batch jobs using qsub, and can be queried using qhost -l mem_free=1G and qstat -f -l mem_free=1G. |
|
The message "error: no suitable queues" results from the "-w e" submit option which is active by default for interactive jobs like qrsh (look for "-w e" in qrsh(1)). This option causes the submit command to fail if the qmaster does not know for sure that the job will be dispatchable according to the current cluster configuration. The intention of this mechanism is to decline job requests in advance in case they can't be granted. |
In this case 'mem_free' is configured to be a consumable resource, but you have not specified the amount of memory available at each the host. The memory load values are deliberately not considered for this check, because they vary, so they can't be seen as part of the cluster configuration. To overcome this you can either
|
qrsh wont dispatch to the same node it is on. From a qsh shell: host2 [49]% qrsh -inherit host2 hostname error: executing task of job 1 failed: host2 [50]% qrsh -inherit host4 hostname host4 |
|
gid_range not sufficient. It should be defined as a range, not a single number. SGE assigns each job on a host a distinct gid. |
Adjust gid_range using 'qconf -mconf' or qmon. The suggested range is: gid_range 20000-20100 |
qrsh -inherit -V does not work when used inside a parallel job: cannot get connection to "qlogin_starter" |
|
This problem occurs with nested qrsh calls, and is due to the -V switch. The first qrsh -inherit call will set the environment variable TASK_ID (the id of the tightly integrated task within the parallel job). The second call will then use this environment variable for registration of its task, which will fail as it tries to start a task with the same id as the first task. |
You can either
|
qrsh does not seem to work at all: host2$ qrsh -verbose hostname local configuration host2 not defined - using global configuration waiting for interactive job to be scheduled ... Your interactive job 88 has been successfully scheduled. Establishing /share/gridware/utilbin/solaris64/rsh session to host exehost ... rcmd: socket: Permission denied /share/gridware/utilbin/solaris64/rsh exited with exit code 1 reading exit code from shepherd ... error: error waiting on socket for client to connect: Interrupted system call error: error reading returncode of remote command cleaning up after abnormal exit of /share/gridware/utilbin/solaris64/rsh host2$ |
|
Permissions for qrsh are not set properly |
Check the permissions of the following files. They are located in $SGE_ROOT/utilbin/. Note that rlogin and rsh need to be setuid and owned by root. -r-s--x--x 1 root root 28856 Sep 18 06:00 rlogin* -r-s--x--x 1 root root 19808 Sep 18 06:00 rsh* -rwxr-xr-x 1 sgeadmin adm 128160 Sep 18 06:00 rshd* NOTE: the $SGE_ROOT directory also needs to be NFS-mounted with the "setuid" option. If it is mounted with "nosuid" from your submit client, then qrsh (and associated commands) will not work. |
When trying to start a distributed make qmake exits with the following error message: qrsh_starter: executing child process qmake failed: No such file or directory |
|
Grid Engine will start an instance of qmake on the execution host. If the Grid Engine environment (esp. the PATH) is not setup in the users shell resource file (.profile/.cshrc) this qmake call will fail. |
Use the -v option to export the PATH to the qmake job. A typical qmake call is qmake -v PATH -cwd -pe make 2-10 -- |
When doing qmake, the error seen is: waiting for interactive job to be scheduled ...timeout (4 s) expired while waiting on socket fd 5 Your "qrsh" request could not be scheduled, try again later. |
|
The ARCH variable could be set incorrectly in the shell which called qmake
|
Set ARCH correctly to a supported value matching a host available in your cluster, or else specify the correct value at submit time, e.g., qmake -v ARCH=solaris64 ... |
Parts of Sun HPC ClusterTools parallel jobs (job script itself, child processes, etc) fail to stop when terminated by user or by qmaster. |
|
The user may not have supplied the necessary means (scripts) for SGE to control the distributed jobs. |
Follow the complete HOW-TO instructions on Integration between Grid Engine and HPC Cluster Tools. |
Shadow host fails to own mastership of SGE cluster |
|
Lock file exists. |
Remove $SGE_ROOT/<cell>/spool/qmaster/lock file if
master host has crashed or can no longer function as
qmaster. |
Root R/W access to $SGE_ROOT directory and its sub-directories should be from both master and shadow. |
Adjust permissions for root r/w access to the $SGE_ROOT directory and its sub-directories from shadow host. NOTE: please see the Shadow Master HOWTO |