sge_shepherd.8




NAME

       sge_shepherd - Grid Engine single job-controlling agent


SYNOPSIS

       sge_shepherd


DESCRIPTION

       sge_shepherd provides the parent process functionality for a single
       Grid Engine job.  The parent functionality is necessary on UNIX systems
       to retrieve resource usage information (see getrusage(2)) after a job
       has finished. In addition, the sge_shepherd forwards signals to the
       job, such for suspension, enabling, termination, and the Grid Engine
       checkpointing signal (see sge_ckpt(5) and queue_conf(5) for details).

       The sge_shepherd receives information about the job to be started from
       the sge_execd(8).  During the execution of the job it actually starts
       up to 5 child processes. First a prolog script is run if this feature
       is enabled by the prolog parameter in the cluster configuration. (See
       sge_conf(5).)  Next a parallel environment startup procedure is run if
       the job is a parallel job. (See sge_pe(5) for more information.)  After
       that, the job itself is run, followed by a parallel environment
       shutdown procedure for parallel jobs, and finally an epilog script if
       requested by the epilog parameter in the cluster configuration. The
       prolog and epilog scripts, as well as the parallel environment startup
       and shutdown procedures, are to be provided by the Grid Engine
       administrator and are intended for site-specific actions to be taken
       before and after execution of the actual user job.

       After the job has finished and the epilog script is processed,
       sge_shepherd retrieves resource usage statistics about the job, places
       them in a job-specific subdirectory of the sge_execd(8) spool directory
       for reporting through sge_execd(8), and finishes.

       sge_shepherd also places an exit status file in the spool directory.
       This exit status can be viewed with qacct -j JobId (see qacct(1)); it
       is not the exit status of sge_shepherd itself but of one of the methods
       executed by sge_shepherd.  This exit status can have several meanings,
       depending on the method in which an error occurred (if any).  The
       possible methods are: prolog, parallel start, job, parallel stop,
       epilog, suspend, restart, terminate, clean, migrate, and checkpoint.

       The following exit values are returned:

       0      All methods: Operation was executed successfully.

       99     Job script, prolog and epilog: When FORBID_RESCHEDULE is not set
              in the configuration (see sge_conf(5)), the job gets re-queued.
              Otherwise see "Other".

       100    Job script, prolog and epilog: When FORBID_APPERROR is not set
              in the configuration (see sge_conf(5)), the job gets re-queued.
              Otherwise see "Other".

       Other  Job script: This is the exit status of the job itself. No action
              is taken upon this exit status because the meaning of this exit
              status is not known.
              Prolog, epilog and parallel start: The queue is set to error
              state and the job is re-queued.
              Parallel stop: The queue is set to error state, but the job is
              not re-queued. It is assumed that the job itself ran
              successfully and only the clean up script failed.
              Suspend, restart, terminate, clean, and migrate: Always
              successful.
              Checkpoint: Success, except for kernel checkpointing: checkpoint
              was not successful, did not happen (but migration will happen).

       For the meaning of the return codes of the shepherd itself (which are
       interpreted by qacct(1)) see sge_status(5).


RESTRICTIONS

       sge_shepherd should not be invoked manually, but only by sge_execd(8).


ENVIRONMENT VARIABLES

       SGE_ROOT       Specifies the location of the Grid Engine standard
                      configuration files.

       SGE_CELL       If set, specifies the default Grid Engine cell. To
                      address a Grid Engine cell sge_execd uses (in the order
                      of precedence):

                             The name of the cell specified in the environment
                             variable SGE_CELL, if it is set.

                             The name of the default cell, i.e. default.


       SGE_ENABLE_COREDUMP
                      If set, enable core dumps on Linux when the admin_user
                      is not root.  Linux normally disables core dumps when
                      the daemon has changed uid or gid.  Setting
                      SGE_ENABLE_COREDUMP in sge_execd's environment defeats
                      that to enable core dumps for debugging if they are
                      otherwise allowed.  This is typically not a big hazard
                      with SGE, since most information is exposed in the spool
                      area anyhow.  Dumps will appear in the qmaster spool
                      directory, which need not be world-readable.
                      On Solaris, coreadm(1) may be used to enable such dumps.

       SGE_CGROUP_DIR If Linux cgroups handling is enabled, this variable
                      names a directory under the cgroup mount point in which
                      to create job-specific directories.  The default is
                      sge.SGE_CELL so, for instance, the cpuset cgroup for a
                      job might be /sys/fs/cgroup/cpuset/sge.default/123.


FILES

       sgepasswd contains a list of user names and their corresponding
       encrypted passwords. If available, the password file will be used by
       sge_shepherd. To change the contents of this file please use the
       sgepasswd command. It is not advised to change that file manually.
       <execd_spool>/job_dir/<job_id>     job specific directory
       <sge_root>/<cell>/common/sgepasswd
                                          Password information used on Microsoft Windows hosts.  See
       sgepasswd(5).


SEE ALSO

       sge_intro(1), sge_conf(5), sge_status(5), remote_startup(5),
       sgepasswd(5), sge_execd(8).


COPYRIGHT

       See sge_intro(1) for a full statement of rights and permissions.



SGE 8.1.3pre             $Date: 2007-07-19 09:04:33 $          SGE_SHEPHERD(8)

Man(1) output converted with man2html