checkpoint.5




NAME

       checkpoint - Grid Engine checkpointing environment configuration file
       format


DESCRIPTION

       Checkpointing is a facility to save the complete status of an executing
       program or job and to restore and restart from this so-called
       checkpoint at a later point of time if the original program or job was
       halted, e.g.  through a system crash.

       Grid Engine provides various levels of checkpointing support (see
       sge_ckpt(5)).  The checkpointing environment described here is a means
       to configure the different types of checkpointing in use for your Grid
       Engine cluster or parts thereof. For that purpose you can define the
       operations which have to be executed in initiating a checkpoint
       generation, a migration of a checkpoint to another host, or a restart
       of a checkpointed application.

       Supporting different operating systems may easily force Grid Engine to
       introduce operating system dependencies for the configuration of the
       checkpointing configuration file and updates of the supported operating
       system versions may lead to frequently changing implementation details.
       Please refer to the <sge_root>/ckpt directory for more information.

       Please use the -ackpt, -dckpt, -mckpt or -sckpt options to the qconf(1)
       command to manipulate checkpointing environments from the command-line
       or use the corresponding qmon(1) dialogue for X-Windows based
       interactive configuration.

       Note, Grid Engine allows backslashes (\) be used to escape newline
       characters. The backslash and the newline are replaced with a space
       character before any interpretation.


FORMAT

       The format of a checkpoint file is defined as follows:

   ckpt_name
       The name of the checkpointing environment in the format for ckpt_name
       in sge_types(5).  To be used in the qsub(1) -ckpt switch or for the
       qconf(1) options mentioned above.

   interface
       The type of checkpointing to be used. Currently, the following types
       are valid:

       hibernator
              The Hibernator kernel level checkpointing is interfaced.

       cpr    The SGI kernel level checkpointing is used.

       transparent
              Grid Engine assumes that the jobs submitted with reference to
              this checkpointing interface use a checkpointing library such as
              provided by the free package Condor.

       userdefined
              Grid Engine assumes that the jobs submitted with reference to
              this checkpointing interface perform their private checkpointing
              method.

       application-level
              Uses all of the interface commands configured in the
              checkpointing object like in the case of one of the kernel level
              checkpointing interfaces (cpr, etc.) except for the
              restart_command (see below), which is not used (even if it is
              configured) but the job script is invoked in case of a restart
              instead.

   ckpt_command
       A command-line type command string to be executed by Grid Engine in
       order to initiate a checkpoint.  The following pseudo-variables are
       available to be substituted in the value:

       $host  The name of the host on which the command is executed.

       $ja_task_id
              The array job task index (0 if not an array job).

       $job_owner
              The user name of the job owner.

       $job_id
              Grid Engine's unique job identification number.

       $job_name
              The name of the job.

       $queue The cluster queue name of the master queue instance, on which
              the command is started.

       $job_pid
              The process id of the job/task to checkpoint.

       $ckpt_dir
              See ckpt_dir below.

       $ckpt_signal
              See signal below.

       $sge_cell
              The SGE_CELL environment variable (useful for locating files).

       $sge_root
              The SGE_ROOT environment variable (useful for locating files).

   migr_command
       A command-line type command string to be executed by Grid Engine during
       a migration of a checkpointing job from one host to another.  The same
       pseudo-variables are available as for ckpt_command.  Note that the
       command is expected to create a checkpoint itself - the checkpointing
       command isn't called automatically on migration.

   restart_command
       A command-line type command string to be executed by Grid Engine when
       restarting a previously checkpointed application.  The same pseudo-
       variables are available as for ckpt_command.

   clean_command
       A command-line type command string to be executed by Grid Engine in
       order to cleanup after a checkpointed application has finished.  The
       same pseudo-variables are available as for ckpt_command.

   ckpt_dir
       A file system location to which checkpoints of potentially considerable
       size should be stored.

   signal
       A Unix signal to be sent to a job by Grid Engine to initiate checkpoint
       generation. The value for this field can either be a symbolic name from
       the list produced by the -l option of the kill(1) command or an integer
       number which must be a valid signal on the systems used for
       checkpointing.

   when
       The points of time when checkpoints are expected to be generated.
       Valid values for this parameter are composed from the letters s, m, x,
       r, and any combinations thereof without any separating character in
       between. The same letters are allowed for the -c option of the qsub(1)
       command which will overwrite the definitions in the checkpointing
       environment used.  The meaning of the letters is as follows:

       s      A job is checkpointed, aborted and, if possible, migrated if the
              corresponding sge_execd(8) is shut down on the job's host.  This
              operation is handled by the specified migr_command.

       m      checkpoints are generated periodically at the min_cpu_interval
              interval defined by the queue (see queue_conf(5)) in which a job
              executes.

       x      A job is checkpointed, aborted and, if possible, migrated as
              soon as the job gets suspended (manually as well as
              automatically).  This operation is handled by the specified
              migr_command.

       r      A job will be rescheduled (not checkpointed) when the host on
              which the job currently runs goes into the "unknown" state and
              the time interval reschedule_unknown (see sge_conf(5)) defined
              in the global/local cluster configuration is exceeded.


ENVIRONMENT VARIABLES

       SGE_BINDING and SGE_CKPT_DIR may be specified on job submission.  See
       submit(1).


RESTRICTIONS

       Note that the functionality of any checkpointing, migration or restart
       procedures provided by default with the Grid Engine distribution, as
       well as the way how they are invoked in the ckpt_command, migr_command
       or restart_command parameters of any default checkpointing
       environments, should not be changed; otherwise the functionality
       remains the full responsibility of the administrator configuring the
       checkpointing environment.  Grid Engine will just invoke these
       procedures and evaluate their exit status. If the procedures do not
       perform their tasks properly, or are not invoked in a proper fashion,
       the checkpointing mechanism may behave unexpectedly; Grid Engine has no
       means to detect this - all exit codes are treated as successful
       operation except for the case of kernel checkpointing.

       See also the restrictions in sge_ckpt(5).


SEE ALSO

       sge_intro(1), sge_ckpt(5), sge_types(5), qconf(1), qmod(1), qsub(1),
       sge_execd(8).


COPYRIGHT

       See sge_intro(1) for a full statement of rights and permissions.



SGE 8.1.3pre                      2012-01-07                     CHECKPOINT(5)

Man(1) output converted with man2html