checkpoint.5
NAME
checkpoint - Grid Engine checkpointing environment configuration file
format
DESCRIPTION
Checkpointing is a facility to save the complete status of an executing
program or job and to restore and restart from this so-called
checkpoint at a later point of time if the original program or job was
halted, e.g. through a system crash.
Grid Engine provides various levels of checkpointing support (see
sge_ckpt(5)). The checkpointing environment described here is a means
to configure the different types of checkpointing in use for your Grid
Engine cluster or parts thereof. For that purpose you can define the
operations which have to be executed in initiating a checkpoint
generation, a migration of a checkpoint to another host, or a restart
of a checkpointed application.
Supporting different operating systems may easily force Grid Engine to
introduce operating system dependencies for the configuration of the
checkpointing configuration file and updates of the supported operating
system versions may lead to frequently changing implementation details.
Please refer to the <sge_root>/ckpt directory for more information.
Please use the -ackpt, -dckpt, -mckpt or -sckpt options to the qconf(1)
command to manipulate checkpointing environments from the command-line
or use the corresponding qmon(1) dialogue for X-Windows based
interactive configuration.
Note, Grid Engine allows backslashes (\) be used to escape newline
characters. The backslash and the newline are replaced with a space
character before any interpretation.
FORMAT
The format of a checkpoint file is defined as follows:
ckpt_name
The name of the checkpointing environment in the format for ckpt_name
in sge_types(5). To be used in the qsub(1) -ckpt switch or for the
qconf(1) options mentioned above.
interface
The type of checkpointing to be used. Currently, the following types
are valid:
hibernator
The Hibernator kernel level checkpointing is interfaced.
cpr The SGI kernel level checkpointing is used.
transparent
Grid Engine assumes that the jobs submitted with reference to
this checkpointing interface use a checkpointing library such as
provided by the free package Condor.
userdefined
Grid Engine assumes that the jobs submitted with reference to
this checkpointing interface perform their private checkpointing
method.
application-level
Uses all of the interface commands configured in the
checkpointing object like in the case of one of the kernel level
checkpointing interfaces (cpr, etc.) except for the
restart_command (see below), which is not used (even if it is
configured) but the job script is invoked in case of a restart
instead.
ckpt_command
A command-line type command string to be executed by Grid Engine in
order to initiate a checkpoint. The following pseudo-variables are
available to be substituted in the value:
$host The name of the host on which the command is executed.
$ja_task_id
The array job task index (0 if not an array job).
$job_owner
The user name of the job owner.
$job_id
Grid Engine's unique job identification number.
$job_name
The name of the job.
$queue The cluster queue name of the master queue instance, on which
the command is started.
$job_pid
The process id of the job/task to checkpoint.
$ckpt_dir
See ckpt_dir below.
$ckpt_signal
See signal below.
$sge_cell
The SGE_CELL environment variable (useful for locating files).
$sge_root
The SGE_ROOT environment variable (useful for locating files).
migr_command
A command-line type command string to be executed by Grid Engine during
a migration of a checkpointing job from one host to another. The same
pseudo-variables are available as for ckpt_command. Note that the
command is expected to create a checkpoint itself - the checkpointing
command isn't called automatically on migration.
restart_command
A command-line type command string to be executed by Grid Engine when
restarting a previously checkpointed application. The same pseudo-
variables are available as for ckpt_command.
clean_command
A command-line type command string to be executed by Grid Engine in
order to cleanup after a checkpointed application has finished. The
same pseudo-variables are available as for ckpt_command.
ckpt_dir
A file system location to which checkpoints of potentially considerable
size should be stored.
signal
A Unix signal to be sent to a job by Grid Engine to initiate checkpoint
generation. The value for this field can either be a symbolic name from
the list produced by the -l option of the kill(1) command or an integer
number which must be a valid signal on the systems used for
checkpointing.
when
The points of time when checkpoints are expected to be generated.
Valid values for this parameter are composed from the letters s, m, x,
r, and any combinations thereof without any separating character in
between. The same letters are allowed for the -c option of the qsub(1)
command which will overwrite the definitions in the checkpointing
environment used. The meaning of the letters is as follows:
s A job is checkpointed, aborted and, if possible, migrated if the
corresponding sge_execd(8) is shut down on the job's host. This
operation is handled by the specified migr_command.
m checkpoints are generated periodically at the min_cpu_interval
interval defined by the queue (see queue_conf(5)) in which a job
executes.
x A job is checkpointed, aborted and, if possible, migrated as
soon as the job gets suspended (manually as well as
automatically). This operation is handled by the specified
migr_command.
r A job will be rescheduled (not checkpointed) when the host on
which the job currently runs goes into the "unknown" state and
the time interval reschedule_unknown (see sge_conf(5)) defined
in the global/local cluster configuration is exceeded.
ENVIRONMENT VARIABLES
SGE_BINDING and SGE_CKPT_DIR may be specified on job submission. See
submit(1).
RESTRICTIONS
Note that the functionality of any checkpointing, migration or restart
procedures provided by default with the Grid Engine distribution, as
well as the way how they are invoked in the ckpt_command, migr_command
or restart_command parameters of any default checkpointing
environments, should not be changed; otherwise the functionality
remains the full responsibility of the administrator configuring the
checkpointing environment. Grid Engine will just invoke these
procedures and evaluate their exit status. If the procedures do not
perform their tasks properly, or are not invoked in a proper fashion,
the checkpointing mechanism may behave unexpectedly; Grid Engine has no
means to detect this - all exit codes are treated as successful
operation except for the case of kernel checkpointing.
See also the restrictions in sge_ckpt(5).
SEE ALSO
sge_intro(1), sge_ckpt(5), sge_types(5), qconf(1), qmod(1), qsub(1),
sge_execd(8).
COPYRIGHT
See sge_intro(1) for a full statement of rights and permissions.
SGE 8.1.3pre 2012-01-07 CHECKPOINT(5)
Man(1) output converted with
man2html