sge_ckpt.5
NAME
sge_ckpt - the Grid Engine checkpointing mechanism and checkpointing
support
DESCRIPTION
Grid Engine supports two levels of checkpointing: the user level and an
operating system-provided transparent level. User level checkpointing
refers to applications which do their own checkpointing by writing
restart files at certain times or algorithmic steps and by properly
processing these restart files when restarted.
Transparent checkpointing has to be provided by the operating system
and is usually integrated in the operating system kernel. An example
for a kernel integrated checkpointing facility is the Hibernator
package from Softway for SGI IRIX platforms.
Checkpointing jobs need to be identified to the Grid Engine system by
using the -ckpt option of the qsub(1) command. The argument to this
flag refers to a so called checkpointing environment, which defines the
attributes of the checkpointing method to be used (see checkpoint(5)
for details). Checkpointing environments are setup by the qconf(1)
options -ackpt, -dckpt, -mckpt and -sckpt. The qsub(1) option -c can be
used to overwrite the when attribute for the referenced checkpointing
environment.
As opposed to the behavior for regular batch jobs, checkpointing jobs
(see the -ckpt option to qsub(1)) are aborted under conditions for
which batch or interactive jobs are suspended or even stay unaffected.
These conditions are:
o Explicit suspension of the queue or job via qmod(1) by the cluster
administration or a queue owner if the x occasion specifier (see
qsub(1) -c and checkpoint(5)) was assigned to the job.
o A load average value exceeding the suspend threshold as configured
for the corresponding queues (see queue_conf(5)).
o Shutdown of the Grid Engine execution daemon sge_execd(8) being
responsible for the checkpointing job.
After they are aborted, jobs will migrate to other hosts, and possibly
other cluster queues, unless they were submitted to a specific one by
an explicit user request. The migration of jobs leads to a dynamic
load balancing. Note: Aborting checkpointed jobs will free all
resources (memory, swap space) which the job occupies at that time.
This is opposed to the situation for suspended regular jobs, which
still use virtual memory and other consumable resources.
RESTRICTIONS
When a job migrates to another machine, at present no files are
transferred automatically to that machine. This means that all files
which are used throughout the entire job, including restart files,
executables, and scratch files, must be visible or transferred
explicitly (e.g. at the beginning of the job script).
There are also some practical limitations regarding use of disk space
for transparently checkpointing jobs. Checkpoints of a transparently
checkpointed application are usually stored in a checkpoint file or
directory by the operating system. The file or directory contains all
the text, data, and stack space for the process, along with some
additional control information. This means jobs which use a very large
virtual address space will generate very large checkpoint files. Also
the workstations on which the jobs will actually execute may have
little free disk space. Thus it is not always possible to transfer a
transparent checkpointing job to a machine, even though that machine is
idle. Since large virtual memory jobs must wait for a machine that is
both idle, and has a sufficient amount of free disk space, such jobs
may suffer long turnaround times.
There is currently no mechanism for restarting jobs with the same
resources they were granted originally. That might be important if
they were submitted with a choice or range of resources and start
running in a particular way with what they're given.
Similarly, with heterogeneous execution hosts, jobs may need to restart
on a host which supports a superset of the instruction set where the
job originally ran if the checkpoint mechanism (e.g. BLCR or DMTCP)
dumps an image of the running process. Runtime libraries, in
particular, may initialize themselves depending on details of the
architecture they start up on - say to use a specific type of vector
unit. Then, they may fail if moved to an older host of similar
architecture which lacks that feature, even if they were compiled for a
common instruction set.
SEE ALSO
sge_intro(1), qconf(1), qmod(1), qsub(1), checkpoint(5)
COPYRIGHT
See sge_intro(1) for a full statement of rights and permissions.
SGE 8.1.3pre 2012-09-18 SGE_CKPT(5)
Man(1) output converted with
man2html