sge_ckpt.5




NAME

       sge_ckpt - the Grid Engine checkpointing mechanism and checkpointing
       support


DESCRIPTION

       Grid Engine supports two levels of checkpointing: the user level and an
       operating system-provided transparent level. User level checkpointing
       refers to applications which do their own checkpointing by writing
       restart files at certain times or algorithmic steps and by properly
       processing these restart files when restarted.

       Transparent checkpointing has to be provided by the operating system
       and is usually integrated in the operating system kernel. An example
       for a kernel integrated checkpointing facility is the Hibernator
       package from Softway for SGI IRIX platforms.

       Checkpointing jobs need to be identified to the Grid Engine system by
       using the -ckpt option of the qsub(1) command. The argument to this
       flag refers to a so called checkpointing environment, which defines the
       attributes of the checkpointing method to be used (see checkpoint(5)
       for details).  Checkpointing environments are setup by the qconf(1)
       options -ackpt, -dckpt, -mckpt and -sckpt. The qsub(1) option -c can be
       used to overwrite the when attribute for the referenced checkpointing
       environment.

       As opposed to the behavior for regular batch jobs, checkpointing jobs
       (see the -ckpt option to qsub(1)) are aborted under conditions for
       which batch or interactive jobs are suspended or even stay unaffected.
       These conditions are:

       o  Explicit suspension of the queue or job via qmod(1) by the cluster
          administration or a queue owner if the x occasion specifier (see
          qsub(1) -c and checkpoint(5)) was assigned to the job.

       o  A load average value exceeding the suspend threshold as configured
          for the corresponding queues (see queue_conf(5)).

       o  Shutdown of the Grid Engine execution daemon sge_execd(8) being
          responsible for the checkpointing job.

       After they are aborted, jobs will migrate to other hosts, and possibly
       other cluster queues, unless they were submitted to a specific one by
       an explicit user request.  The migration of jobs leads to a dynamic
       load balancing.  Note: Aborting checkpointed jobs will free all
       resources (memory, swap space) which the job occupies at that time.
       This is opposed to the situation for suspended regular jobs, which
       still use virtual memory and other consumable resources.


RESTRICTIONS

       When a job migrates to another machine, at present no files are
       transferred automatically to that machine. This means that all files
       which are used throughout the entire job, including restart files,
       executables, and scratch files, must be visible or transferred
       explicitly (e.g. at the beginning of the job script).

       There are also some practical limitations regarding use of disk space
       for transparently checkpointing jobs. Checkpoints of a transparently
       checkpointed application are usually stored in a checkpoint file or
       directory by the operating system. The file or directory contains all
       the text, data, and stack space for the process, along with some
       additional control information. This means jobs which use a very large
       virtual address space will generate very large checkpoint files. Also
       the workstations on which the jobs will actually execute may have
       little free disk space. Thus it is not always possible to transfer a
       transparent checkpointing job to a machine, even though that machine is
       idle. Since large virtual memory jobs must wait for a machine that is
       both idle, and has a sufficient amount of free disk space, such jobs
       may suffer long turnaround times.

       There is currently no mechanism for restarting jobs with the same
       resources they were granted originally.  That might be important if
       they were submitted with a choice or range of resources and start
       running in a particular way with what they're given.

       Similarly, with heterogeneous execution hosts, jobs may need to restart
       on a host which supports a superset of the instruction set where the
       job originally ran if the checkpoint mechanism (e.g. BLCR or DMTCP)
       dumps an image of the running process.  Runtime libraries, in
       particular, may initialize themselves depending on details of the
       architecture they start up on - say to use a specific type of vector
       unit.  Then, they may fail if moved to an older host of similar
       architecture which lacks that feature, even if they were compiled for a
       common instruction set.


SEE ALSO

       sge_intro(1), qconf(1), qmod(1), qsub(1), checkpoint(5)


COPYRIGHT

       See sge_intro(1) for a full statement of rights and permissions.



SGE 8.1.3pre                      2012-09-18                       SGE_CKPT(5)

Man(1) output converted with man2html