Topic:

Checkpointing of serial jobs using the built in checkpointing support in SGE.

Author:

Reuti, reuti__at__staff.uni-marburg.de; Philipps-University of Marburg, Germany

Version:

1.1a -- 2005-02-14 First updated release, comments and corrections are welcome

Contents:

Note:

This HOWTO complements the information contained in the man pages of SGE and the Administration Guide.

Acknowledgement:

Many thanks to Lip Kian NG (lkng__at__apstc.sun.com.sg) from SUN Microsystems Singapore for reviewing this document and eliminating typos and ambiguities.



Prerequisites

Configuration of SGE with qconf or the GUI

You should be familiar with modifying of the SGE environment such as the queue definitions and scheduler configuration. Additional information on the SGE checkpoint interfaces is available from the "checkpoint" and "sge_ckpt" man pages (make sure the SGE man pages are defined in your $MANPATH).

While working through this HOWTO, you may find it convenient to modify the scheduler parameters "schedule_interval" to be around 15 seconds, and optionally set:

SGE 5.3: the global configuration "schedd_params" to "FLUSH_SUBMIT_SEC=0".

SGE 6.x: the scheduler configuration "flush_submit_sec" to be unequal to "0", like a small value of "4", to enable this setting to be honoured.

Available checkpointing interfaces in SGE

You can divide the available checkpointing interfaces into two groups: they are either kernel based or need some support by the running application, hence called user-level interfaces. This document will show some examples on how to implement checkpointing with the different user-level checkpoint interfaces. These include:

Transparent interface
Application-level interface
Userdefined interface

Goal of checkpointing implementation in your applications

Some lengthy calculations can run for months. If the calculation node dies due to a hardware failure or power loss, you will lose all the calculation time and have to restart from the beginning. Having a checkpoint created, you can at least restart from this point of calculation.

Location of the created checkpoint files

You should use a shared file system for the location of the written checkpoint files. This way the created checkpoint files will be accessible to the restarted job, which will be most likely on another node. Depending on your application, this can be just the home directory of the executing user, or like used in this document /home/checkpoint. It can of course be a mounted file system just for this purpose, but it shouldn't be a shared scratch space without redundancy, like a shared scratch space with PVFS1.

Interval for the created checkpoints

Although it depends on your application, you can just create checkpoints after each critical step of calculation without any timed interval. If it's more suitable, SGE can trigger the creation of checkpoints after a timed interval. The default setting in the queue definition for this behaviour is 5 minutes, which you should consider only as a setting for testing the SGE setup. For real applications that runs for weeks, it maybe sufficient to create a checkpoint after every 6 to 24 hours. With this setting, the amount of data written to the checkpoint directory will not slow down any other communications to the shared file system on a large scale. Also you will only lose the last 6 to 24 hours of calculations.

Supplied files

All scripts and files for the examples you will find in the following archive: Checkpoint_Howto_Examples.tar.gz. The paths in the scripts have to be adjusted to your installation of SGE and your intended directory to run the scripts.

State transition diagrams

The behaviour of the checkpoint interfaces can be described as a finite machine. State transition diagrams you can find in the Howto "Checkpointing under Linux with Berkeley Lab Checkpoint/Restart" on the Howto page.



The transparent interface

Application Characteristic

This interface is used for applications that has built-in checkpointing support based on listening to a UNIX signal.

Behaviour of this interface

This interface will send a signal to the running job to initiate a checkpoint, according to the "min_cpu_interval" setting in the queue definition. Hence you have to define values for "signal", "ckpt_dir" and "min_cpu_interval". Not all parameters in the definition of the checkpointing interface will be honoured. The needed entries for the checkpointing interface:

$ qconf -sckpt check_transparent
ckpt_name          check_transparent
interface          transparent
ckpt_command       none
migr_command       none
restart_command    none
clean_command      none
ckpt_dir           /home/checkpoint
queue_list         all
signal             usr2
when               xmr      

Here the "when" condition for the creation of a checkpoint file is set to "xmr". So a checkpoint will be created, by sending the signal usr2 (to the job, i.e. the whole processgroup), when the specified time interval (which is defined in the queue definition) has elapsed, or when the node goes offline.. It's not possible to initiate a checkpoint just before the migration of the job, but we set the "x" anyway to get the job at least restarted. This is limited to happen in the migration script, which is available for application-level checkpointing. For demonstration purpose, we set the time interval in the queue to 5 minutes (for testing purpose you can lower the value also to be around 15 seconds), which is the default but not useful for production, since a checkpoint every few hours is sufficient. The queue should contain the setting:

$ qconf -sq vast00
...
min_cpu_interval     00:05:00
...

The setting for the availability of the checkpointing interface depends on the used version of SGE.

SGE 5.3: The queue type must be set to CHECKPOINTING, and the intended queues added to the queue list in the definition of the checkpointing interface.

SGE 6.x: Here the definition is in the opposite way, so you have to add the name of the checkpointing interface to the queue definition. The type of the queue will automatically be set to support checkpointing.

First script "checkpoint_1.sh" (example 1)

Having this setting, let's start with just a bash script, demonstrating the checkpoint behaviour. For now we simply write a file to the directory pointed to by $SGE_CKPT_DIR. This is set by SGE from the above setting of the checkpointing interface:

#!/bin/sh
# check_transparent1.sh
      
trap 'date >> $SGE_CKPT_DIR/checkpoint_1' usr2
      
echo "Script started."
      
for ((i=0; i<1000; i++)) ; do
    sleep 1
done
      
echo "Script finished."
      
exit 0

Although here is no actual checkpoint created for now, you can have a look at the created files in the $SGE_CKPT_DIR and the output of the script. After some elapsed time, you can find the following:

$qsub -ckpt check_transparent check_transparent1.sh
your job 8005 ("check_transparent1.sh") has been submitted
$ ls /home/checkpoint/
checkpoint_1
$ cat /home/checkpoint/checkpoint_1.sh 
Tue Dec 21 23:03:35 WET 2004
Tue Dec 21 23:08:35 WET 2004

Okay, this is working. So we may now implement a more working version.

Second script "checkpoint_2.sh" (example 2)

For this we have to write some more valuable information to the checkpoint file.

#!/bin/sh
# check_transparent2.sh
      
trap 'echo $ACTUAL_VALUE > $SGE_CKPT_DIR/checkpoint_2' usr2
      
#
# Check whether we are restarted and a checkpoint file is already avaiualble.
#
      
if [ "$RESTARTED" -eq "1" -a -e "$SGE_CKPT_DIR/checkpoint_2" -a -r "$SGE_CKPT_DIR/checkpoint_2" ] ; then
    read ACTUAL_VALUE < $SGE_CKPT_DIR/checkpoint_2
    echo "Script restarted with value $ACTUAL_VALUE."
else
    ACTUAL_VALUE=1
    echo "Script started."
fi
      
#
# Start of the program.
#
      
while [ "$ACTUAL_VALUE" -le 1000 ] ; do
    echo "Processing $ACTUAL_VALUE."
    let ACTUAL_VALUE++
    sleep 1
done
      
echo "Script finished."
      
exit 0

This time we write some information about the actual value of the loop into the checkpoint file. Although this is not foolproof (if the node just crashes in the moment when we write as new version of the checkpoint file we would be out of luck), but for now it's feasible. So let's look at the created file after at least five minutes:

$ ls ../checkpoint/checkpoint_2 
../checkpoint/checkpoint_2
$ cat ../checkpoint/checkpoint_2 
292

This seems working, and we can now try to trigger the restart of the job. Instead of pulling the power line out of the node we can use the qmod command, which will in case of a correctly setup of the checkpointing interface kill the running job and restart it (i.e. migrate the job).

$ qstat
job-ID  prior name       user         state submit/start at     queue      master  ja-task-ID 
---------------------------------------------------------------------------------------------
   8016     0 check_tran reuti        r     12/22/2004 00:14:21 vast15     MASTER
$ qmod -s 8016
reuti - suspended job 8016

After a while, you will notice that the job is restarted and running eventually on a different node than before:

$qstat
job-ID  prior name       user         state submit/start at     queue      master  ja-task-ID 
---------------------------------------------------------------------------------------------
   8016     0 check_tran reuti        Rr    12/22/2004 00:24:06 vast11     MASTER

You will notice that this job has the same job number as the original one, but the status field "Rr" will show you, that this job was restarted. When you look at the generated output file, you will notice something like this:

$ o check_transparent2.sh.o8016
...
Processing 570.
Processing 571.
Script restarted with value 292.
Processing 292.
Processing 293.
Processing 294.
Processing 295.
Processing 296.
...

As expected, the processing is going on from the last saved state of the checkpoint file. So we can try to move the checkpoint creation to a working program, and change the bash script to provide only some setups for handling more than just one checkpoint file from a user.

Third script "checkpoint_3.sh" (example 3)

We should create a directory in the $SGE_CKPT_DIR and give this information to the program to support more than just one checkpointing program at a time. Since all signal handling is send to the whole process group, we have to ignore the signal in the shell and handle it only in the called program.

#!/bin/sh
# check_transparent3.sh
      
trap '' usr2
      
SGE_CKPT_JOB=$SGE_CKPT_DIR/$JOB_ID
      
if [ \! -e "$SGE_CKPT_JOB" ] ; then
    mkdir $SGE_CKPT_JOB
fi
      
if [ \! -d "$SGE_CKPT_JOB" ] ; then
    echo "Checkpoint subdirectory couldn't be found."
    exit 1
fi
      
#
# Check whether we are restarted.
#
      
if [ "$RESTARTED" -eq "1" ] ; then
    /home/reuti/checkpoint_program3 -r -d $SGE_CKPT_JOB
else
    /home/reuti/checkpoint_program3 -d $SGE_CKPT_JOB
fi
      
rm -rf $SGE_CKPT_JOB
      
exit 0

It may not be necessary to give extra options to the program at all, as all environment variables are also exported to the program. But handling it in the shell will keep the programs flexible to run with other queuing systems if necessary and adjust only the shell scripts. A sample program in C follows in the file checkpoint_program3.c. Please compile with:

$ gcc -o checkpoint_program3 checkpoint_program3.c -lm

The key points in the program are:

  • Check for valid checkpoint directory
  • Read the old checkpoint file if requested
  • Install a signal handler to process the checkpoint request
  • Processing
  • Output of the results

In this case, the program can also run without any change without SGE or checkpointing usage at all. It's just an option in the program. To have always only one checkpoint file in the given directory, the signal handler routine will remove the old one only, if a new version could be written.

Let's submit the job and try to move it to a different machine. We may get the following output:

$ qsub -ckpt check_transparent check_transparent3.sh 
your job 8149 ("check_transparent3.sh") has been submitted
$ qstat
job-ID  prior name       user         state submit/start at     queue      master  ja-task-ID 
---------------------------------------------------------------------------------------------
   8149     0 check_tran reuti        r     12/24/2004 11:08:00 vast12     MASTER

The job is running, and we may check the checkpoint directory, whether any file was written up to now.

$ ls /home/checkpoint/
8149
$ ls /home/checkpoint/8149/
$

Although there is none up to now, we may try to move it to a different machine anyway, because this may also happen with just the crash of the node.

$ qmod -s 8149
reuti - suspended job 8149
$ qstat
job-ID  prior name       user         state submit/start at     queue      master  ja-task-ID 
---------------------------------------------------------------------------------------------
   8149     0 check_tran reuti        Rr    12/24/2004 11:09:09 vast15     MASTER
$ cat check_transparent3.sh.e8149
I will try to restart...
No checkpoint file written up to now. Restart from the beginning.

The program discovered the absence of a valid checkpoint file and therefore just starts at the beginning. We will wait now, until a valid file was written.

$ ls /home/checkpoint/8149/
checkpoint_3
$ cat check_transparent3.sh.e8149
I will try to restart....
No checkpoint file written up to now. Restart from the beginning.
Checkpoint creation initiated.
Checkpoint file created to restart with 297.
$ qmod -s 8149
reuti - suspended job 8149
$ qstat
job-ID  prior name       user         state submit/start at     queue      master  ja-task-ID 
---------------------------------------------------------------------------------------------
   8149     0 check_tran reuti        Rr    12/24/2004 11:14:53 vast12     MASTER         
$ cat check_transparent3.sh.e8149
I will try to restart...
No checkpoint file written up to now. Restart from the beginning.
Checkpoint creation initiated.
Checkpoint file created to restart with 297.
I will try to restart...
Checkpoint file read. Recalculation starts at 297.

This time we got the real processing of the checkpoint file. It starts again at 297, using the already calculated results. After the job, we may find this in the created error file as the whole checkpoint processing log:

$ cat check_transparent3.sh.e8149
I will try to restart...
No checkpoint file written up to now. Restart from the beginning.
Checkpoint creation initiated.
Checkpoint file created to restart with 297.
I will try to restart...
Checkpoint file read. Recalculation starts at 297.
Checkpoint creation initiated.
Checkpoint file created to restart with 594.
Checkpoint creation initiated.
Checkpoint file created to restart with 892.

Although the later written checkpoints weren't used, the whole procedure is working, and also the created directory for the job specific checkpoint files was deleted correctly:

$ ls /home/checkpoint/
$

We may no move on to the next available interface in SGE, with a slightly different behaviour.



The userdefined interface

Application Characteristic

This interface is used for applications that has built-in checkpoint support which will automatically checkpoint based on the application's internal logic and are not dependent on external factors to initiate a checkpoint.

Behaviour of this interface (example 4)

This interface is similar to the transparent interface, but in this case there is no need to specify a signal at all (specifying a signal will lead to the same behaviour as the transparent interface):

$ qconf -sckpt check_userdefined
ckpt_name          check_userdefined
interface          userdefined
ckpt_command       none
migr_command       none
restart_command    none
clean_command      none
ckpt_dir           /home/checkpoint
queue_list         all
signal             none
when               xr

The creation of the checkpoints is up to the working program. There will be no external trigger by SGE to create any checkpoint file at all (unless you specify the signal). Having this setup, it's still advantageous using checkpointing, because SGE will restart your job and give this information along with the job.

We can use nearly the same script as in example 3 to start the job (without the trap command to catch the signal), but the program checkpoint_program4.c has to decide when to write a checkpoint. In a real working program this may be after each critical step. This may also be easier to implement, than writing all the data of the current program and the state of it (maybe in form of a finite machine) to a file.

#!/bin/sh
# check_userdefined4.sh
      
SGE_CKPT_JOB=$SGE_CKPT_DIR/$JOB_ID
      
if [ \! -e "$SGE_CKPT_JOB" ] ; then
    mkdir $SGE_CKPT_JOB
fi
      
if [ \! -d "$SGE_CKPT_JOB" ] ; then
    echo "Checkpoint subdirectory couldn't be found."
    exit 1
fi
      
#
# Check whether we are restarted.
#
      
if [ "$RESTARTED" -eq "1" ] ; then
    /home/reuti/checkpoint_program4 -r -d $SGE_CKPT_JOB
else
    /home/reuti/checkpoint_program4 -d $SGE_CKPT_JOB
fi
      
rm -rf $SGE_CKPT_JOB
      
exit 0

You may notice, that here the option -d will also trigger the creation of the checkpoint file within the program. As we had already walked through the created output in example 3, here is just the output of the issued commands:

$ qsub -ckpt check_userdefined check_userdefined4.sh 
your job 8150 ("check_userdefined4.sh") has been submitted
$ qstat -u reuti             
job-ID  prior name       user         state submit/start at     queue      master  ja-task-ID 
---------------------------------------------------------------------------------------------
   8150     0 check_user reuti        r     12/24/2004 11:43:15 vast12     MASTER         
$ ls /home/checkpoint/
8150
$ ls /home/checkpoint/8150/
$ qmod -s 8150
reuti - suspended job 8150
$ qstat
job-ID  prior name       user         state submit/start at     queue      master  ja-task-ID 
---------------------------------------------------------------------------------------------
   8150     0 check_user reuti        Rr    12/24/2004 11:47:31 vast22     MASTER         
$ cat check_userdefined4.sh.e8150 
I will try to restart...
No checkpoint file written up to now. Restart from the beginning.
$ ls /home/checkpoint/8150/                         
checkpoint_4
$ cat check_userdefined4.sh.e8150
I will try to restart...
No checkpoint file written up to now. Restart from the beginning.
Checkpoint creation initiated.
Checkpoint file created to restart with 300.
Checkpoint creation initiated.
Checkpoint file created to restart with 600.
$ qmod -s 8150
reuti - suspended job 8150
$ qstat
job-ID  prior name       user         state submit/start at     queue      master  ja-task-ID 
---------------------------------------------------------------------------------------------
   8150     0 check_user reuti        Rr    12/24/2004 11:59:46 vast12     MASTER         
$ cat check_userdefined4.sh.e8150
I will try to restart...
No checkpoint file written up to now. Restart from the beginning.
Checkpoint creation initiated.
Checkpoint file created to restart with 300.
Checkpoint creation initiated.
Checkpoint file created to restart with 600.
I will try to restart...
Checkpoint file read. Recalculation starts at 600.
$ qstat
$ cat check_userdefined4.sh.e8150
I will try to restart...
No checkpoint file written up to now. Restart from the beginning.
Checkpoint creation initiated.
Checkpoint file created to restart with 300.
Checkpoint creation initiated.
Checkpoint file created to restart with 600.
I will try to restart...
Checkpoint file read. Recalculation starts at 600.
Checkpoint creation initiated.
Checkpoint file created to restart with 900.  



The application-level interface

Application Characteristics

This interface is usually used for applications that require an external process to intiate a successful checkpoint such as passing a message, moving the checkpoint file to the shared directory etc.

Behaviour of this interface (example 5)

This interface will execute the defined commands for ckpt_command, migr_command and clean_command. The restart_command will not be honoured, since this is reserved for the implemented kernel level checkpointing interfaces. You may also specify just signal names instead of procedures, but in our example we will use some procedures, to take care of the checkpoint files. Optional you may also specify a signal, which we also not use here.

$ qconf -sckpt check_application-level
ckpt_name          check_application-level
interface          application-level
ckpt_command       /home/reuti/check/checkpoint.sh $ckpt_dir $job_id
migr_command       /home/reuti/check/migrate.sh $job_pid $ckpt_dir $job_id
restart_command    none
clean_command      /home/reuti/check/clean.sh $ckpt_dir $job_id
ckpt_dir           /home/checkpoint
queue_list         all
signal             none
when               xmr

In this setup, the defined command for "ckpt_command" will be called to create a checkpoint file in case of the "min_cpu_interval", the "migr_command" procedure will be executed if you suspend the job or the queue which the job is running in (remember, that this can be done by hand using qmod, or also automatically by exceeding a set "suspend_thresholds" for the queue). Finally, the "clean_command" will remove the created checkpoint files in case of a qdel or successful completion of the job.

Having such an interface maybe still useful, if you have the binary of a program only, which will create checkpoint files or can use some kind of scratch file for a restart, but will always write it to the temporary directory like $TMPDIR, and you have no option to rename it or save it in a different location. Another application of this interface maybe in the case, that any checkpoint or scratch files are written too often or are too big. In these cases it's advantageous to write to a local filesystem first, to minimize the network traffic. The checkpointing process can then copy the local checkpoint file from e.g. $TMPDIR to the shared checkpoint directory.

For the example, we will reuse a slightly modified program as in example 4, but write a checkpoint after each 5 steps. Since this is too often to be put on the shared checkpoint directory, we will use the defined script for the creation of checkpoint in the application interface to copy this file less often to the shared checkpoint directory. For the restart, this copied file will be used, because all the files in the scratch directory $TMPDIR are of course deleted with the removing of the job from the node.

Although the copy of the checkpoint file in the "min_cpu_interval" may be sufficient, we can get a more recent version when we start the migrate command and perform another checkpoint copy. In case of a failure of the node, this won't be possible of course. So the necessary checkpoint.sh script may look like:

#!/bin/sh
#
# checkpoint.sh
#
      
#
# Copy a checkpoint file from the scratch directory of the
# job to the checkpoint directory.
#
      
me=`basename $0`
      
# test number of args
if [ $# -ne 2 ]; then
   echo "$me: got wrong number of arguments" >&2
   exit 1
fi
      
SGE_CKPT_JOB=$1/$2
      
if [ \! -e "$SGE_CKPT_JOB" ] ; then
    mkdir $SGE_CKPT_JOB
fi
      
if [ \! -d "$SGE_CKPT_JOB" ] ; then
    echo "Checkpoint subdirectory couldn't be found."
    exit 1
fi
      
while [ -z "$DONE" ] ; do
    cp $TMPDIR/checkpoint_5 $SGE_CKPT_JOB/checkpoint 2>/dev/null
    if [ "$?" -eq "0" ] ; then
        diff $TMPDIR/checkpoint_5 $SGE_CKPT_JOB/checkpoint 1>/dev/null 2>&1
        if [ "$?" -eq "0" ] ; then
            DONE="1"
        fi
    else
        DONE="1"
    fi
done
      
exit 0

To get a valid copy of the checkpoint file, we do a simple test here by a diff. You may have to do another form of the copy command with your real application. The two arguments given to the procedure will be used to create the name of the checkpoint directory and create it in the first call to the procedure. The next procedure is to initiate the migration of the job. In this type of checkpoint interface, you have to kill the job on your own. We do this with a kill -9 to the whole process group.

#!/bin/sh
#
# migrate.sh
#
      
me=`basename $0`
      
# test number of args
if [ $# -ne 3 ]; then
   echo "$me: got wrong number of arguments" >&2
   exit 1
fi
      
#
# Get the current checkpoint besides the regular copied one.
#
      
/home/reuti/check/checkpoint.sh $2 $3
      
#
# Now kill the job with the whole process group.
#
      
kill -9 -- -$1

The last step is to remove the created checkpoint directory after the job. This is done in an appropriate procedure which will be executed at the end of the job.

#!/bin/sh
#
# clean.sh
#
      
#
# Delete the checkpoint directory for this job.
#
      
me=`basename $0`
      
# test number of args
if [ $# -ne 2 ]; then
   echo "$me: got wrong number of arguments" >&2
   exit 1
fi
      
SGE_CKPT_JOB=$1/$2
      
if [ \! -d "$SGE_CKPT_JOB" ] ; then
    echo "Checkpoint subdirectory couldn't be found."
    exit 1
fi
      
rm -rf $SGE_CKPT_JOB 
      
exit 0

Having defined this all, we may submit a script for this type of checkpointing interface with:

#!/bin/sh
# check_application-level5.sh
      
SGE_CKPT_JOB=$SGE_CKPT_DIR/$JOB_ID
      
#
# Check whether we are restarted.
#
      
SGE_CKPT_FILE=$SGE_CKPT_JOB/checkpoint
      
if [ "$RESTARTED" -eq "2" -a -e "$SGE_CKPT_FILE" -a -r "$SGE_CKPT_FILE" ] ; then
    /home/reuti/checkpoint_program5 -r $SGE_CKPT_FILE -d $TMPDIR
else
    /home/reuti/checkpoint_program5 -d $TMPDIR
fi
      
exit 0

Please notice, that in this case the value of $RESTARTED is 2. We also take care of the situation where a node crashed before the first checkpoint file was successfully created. During the restart, you may look into the defined checkpoint directory and check if the job's corresponding subdirectory contains any checkpoint file. After the job completes or gets deleted (by qdel), the job's corresponding checkpoint directory is removed by the procedure clean.sh. For this example, ensure that the correct name of the checkpointing interface is specified as it's different from the previous examples:

qsub -ckpt check_application-level check_application-level5.sh



Integration with the Condor library

Checkpointing with Condor in general

Another queuing system is Condor, which in addition supply built in checkpointing libraries (http://www.cs.wisc.edu/condor). But it is possible to use just the Condor libraries in a standalone mode without any queuing system at all or an other one. This way you can still use SGE and add the checkpointing facility of Condor. After downloading the appropriate Condor version for your system, you have to install it in a personal mode to get just the access to the compiler. So you may install it in a subdirectory inside your home directory, e.g. ~/local:

$ ls -d1 condor*
condor-6.6.7-linux-x86-redhat80.tgz
$ tar -xzf condor-6.6.7-linux-x86-redhat80.tgz 
$ ls -d1 condor*
condor-6.6.7
condor-6.6.7-linux-x86-redhat80.tgz
$ cd condor-6.6.7/
$ ./condor_configure --install=/home/reuti/condor-6.6.7/release.tar \
  --install-dir=/home/reuti/local/condor-6.6.7 --make-personal-condor --verbose
$ cd

Having done this, you can remove the installation directory of Condor again. Now you have to execute two commands, to get access to the compiler. These can be put in any of the usual files which will be sourced during login.

$ export CONDOR_CONFIG=/home/reuti/local/condor-6.6.7/etc/condor_config
$ export PATH=/home/reuti/local/condor-6.6.7/bin:$PATH

If all was installed successfully, it's now already possible to use the compiler. Although further details can be looked up in the online Condor manual, we want just compile a program, to see whether it's working this way:

$ cat ever.c
int main(void)
{
    float x;
    long  i;
      
    for (;;)
    {
        for (i=0;i<=100000;i++)
            x=3.1415926*i+i+i*i*2.7182818;
    }
      
    return 0;
}
$ condor_compile gcc ever.c -o ever
...
$ ls -l ever  
-rwxr-xr-x    1 reuti    users    15199008 Dec 25 22:18 ever
$ strip ever
$ ls -l ever
-rwxr-xr-x    1 reuti    users     1332132 Dec 25 22:18 ever

Usually the created executables by Condor are really big, and can be sized down by the command strip, as used in the above example. Before integrating this into SGE, we want to execute the short example to see it working.

$ ./ever
Condor: Notice: Will checkpoint to ./ever.ckpt
Condor: Notice: Remote system calls disabled.
^C
$ ./ever -_condor_restart ever.ckpt
Condor: Notice: Will restart from ever.ckpt

Before we kill the program with a Ctrl-C, we login to another terminal session on the machine and send a "kill -usr2 <pid>" to the executing "ever" program. The SIGUSR2 is the used signal in Condor to force the creation of a checkpointing file. The process list can be retrieved with "ps -e f". By default the checkpoint file is written to a file with the name of the executable appended with .ckpt, but can be forced to be a different one with the option "-_condor_ckpt <filename>". In case of a restart, it is sufficient to give the option "-condor_restart <filename>", because this will also set the filename for further checkpoints. In fact, the additional option "-condor_ckpt <filename>" will be ignored in this case.

There are also function calls available in Condor to be used in a program, which you will find explained in section 4.2.4 (Checkpoint Library Interface) of the Condor user manual.

Please notice, that there are some restrictions on the suitable programs which can be compiled with Condor; a full list of these you can find in section 1.4 (Current limitations) of the Condor user manual.

Integrating Condor with the transparent checkpoint interface (example 6)

Now we will use plain C program "condor_program6.c" to integrate it into SGE. Because we are not using any library functions of Condor in this case, we can put the whole restart procedure in the starting script:

#!/bin/sh
# condor_transparent6.sh
      
trap '' usr2
      
SGE_CKPT_JOB=$SGE_CKPT_DIR/$JOB_ID
      
if [ \! -e "$SGE_CKPT_JOB" ] ; then
    mkdir $SGE_CKPT_JOB
fi
      
if [ \! -d "$SGE_CKPT_JOB" ] ; then
    echo "Checkpoint subdirectory couldn't be found."
    exit 1
fi
      
#
# Check whether we are restarted.
#
      
SGE_CKPT_FILE=$SGE_CKPT_JOB/checkpoint_6
      
if [ "$RESTARTED" -eq "1" -a -e "$SGE_CKPT_FILE" -a -r "$SGE_CKPT_FILE" ] ; then
    /home/reuti/condor_program6 -_condor_restart $SGE_CKPT_FILE
else
    /home/reuti/condor_program6 -_condor_ckpt $SGE_CKPT_FILE
fi
      
rm -rf $SGE_CKPT_JOB
      
exit 0

The corresponding program can be compiled with

condor_compile gcc condor_program6.c -o condor_program6 -lm; strip condor_program6

You will notice, that this program is really short and without any reference to a checkpointing at all, because this is handled by the Condor library. Because the "sleep(1)" command used in the previous examples is not allowed to be used in Condor programs, we have to replace this with an active loop. We can reuse the already defined transparent interface of the above examples and submit the job with:

$ qsub -ckpt check_transparent condor_transparent6.sh
your job 8188 ("condor_transparent6.sh") has been submitted

After the first creation of a checkpoint file, we can try to suspend this job to move it to a different node.

$ qstat
job-ID  prior name       user         state submit/start at     queue      master  ja-task-ID 
---------------------------------------------------------------------------------------------
   8188     0 condor_tra reuti        r     12/25/2004 16:24:32 vast11     MASTER         
$ ls ../checkpoint/8188/
checkpoint_6
$ qmod -s 8188
reuti - suspended job 8188
$ qstat
job-ID  prior name       user         state submit/start at     queue      master  ja-task-ID 
---------------------------------------------------------------------------------------------
   8188     0 condor_tra reuti        Rr    12/25/2004 16:30:50 vast12     MASTER         
$ cat condor_transparent6.sh.e8188
Condor: Notice: Will checkpoint to /home/checkpoint/8188/checkpoint_6
Condor: Notice: Remote system calls disabled.
Condor: Notice: Will restart from /home/checkpoint/8188/checkpoint_6
      

This works seamlessly with your applications. Because Condor will try to use exactly the same environment and files in case of a restart as during the original run, you can't use any files in $TMPDIR, because this will be have a different name when the job starts on the next node. Therefore it's limited to the home directory of yours, any other shared directory across the nodes or a shared scratch space like supplied by PVFS{1,2}.

Integrating Condor library functions with the transparent checkpoint interface (example 7)

You are a little bit more flexible in using Condor, when you integrate the library functions direct in your program. This way the job script will look without having any hint of Condor:

#!/bin/sh
# condor_transparent7.sh
      
trap '' usr2
      
SGE_CKPT_JOB=$SGE_CKPT_DIR/$JOB_ID
      
if [ \! -e "$SGE_CKPT_JOB" ] ; then
    mkdir $SGE_CKPT_JOB
fi
      
if [ \! -d "$SGE_CKPT_JOB" ] ; then
    echo "Checkpoint subdirectory couldn't be found."
    exit 1
fi
      
#
# Check whether we are restarted.
#
      
if [ "$RESTARTED" -eq "1" ] ; then
    /home/reuti/condor_program7 -r -d $SGE_CKPT_JOB
else
    /home/reuti/condor_program7 -d $SGE_CKPT_JOB
fi
      
rm -rf $SGE_CKPT_JOB
      
exit 0

This time we have to supply the necessary functions in the program. The lines with access to the Condor library are "init_image_with_file_name(checkpoint_read);" and "restart();". The generation of a checkpoint is still initiated with a SIGUSR2 signal. So we don't have to take care of it by programming by hand. A sample execution my give these results in the standard error file:

$ o condor_transparent7.sh.e8189 
Condor: Notice: Will checkpoint to /home/reuti/condor_program7.ckpt
Condor: Notice: Remote system calls disabled.
Update: Condor: Notice: Will checkpoint to /home/checkpoint/8189/checkpoint_7
Condor: Notice: Will checkpoint to /home/reuti/condor_program7.ckpt
Condor: Notice: Remote system calls disabled.
Update: Condor: Notice: Will checkpoint to /home/checkpoint/8189/checkpoint_7
I will try to restart...
No checkpoint file written up to now. Restart from the beginning.
Condor: Notice: Will checkpoint to /home/reuti/condor_program7.ckpt
Condor: Notice: Remote system calls disabled.
Update: Condor: Notice: Will checkpoint to /home/checkpoint/8189/checkpoint_7
I will try to restart...

Because at the first suspend of the job was no checkpoint file written, it will discover this and start at the beginning. The latter suspend restarted the program.

Integrating Condor library functions with the userdefined checkpoint interface (example 8)

This way we have the possibility to control also the creation of a checkpoint file in the program instead of rely on the SIGUSR2 method. The necessary call is "ckpt();" in the routine, which was used without the Condor library to handle the writing of the checkpoint file all alone. After compiling the program condor_program8.c in the same fashion as the previous examples, you have to submit the job with the specification of the checkpointing interface set to check_userdefined, as done in the examples without the usage of the Condor library.