Loose and tight LAM/MPI Integration in SGE

Topic:

Loose and tight integration of the LAM/MPI library into SGE.

Author:

Reuti, reuti__at__staff.uni-marburg.de; Philipps-University of Marburg, Germany

Version:

1.0a -- 2005-03-23 Initial release, comments and corrections are welcome
1.1 -- 2010-11-22 Final version

Contents:

Prerequisites
Loose integration using rsh
Modification of the LAM-MPI source for qrsh startup
Loose integration using qrsh
Tight integration using qrsh
Modification of the Startup for other Platforms
References and Documents

Note:

This HOWTO complements the information contained in the man pages of SGE and the Administration Guide.

Acknowledgement:

Many thanks to Sean Dilda for pointing out a bug in the lamd_wrapper and eliminating an ambiguity.

!!! Warning !!!

LAM/MPI was superseded by the Open MPI project. Open MPI has a Tight Integration into SGE built in since 1.2, which can by invoked when you ./configure Open MPI with "--with-sge" for versions 1.3 and newer. Open MPI version 1.2 compiled it in unconditionally. Please refer to the Open MPI FAQ for details.

It will work out-of-the-box with a defined parallel environment where start- and stop_proc_args are both set to NONE in the to be used PE (in the essence, the same PE can now be used for Open MPI and MPICH2), and in the jobscript a plain mpiexec will discover automatically the granted slots and nodes without any further options. Nevertheless, in case that there is more than one MPI installation in a cluster available, the correct mpiexec corresponding to the compiled application must be used as usual.

This document just stays here in case you come across a legacy installation and have to fix or setup such an older version.

Prerequisites

Configuration of SGE with qconf or the GUI – Compilation of LAM/MPI – Cluster setup
You should be familiar with modifying the SGE environment such as the queue definitions and parallel environment (PE) configuration. Additional information on the SGE parallel environment is available from the "sge_pe" man pages (make sure the SGE man pages are defined in your $MANPATH). You should also have access to the LAM/MPI source code, and know how to build and install it. More information can be found in the LAM/MPI Documentation (see References and Documents). For the "Loose integration using rsh" startup method, a working (passwordless) login between the nodes in the cluster is also required.

Target platform

This Howto targets the LAM/MPI 7.1.1 on the Linux platform. Due to the modifications made to the source code and introduced startup sequence, these integrations using qrsh will not work on platforms, where SGE will use the additional group feature for a proper shutdown by default.

LAM/MPI

The LAM/MPI library from the Indiana University is a MPI implementation. In contrast to other implementations like MPICH, this one is using daemons on all the included nodes in a parallel job. One advantage of such an approach is, that programs issuing many mpirun commands during their execution need only one time a "conventional" communication to the nodes to start the daemons. All further communication is handled by the already running daemons, and so additional mpiruns will startup faster.

Available startup methods in SGE

For now there is no startup possibility in SGE, which allows a program started on a node to vanish into daemon land and still be under SGE control for correct accounting and proper shutdown of all slave tasks. Therefore the loose integrations will rely on a shutdown of the daemons by a conventional LAM/MPI command, while the tight integration uses a 2-stage startup, which was introduced by Christopher Duncan for former LAM/MPI versions [1].

Included setups and scripts

All three possibilities to startup the LAM/MPI daemons introduced here can be installed in parallel, as they have a unique PE and script directory. This way you can play around with them, and decide which one to install in your cluster. While walking through this Howto, you can install all of them in a shared directory, e.g. your home directory. For the final installation it could be put in the conventional place like $SGE_ROOT. The scripts, which are only slight modifications of the original scripts in the $SGE_ROOT/mpi directory, can be downloaded here for your convenience (the inserted lines are commented) [2].

Reading of this document

To avoid explaining the complete LAM/MPI behavior in each of the possible startup methods, this document should be read from the beginning onward, although you are e.g. only interested in the Tight integration. Some of the special settings of the scripts and PEs will explain itself, when the default behavior is known.

Loose integration using rsh

Characteristic of the Startup Method
Using a simple (passwordless) rsh between the nodes, the daemons on the slave nodes are started up.

Since LAM/MPI 7.1.1 is SGE-aware, it will create a job specific directory on the slave nodes of the job of the form "lam-$USER@$HOSTNAME-sge-$JOB_ID-$SGE_TASK_ID". So they will be unique for each job and not interfere, if two (or more) LAM/MPI jobs are running on the same node. They will be located in 1. $TMPDIR or 2. /tmp. As $TMPDIR is only set by SGE in this case on the headnode of the job, it will go here into the SGE created and removed $TMPDIR, and to the /tmp directory on all the slave nodes. The removal of these directories on the slave nodes will take place in the stop_proc_args script of the PE, when the lamhalt command is executed. In most of the cases this will work as intended, whether the jobs are finished by a normal end of the program or stopped with qdel. Don't change the $TMPDIR to a different location in your jobscript, because the command mpirun can't execute correctly, as it won't find any information of the current setup. If you must change it for any reason, it must also be changed in the start_proc_args and stop_proc_args scripts of the PE.

As there is no environment on the slave nodes by SGE, this startup will behave most like an interactive startup, and hence need the path to the LAM/MPI binaries also on the slave nodes defined in any file, which will be sourced during a non-interactive login.

Hostfile a.k.a. Machinefile Format

Although LAM/MPI supports a machinefile format, where each node is suffixed by the number of slots to be used on this machine (i.e. "node00 cpu=2"), the default format of the generated machinefile is sufficient, where each machine is repeated times in this file, equal to the slots to be used there.

Setup of the PE in SGE

We will name this PE "lam_loose_rsh" and set the following entries:
$ qconf -sp lam_loose_rsh
pe_name           lam_loose_rsh
queue_list        para21 para22
slots             4
user_lists        NONE
xuser_lists       NONE
start_proc_args   /home/reuti/lam_loose_rsh/startlam.sh $pe_hostfile
stop_proc_args    /home/reuti/lam_loose_rsh/stoplam.sh
allocation_rule   $round_robin
control_slaves    FALSE
job_is_first_task TRUE
This setup is for SGE 5.3, where the queues to be used are defined in the PE. For SGE 6.x, the PE must be specified in the cluster queue, in which it should be used in the entry "pe_list".

Sample job execution

Most likely you have already a MPI program based on LAM/MPI, which is running successfully in interactive mode. Otherwise you will find a sample program in the supplied archive to this Howto, which will run until a qdel of the job, so that the process distribution to the nodes can easily be checked. Let's take this job script:
$ cat tester_mpi.sh
#!/bin/sh
export PATH=/home/reuti/local/lam-7.1.1/bin:$PATH
      
mpirun C /home/reuti/mpihello
Specifying the option "C" to mpirun will use all of the slots granted to this job, which is most likely what you want all of the time. If all went right, we can see the job running, and also check the involved nodes for the running jobs:
$ qsub -pe lam_loose_rsh 2 tester_mpi.sh
your job 10389 ("tester_mpi.sh") has been submitted
$ qstat
job-ID  prior name       user         state submit/start at     queue      master  ja-task-ID 
---------------------------------------------------------------------------------------------
  10389     0 tester_mpi reuti        r     02/20/2005 15:25:31 para21     SLAVE          
  10389     0 tester_mpi reuti        r     02/20/2005 15:25:31 para22     MASTER
$ rsh node21 ps -e f -o pid,ppid,pgrp,command --cols=80
  PID  PPID  PGRP COMMAND
...
 8386     1  8386 /home/reuti/local/lam-7.1.1/bin/lamd -H 192.168.151.23 -P 4288
 8387  8386  8386  \_ /home/reuti/mpihello
$ rsh node22 ps -e f -o pid,ppid,pgrp,command --cols=80
  PID  PPID  PGRP COMMAND
...
24576     1 24576 /home/reuti/local/lam-7.1.1/bin/lamd -H 192.168.151.23 -P 4288
24581 24576 24576  \_ /home/reuti/mpihello
Using this setup, the issued qdel, which will execute the lamhalt in the stop_proc_args script of the PE, is the only way to remove the running processes and created directories on the nodes (except the directory on the headnode of the job, which is removed by SGE as usual).

Another more sophisticated solution, maybe in "pseudo code" is to "rsh <node> 'ps some_jobid | grep process_leader | (cut process_group) * -1 | xargs kill -9 --'" in a loop over all involved machines. As we are interested in a Tight integration, we will no further dig into this, and it's up to the reader to implement it.

Modification of the LAM/MPI source for qrsh startup

Pitfall during startup of the Daemons
The daemon on the nodes is started by the program hboot in LAM/MPI, which will fork a child process, which in turn will be replaced by the lamd. While this will work without any modification on the headnode of the job, where hboot is started locally without rsh, we will face a problem on the slave nodes: when the parent task ends, SGE will discover this and kill the whole processgroup of the job. This would also kill the just started lamd, as it is still in the same processgroup as hboot was before.

Modification

The solution to this problem is the command setsid, which will enforce a new processgroup for the child task. This has to be added in line 317 of hboot.c (which can be found in your lam-7.1.1 sourcecode installation in the directory tools/hboot), after the child forks. The program section:
                else if (pid == 0) {            /* child */
      
                        if (ao_taken(ad, "s")) {
should become:
                else if (pid == 0) {            /* child */
                        setsid();
                        if (ao_taken(ad, "s")) {
Then the executable hboot must be build again with the conventional make inside the lam-7.1.1 sourcecode installation, and the resulting binary hboot copied just by hand to the place, where the binaries were installed, e.g. ~/local/lam-7.1.1/bin.

No warranty

As usual, there is no guaranty, that this will work successfully on all Linux distributions and versions. Just do on your own risk.

Loose integration using qrsh

Different behavior on the slave nodes compared to "Loose integration using rsh"
Because we are now using qrsh to start the daemons, no rsh would be necessary between the nodes. It is just still defined here to have an easy access to the nodes. The used qrsh will now also create a temporary directory on the slave nodes and set the $TMPDIR accordingly. Unfortunately, this will be deleted when the job (i.e. the executed hboot) ends. So it can't be used to store the LAM/MPI created information of the daemon setup. As $TMPDIR is the only environment variable which can't be exported to the nodes by using the "-v" or "-V" option to qrsh, we have to use another way. Otherwise you will see the job starting as intended, and the lamhalt will also look successful, but you will still have the processes running on the slave nodes in case that you issued a qdel, as LAM/MPI stores in its special directory the PIDs of the processes, which need to be killed.

The supplied rsh-wrapper in the mpi subdirectory in $SGE_ROOT already has the option to supply a prefix command, which should be executed before the command on the slave node. This can be set in the start_proc_args script of the PE:
export RCMD_PREFIX="export TMPDIR=/tmp;"
This way, the LAM/MPI created directories are in the same place as using just the plain rsh. Be sure, that you also changed the path to the actual location of your rsh-wrapper in the supplied script startlam.sh of the archive.

Setup of the PE in SGE

We will name this PE "lam_loose_qrsh" and set the following entries:
$ qconf -sp lam_loose_qrsh
pe_name           lam_loose_qrsh
queue_list        para21 para22
slots             4
user_lists        NONE
xuser_lists       NONE
start_proc_args   /home/reuti/lam_loose_qrsh/startlam.sh -catch_rsh $pe_hostfile
stop_proc_args    /home/reuti/lam_loose_qrsh/stoplam.sh
allocation_rule   $round_robin
control_slaves    TRUE
job_is_first_task TRUE
To allow the execution of the qrsh commands, the control_slaves entry must be set to TRUE in this case. On the headnode of the job is no qrsh necessary, so we can safely set job_is_first_task to TRUE.

Behavior on platforms with process control by the Additional Group Feature

Although the started lamd would be in a new process group, the reliable shutdown of all the processes using the additional group feature (see: man setgroups) by SGE would also kill the just started daemon, as it can't escape from this additional group.

Disadvantages and Advantages of the Loose Integration

Which ever of the two loose integrations you choose, they share most of the characteristics in common, as the only difference is the usage of qrsh instead of the plain rsh:

- wrong accounting

- no processes/directories controlled by SGE, removal relies on the lamhalt command

If all went okay, the result in case of a normal program stop or an abort with qdel is:

+ no semaphores or shared memory segments left behind

+ processes and directories removed on the slave nodes

As the output is exactly the same as with the "Loose Integration using rsh", we skip it here, because it will not present any new information or facts. The scripts for this startup method are of course also included in the supplied script archive to this Howto.

Tight integration using qrsh

2-stage Startup for Tight Integration
For now, there is no direct possibility in SGE to start a program in daemon-mode on a node, which is still under control of SGE for correct accounting and removal of the processes.

The original idea by Christopher Duncan in [1], was to make first a conventional qrsh to the slave node, and on this slave node a local-qrsh to start the final daemon. This is possible with LAM/MPI, as the daemon is not pushing itself into the background (like it is mostly done with daemonizing programs), but the starting hboot makes the forking. At this step, it is possible to insert the local-qrsh call and achieve the desired effect.

Additional changes to LAM/MPI

For this to work, it is necessary to use the already patched hboot, like in the chapter "Loose Integration using qrsh". Another change is now necessary to insert the local-qrsh. Therefore the following steps must be executed:
move to your binary installation of LAM/MPI, e.g. ~/local/lam-7.1.1/bin

rename lamd to lamd_binary
create a file lamd_wrapper with the following content:
#!/bin/sh
#
# Modified startup of lamd to get it working in a controlled
# setup by SGE. To be used for interactive usage like usual.
#
         
LAMD_BINARY=lamd_binary
         
if [ "$ENVIRONMENT" = "BATCH" -a "$PE" = "lam_tight_qrsh" ] ; then
    COUNTER=5
    while (( COUNTER-- > 0 )) ; do
        if (( `ps --no-headers -o ppid -p $$` != 1 )) ; then
            :
        else
            :
            exec $SGE_ROOT/bin/$ARC/qrsh -V -inherit -nostdin $HOSTNAME $LAMD_BINARY "$@"
        fi
    done
else
    exec $LAMD_BINARY "$@"
fi
         
#
# If we are lucky, we never get here.
#
         
exit -1
make the lamd_wrapper executable, i.e.: chmod +x lamd_wrapper

make a symbolic link from lamd to lamd_wrapper, i.e.: ln -s lamd_wrapper lamd
With this script, the installed LAM/MPI can still be used interactively or with the loose integration scripts. So nothing must be changed back, to get the initial behavior.

Setup of the PE in SGE

A new PE "lam_tight_qrsh" will reflect the changes for this tight integration:
$ qconf -sp lam_tight_qrsh
pe_name           lam_tight_qrsh
queue_list        para21 para22
slots             4
user_lists        NONE
xuser_lists       NONE
start_proc_args   /home/reuti/lam_tight_qrsh/startlam.sh -catch_rsh $pe_hostfile
stop_proc_args    /home/reuti/lam_tight_qrsh/stoplam.sh
allocation_rule   $round_robin
control_slaves    TRUE
job_is_first_task FALSE
As we now make a qrsh also on the headnode of the parallel job, the job_is_first_task must be set to FALSE. It may happen, that we get only one slot on this machine.

Program flow of the wrapper script

SGE will control the number of qrsh calls made to each node, so the idea is to wait a small time, until the parent PID of this lamd_wrapper became 1. This will happen, when the starting hboot left already the machine.

[Side note: In case, that there are some race conditions, and SGE still prevents the second local-qrsh to be made (although the parent PID became already 1), the allocation_rule could be changed to be just 2 (and so limited to dual CPU machines); this would give the exact behavior as Christopher Duncan's initial Perl script. In this case there is no a need to wait at all, and the entry job_is_first_task could be changed to TRUE (one local qrsh is at least allowed in this case).]

On some slow systems, there is the chance to increase the number in the wait loop to a higher value, although it wasn't observed also under already loaded machines that the waiting loop in the lamd_wrapper was ever executed. It's just in the script, to avoid an endless loop, in case anything went wrong.

Again, the startlam.mpi needs to be adjusted, so that the actual location of the rsh wrapper is used here.

Sample execution

Using the same commands like in the first example, the results show the tight integration of LAM/MPI:
$ qsub -pe lam_tight_qrsh 2 tester_mpi.sh
your job 10395 ("tester_mpi.sh") has been submitted
$ qstat
job-ID  prior name       user         state submit/start at     queue      master  ja-task-ID 
---------------------------------------------------------------------------------------------
  10395     0 tester_mpi reuti        r     02/20/2005 18:39:25 para21     MASTER         
            0 tester_mpi reuti        r     02/20/2005 18:39:25 para21     SLAVE          
  10395     0 tester_mpi reuti        r     02/20/2005 18:39:25 para22     SLAVE
$ rsh node21 ps -e f -o pid,ppid,pgrp,command --cols=80
  PID  PPID  PGRP COMMAND
...
  789     1   789 /usr/sge/bin/glinux/sge_execd
 8945   789  8945  \_ sge_shepherd-10395 -bg
 8993  8945  8993  |   \_ /bin/sh /var/spool/sge/node21/job_scripts/10395
 8994  8993  8993  |       \_ mpirun C /home/reuti/mpihello
 8972   789  8972  \_ sge_shepherd-10395 -bg
 8973  8972  8973      \_ /usr/sge/utilbin/glinux/rshd -l
 8975  8973  8975          \_ /usr/sge/utilbin/glinux/qrsh_starter /var/spool/sg
 8976  8975  8976              \_ lamd_binary -H 192.168.151.22 -P 55513 -n 0 -o
 8995  8976  8976                  \_ /home/reuti/mpihello
 8971     1  8971 /usr/sge/bin/glinux/qrsh -V -inherit -nostdin node21 lamd_bina
 8974  8971  8971  \_ /usr/sge/utilbin/glinux/rsh -n -p 55517 node21 exec '/usr/
$ rsh node22 ps -e f -o pid,ppid,pgrp,command --cols=80
  PID  PPID  PGRP COMMAND
...
  789     1   789 /usr/sge/bin/glinux/sge_execd
25180   789 25180  \_ sge_shepherd-10395 -bg
25181 25180 25181      \_ /usr/sge/utilbin/glinux/rshd -l
25183 25181 25183          \_ /usr/sge/utilbin/glinux/qrsh_starter /var/spool/sg
25184 25183 25184              \_ lamd_binary -H 192.168.151.22 -P 55513 -n 1 -o
25185 25184 25184                  \_ /home/reuti/mpihello
25179     1 25179 /usr/sge/bin/glinux/qrsh -V -inherit -nostdin node22 lamd_bina
25182 25179 25179  \_ /usr/sge/utilbin/glinux/rsh -n -p 49142 node22 exec '/usr/
The escaped processes don't execute any time or resource consuming program, and can just stay there out of control of SGE. The real executing lamd and child processes are under SGE control, which was the goal to be achieved by this setup. These processes will be killed when they finish on their own, or after issuing a qdel. In the latter case, it is not possible to get a proper shutdown by using lamhalt, since lamd was already shutdown and the directory containing the LAM-MPI specific information removed SGE.

Advantages and Disadvantages of the Tight Integration

With this setup we trade some advantages for the disadvantages:

+ correct accounting

+ processes/directories controlled by SGE, definitely removed at the end of the job

The drawback is in this case:

- semaphores or shared memory segments may be left behind, if the job was aborted with qdel

Depending on the usage of the nodes, there may be a need to make some cleanup of the semaphores or shared memory segments from time to time, or in the stop_proc_args of the PE, in case there are often jobs killed by qdel.

Modification of the Startup for other Platforms

In case of the shutdown of the processes by the additional group feature, the other way round may be successful: usage of the "Tight integration with qrsh" and waiting at the end of hboot, until the lamd is up and running, instead of the inserted setsid. Although the starting local-qrsh will be removed in this case (and so not appear in a process listing like the above one), the started lamd will continue to operate and service for the program execution.

References and Documents

SGE-LAM Integration
[1] Latest version of the script sge-lam by Christopher Duncan. Because it can only be found in the mailing list archive at SGE and LAM/MPI, it's attached here for reference (sge-lam). As there was a small bug inside, this version was patched to remove any "-n" for the remote-qrsh call; this was taken from a former version of this script, where it was included.

[2] Archive with all the scripts used in this Howto: sge-lam-integration-scripts.tgz.

LAM/MPI

The implementation specific documentation can be found at the LAM/MPI website on the page "Documentation" (http://www.lam-mpi.org/using/docs/).

MPI documentation in general and tutorials

For some general introduction to MPI and MPI-Programming, you can study the following documents:

http://www.mpi-forum.org/docs

http://www.netlib.org/utk/papers/mpi-book/mpi-book.html

http://www-unix.mcs.anl.gov/mpi/usingmpi/index.html

http://www-unix.mcs.anl.gov/mpi/usingmpi2

ftp://math.usfca.edu/pub/MPI/mpi.guide.ps