Tight MPICH2 Integration in Grid Engine

Topic:

Tight Integration of the MPICH2 library into SGE.

Author:

Reuti, reuti__at__staff.uni-marburg.de; Philipps-University of Marburg, Germany

Version:

1.1 -- 2008-11-25 Updated release, comments and corrections are welcome
1.1.1 -- 2009-03-10 Updated a wrong spelling of $SGE_TASK_ID in the supplied scripts (thx to Kenneth for mentioning this)
1.2 -- 2010-11-19 Final version with minor adjustments

Contents:

Prerequisites
Introduction to the MPICH2 family
Tight Integration of the mpd startup method
Tight Integration of the daemonsless smpd startup method
Tight Integration of the daemon-based smpd startup method
Tight Integration of the gforker startup method
Nodes with more than one network interface
References and Documents

!!! Warning !!!

With Version 1.3 of MPICH2, they moved to Hydra as default startup method for their slaves tasks and other startup methods will be removed over time. Hydra has a compiled in Tight Integration into SGE by default, and no special setup in SGE to support MPICH2 jobs is necessary any longer.

Hydra will work out-of-the-box with a defined parallel environment where start- and stop_proc_args are both set to NONE in the to be used PE (in the essence, the same PE can now be used for Open MPI and MPICH2), and in the jobscript a plain mpiexec will discover automatically the granted slots and nodes without any further options. Nevertheless, in case that there is more than one MPI installation in a cluster available, the correct mpiexec corresponding to the compiled application must be used as usual.

This document just stays here in case you come across a legacy installation and have to fix or setup such an older version.

Prerequisites

Configuration of SGE with qconf or the GUI
You should already know how to change settings in SGE, like to setup and change a queue definition or the entries in the PE configuration. Additional information about queues and parallel interfaces you can get from the man pages "queue_conf" and "sge_pe" of SGE (make sure the SGE man pages are defined in your $MANPATH).

Target platform

This Howto targets the MPICH2 version 1.0.8 and SGE 6.2 on Linux. Most likely it will work under other operating systems in the same way. Some of the commands will in this case need slight modifications. It will not work this way for MPICH2 version 1.0, as some things were only adjusted since 1.0.1, which will allow an easy Tight Integration.

MPICH2

The MPICH2 is a library from the Argonne National Laboratory (http://www.anl.gov) which is an implementation of the MPI-2 standard. Before you start with the integration of MPICH2 into SGE, you should already be familiar with the operation of MPICH2 outside of SGE and know how to compile a parallel program using MPICH2.

Included setups and scripts

The supplied archive in [1] contains the necessary scripts for the mpd and smpd startup methods (for the gforker method only the example shell script is included, as this startup method needs no scripts to start and stop any daemon). It contains scripts and programs similar to the distribution of the PVM and MPICH integration package in SGE. For installing it for common usage in the whole cluster, you may like to untar it in $SGE_ROOT to get the new directories $SGE_ROOT/mpich2_mpd, $SGE_ROOT/mpich2_smpd, $SGE_ROOT/mpich2_smpd_rsh and $SGE_ROOT/mpich2_gforker.

A short program is provided in [2], which will allow you to observe the correct distribution of the spawned tasks.

Queue configuration

The supplied jobscripts are to be executed under the bash shell. As the default setting in SGE during installation is set to use the csh shell, you might need to change either two entries in the queue definition to read:
$ qconf -sq all.q
...
shell                 /bin/bash
...
shell_start_mode      unix_behavior
...
(please see "man queue_conf" for details about this setting), or submit the (parallel) jobs with the additional argument:
$ qsub -S /bin/bash ...
Please note, that under the Linux operating system /bin/sh is often a link to /bin/bash and can be abbreviated this way.

Introduction to the MPICH2 family

This new MPICH2 implementation of the MPI-2 standard was created to supersede the widely used MPICH(1) implementation. Besides implementing the MPI-2 standard, another goal was a faster startup. To give the user a greater flexibility, there are (for now) 3 startup methods implemented:

mpd: As the primary startup method mpd is introduced in MPICH2. It’s based on the script language Python to startup a so called ring of machines. Giving mpdboot a list of nodes it will startup daemons on the requested machines, which can be used immediately for the execution of parallel programs inside this ring. This is convenient for the interactive use of a parallel program, as the only thing which must be prepared is a list of to be used nodes.

Due to limitations in mpdboot, it must have a connection to the daemons on the nodes via stdin/stdout, until the mpd fork into daemonland. Therefore this Howto will instead launch the mpd in a script defined in start_proc_args one after the other, so that the mpd are still bound the started sge_shepherd.

smpd: This startup method can be used in a daemon based or daemonless mode. The daemon based startup is not creating all the daemons on the nodes according to a nodelist on its own (like it is done by the mpdboot command in the mpd startup method), but the daemons have to be started before the execution of the main program, e.g. by a script. In this Howto, a startup of the daemons will be presented where they are started by the given procedure to start_proc_args.

A daemonless startup is very similar to the startup of the tasks in the former MPICH(1). Although it includes the same scripts from the original $SGE_ROOT/mpi, it’s included here (with some editing to the templates), so that it can easily be used with a still installed $SGE_ROOT/mpi without any side effects.

gforker: Programs started under gforker are limited to one machine and supports only forks for additional processes.

Be aware, that for each startup method and chosen way to compile them, you will get a set of mpirun and/or mpiexec for each of them. They are not interchangeable! Hence, once you installed mpd and compiled a program to run in the ring, you can’t switch to smpd simply by using a different mpirun or mpiexec. Instead you have to recompile (or at least relink) your program with the intended libraries to be used with this specific startup method. This means, that you have to plan carefully your set $PATH during compilation and execution of the parallel program, to get a correct behavior. Not doing so will result in strange error messages, which will not point directly to the cause of trouble. After compiling your application software, it may be advisable not to rely on the set $PATH in your interactive shell for the submission, but to set it explicitly in the submitted script to SGE, as we will do it in this Howto for demonstration purpose. Also note, that the preferred startup command in MPICH2 is mpiexec, not mpirun.

Tight Integration of the mpd startup method

First we discuss the integration of the preferred startup method in MPICH2, called mpd. You can compile MPICH2 after you configured it; maybe with an alternative path for the installation of the parallel library:
$ ./configure --prefix=/home/reuti/local/mpich2-1.0.8/mpd
After the usual make and make install we can compile the short program which is supplied in [2] with:
$ mpicc -o mpihello mpihello.c
Similar to the PVM-Integration, we need a small helping program to start the daemons as a task on the slave nodes using the qrsh-command. In some way, this start_mpich2 can be seen as a generic program extending SGE with the ability to run a qrsh-command in the background, which can easily be modified for similar startup methods.

If you installed the whole package like suggested in $SGE_ROOT/mpich2_mpd, set the working directory to $SGE_ROOT/mpich2_mpd/src and compile the included program with:
$ ./aimk
$ ./install.sh
The installation process will put the helping program mpich2_mpd in a created directory $SGE_ROOT/mpich2_mpd/bin, which is the default location of the included script startmpich2.sh to look for this program. This helper program must be compiled for every platform you have in the cluster, and on which you want to run this startup method. A parallel environment for this startup method may look like:
$ qconf -sp mpich2_mpd
pe_name            mpich2_mpd
slots              8
user_lists         NONE
xuser_lists        NONE
start_proc_args    /usr/sge/mpich2_mpd/startmpich2.sh -catch_rsh $pe_hostfile \
                   /home/reuti/local/mpich2-1.0.8/mpd
stop_proc_args     /usr/sge/mpich2_mpd/stopmpich2.sh -catch_rsh \
                   /home/reuti/local/mpich2-1.0.8/mpd
allocation_rule    $round_robin
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary FALSE
Remember to attach this PE to a cluster queue of your choice and to adjust the path to your MPICH installation. As the chain of Python modules used in this startup method will create additonal processes and processgroups, it’s essential to include in your cluster configuration a special switch, which will kill the processes at the end of a job or after an issued qdel by identifying the associated processes by an additonal group id, which is attached to all spawned processes on a slave node (by default the processgroup of the first started process by qrsh_starter is used to kill just this complete processgroup including its kids):
$ qconf -sconf
...
execd_params                 ENABLE_ADDGRP_KILL=TRUE
...
Having done so, we can now submit a job with the ususal sequence:
$ qsub -pe mpich2_mpd 4 mpich2_mpd.sh
This will first start a local mpd process on the master node of the parallel job. Even this process will be started by a local qrsh in the startmpich2.sh, so that the sge_shepherd will stay alive. After it was launched successfully, its used port will be queried and the accompanying daemons on the slave nodes of the parallel job will be started with this information.

The mpich2_mdp.sh will generate a *.po$JOB_ID like:
$ cat mpich2.sh.po628
-catch_rsh /var/spool/sge/pc15381/active_jobs/628.1/pe_hostfile /home/reuti/local/mpich2-1.0.8/mpd
pc15381:2
pc15370:2
startmpich2.sh: check for local mpd daemon (1 of 10)
/usr/sge/bin/lx24-x86/qrsh -inherit -V pc15381 /home/reuti/local/mpich2-1.0.8/mpd/bin/mpd
startmpich2.sh: check for local mpd daemon (2 of 10)
startmpich2.sh: check for mpd daemons (1 of 10)
/usr/sge/bin/lx24-x86/qrsh -inherit -V pc15370 /home/reuti/local/mpich2-1.0.8/mpd/bin/mpd -h pc15381 -p 14581 -n
startmpich2.sh: check for mpd daemons (2 of 10)
startmpich2.sh: got all 2 of 2 nodes
-catch_rsh /home/reuti/local/mpich2-1.0.8/mpd
The first check will only look for the local mpd daemon, i.e. whether it can contact the local mpd daemon by a mpdtrace -l, as we need this information to instruct the other daemons to use the port which was selected by the first started mpd. The following loop will start the daemons on all remaining slave nodes, and waits until all are up and running. After the startmpich2.sh the mpd processes are available, and the user program started by an mpiexec (or several) in the job script will spawn all processes to the already running mpd. On the master node of the parallel job the following processes can be discovered:
$ ssh pc15381 ps -e f -o pid,ppid,pgrp,command --cols=120
  PID  PPID  PGRP COMMAND
...
22110     1 22110 /usr/sge/bin/lx24-x86/sge_execd
31712 22110 31712  \_ sge_shepherd-628 -bg
31775 31712 31775  |   \_ /bin/sh /var/spool/sge/pc15381/job_scripts/628
31776 31775 31775  |       \_ python2.5 /home/reuti/local/mpich2-1.0.8/mpd/bin/mpiexec -machinefile /tmp/628.1.all.q/mac
31744 22110 31744  \_ sge_shepherd-628 -bg
31745 31744 31745      \_ /usr/sge/utilbin/lx24-x86/qrsh_starter /var/spool/sge/pc15381/active_jobs/628.1/1.pc15381
31755 31745 31755          \_ python2.5 /home/reuti/local/mpich2-1.0.8/mpd/bin/mpd
31777 31755 31777              \_ python2.5 /home/reuti/local/mpich2-1.0.8/mpd/bin/mpd
31780 31777 31780              |   \_ /home/reuti/mpihello
31778 31755 31778              \_ python2.5 /home/reuti/local/mpich2-1.0.8/mpd/bin/mpd
31779 31778 31779                  \_ /home/reuti/mpihello
31736     1 31713 /usr/sge/bin/lx24-x86/qrsh -inherit -V pc15381 /home/reuti/local/mpich2-1.0.8/mpd/bin/mpd
31759     1 31713 /usr/sge/bin/lx24-x86/qrsh -inherit -V pc15370 /home/reuti/local/mpich2-1.0.8/mpd/bin/mpd -h pc15381 -
The distribution of user processes is according to the granted slot allocation. The two other processes can be found on the other node:
$ ssh pc15370 ps -e f -o pid,ppid,pgrp,command --cols=120
  PID  PPID  PGRP COMMAND
...
15848     1 15848 /usr/sge/bin/lx24-x86/sge_execd
 3146 15848  3146  \_ sge_shepherd-628 -bg
 3148  3146  3148      \_ /usr/sge/utilbin/lx24-x86/qrsh_starter /var/spool/sge/pc15370/active_jobs/628.1/1.pc15370
 3156  3148  3156          \_ python2.5 /home/reuti/local/mpich2-1.0.8/mpd/bin/mpd -h pc15381 -p 14581 -n
 3157  3156  3157              \_ python2.5 /home/reuti/local/mpich2-1.0.8/mpd/bin/mpd -h pc15381 -p 14581 -n
 3159  3157  3159              |   \_ /home/reuti/mpihello
 3158  3156  3158              \_ python2.5 /home/reuti/local/mpich2-1.0.8/mpd/bin/mpd -h pc15381 -p 14581 -n
 3160  3158  3160                  \_ /home/reuti/mpihello
On the master node of the parallel job, the console of the mpd-ring will be created in /tmp, as this is unfortunately hard-coded in several places in the MPICH2 source (it might change in version 1.1 of MPICH2 to honor the $TMPDIR which is already set by SGE). To destinguish between several consoles of the same user on a node, the environment variable MPD_CON_EXT is set to reflect the jobnumber and task-id of SGE in the name of the console file.

Possible optimization: in the version of the delivered scripts, the MPICH2 console will be created (only) on the master node of the parallel job on the top level of /tmp. This will work and is still unique for each user and job. It's also possible since version 1.1 of MPICH2 to relocate the console into the $TMPDIR of the job, to avoid cluttering of /tmp. To achieve this, the variable $MPD_TMPDIR must be set in start_proc_args, the job script and stop_proc_args to the value of $TMPDIR, so that all processes which have to access the console will be able find it.

Pitfall (especially when using ROCKS): in the script startmpich2.sh the first comparison in the used loop to start the mpds must succeed to get the correct information about the to be used port. This is implemented by comparing the actual result of the command hostname with the hostname recorded in the $PE_HOSTFILE. The ROCKS implementation delivers always the FQDN (full qualified domainname) of the machine, instead of the plain hostname. To get the desired result, on ROCKS the command must be changed to read:
NODE=`hostname --short`
in line 178, but only when you set up SGE to ignore the FQDN (and this way only the plain hostname is recorded in $PE_HOSTFILE). When SGE was set up to use also the FQDN, it will work for ROCKS by default, but on other Linux distributions it might be necessary to define:
NODE=`hostname --long`
then. For other operating systems the documentation must be consulted, which option to hostname delivers the correct output or any scripting to adjust the result of the command must be used.

Tight Integration of the daemonless smpd startup method

To compile MPICH2 for a smpd-based startup, it must first be configured (after a make distclean, in case you just walked through the mpd startup method before):
$ ./configure --prefix=/home/reuti/local/mpich2-1.0.8/smpd --with-pm=smpd
and to get a Tight Integration we need a PE like the following (including a -catch_rsh to the start script of the PE):
$ qconf -sp mpich2_smpd_rsh
pe_name            mpich2_smpd_rsh
slots              8
user_lists         NONE
xuser_lists        NONE
start_proc_args    /usr/sge/mpich2_smpd_rsh/startmpich2.sh -catch_rsh \
                   $pe_hostfile
stop_proc_args     /usr/sge/mpich2_smpd_rsh/stopmpich2.sh
allocation_rule    $round_robin
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary FALSE
Please lookup in the MPICH2 documentation, how to create a .smpd file with a "phrase" in it. After submitting the job in exact the same way as before (but this time taking the script mpich2_smpd_rsh.sh in the qsub command):
$ qsub -pe mpich2_smpd_rsh 4 mpich2_smpd_rsh.sh
you should see a distribution on the master node of your parallel job like:
$ ssh pc15381 ps -e f -o pid,ppid,pgrp,command --cols=120
  PID  PPID  PGRP COMMAND
...
22110     1 22110 /usr/sge/bin/lx24-x86/sge_execd
31930 22110 31930  \_ sge_shepherd-630 -bg
31955 31930 31955  |   \_ /bin/sh /var/spool/sge/pc15381/job_scripts/630
31956 31955 31955  |       \_ mpiexec -rsh -nopm -n 4 -machinefile /tmp/630.1.all.q/machines /home/reuti/mpihello
31957 31956 31955  |           \_ mpiexec -rsh -nopm -n 4 -machinefile /tmp/630.1.all.q/machines /home/reuti/mpihello
31958 31956 31955  |           \_ /usr/sge/bin/lx24-x86/qrsh -inherit pc15381 env PMI_RANK=0 PMI_SIZE=4 PMI_KVS=359B9A86
31959 31956 31955  |           \_ /usr/sge/bin/lx24-x86/qrsh -inherit pc15381 env PMI_RANK=1 PMI_SIZE=4 PMI_KVS=359B9A86
31960 31956 31955  |           \_ /usr/sge/bin/lx24-x86/qrsh -inherit pc15370 env PMI_RANK=2 PMI_SIZE=4 PMI_KVS=359B9A86
31961 31956 31955  |           \_ /usr/sge/bin/lx24-x86/qrsh -inherit pc15370 env PMI_RANK=3 PMI_SIZE=4 PMI_KVS=359B9A86
31986 22110 31986  \_ sge_shepherd-630 -bg
31987 31986 31987  |   \_ /usr/sge/utilbin/lx24-x86/qrsh_starter /var/spool/sge/pc15381/active_jobs/630.1/1.pc15381
32004 31987 32004  |       \_ /home/reuti/mpihello
31991 22110 31991  \_ sge_shepherd-630 -bg
31992 31991 31992      \_ /usr/sge/utilbin/lx24-x86/qrsh_starter /var/spool/sge/pc15381/active_jobs/630.1/2.pc15381
32010 31992 32010          \_ /home/reuti/mpihello
The important thing is, that the started script including the mpiexec and the program mpihello are under full SGE control.

(Side note: the default command compiled into MPICH2 this way is ssh -x. You may replace this by changing in the MPICH2 source $MPICH2_ROOT/src/pm/smpd/mpiexec_rsh.c in the routine mpiexec_rsh() the default value ssh -x to a plain rsh, or change it each time during execution of your application program by setting the environment variable “MPIEXEC_RSH=rsh; export MPIEXEC_RSH” to get access to a the rsh-wrapper, like in the original MPICH implementation.)

Tight Integration of the daemon-based smpd startup method

Like for the mpd startup method, we will nees a small helping program. As different parameters have to be used, this program is not identical to the one used in the tight mpd integration.
If you installed the whole package like suggested in $SGE_ROOT/mpich2_smpd, set the working directory to $SGE_ROOT/mpich2_smpd/src and compile the included program with:
$ ./aimk
$ ./install.sh
The installation process will put the helping program mpich2_smpd in a created directory $SGE_ROOT/mpich2_smpd/bin, which is the default location of the included script startmpich2.sh to look for this program. A parallel environment for this startup method may look like:
$ qconf -sp mpich2_smpd
pe_name            mpich2_smpd
slots              8
user_lists         NONE
xuser_lists        NONE
start_proc_args    /usr/sge/mpich2_smpd/startmpich2.sh -catch_rsh $pe_hostfile \
                   /home/reuti/local/mpich2-1.0.8/smpd
stop_proc_args     /usr/sge/mpich2_smpd/stopmpich2.sh -catch_rsh \
                   /home/reuti/local/mpich2-1.0.8/smpd
allocation_rule    $round_robin
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary FALSE
If we start the daemons on our own, we have to select a free port. Although it maybe not safe in all cluster setups, the included formula in startmpich2.sh, stopmpich2.sh and the demonstration submit script mpich2_smpd.sh uses “$JOB_ID MOD 5000 + 20000” for the port. Depending on your job turnaround in your cluster, you may modify it in all locations where it’s defined. To force the smpds not to fork themselves into daemon land, they are started with the additional parameter “-d 0”. According to the MPICH2 team, this will not have any speed impact (because the level of debugging is set to 0), but only prevent the daemons from forking. Having this setup in a proper way, we can submit the demonstration job:
$ qsub -pe mpich2_smpd 4 mpich2_smpd.sh
and observe the distributed tasks on the nodes, after looking at the selected nodes:
$ qstat -g t
job-ID  prior   name       user         state submit/start at     queue                          master ja-task-ID 
------------------------------------------------------------------------------------------------------------------
    643 0.55500 mpich2_smp reuti        r     11/25/2008 13:11:37 all.q@pc15370.Chemie.Uni-Marbu SLAVE         
                                                                  all.q@pc15370.Chemie.Uni-Marbu SLAVE         
    643 0.55500 mpich2_smp reuti        r     11/25/2008 13:11:37 all.q@pc15381.Chemie.Uni-Marbu MASTER        
                                                                  all.q@pc15381.Chemie.Uni-Marbu SLAVE         
                                                                  all.q@pc15381.Chemie.Uni-Marbu SLAVE 
On the head node of the MPICH2 job, a process distribution like the following can be observed:
$ ssh pc15381 ps -e f --cols=120
  PID TTY      STAT   TIME COMMAND
 ...
22110 ?        Sl     1:09 /usr/sge/bin/lx24-x86/sge_execd
 2446 ?        S      0:00  \_ sge_shepherd-643 -bg
 2518 ?        Ss     0:00  |   \_ /bin/sh /var/spool/sge/pc15381/job_scripts/643
 2519 ?        S      0:00  |       \_ mpiexec -n 4 -machinefile /tmp/643.1.all.q/machines -port 20643 /home/reuti/mpihe
 2485 ?        Sl     0:00  \_ sge_shepherd-643 -bg
 2486 ?        Ss     0:00      \_ /usr/sge/utilbin/lx24-x86/qrsh_starter /var/spool/sge/pc15381/active_jobs/643.1/1.pc1
 2495 ?        S      0:00          \_ /home/reuti/local/mpich2-1.0.8/smpd/bin/smpd -port 20643 -d 0
 2520 ?        S      0:00              \_ /home/reuti/local/mpich2-1.0.8/smpd/bin/smpd -port 20643 -d 0
 2521 ?        R      0:12                  \_ /home/reuti/mpihello
 2522 ?        R      0:11                  \_ /home/reuti/mpihello
 2477 ?        Sl     0:00 /usr/sge/bin/lx24-x86/qrsh -inherit -V pc15381 /home/reuti/local/mpich2-1.0.8/smpd/bin/smpd -
 2497 ?        Sl     0:00 /usr/sge/bin/lx24-x86/qrsh -inherit -V pc15370 /home/reuti/local/mpich2-1.0.8/smpd/bin/smpd -
On the slave node only the only the daemon and the attached processes are shown:
$ ssh pc15370 ps -e f --cols=120
  PID TTY      STAT   TIME COMMAND
 ...
15848 ?        Sl     2:06 /usr/sge/bin/lx24-x86/sge_execd
23121 ?        Sl     0:00  \_ sge_shepherd-643 -bg
23122 ?        Ss     0:00      \_ /usr/sge/utilbin/lx24-x86/qrsh_starter /var/spool/sge/pc15370/active_jobs/643.1/1.pc1
23130 ?        S      0:00          \_ /home/reuti/local/mpich2-1.0.8/smpd/bin/smpd -port 20643 -d 0
23131 ?        S      0:00              \_ /home/reuti/local/mpich2-1.0.8/smpd/bin/smpd -port 20643 -d 0
23132 ?        R      0:32                  \_ /home/reuti/mpihello
23133 ?        R      0:32                  \_ /home/reuti/mpihello
The forked-off qrsh-commands by the startmpich2.sh (and start_mpich2 program) are no longer bound to the starting script in start_proc_args, but they are not consuming any CPU time or need to be shut down during a qdel (they are just waiting for the shutdown of the spawned daemons on the slave nodes). Important is, that the working tasks of the mpihello are bound to the process chain, so that the accounting will be correct, and also a controlled shutdown of the daemons is possible. To give some feedback to the user of the started tasks, the *.po$JOB_ID file will contain the check of the started MPICH2 universe:
$ cat mpich2_smpd.sh.po643
-catch_rsh /var/spool/sge/pc15381/active_jobs/643.1/pe_hostfile /home/reuti/local/mpich2-1.0.8/smpd
pc15381
pc15381
pc15370
pc15370
/usr/sge/bin/lx24-x86/qrsh -inherit pc15381 /home/reuti/local/mpich2-1.0.8/smpd/bin/smpd -port 20643 -d 0
/usr/sge/bin/lx24-x86/qrsh -inherit pc15370 /home/reuti/local/mpich2-1.0.8/smpd/bin/smpd -port 20643 -d 0
startmpich2.sh: check for smpd daemons (1 of 10)
startmpich2.sh: found running smpd on pc15381
startmpich2.sh: found running smpd on pc15370
startmpich2.sh: got all 2 of 2 nodes
-catch_rsh /home/reuti/local/mpich2-1.0.8/smpd
shutdown smpd on pc15370
shutdown smpd on pc15381
If all is running fine, you may comment out these lines to shorten the output a little bit and avoid any confusion to the user. Depending of your personal taste, you may put the definition of your MPICH2 path in a file like .bashrc, which will be sourced during a non-interactive login.

Note: it is mandatory, that in your jobscript you include a line "export SMPD_OPTION_NO_DYNAMIC_HOSTS=1" besides the port identification. Otherwise, the node where the jobscript is running will be added to your ~/.smpd. This will prevent a proper shutdown, although this environment variable is already set during the start and stop of the daemons in the appropriate scripts of the PE. Also the option -V will be used in the accompanying skripts for this Howto.

Tight Integration of the gforker startup method

First we discuss the integration of a startup method, which is limited to one machine and hence need no network communication at all. The command line to compile MPICH2 this way is:
$ ./configure --prefix=/home/reuti/local/mpich2-1.0.8/gforker --with-pm=gforker
After the usual make and make install we can compile the short program which is supplied in [2] with:
$ mpicc -o mpihello mpihello.c
Although we will run only on one machine, we will use a parallel environment (PE) inside SGE, to stay conform with the idea of SGE to request more than one slot by requesting a parallel environment in the submit command. This PE may look like:
$ qconf -sp mpich2_gforker
pe_name            mpich2_gforker
slots              8
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $pe_slots
control_slaves     FALSE
job_is_first_task  TRUE
urgency_slots      min
accounting_summary FALSE
Remember to add this PE to a cluster queue of your choice.
$ qsub -pe mpich2_gforker 4 mpich2_gforker.sh
And with:
$ ssh pc15370 ps -e f -o pid,ppid,pgrp,command --cols=120
  PID  PPID  PGRP COMMAND
...
15848     1 15848 /usr/sge/bin/lx24-x86/sge_execd
 7445 15848  7445  \_ sge_shepherd-647 -bg
 7447  7445  7447      \_ /bin/sh /var/spool/sge/pc15370/job_scripts/647
 7448  7447  7447          \_ mpiexec -n 4 /home/reuti/mpihello
 7449  7448  7447              \_ /home/reuti/mpihello
 7450  7448  7447              \_ /home/reuti/mpihello
 7451  7448  7447              \_ /home/reuti/mpihello
 7452  7448  7447              \_ /home/reuti/mpihello
we already got the proper startup and Tight Integration of all started processes.

Nodes with more than one network interface

With the version 1.0.8 of mpich2 it’s possible to direct the network communication to a dedicated interface. For this to work, you have to adjust the generated machine file, i.e. the file $TMPDIR/machines, which is created in the start_proc_args defined script, to include the interface name after the number of slots. E.g.:

node01:2 ifhn=node01-grid

References and Documents

SGE-MPICH2 Integration
[1] Archive with all the scripts used in this Howto: mpich2-62.tgz. It should be installed in your $SGE_ROOT.

[2] Archive with a small MPICH2 program to check the correct distribution of all the tasks: mpihello.tgz.

MPICH2

The latest version of MPICH2 and build instructions can be downloaded from (http://www.mcs.anl.gov/research/projects/mpich2/).

MPI documentation in general and tutorials

For some general introduction to MPI and MPI-Programming, you can study the following documents:

http://www.mpi-forum.org/docs

http://www.netlib.org/utk/papers/mpi-book/mpi-book.html

http://www-unix.mcs.anl.gov/mpi/usingmpi/index.html

http://www-unix.mcs.anl.gov/mpi/usingmpi2

ftp://math.usfca.edu/pub/MPI/mpi.guide.ps