File-Staging approaches in Grid Engine

Although Grid Engine is often used in an environment where NFS is used to share files, sometimes it's useful to have a way of transferring specific files associated with Grid jobs. Situations where this is applicable could include:

This HOWTO will cover two general cases for setting up File-Staging in SGE.

  1. Local File-Staging: this is when you wish to transfer files directly to the compute host on which a job will run

  2. Site File-Staging: this means that you need to obtain a file from outside the Grid and place it in a central repository, from which any compute host in the Grid can access it (via NFS or some secondary staging mechanism)


Local File-Staging

For non-NFS SGE clusters, file staging can be provided by prolog and epilog methods. The actual mechanism of file transfer will depend upon the local setup, but here we outline the generic setup. The "remote" compute hosts are ones which do not share any NFS filesystems with the master host or submit hosts.

This setup uses rcp as the file transfer mechanism. The underlying permissions for rcp need to be in place ahead of time. This includes:

It is advisable that you first manually test rcp between hosts before implementing the file-staging procedure. Of course, the use of rcp is a security concern. However, it should be straightforward to convert this example to use scp instead of rcp.

Configuration

The example given here is a simple job which transfers an input text file to the compute host, converts all characters to uppercase, and transfers the output file back to the directory from which the input file was obtained.

The prototype consists of: the prolog and epilog scripts, which run on the remote compute hosts, and a sample job script which shows how to invoke the file transfer facility. The configuration is as follows:


Site File-Staging

The transfer of files from outside the Grid to a local central repository can often times be a more complicated task. Additionally, it might be possible that only certain hosts in the grid are appropriate for doing this transfer process. Maybe only these hosts have the ability to go outside the firewall, or perhaps these hosts have high bandwidth to the network or to the local storage. Finally, it could be possible that extremely large files need to be transferred, eg, a large dataset from a public archive.

In this situation, the ideal approach is to have the file transfer procedure be a separate job in Grid Engine. The processing jobs which depend upon this data are then submitted with a dependency on this transfer job. With this setup, the transfer jobs can be configured to run only on those hosts appropriate for this task, leaving your compute hosts free to do other jobs, instead of being held up while the data transfer occurs. By transferring to a central repository, you can then allow any compute host to process the file.

Configuration

[Note that the example FTP site below no longer works, and it is not clear what the ftp command is that is used in the transferjob.sh script.]

In this second example, a transfer job is used to download selected texts from the Gutenberg Project archive via anoymous ftp. A processing job then does a word frequency analysis and generates a report. The whole process is driven by a submit script, which submits the two tasks with proper dependencies.

The setup is as follows:

Modify the scripts to match your environment. It is recommended that you first test the transfer job by itself from the command line, to ensure that the anoymous ftp is working properly.


Appendix: list of example files and scripts

    1. file_trans_prlg.sh

    2. file_trans_eplg.sh

    3. changecase.sh

    4. song_of_wreck.txt

    1. .netrc

    2. transferjob.sh

    3. countwords.pl

    4. submitscript.sh