Although Grid Engine is often used in an environment where NFS is used to share files, sometimes it's useful to have a way of transferring specific files associated with Grid jobs. Situations where this is applicable could include:
Grids where NFS is not used on some or all compute hosts, either for performance or stability reasons
cases where files have to be obtained from some remote system. It's possible that not every host in the Grid has the ability to retrieve these files, due to connectivity, security, or bandwidth issues, but it would still be nice to have any compute host operate on them
This HOWTO will cover two general cases for setting up File-Staging in SGE.
Local File-Staging: this is when you wish to transfer files directly to the compute host on which a job will run
Site File-Staging: this means that you need to obtain a file from outside the Grid and place it in a central repository, from which any compute host in the Grid can access it (via NFS or some secondary staging mechanism)
For non-NFS SGE clusters, file staging can be provided by prolog and epilog methods. The actual mechanism of file transfer will depend upon the local setup, but here we outline the generic setup. The "remote" compute hosts are ones which do not share any NFS filesystems with the master host or submit hosts.
This setup uses rcp
as the file transfer mechanism. The underlying permissions for rcp
need to be in place ahead of time. This includes:
any potential submit host must be
allowed to rcp
to any potential compute host
vice versa, that is, any potential
compute host must be allowed to rcp
back to any
potential submit host
It is advisable that you first manually
test rcp
between hosts before implementing the
file-staging procedure. Of course, the use of rcp
is a
security concern. However, it should be straightforward to convert
this example to use scp
instead of rcp
.
The example given here is a simple job which transfers an input text file to the compute host, converts all characters to uppercase, and transfers the output file back to the directory from which the input file was obtained.
The prototype consists of: the prolog and epilog scripts, which run on the remote compute hosts, and a sample job script which shows how to invoke the file transfer facility. The configuration is as follows:
for queues on the compute hosts,
set the prolog and epilog to be the scripts included here:prolog
file_trans_prlg.sh
epilog
file_trans_eplg.sh
use the sample jobs changecase.sh
as a model to create any arbitrary job script which makes use of
stage-in/stage-out.
submit the job as specified in the
script. You can use this file song_of_wreck.txt
as sample input.
It is recommended to use $TMPDIR
as the working directory for staged files during a job run. This
environment variable is automatically set by SGE to be a directory
unique to every job. The directory and its contents are
automatically deleted after the epilog completes.
The transfer of files from outside the Grid to a local central repository can often times be a more complicated task. Additionally, it might be possible that only certain hosts in the grid are appropriate for doing this transfer process. Maybe only these hosts have the ability to go outside the firewall, or perhaps these hosts have high bandwidth to the network or to the local storage. Finally, it could be possible that extremely large files need to be transferred, eg, a large dataset from a public archive.
In this situation, the ideal approach is to have the file transfer procedure be a separate job in Grid Engine. The processing jobs which depend upon this data are then submitted with a dependency on this transfer job. With this setup, the transfer jobs can be configured to run only on those hosts appropriate for this task, leaving your compute hosts free to do other jobs, instead of being held up while the data transfer occurs. By transferring to a central repository, you can then allow any compute host to process the file.
[Note that the example FTP site below no longer works, and
it is not clear what the ftp command is that is used in
the transferjob.sh
script.]
In this second example, a transfer job is used to download selected texts from the Gutenberg Project archive via anoymous ftp. A processing job then does a word frequency analysis and generates a report. The whole process is driven by a submit script, which submits the two tasks with proper dependencies.
The setup is as follows:
since we are using anoymous ftp to transfer files, we need a
.netrc
file in the user home
directory in order to make the ftp scriptable. Rename the download
to “.netrc
” and place this file in the
user's home directory with permissions 700.
the transferjob.sh
script takes some arguments to specify which Gutenberg text to
download and where to put it. The destination in this case should be
an NFS directory available to all compute hosts. (NOTE: we are using
the Gutenberg archive found at ftp.cdrom.com)
The countwords.pl
script does the word frequency analysis for the file given as an
argument.
The submitscript.sh
script drives the process, and accepts as arguments the directory
and filename from the Gutenberg archive
Modify the scripts to match your environment. It is recommended that you first test the transfer job by itself from the command line, to ensure that the anoymous ftp is working properly.
Local File-Staging
Site File-Staging