sge_pe.5
NAME
sge_pe - Grid Engine parallel environment configuration file format
DESCRIPTION
Parallel environments are parallel programming and runtime environments
supporting the execution of shared memory or distributed memory
parallelized applications. Parallel environments usually require some
kind of setup to be operational before starting parallel applications.
Examples of common parallel environments are OpenMP on shared memory
multiprocessor systems, and Message Passing Interface (MPI) on shared
memory or distributed systems.
sge_pe allows for the definition of interfaces to arbitrary parallel
environments. Once a parallel environment is defined or modified with
the -ap or -mp options to qconf(1) and linked with one or more queues
via pe_list in queue_conf(5) the environment can be requested for a job
via the -pe switch to qsub(1) together with a request for a numeric
range of parallel processes to be allocated by the job. Additional -l
options may be used to specify more detailed job requirements.
Note, Grid Engine allows backslashes (\) be used to escape newline
characters. The backslash and the newline are replaced with a space
character before any interpretation.
FORMAT
The format of a sge_pe file is defined as follows:
pe_name
The name of the parallel environment in the format for pe_name in
sge_types(1). To be used in the qsub(1) -pe switch.
slots
The total number of slots (normally one per parallel process or thread)
allowed to be filled concurrently under the parallel environment. Type
is integer, valid values are 0 to 9999999.
user_lists
xuser_lists
A comma-separated list of user access list names (see access_list(5)).
Each user contained in at least one of the user_lists access lists has
access to the parallel environment. If the user_lists parameter is set
to NONE (the default) any user has access if not explicitly excluded
via the xuser_lists parameter.
Each user contained in at least one of the xuser_lists access lists is
not allowed to access the parallel environment. If the xuser_lists
parameter is set to NONE (the default) any user has access.
If a user is contained both in an access list in xuser_lists and
user_lists the user is denied access to the parallel environment.
start_proc_args
stop_proc_args
The command line respectively of a startup or shutdown procedure (an
executable command, plus possible arguments) for the parallel
environment, or "none" for no procedure (typically for tightly
integrated PEs). The command line is started directly, not in a shell.
An optional prefix "user@" specifies the username under which the
procedure is to be started. In that case see the SECURITY section
below concerning security issues running as a privileged user.
The startup procedure is invoked by sge_shepherd(8) on the master node
of the job prior to executing the job script. Its purpose is to setup
the parallel environment according to its needs. The shutdown
procedure is invoked by sge_shepherd(8) after the job script has
finished. Its purpose is to stop the parallel environment and to remove
it from all participating systems. The standard output of the
procedure is redirected to the file REQUEST.poJID in the job's working
directory (see qsub(1)), with REQUEST being the name of the job as
displayed by qstat(1), and JID being the job's identification number.
Likewise, the standard error output is redirected to REQUEST.peJID. If
the -e or -o options are given on job submission, the PE error and
standard output is merged into the paths specified.
The following special variables, expanded at runtime, can be used
(besides any other strings which have to be interpreted by the start
and stop procedures) to constitute a command line:
$pe_hostfile
The pathname of a file containing a detailed description of the
layout of the parallel environment to be setup by the start-up
procedure. Each line of the file refers to a host on which
parallel processes are to be run. The first entry of each line
denotes the hostname, the second entry the number of parallel
processes to be run on the host, the third entry the name of the
queue. The entries are separated by spaces. If -binding pe is
specified on job submission, the fourth column is the core
binding specification as colon-separated socket-core pairs, like
"0,0:0,1", meaning the first core on the first socket and the
second core on the first socket can be used for binding.
Otherwise it will be "UNDEFINED". With the obsolete queue
processors specification the fourth entry could be a multi-
processor configuration (or "<NULL>").
$host The name of the host on which the startup or stop procedures are
run.
$ja_task_id
The array job task index (0 if not an array job).
$job_owner
The user name of the job owner.
$job_id
Grid Engine's unique job identification number.
$job_name
The name of the job.
$pe The name of the parallel environment in use.
$pe_slots
Number of slots granted for the job.
$processors
The processors string as contained in the queue configuration
(see queue_conf(5)) of the master queue (the queue in which the
startup and stop procedures are run).
$queue The cluster queue of the master queue instance.
$sge_cell
The SGE_CELL environment variable (useful for locating files).
$sge_root
The SGE_ROOT environment variable (useful for locating files).
$stdin_path
The standard input path.
$stderr_path
The standard error path.
$stdout_path
The standard output path.
$merge_stderr
$fs_stdin_host
$fs_stdin_path
$fs_stdin_tmp_path
$fs_stdin_file_staging
$fs_stdout_host
$fs_stdout_path
$fs_stdout_tmp_path
$fs_stdout_file_staging
$fs_stderr_host
$fs_stderr_path
$fs_stderr_tmp_path
$fs_stderr_file_staging
The start and stop commands are run with the same environment setting
as that of the job to be started afterwards (see qsub(1)).
allocation_rule
The allocation rule is interpreted by the scheduler thread and helps
the scheduler to decide how to distribute parallel processes among the
available machines. If, for instance, a parallel environment is built
for shared memory applications only, all parallel processes have to be
assigned to a single machine, no matter how many suitable machines are
available. If, however, the parallel environment follows the
distributed memory paradigm, an even distribution of processes among
machines may be favorable, as may packing processes onto the minimum
number of machines.
The current version of the scheduler only understands the following
allocation rules:
int An integer, fixing the number of processes per host. If it is 1,
all processes have to reside on different hosts. If the special
name $pe_slots is used, the full range of processes as specified
with the qsub(1) -pe switch has to be allocated on a single host
(no matter what value belonging to the range is finally chosen
for the job to be allocated).
$fill_up
Starting from the best suitable host/queue, all available slots
are allocated. Further hosts and queues are "filled up" as long
as a job still requires slots for parallel tasks.
$round_robin
From all suitable hosts, a single slot is allocated until all
tasks requested by the parallel job are dispatched. If more
tasks are requested than suitable hosts are found, allocation
starts again from the first host. The allocation scheme walks
through suitable hosts in a most-suitable-first order.
control_slaves
This parameter can be set to TRUE or FALSE (the default). It indicates
whether Grid Engine is the creator of the slave tasks of a parallel
application via sge_execd(8) and sge_shepherd(8) and thus has full
control over all processes in a parallel application ("tight
integration"). This enables:
o resource limits are enforced for all tasks, even on slave hosts;
o resource consumption is properly accounted on all hosts;
o proper control of tasks, with no need to write a customized
terminate method to ensure that whole job is finished on qdel
and that tasks are properly reaped in the case of abnormal job
termination;
o all tasks are started with the appropriate nice value which was
configured as priority in the queue configuration;
o propagation of the job environment to slave hosts, e.g. so that
they write into the appropriate per-job temporary directory
specified by TMPDIR, which is created on each host and properly
cleaned up.
To gain control over the slave tasks of a parallel application, a
sophisticated PE interface is required, which works closely together
with Grid Engine facilities, typically interpreting the Grid Engine
hostfile and starting remote tasks with qrsh(1) and its -inherit
option. See, for instance, the $SGE_ROOT/mpi directory and the howto
pages <http://arc.liv.ac.uk/SGE/howto/
#Tight%20Integration%20of%20Parallel%20Libraries>.
Please set the control_slaves parameter to false for all other PE
interfaces.
job_is_first_task
The job_is_first_task parameter can be set to TRUE or FALSE. A value of
TRUE indicates that the Grid Engine job script already contains one of
the tasks of the parallel application (and the number of slots reserved
for the job is the number of slots requested with the -pe switch).
FALSE indicates that the job script (and its child processes) is not
part of the parallel program, just being used to kick off the tasks
that do the work; then the number of slots reserved for the job in the
master queue is increased by 1, as indicated by qstat/qhost.
This should be TRUE for the common modern MPI implementations with
tight integration. Consider if the allocation rule is $fill_up, and a
job is allocated only a single slot on the master host; then one of the
MPI processes actually runs in that slot, and should be accounted as
such, so the job is the first task.
If wallclock accounting is used (execd_params ACCT_RESERVED_USAGE
and/or SHARETREE_RESERVED_USAGE Is TRUE) and control_slaves is set to
FALSE, the job_is_first_task parameter influences the accounting for
the job: A value of TRUE means that accounting for CPU and requested
memory gets multiplied by the number of slots requested with the -pe
switch. FALSE means the accounting information gets multiplied by
number of slots + 1. Otherwise, the only significant effect of the
parameter is on the display of the job.
urgency_slots
For pending jobs with a slot range PE request with different minimum
and maximum, the number of slots they will actually use is not
determined. This setting specifies the method to be used by Grid Engine
to assess the number of slots such jobs might finally get.
The assumed slot allocation has a meaning when determining the
resource-request-based priority contribution for numeric resources as
described in sge_priority(5) and is displayed when qstat(1) is run
without -g t option.
The following methods are supported:
int The specified integer number is directly used as prospective
slot amount.
min The slot range minimum is used as prospective slot amount. If no
lower bound is specified with the range, 1 is assumed.
max The slot range maximum is used as prospective slot amount. If
no upper bound is specified with the range, the absolute maximum
possible due to the PE's slots setting is assumed.
avg The average of all numbers occurring within the job's PE range
request is assumed.
accounting_summary
This parameter is only checked if control_slaves (see above) is set to
TRUE and thus Grid Engine is the creator of the slave tasks of a
parallel application via sge_execd(8) and sge_shepherd(8). In this
case, accounting information is available for every single slave task
started by Grid Engine.
The accounting_summary parameter can be set to TRUE or FALSE. A value
of TRUE indicates that only a single accounting record is written to
the accounting(5) file, containing the accounting summary of the whole
job, including all slave tasks, while a value of FALSE indicates an
individual accounting(5) record is written for every slave task, as
well as for the master task.
Note: When running tightly integrated jobs with
SHARETREE_RESERVED_USAGE set, and accounting_summary enabled in the
parallel environment, reserved usage will only be reported by the
master task of the parallel job. No per-parallel task usage records
will be sent from execd to qmaster, which can significantly reduce load
on the qmaster when running large, tightly integrated parallel jobs.
However, this removes the only post-hoc information about which hosts a
job used.
qsort_args library qsort-function [arg1 ...]
Specifies a method for specifying the queues/hosts and order that
should be used to schedule a parallel job. For details, and the API,
consult the header file $SGE_ROOT/include/sge_pqs_api.h. library is
the path to the qsort dynamic library, qsort-function is the name of
the qsort function implemented by the library, and the args are
arguments passed to qsort. Substitutions from the hard requested
resource list for the job are made for any strings of the form
$resource, where resource is the full name of the resource as defined
in the complex(5) list. If resource is not requested in the job, a
null string is substituted.
RESTRICTIONS
Note that the functionality of the start and stop procedures remains
the full responsibility of the administrator configuring the parallel
environment. Grid Engine will invoke these procedures and evaluate
their exit status. A non-zero exit status will put the queue into an
error state. If the start procedure has a non-zero exit status, the
job will be re-queued.
SECURITY
If start_proc_args, or stop_proc_args is specified with a user@ prefix,
the same considerations apply as for the prolog and epilog, as
described in the SECURITY section of sge_conf(5).
SEE ALSO
sge_intro(1), sge__types(1), qconf(1), qdel(1), qmod(1), qrsh(1),
qsub(1), access_list(5), sge_conf(5), sge_qmaster(8), sge_shepherd(8).
FILES
$SGE_ROOT/include/sge_pqs_api.h
COPYRIGHT
See sge_intro(1) for a full statement of rights and permissions.
SGE 8.1.3pre 2012-09-11 SGE_PE(5)
Man(1) output converted with
man2html