Grid Engine is a full function, general purpose Distributed Resource
Management (DRM) tool. The scheduler component in Grid Engine supports
a wide range of different compute farm scenarios. To get the maximum
performance from your compute environment it can be worthwhile to
review which features are enabled and which are really needed to solve
your load management problem. Disabling/Enabling these features
can have a performance benefit on the throughput of your cluster. Each
feature contains in parentheses when it was introduced. If not
otherwise
stated, it is available in higher versions as well.
-
overall cluster tuning
Experience has shown utilization of NFS or similar shared file systems for
distributing files required by Grid Engine can have a critical share
in both overall network load and file server load. Thus keeping such
files locally is always at least slightly beneficial for overall
cluster throughput, but at the cost of easier monitoring/debugging
which may not be a good trade-off in low-throughput cases. The HOWTO
Reducing and Eliminating NFS usage by Grid Engine.
shows different common choices for accomplishing this.
-
scheduler monitoring
Scheduler monitoring can be helpful to find out the reason
why certain jobs are not dispatched (displayed via qstat). However, providing this
information for all jobs at any time can be resource consuming (memory
and CPU time) and is usually not needed. To disable scheduler
monitoring set schedd_job_info to false in scheduler
configuration sched_conf(5).
-
finished jobs
In case of array jobs the finished job list in qmaster can
become quite big. Switching it off will save memory and speed up qstat
commands because qstat also fetches the finished jobs list.
Set finished_jobs to 0 in global configuration. See sge_conf(5).
-
job verification
Forcing validation at job submission time can be a
valuable
tool to prevent non-dispatchable jobs from remaining in pending
state forever. However, It can be time consuming to validate jobs,
especially in heterogeneous environments with a variety of different
execution nodes and consumable resources and where every user has his
own
job profile. In homogeneous environments with only a couple of
different
jobs, a general job validation usually can be omitted. Job verification
is disabled per default and should only be used (qsub(1): -w [v|e|w]) when
needed. [It is enabled by default with DRMAA.]
-
load thresholds and suspend thresholds
Load thresholds are needed if you deliberately
oversubscribe your machines, and you need a mechanism to prevent
excessive system load. Suspend thresholds are also used for this. The
other case in which load thresholds are needed is when the execution
node is open for interactive load which is not under control of
Grid Engine, and you want to prevent the node from being overloaded. If
a compute farm is more single-purpose, e.g., each CPU at a compute
node
is represented by only one queue slot, and no interactive load is
expected at these nodes, then load_thresholds can be omitted.
To
disable both thresholds set load_thresholds to none and
suspend_thresholds to none. See queue_conf(5).
load_thresholds are applicable
to consumable resources as well (see queue_conf(5)).
Using this feature will have a negative impact on
the scheduler performance.
-
load adjustments
Load adjustments are used to increase virtually the
measured load after a job has been dispatched. This mechanism is
helpful in the case of oversubscribed machines in order to align with
load
thresholds. Load adjustments should be switched off if they are not
needed, because they impose on the scheduler some additional work in
connection sorting hosts and load thresholds verification. To disable
load adjustments set job_load_adjustments to none and load_adjustment_decay_time
to 0 in the scheduler configuration. See sched_conf(5).
-
scheduling-on-demand
The default for Grid Engine is to start scheduling runs in
a fixed scheduling interval (see schedule_interval in sched_conf(5)).
The good thing with fixed intervals is that they limit the CPU time
consumption of the qmaster/scheduler. The bad thing is that they
throttle the scheduler artificially, resulting in a limited throughput.
In many compute farms there are machines specifically dedicated to
qmaster/scheduler and in such setups there is no reason for throttling
the scheduler. How many seconds one should use for flush times is
difficult to say. It depends on the time the scheduler needs for a
single run and the number of jobs in the system. A couple test runs
with
the scheduler profiling (Add profile=1
to the params in the sched_conf(5).) should give one
enough data to select a good value.
Scheduling-on-demand can be
configured using the FLUSH_SUBMIT_SEC and FLUSH_FINISH_SEC
settings in the sched_conf(5).
If it is activated, the throughput of a compute farm is
only limited by the power of the machine hosting qmaster/scheduler.
After every scheduling interval, the scheduler sends the calculated
priority information (tickets, job priority, urgency) to
the qmaster.
This information is used to order the job output in "qstat -ext",
"-urg", and "-pri". The transfer of the
information can be turned off by setting report_pjob_tickets to
false
in sched_conf(5).
The scheduler contains different policy modules (see sge_priority(5))
to
compute the importance of a job:
- ticket policy
- urgency policy
- POSIX priority policy
- deadline policy
- waiting time policy
All policies are turned on by
default. If one or two of them are not used, it is preferable to turn
the policy off by setting its
weighting factor to 0 in sched_conf(5).
Resource reservation prevents the
starvation of jobs with high resource requests. The configuration of
the scheduler allows one to enable/disable this feature as well as
limit
the number of jobs which will get a reservation. Turning off this
feature, by setting max_reservation to 0 in sched_conf(5),
will have a positive impact on the scheduler run time.
If resource reservation is needed, the number of jobs which will
get a reservation from the scheduler should be as small as possible.
This is done by setting a small number for max_reservation in sched_conf(5).
In clusters with large quantities of jobs a limiting factor is often the memory
footprint required to store all job properties. Experience shows large parts of
the memory occupied by the qmaster are used to store
each job's environment as specified via "-v variable_list" or "-V".
End users sometimes perceive it as convenient to simply use "-V", even
though it would have been entirely sufficient to inherit a handful of specific
environment variables from the submission environment. Conscious and sparing
use of job environment variables has been shown to greatly increase the maximum
number of jobs that can be processed with a given amount of main memory by Grid Engine.
By default Grid Engine qsub job submission sends the job scripts together
with the job itself. The -b y option can be used to prevent
job scripts from being sent, instead simply sending the path to the executable along
with the job. This technique requires that the script be made available elsewhere,
but in many cases the script is already available or could easily be made available
by means of shared file systems. Use of -b y has a beneficial impact on cluster
scalability because job scripts do not need to be stored on disk by the qmaster at
submission time or be packed with the job when it is delivered to the execd.
- job filter based on job classes
The job filter can be enabled by adding JC_FILTER=1 to the
params
field in sched_conf(5).
This feature is deprecated and, if enabled, can lead some
minor problems in the system.
If enabled, the scheduler limits the number of jobs it looks at during
a scheduling run. At the beginning of the scheduling run it assigns
each
job a specific category based on the job's requests, priority
settings, and the job owner. All scheduling policies will assign the
same
importance to each job in a category. Therefore, the number of jobs
per
category will have a FIFO order and can be limited to the number of
free
slots in the system.
An exception is jobs which request a resource reservation. They are
included regardless of the number of jobs in a category.
This setting is turned off per default, because in very rare cases the
scheduler
can make a wrong decision. It is also advised to turn
report_pjob_tickets off when this feature is used. Otherwise "qstat
-ext" can report
outdated
ticket amounts. The information shown
with a "qstat -j " for a job that was
excluded in a scheduling run is
very limited.
Scheduler profiles, such as are used during Grid Engine installation,
can be stored using "qconf -ssconf >file". The
profiles are not
stored internally. With the combination of dynamically changing the
scheduler configuration by loading a new profile with "qconf
-Msconf <file>" and a cron job, one can switch to a leaner
configuration over night and
return to a user friendly configuration during the day.
|