Default (In)security

A default Grid Engine installation (without CSP or MUNGE) is highly insecure, and demands trusting all users who have access to it. For instance, any exec host can be owned using something like:[1]

$ fakeroot qrsh id -u
0
$

That can be defeated for root by setting uidmin in sge_conf(5). However, any user with access to an admin host (e.g. with qmaster running on the login node) can just run

fakeroot qconf -sconf

to change it. In that case, if they can write executables to a file system visible from the qmaster, they can also own the qmaster, say by configuring mailer or jsv_url.

Thus, to protect exec hosts and the qmaster with default “security”, it is at least necessary to set uidmin/gidmin and run the qmaster on a separate host with no access to non-admin users.[2] (If that can’t be done, it might be possible to restrict access to the qmaster socket, perhaps with iptables’s owner module.)

With admin host restrictions and uid limits in place, it is still possible to submit jobs as any allowed user with qsub and an LD_PRELOAD trick, as with fakeroot or otherwise — with a simple DRMAA client, for instance.

In most environments you will want to use either CSP or MUNGE, as below.

CSP Security

CSP security is the original method. It prevents job submission as another user and configuration changes by a non-admin/operator user. Thus it does allow admin hosts to be more safely used also as submission hosts if required. It also secures the daemon communication channels, but that will usually be a secondary consideration.[3]

There are some limitations of using CSP:

  • Users must be explicitly added (no enforce_user true) and have keys generated for them (util/sgeCA/sge_ca -usercert);

  • The keys must be distributed to the relevant hosts, though you can have selective authorization of users by submission host according to which keys are distributed where;

  • Keys must be renewed (using util/sgeCA/renew_all_certs.sh) after the set expiry time (a year by default) and redistributed;

  • Currently (SGE 8.1.2) only a single security method can be configured, so using CSP excludes AFS support, for instance (see below).

It is not necessary to re-install to turn on CSP, just:

  • Stop all the SGE daemons;

  • Edit security_mode in the common/bootstrap file;

  • Generate certificates with util/sgeCA/sge_ca -init and util/sgeCA/sge_ca -usercert;

  • Distribute the certificates as appropriate;

  • Restart the daemons.

Note that CSP isn’t really a public key system as used for https, but is basically relying on shared secrets distributed within a single administrative domain. The security of a user’s key is dependent on the security of all hosts to which it is distributed — a privileged user on a submit host can impersonate any user whose keys are on that host. Typically keys will only be distributed en masse to submit hosts which are secure login nodes. Users can copy their own keys to another submit host, such as a personal workstation (with sge_ca -copy, assuming the home directory is secure).

See sge_ca(8) for more usage information. The real security of CSP is unclear; there is no known audit of it.

MUNGE Authentication

Authentication with MUNGE was introduced in SGE 8.1.9, and may be most convenient in an HPC cluster. However, it has not been well tested at the time of writing. It is probably more convenient than CSP since it only requires a secret shared by daemons running on each host. It also allows operation with enforce_user=auto. However, it provides authentication, not encryption of of the communication channels, and is probably only appropriate in a tightly-coupled security domain like and HPC cluster.

To use it, SGE must be built against the MUNGE library, e.g. the GNU/Linux packaged versions. Then MUNGE must be set up (see the installation guide), on all the SGE hosts, i.e. with the daemon running against the shared key for the cluster. Then the SGE daemons can be started with munge configured as the security_mode.

Other Methods

The only other security method which currently (SGE 8.1.9) works properly is the afs one. However, as implemented, it doesn’t provide authentication of users submitting jobs. Without that it is possible to submit a job as another user (as above) and steal credentials from another job of that user running on the same host, so that it could actually facilitate security breaches. It may be possible to use AUKS with CSP as an alternative, but there’s currently no setup recipe published.

The GSSAPI (kerberos/dce) method would work for authenticating job submission and passing (but not renewing) Kerberos tickets, but the mechanism for calling the sub-programs involved is partially broken and needs re-implementing.

Some largely historical information on security is available. The security framework should be re-done with hooks to allow arbitrary, composable methods.

External Program Hooks

SGE no longer runs external remote startup programs (see remote_startup(5)) in the user-defined environment, and so is not vulnerable to that part of CVE-2012-0208 (see the SGE source). Other published responses to the CVE pass the environment with some sanitation, but fail to remove all the known sensitive variables.

Although the environment passed to other external methods, such as the prolog, is sanitized when they are invoked with privileges (user@…), the sanitization may not be foolproof. See the SECURITY section of sge_conf(5).

Obviously these concerns are moot without restrictions imposed by the uidmin limit or user authentication, as above.

 

Copyright © 2012, 2013, 2016 Dave Love, University of Liverpool

Licence GFDL.


1. It may fail on systems which use ssh for remote startup with passwordless private keys, but a batch job will still work, and can steal the key.
2. Obviously even in the absence of separate hardware, a virtual machine can provide a separate host for a cluster head node. Having a separate head with only admin access also helps with admin tools like PowerMan which lack authentication.
3. A single certificate covers the qmaster and execd daemons, and so is potentially vulnerable to compromise of an execution host in case man-in-the-middle-type threats are a concern.