Grid Engine configuration backup

Author: Dave Love, Liverpool University, 2013–02

Introduction

It is advisable to make backups of the configuration as it changes, for the usual reasons of disaster recovery and recording/undoing configuration changes that might, for instance, prove ill-advised. Two methods of backup for different purposes are supplied with the installation, described below. Note that neither of them backs up the job spool as supplied.

The configuration is stored in the qmaster spool area (see ‘spooling_params’ in bootstrap(5)) which is likely subject to normal system backups. In the case of ‘classic’ spooling (into simple text files), normal system backups may be enough, but the methods below may still be useful, especially for more frequent backups and easy checking of changes. Also there are extra considerations if Berkeley DB spooling is used, when it is not safe simply to copy the DB files in a normal backup, and it may not be safe to back up at all in some circumstances.

Using ‘upgrade modules’

The scripts in $SGE_ROOT/util/upgrade_modules, intended for distribution updates and changing the spool type etc. (inst_sge -upd), are also useful for backup/restore and recording the configuration. They use qconf(5) to load and restore the configuration. That requires a running qmaster and operator/manager privileges, typically as user sgeadmin. After setting $SGE_ROOT and $SGE_CELL as usual, and with $SGE_ROOT as the current directory, use

    util/upgrade_modules/save_sge_config.sh directory

to back up the configuration into the fresh directory. Note that some data are not present in the configuration saved this way, particularly usage information for fair share purposes, and jobs. The inst_sge method below preserves more.

Restoring the configuration (which also requires a running qmaster) is done using util/upgrade_modules/load_sge_config.sh. Use the -help option for more information.

Note that the CSP configuration isn't backed up, but a new configuration can be re-created on restoring with inst_sge -upd.

Revision control

Backup directories made by running save_sge_config.sh under cron(1) may conveniently be kept under revision control to have a record of changes, not necessarily as an actual backup. There is a version of the technique using git ("scripts/GridEngine-git-config") in the flex-grid repository, based on a subversion one. A simple darcs-based script is also available. As in these examples, it is usual to delete the accounting file, and possibly arseqnum and jobseqnum (recording the advance reservation and job sequence numbers), and also to expunge the execution host load_values. These can be expected to change for each backup, contributing noise to the history. Similarly, it might be useful to expunge the delete_time parameter for any auto-created users in the users sub-directory of the backup. However, if backups are modified like that, they may not be able to be re-loaded directly.

Using inst_sge

Backups using inst_sge work by copying (part of) the spool area, rather than using qconf, preserving more information than the upgrade script method, such as share tree information. With Berkeley DB spooling, some of the copying involves the db_dump and db_load utilities, which must be available in the $SGE_ROOT/utilbin/$ARCH/ directory. If they were not installed, copy or symbolically link your operating system's versions there. Some systems support more than one BDB version, and may call the utilities something like db5.1_dump, in which case it may be necessary to select one corresponding to the version of the BDB library used by gridengine. For a packaged version, consult the package dependencies to determine that. E.g.

    ln -s /usr/bin/db5.1_dump utilbin/lx-amd64/db_dump

Caution: This method is not safe if Berkeley DB spooling is in use, the qmaster is running, and it opens the spool database in ‘private’ mode to allow spooling to a network filesystem. (See the BDB documentation.) Some gridengine implementations do this unconditionally, and others may provide it as an option to allow use of either network filesystems or this backup method on a live system (with a local spool directory). The database is opened in private mode if qmaster runs with no files of the form __db.nnn in the directory specified by spooling_params of bootstrap(5). It is done optionally (in the Son of Grid Engine development version at the time of writing) by appending the private option to spooling_params in the bootstrap file. In that case, inst_sge should refuse to do the backup if the qmaster is running.

To back up manually using this method, cd to $SGE_ROOT, run

    ./inst_sge -bup

and answer the questions. This can be done with or without the qmaster running, including automatically on a live system (e.g. under cron), by supplying a configuration file:

    ./inst_sge -bup -auto conf-file

An example configuration is installed as $SGE_ROOT/util/install_modules/backup_template.conf. In auto mode, a timestamp is appended to the backup directory name, so you may want to reap old directories with something like tmpwatch(1). The -auto method may fail with some Bourne shell implementations; if so, try running it under bash (assuming you have it installed) as

    bash inst_sge ...

See also documentation from Oracle.

To restore the configuration (which must be done with the qmaster shut down) use

    ./inst_sge -rst directory

Like the upgrade script, this method doesn't back up the job spool (BDB sge_jobs database or classic spool/qmaster/{jobs,job_scripts}), but it might usefully be extended to back it up if throughput is low compared with the backup frequency, and if potentially re-running jobs isn't problematic. For classic spooling, see the case in the DoBackup function, and for Berkeley DB spooling see the use of db_dump/db_load in the SwitchArchBup and SwitchArchRst functions (all in $SGE_ROOT/util/install_modules/inst_common.sh).

Any CSP configuration also isn't included in the backup, because of the risk of exposing private keys if the backup is stored insecurely. If you do want to make a copy, the configuration resides in $SGE_ROOT/$SGE_CELL/common/sgeCA and, usually, /var/lib/sgaCA (where private keys live).