3 Batch Jobs

The RCE provides access to batch nodes, a cluster of many computers. The batch nodes are good for jobs will run for a long time, and for groups of very similar jobs (e.g., simulations where a number of parameters are varied).

Running jobs on the batch nodes is somewhat more complicated than running interactive jobs on the RCE. The main access points are two command line programs, condor_submit_util and condor_submit. In this tutorial we focus on writing simple submit files and submitting them with condor_submit. For more details on automatically generating and submitting using condor_submit_util refer to the main RCE batch job documentation.

3.1 Preparing a batch submission

In practical terms, running in ‘batch’ means that you will not be able to interact with the running process. This means that all the information your program needs to successfully complete needs to be specified ahead of time. You can pass arguments to your process so that each job gets different inputs, but the script must process these arguments and do the right thing without further instruction.

When you submit a job to the batch processing system each process will generate output and (perhaps) errors. It is usually a good idea to make a sub-folder to store these results. Thus your project folder should contain at least the following:

script or program to run
submit file
output directory

When preparing your job for batch submission you usually need to figure out how to split up the computation, (with one piece going to each process), and how to tell each process which piece it is responsible for. The examples below illustrate how to do this.

3.2 Submit file overview

In order to run jobs in parallel on the batch nodes you need to create a submit file that describes the process to be run on each node. If creating these files by hand you may use any text editor (e.g., gedit, accessible though the Applications --> Accessories menu on the RCE).

The submit file template below includes all required elements. (Note that this file is a template only – see the next section for working examples.)

# Universe whould always be 'vanilla'. This line MUST be
#included in your submit file, exactly as shown below.
Universe = vanilla

# The following arguments are _optional_. If included
# they are used to specify the requirements for the
# submission.
request_cpus = 1
request_disk = 4GB
request_memory = 4GB

# Enter the path to the program you wish to run.
# The default runs the R program. To run another
# program just change '/user/local/bin/R' to the
# path to the program you want to run. For example,
# to run Stata set Executable to '/usr/local/bin/stata'.
Executable = /usr/local/bin/R

# Specify any arguments you want to pass to the executable.
Arguments = --no-save --no-restore --slave

# Specify the relative path to the input file (if any). If you
# are using R this should be your R script. If you are using
# Stata this should be your do file.
input = example.R

# Specify where to output any results printed by your program.
output = output/out.$(Process)
# Specify where to save any errors returned by your program.
error = output/error.$(Process)
# Specify where to save the log file.
Log = output/log
# Enter the number of processes to request. This should
# always be the last part of your submit file.
Queue 10

This submit file instructs the scheduler to request 10 nodes (Queue 10), start R on each one (Executable = /usr/local/bin/R), run the code in example.R (input = example.R), write the output to files named out.0 – out.9 in the output folder (output = output/out.$(Process)), write any errors to files named out.0 – out.9 in the output folder (error = output/error.$(Process)), and write a log file in the output folder (Log = output/log). Each of the 10 requested nodes must be able to provide at least one cpu (request_cpus = 1), four Gb of disk space (request_disk = 4GB) and four Gb of memory (request_memory = 4GB).

The elements included in the submit file template above should be suffucient for most jobs. You can download this submit file template and modify it to suit your needs. For a complete description of the Condor submit file syntax, including less commonly used elements not described here refer to the official documentation.

3.3 Monitoring and managing

After submitting the jobs we may wish to monitor them, e.g. to check if they are running. You can do this by running condor_q <your_user_name> in a terminal. If this returns nothing then you have no jobs in the queue. Otherwise you will see information for each request in the queue which will look something like this:

-- Schedd: HMDC.batch@rce6-5.hmdc.harvard.edu : <10.0.0.10:9619?sock=7858_e19e_247>
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
 200.0   izahn           4/27 11:45   0+00:00:04 R  0   0.0  R --no-save --no-r
 200.1   izahn           4/27 11:45   0+00:00:04 R  0   0.0  R --no-save --no-r
 200.2   izahn           4/27 11:45   0+00:00:04 R  0   0.0  R --no-save --no-r
 200.3   izahn           4/27 11:45   0+00:00:04 R  0   0.0  R --no-save --no-r
 200.4   izahn           4/27 11:45   0+00:00:04 R  0   0.0  R --no-save --no-r
 200.5   izahn           4/27 11:45   0+00:00:04 R  0   0.0  R --no-save --no-r
 200.6   izahn           4/27 11:45   0+00:00:04 R  0   0.0  R --no-save --no-r
 200.7   izahn           4/27 11:45   0+00:00:04 R  0   0.0  R --no-save --no-r
 200.8   izahn           4/27 11:45   0+00:00:04 R  0   0.0  R --no-save --no-r
 200.9   izahn           4/27 11:45   0+00:00:04 R  0   0.0  R --no-save --no-r
 200.10  izahn           4/27 11:45   0+00:00:04 R  0   0.0  R --no-save --no-r
 200.11  izahn           4/27 11:45   0+00:00:04 R  0   0.0  R --no-save --no-r
 200.12  izahn           4/27 11:45   0+00:00:04 R  0   0.0  R --no-save --no-r

Perhaps the most important information returned by condor_q is the program status (the ST column). Status I means your job is in the queue but has not yet started running, R means the job is currently running, and H means the job is on hold. If you job is on hold you can get more information about what the problem might be by running condor_q -hold.

You will know your job is finished when it is no longer listed in the condor_q output. When it finishes you can examine the output and/or error files to see if the program exited successfully.

If you would like to remove a batch job from the queue you may do so using condor_rm. For example condor_rm 200 will remove the jobs listed above.

For more details on monitoring and manageing your batch jobs please refer to http://projects.iq.harvard.edu/rce/book/checking-your-process-status