The Portable Batch System (PBS) and Load Sharing Facility (LSF) are popular job-schedulers for batch environments. Their goal is to manage computing jobs among the available computing resources. The figure on the right shows the current load of the Odyssey cluster at Harvard Research Computing. It demonstrates how computer jobs submitted by hundreds of researchers can be organized. I have some experience using both systems: the new Cray machine Lindgren in PDC and all clusters in Lunarc use PBS; while Odyssey at Harvard Research Computing uses LSF. This blog is trying to summarize the most useful commands on these systems, together with some sample submission scripts.
PBS: I will describe PBS first. Similar write-ups can be found here for Lindgren; while Lunarc provides both a quick start guide and a more detailed reference. Of course, PBSworks has a detailed manual but we will keep things simple here. The three most important commands for PBS are
qstat. The fist and second ones allow you to submit and delete jobs; the last one let you check your job status.
Suppose you have ssh-ed into a system using PBS, say Lindgren in our case; typing
`qstat` lists all the submitted and running jobs:
user@lindgren:~$ qstat Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 25647.nid01532 ihd,K=32 ckc 01:24:03 R gpu 25642.nid01532 ihd,K=64 ckc 01:24:06 R gpu 25637.nid01532 ihd,K=128 ckc 01:24:03 R gpu 26145.nid01532 ihd,K=256 ckc 0 Q gpu 26145.nid01532 ihd,K=512 ckc 0 Q gpu 26145.nid01532 ihd,K=1024 ckc 0 Q gpu 26341.nid01532 gpen1 user 00:10:07 R batch 26343.nid01532 gpen2 user 0 Q batch ... 27627.nid01532 sila-mri-ideal mep 00:20:02 R batch 27628.nid01532 sila-mri mep 00:20:00 R batch
The first column is the job id, you will need it in order to delete a job:
qdel 25647. The second column is the name you can specify for a job. The third column is the user name. The fourth column, if the job already started, lists the time used by the job. The fifth “S” column shows the status. “Q” stands for queuing, “R” is for running, and “C” is for completed. Finally, the last column is the name of the queue. On Lindgren there is only one queue “batch”, but for other systems such Platon in Lunarc, you can submit to the “gpu” queue in order to run on GPUs.
Suppose you want to submit a new job, you will need to prepare a submission script. Submission script for PBS is a simple bash script:
#!/bin/sh #PBS -N slimdisk #PBS -l mppwidth=16 #PBS -l walltime=24:00:00 #PBS -e err.pbs #PBS -o out.pbs module add hdf5 cd $PBS_O_WORKDIR aprun -n 16 ./zeusmp.x > out 2> err
The first line
#!/bin/sh is not necessary. The lines start with
#PBS are options that will be passed to
qsub. For example, the line
#PBS -N slimdisk:Mdot=10 is equivalent to use
`qsub -N slimdisk:Mdot=10 ...`. Because of this, there cannot be any space in the arguments. The meaning of these options can be in the manual
`man pbs_job_attributes`. The important ones are
-N: the name assigned to the job by the qsub or qalter command. Format: string up to 15 characters, first character must be alphabetic; default value: the base name of the job script or
-l: resource list, a set of
name=valuestrings. The most commonly used names are
mppwidth=number of processing elements
mppdepth=number of threads per processor
mppnppn=processing elements per node
walltime=maximum amount of real time, in format
In the above example, we need to use 16 MPI-cores so we simply use
#PBS -l mppwidth=16. The wall time is set to be one day
#PBS -l walltime=24:00:00
-o: error and output paths contain the job’s standard error and output streams. One can also use
-jto join the error and output together.
#PBS options, we have a couple lines of simple bash script to start the job.
`module add hdf5` loads the
hdf5 module, i.e., set the correct search path.
`cd $PBS_O_WORKDIR` moves us to the work-directory. Finally,
`aprun -n 16 zeusmp.x > out 2> err` launch a 16 MPI-cores job, redirect the standard output to the file
out and the standard error to the file
Once the submission script is ready, you can now submit it to the system.
user@lindgren:~$ qsub job.pbs 27645.nid01532 user@lindgren:~$ qstat -u user nid01532: Req'd Req'd Elap Job ID Username Queue Jobname ... TSK Memory Time S Time -------------------- -------- -------- -------------- ... --- ------ ----- - ----- 27645.nid01532 user batch slimdisk ... -- -- 24:00 Q --
-u user makes
qstat print only the job submitted by
user. If you want to cancel the job for any reason, you can simply use
user@lindgren:~$ qdel 27645 user@lindgren:~$ qstat -u user nid01532: Req'd Req'd Elap Job ID Username Queue Jobname ... TSK Memory Time S Time -------------------- -------- -------- -------------- ... --- ------ ----- - ----- 27645.nid01532 user batch slimdisk ... -- -- 24:00 C 00:03
Note that the status column has switch from “Q” queuing to “C” canceling. It should disappear from the list after a few seconds.
LSF: I will write a similar summary for LSF later.