PBS on Aquila
Sheyna E.
Gifford July 24, 2002; last edited by Kelley McDonald, Dec 4, 2003.
To suggest changes, additions, clarifications in this documentation,
contact Central Services: central@astro.
Abstract
The purpose of this
article is to describe how Veridian System's Portable Batch System
(OpenPBS v. 2.3) has been configured on the UC Berkeley astronomy
department's Aquila cluster. It is intended not only as a manual for
everyday Aquila users and administrator, but as a reference document
for those individuals who may be constructing massively parallel
systems running opensource software.
Of this document, only
Section 5: Job Submission
is intended for users, the rest is for sys admins.
Introduction
The Portable Batch System is a creation
of Veridian Information Systems, Inc. Its function is to allow users
of a system to submit jobs consisting of a shell script and control
attributes to the PBS server daemon, which will then pass
the job onto the Scheduler and the Job Executor. The Scheduler keeps
tabs on when and where the job is running. It is also capable of
holding jobs in the queue until the resources required to run the
job are available. The Execution daemon, lovingly dumbed ``Mom'',
receives jobs directly from the server daemon, and, with the aid of
the scheduler, attempts to spawn a session identical to the login
session of the user that submitted the job. The job is then
executed in accordance with the specifications in the user script.
Upon completion of the job the output is delivered, and the job exits
PBS.
PBS is designed to function as a
stand-alone Batch System. However, it integrates well with other
systems software, especially those modules designed to increase the
efficiency of jobs that run on parallel computing networks with
distributed memory. One example of such a system is MPI, or Message
Passing Interface. PBS itself is unaware of the existence of MPI.
However, MPICH, the opensource implementation of MPI, has been
designed to be used in conjunction with PBS.
(See notes on MPI below.)
The body of this document is dedicated
to describing how to build and operate OpenPBS v. 2.3 (hereafter PBS)
on a small cluster of SMP computers running opensource software, each
with their own memory, networked and configured to run in parallel,
specifically, the UC Berkeley Astronomy Department Beowuld Cluster aquila.
The basics of PBS installation and configuration for a system with
one central master computer and a series of processors nodes, or
slaves, will be covered. Then, queue creation, activation and
maintenance will be addressed, followed by a discussion of
troubleshooting. Lastly, there will be a Notes section with
miscellaneous information relevant to PBS, opensource computing and
Beowulf clusters.
Section 1: Installation
Acquiring PBS is a simple matter of
downloading it from the
PBS Public Homepage.
Utilities and patches for recent releases may be
found there as well.
At the present time the most recent
version of OpenPBS is OpenPBS_2_3_16.
Install it in the /usr/local directory
of the master node and unpack and untar it.
1.2 Configuration
Strictly speaking, no options are
absolutely required when building PBS.
The options selected for master on
Aquila were:
Documentation is not built by default,
and thus must be enabled. Similarly so with the
system logs and the GUIs, xpbs and
xpbsmon. gcc is the default compiler. rsh and rcp are assumed to be
the enabled mode of intranetwork communication. If you intend to
use ssh and scp, you have to enable that with another configure
option.
Also be aware that unless you
specify otherwise with the --set-server-home=DIR option, the
directory that PBS will build is /usr/spool/PBS. Running make and
make install will create this PBS_HOME directory on your master node.
There can be only one Scheduler and
Server on any given system. However, there needs to be a mom
everywhere you intend to execute processes, so you must
build the directories required to create mom on each individual
processor node as follows:
On each slave node, cd into the
buildutils directory in /usr/local/PBS and run the following scripts:
As long as each slave node is mounting
/usr/local from master, this should build /usr/spool/PBS/ and all
of the files required to build and run mom on each node.
Give the all of the /usr/spool/PBS/mom_priv
directories a file called "config" and make sure that
server_name=the clienthost name given in that file. For example:
Later on this should allow you to more
the file /usr/spool/PBS/server_name with the result of ``master''.
According to the literature the clienthost machine (master) does not
actually need this file, however, it has always been included in our
configuration.
On the master node,
create a file: /usr/spool/PBS/server_priv/nodes
that lists all of the available
nodes where jobs will be running. For
example:
All of the nodes in the cluster MUST
have the timeshare (:ts) extension.
1.3 Modifications to /etc/services
PBS is extremely peculiar about the
ports will use for communication. Therefore
on the master and all of the nodes add
the following lines to end of the /etc/services file:
This concludes the initial installation
and configuration. Please remember to check the Open PBS homepage for
updates and bug reports. Development of this software by the
opensource community is continuous. As a result new releases and
patches are being issued constantly, and, as you will see in the
following sections, this software is not without its bugs, both major
and minor.
Section 2: Enabling and Starting the
PBS Batch System
The PBS batch system consists of the
execution daemon (mom), the server, and the scheduler, each of which
must be individually activated in the order and manner presented in
the following subsections.
2.1 MOM
The execution daemon is the simplest to
start of the three. After all of the appropriate files and changes
from section one are in place, as root at the command line simply
type:
On master and all the nodes, this
should start the PBS execution daemon. The action will return no
value or message if it is successful. Should the system complain at
this point, try changing directories to /sbin and starting Mom from
there.
2.2 The Server
There need only be one server, and it
should be built on master. Run the command:
Again, if successful, it should return
no value. If this is not the first time you are starting the
scheduler, you need only type:
for the program to begin. If you are
attempting to rebuild the system from scratch (as will be discussed
in later chapters) you may get the error message, ``server database
already exists. Proceed Y/N?'' You will have to proceed in order for
PBS to run, but the server will continue to use the same database as
it did before (meaning that any previous jobs you had running or
options you had set will be reinstated).
Assuming that all proceeds normally,
you may run:
You should now be able to start, enable
and configure the scheduler.
2.3 The Scheduler
At the command line run:
With the scheduler activated, it is
possible to evoke the Queue Manager (Qmgr). The qmgr command allow
you to create and modify servers and queues for use within PBS.
Without at least one active queue to run jobs and a server to
moderate tasks for the queue, PBS is little more than a powerless
daemon trapped within the system. To begin the process that will
unleash them upon your cluster type:
This will start the queue manager. You
will see:
One server already exists-the
clienthost ``master''. Therefore, the first order of business is to
create a queue. To create a queue by the name of ``rque'', for
example, one would type:
Queues can be of either type=execution
or type=scheduling. There must be at least one execution queue for
jobs to run. In addition the queue has to know to which server it is
bound, hence the @master prefix affixed to rque.
Now the server needs to be made aware
of the queue insofar as knowing where to direct jobs by default. If
multiple queues exists, it is possible to submit jobs to one queue
specifically. So, for example, if rque, dque and mque were all
@master, I could submit my job specifically to mque.
In any case the server requires a
default queue. Assuming that you have already created rque@master,
run:
At this point you can start and enable
your queue:
At this point all that remains to be
done before jobs can be submitted is to tell both the server and the
queue(s) how many jobs can be submitted to them at a time. Do this
with the attribute max_running. For example:
Qmgr: s s master max_running=10
Qmgr: s q rque@master max_running=10
At this point your system should be
ready to receive, queue, run jobs, and deliver the output. Run
``xpbs'' at the command line to start a gui that will allow you to
monitor the status of your queues and the jobs within those queues.
Run ``xpbsmon'' to bring up a color-coated display of the nodes which
are online, busy, available, etc.
If you have any problems seeing either
of the GUIs, or if a user has a preference regarding color and/or
font, many of those options may be set in the ~./xpbsrc file. More
information on how to set xpbs preferences may be found on page 57 of
the administrator's manual.
2. 31 Further Queue Attributes
Experience has shown that, in general,
the fewer attributes assigned to a single queue or server, the
better. Setting resource defaults regarding job size, cycles or CPU
time can cause jobs to be rejected right and left. Unless severe
problems arise as a result of jobs monopolizing nodes for weeks at a
time, it is suggested that resource limits and defaults remain unset.
On the other hand, the query_other_jobs
attribute can be extremely helpful, especially if there are groups of
users on the system with a vested interested in the progress of each
others' jobs. This attribute allows users to qstat, or view the
status of jobs submitted by another users. To activate it, set:
2. 32 Unsetting Queue Attributes,
Queues, and Killing Jobs
Attributes may also be unset, as
needed. If, for instance, a job is stuck in a queue, one solution in
order to get things moving along again is to make another queue. In
this case, in order prevent jobs from being submitted to the queue
with the stopped job, you change the default queue server settings.
Before you set it, though, you have to unset it. For example:
Remember always to name objects
@master, or @their respective server, or
you will encounter errors like:
Simply naming the applicable server,
even if there is no other server, will solve the problem:
Rque can then be disabled, or
destroyed, as the case may be. To delete a queue, run:
Once in a while you will take a
completely valid action, with an entirely correct syntax, and the
system will still complain. One example is:
If the queue is busy a user with
administrative privileges may attempt to delete the job in question
with the qsig command, which, when it takes an argument of 0 signals
a running job to quit immediately.
Problems may arise when the job itself
is still in the queue, but is not running. To see if a queued job is
running, run xpbs and look under the Jobs section for the column
labeled``S'' (``Status''). If the letter in the column is not an R,
the job is not running and therefore cannot be killed with either the
qsig command, or manually via the delete button in the GUI (xpbs).
Once in a while a job will fail during
execution, and will remain in the queue with a Status of R (Running),
despite its processes being halted. Jobs of this type can take up
CPU time and lock up nodes, preventing them from being used for other
jobs. PBS sees part of its jobs as ``protecting the job against
system crashes''. As a result, if a system or process goes down, PBS
will often keep the job in queue, with the status of running.
Furthermore it will ignore attempts to manually deactivate these
jobs. Stopping and restarting PBS will not serve to clear the queue,
nor will rebooting the entire
system, as PBS is specifically designed
to preserve jobs through a system shutdown.
The only viable recourse that the
Aquila team has identified in the contingency of a job being
irrevocably stuck in a queue is to -- quite literally -- kill off the
server. That is, if the status of a job is reading as Q, E, or R for
an extended period but is clearing making no progress, you can kill
off PBS completely by sending all PBS daemons the kill -9 PID signal,
and then erasing, completely, the PBS home directory. Of course, this
requires then running the configuration script and, essentially,
remaking and reinstalling PBS on the master node.
On the upside, this is a fairly swift
procedure, especially if you wrote a script to do your initial
configuration. If the PBS/home/server_priv/nodes file and the
PBS/home/mom_priv/config file have been copied elsewhere, they can
simply be recopied back into the appropriate directories after the
reinstallation.
If you wait long enough, and kill off
and bring back the server a few times, zombied jobs will sometimes
simply disappear of their own accord. How they do this is unknown.
This sort of exorcism works in mysterious and unpredictable ways, if
it works at all. In general, it has been our experience you must
erase the PBS/home directory in usr/spool, for upon restarting the
server with pbs_server -t create command, you will get an error
stating that the a server database already exists. Part of that
database, apparently, is a complete record of all jobs that were
running before the server was halted. Unless you erase all records
of the server database, you will be right back where you started.
Section 3: Hanging and Unhanging the System
A number of fairly typical errors can
cause jobs to ``zombie''-or become trapped in the queue, unable to
run and/or exit, unable to be erased, even by a user with full
administrative privileges.
One of the most common causes of a
zombied job is an inconsistency between a user's .login and .cshrc
(or .bashrc) file. When Mom attempts to run a batch job, she does so
by creating a new session the corresponds to the user's login
session. In other words, Mom will attempt to run .login as well as
(for example) .cshrc. If there is an irreconcilable conflict between
those two files-for example, the same variable set to two
different values-the job executor may
crash while trying to create the session.
Another issue arises when the output of
a job is undeliverable. Cases have occurred where either the
directory to which the output was directed did not exist, or the
permissions were not set properly, such that program was unable to
write to the designated directory. In either case the job hangs in
the queue as a result. Its status is that of Exiting, but since it
cannot not deliver its output, the exit is unsuccessful.
Section 4: Job Rejection
If you request a job for which there
are insufficient resources, your job will be rejected. On the Aquila
cluster, this has only happened in one instance, for there are no set
restrictions as to the amount of CPU time or cycles that a given job
may take up.
In the case of enabled shared memory,
insufficient shared memory to run a job will cause that job to be
dropped, with the errors message like:
This is an issue of insufficient amount
of available, allocated global memory. You can increase the amount of
memory by setting the environment variable P4_GLOBMEMSIZE (in bytes).
Since the request for 17640040 bytes failed, one can reset the value
of P4_GLOBMEMSIZE to 20000000 bytes with the following additions to
the script:
For C Shell :
Alternatively, if you have failed to
create a maximum number of running processes (s q rque@master
max_running=12 for example) you will get a message of qsub: Job
exceeds queue resource limits. Simply setting max_running for the
server and queue(s) to some value will ameliorate that issue.
Section 5: Job Submission
Assuming that all goes well during the
installation and configuration-that the queues are active and enabled
and everything has been configured as it should be-a user ought to be
able to submit a job with the reasonable expectation that it will run
its proper course and return the appropriate output.
The qsub command
In order to increase the odds of a
successful run, users are encouraged to emulate the following format.
Bear in mind that there are essentially 3 commands you use
regularly:
Each of these has numerous options
which can make your life easier once you
become acquainted with the system.
While you can submit jobs a variety of
ways, perhaps the most convenient is to createt a shell script that
passes the PBS arguments. Here are 2 such examples -- one for a
serial job one for a parallel.
Example One:
Example Two:
PBS_NODEFILE is an environment variable which will be
set by PBS to point to a file which lists the nodes
that that job has been allocated.
The format of the shell script is the
usual. It starts with
then you can have any number of lines
beginning with #PBS. These are treated
as comments by the shell, but contain
information for PBS. The most common
arguments are:
Following the PBS arguments are the
shell script commands. In these exampels the scripts begin by
"cd"ing to the working directory. As qsub remembers from
where you
submitted the job, this isn't strictly
necessary. This is a useful technique, however, as the scripts and
the relavant data often reside in different places.
Job Submission
In the above example, the job "prop.sh"
runs a halo property
finder on the 512^3 simulation using 14
processors (the whole cluster at
present). The job "xi.sh"
runs a serial code to compute the correlation
function for a fake galaxy catalogue
produced from this simulation. To run
the latter, for example, one would type
at the command line:
When the job is successfully submitted,
it will return a value that corresponds to the job ID numer (99 in
this case). To observe your job in the queue type:
Job 99, named Xi(r) submitted mwhite is
running. So far it has used
0 time, since it has just started. It
has a status of R, running, and is doing so in queue dque. The queue
to which jobs are submitted may be specified with the -q option.
Since it wasn't specified, one may assume that dque is the default
queue.
To delete the job from the queue one
may use:
which, if successful, returns no
output. One could also say "qdel 99.master".
Now, if you try and view the queue, you
will see that the queue is empty:
Stdout and Stderr
PBS creates files log files with
messages pertaining to the run of the job. For example:
Containing stdout and stderr for job
99. The output contains simply:
This reveals, essentially, that the job
was terminated. The first two lines always appear and can be safely
ignored.
*
above section submitted by Martin White
Section 6: Notes
6. 1 MPI
In order to increase the efficiency
with which PBS processes batch jobs on a parallel network, other
software may be brought into play. The two most popular opensource
applications for this purpose are PVM and MPI. As was mentioned
briefly in the introduction, MPICH has been introduced into the
Aquila cluster in the hope that process communication via message
passing will significantly speed up the system.
MPI must be built directly in the
/usr/local directory in order for all of the nodes to properly import
the execution binaries. There it will build a home directory and
source tree. Thereafter some minor tweaking is required in order to
for it to ``see'' the nodes upon which PBS will be running batch
processes.
According to the official MPI
documentation, every time MPI runs, it creates a new file of machine
names to use just for that run, using a machines file in the MPI In
addition you must edit the machines file in
$mpihomedir$/util/machines/machines.LINUX to contain the information
from the server_priv nodefile (e.g. the names of the nodes and how
many processors they each contain). Here is an example:
If you encounter difficulty more than
16 processes in PBS, despite having an adequate number of servers
built in qmgr, it could possibly be an issue with MPI. By default
mpirun will use at most the number of processors specified in the
machines list for each node (up to 16 per machine, 16 TOTAL in our
case). This can be controlled/altered with the MPI_MAX_CLUSTER_SIZE
environmental variable.
6.2 SSH
When PBS was first brought up on
Aquila, ssh and secure copy were being used for intranetwork
communication. This proved to be both cumbersome and complicated.
Additionally, SSH and MPI did not play especially well together,
despite efforts on behalf of the system administrators to put public
and private keys in all the proper places. When all was said and
done it proved easier to put rsh and rcp on the system and leave ssh
and scp on master only for purposes of communication with the outside
world.
6.3 MISC
The author is this document is aware of
the existence of an .rpm package for installation on systems running
RedHat, but has no advice to provide in that regard, as the Aquila
cluster ultimately ran SuSE as a primary OS.
# ./configure --enable-docs --enable-gui --enable-syslog --enable-gcc
# sh pbs_mkdirs mom
# sh pbs_mkdirs aux
# sh pbs_mkdirs default
# more /usr/spool/PBS/mom_priv/config
$logevent 0x1ff
$clienthost master
# more /usr/spool/PBS/server_priv/nodes
node1:ts
node2:ts
node3:ts
node4:ts
node5:ts
node6:ts
node7:ts
node8:ts
# Local services
pbs_server 15001/tcp
pbs_mom 15002/tcp
pbs_mom 15003/udp
pbs_sched 15004/tcp
# pbs_mom
# pbs_server -t create
# pbs_server
# qmgr -c "set server scheduling=true"
# pbs_sched
# qmgr
qmgr
Max open servers: 4
Qmgr:
Qmgr: c q rque@master
queue_type=execution
Qmgr: s s master
default_queue=rque@master
Qmgr: s q rque@master enabled=true
Qmgr: s q rque@master started=true
Qmgr: s s master query_other_jobs=true
unset server master default_queue
Qmgr: s s default_queue=dque
qmgr obj= svr=default: Unauthorized Request
Qmgr: s s master
default_queue=dque@master
Qmgr:
Qmgr: delete queue dque@master
Qmgr: delete queue dque@master
qmgr obj=dque svr=master: Cannot delete busy queue
p0_12177: (11.128671)
xx_shmalloc: returning NULL; requested 17640040 bytes.
setenv P4_GLOBMEMSIZE 20000000
For Bourne Shell :
P4_GLOBMEMSIZE=20000000
export P4_GLOBMEMSIZE
Many options exists to modify these
commands. To view those available, just type "man" and the
command name.
Job Scripts
#!/bin/bash
#PBS -N Xi(r)
#PBS -k eo
#PBS -m a
#PBS -M mwhite@astron.berkeley.edu
#PBS -l nodes=1:ppn=1
#
cd /home/mwhite/treepm/lcdm300/
#
../GalFake/pnt_corrfn_mesh 300.0
tpmsph_1.0000.gal > galcorr.log
#
#!/bin/bash
#PBS -N Prop
#PBS -k eo
#PBS -m a
#PBS -M mwhite@astron.berkeley.edu
#PBS -l nodes=7:ppn=2
#
cd /home/mwhite/treepm/lcdm300/
#
/usr/local/bin/mpirun -np 14 \
-machinefile $PBS_NODEFILE \
./properties tpmsph_1.0000 16 300.0 \
0.3 0.0 0.7 > tpmsph_1.0000.prop
#
#!/bin/bash
$ qsub xi.sh
99.master
$ qstat
Job id Name User
Time Use S Queue
---------------- ---------------- -------- - -----
99.master Xi(r)
mwhite 0 R dque
$ qdel 99
$ qstat
-rw------- 1 mwhite mwhite 95 May 7 18:08 Xi(r).o99
-rw------- 1 mwhite mwhite 0 May 7 18:07 Xi(r).e99
[mwhite@master ~]$ more "Xi(r).o99"
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
Terminated
node1:2
node2:2
node3:2
node4:2
node5:2
node6:2
node7:2
node8:2