PBS on Aquila

Sheyna E. Gifford July 24, 2002; last edited by Kelley McDonald, Dec 4, 2003.

To suggest changes, additions, clarifications in this documentation, contact Central Services: central@astro.

CONTENTS

Abstract

The purpose of this article is to describe how Veridian System's Portable Batch System (OpenPBS v. 2.3) has been configured on the UC Berkeley astronomy department's Aquila cluster. It is intended not only as a manual for everyday Aquila users and administrator, but as a reference document for those individuals who may be constructing massively parallel systems running opensource software.

Of this document, only Section 5: Job Submission is intended for users, the rest is for sys admins.

Introduction

The Portable Batch System is a creation of Veridian Information Systems, Inc. Its function is to allow users of a system to submit jobs consisting of a shell script and control attributes to the PBS server daemon, which will then pass the job onto the Scheduler and the Job Executor. The Scheduler keeps tabs on when and where the job is running. It is also capable of holding jobs in the queue until the resources required to run the job are available. The Execution daemon, lovingly dumbed ``Mom'', receives jobs directly from the server daemon, and, with the aid of the scheduler, attempts to spawn a session identical to the login session of the user that submitted the job. The job is then executed in accordance with the specifications in the user script. Upon completion of the job the output is delivered, and the job exits PBS.

PBS is designed to function as a stand-alone Batch System. However, it integrates well with other systems software, especially those modules designed to increase the efficiency of jobs that run on parallel computing networks with distributed memory. One example of such a system is MPI, or Message Passing Interface. PBS itself is unaware of the existence of MPI. However, MPICH, the opensource implementation of MPI, has been designed to be used in conjunction with PBS. (See notes on MPI below.)

The body of this document is dedicated to describing how to build and operate OpenPBS v. 2.3 (hereafter PBS) on a small cluster of SMP computers running opensource software, each with their own memory, networked and configured to run in parallel, specifically, the UC Berkeley Astronomy Department Beowuld Cluster aquila. The basics of PBS installation and configuration for a system with one central master computer and a series of processors nodes, or slaves, will be covered. Then, queue creation, activation and maintenance will be addressed, followed by a discussion of troubleshooting. Lastly, there will be a Notes section with miscellaneous information relevant to PBS, opensource computing and Beowulf clusters.

Section 1: Installation

Acquiring PBS is a simple matter of downloading it from the PBS Public Homepage. Utilities and patches for recent releases may be found there as well.

At the present time the most recent version of OpenPBS is OpenPBS_2_3_16. Install it in the /usr/local directory of the master node and unpack and untar it.

1.2 Configuration

Strictly speaking, no options are absolutely required when building PBS. The options selected for master on Aquila were:

# ./configure --enable-docs --enable-gui --enable-syslog --enable-gcc

Documentation is not built by default, and thus must be enabled. Similarly so with the system logs and the GUIs, xpbs and xpbsmon. gcc is the default compiler. rsh and rcp are assumed to be the enabled mode of intranetwork communication. If you intend to use ssh and scp, you have to enable that with another configure option. Also be aware that unless you specify otherwise with the --set-server-home=DIR option, the directory that PBS will build is /usr/spool/PBS. Running make and make install will create this PBS_HOME directory on your master node.

There can be only one Scheduler and Server on any given system. However, there needs to be a mom everywhere you intend to execute processes, so you must build the directories required to create mom on each individual processor node as follows:

On each slave node, cd into the buildutils directory in /usr/local/PBS and run the following scripts:

# sh pbs_mkdirs mom
# sh pbs_mkdirs aux
# sh pbs_mkdirs default

As long as each slave node is mounting /usr/local from master, this should build /usr/spool/PBS/ and all of the files required to build and run mom on each node. Give the all of the /usr/spool/PBS/mom_priv directories a file called "config" and make sure that server_name=the clienthost name given in that file. For example:

# more /usr/spool/PBS/mom_priv/config
$logevent 0x1ff
$clienthost master

Later on this should allow you to more the file /usr/spool/PBS/server_name with the result of ``master''. According to the literature the clienthost machine (master) does not actually need this file, however, it has always been included in our configuration.

On the master node, create a file: /usr/spool/PBS/server_priv/nodes that lists all of the available nodes where jobs will be running. For example:

# more /usr/spool/PBS/server_priv/nodes
node1:ts
node2:ts
node3:ts
node4:ts
node5:ts
node6:ts
node7:ts
node8:ts

All of the nodes in the cluster MUST have the timeshare (:ts) extension.

1.3 Modifications to /etc/services

PBS is extremely peculiar about the ports will use for communication. Therefore on the master and all of the nodes add the following lines to end of the /etc/services file:

# Local services
pbs_server      15001/tcp
pbs_mom         15002/tcp
pbs_mom         15003/udp
pbs_sched       15004/tcp

This concludes the initial installation and configuration. Please remember to check the Open PBS homepage for updates and bug reports. Development of this software by the opensource community is continuous. As a result new releases and patches are being issued constantly, and, as you will see in the following sections, this software is not without its bugs, both major and minor.

Section 2: Enabling and Starting the PBS Batch System

The PBS batch system consists of the execution daemon (mom), the server, and the scheduler, each of which must be individually activated in the order and manner presented in the following subsections.

2.1 MOM

The execution daemon is the simplest to start of the three. After all of the appropriate files and changes from section one are in place, as root at the command line simply type:

# pbs_mom

On master and all the nodes, this should start the PBS execution daemon. The action will return no value or message if it is successful. Should the system complain at this point, try changing directories to /sbin and starting Mom from there.

2.2 The Server

There need only be one server, and it should be built on master. Run the command:

# pbs_server -t create

Again, if successful, it should return no value. If this is not the first time you are starting the scheduler, you need only type:

# pbs_server

for the program to begin. If you are attempting to rebuild the system from scratch (as will be discussed in later chapters) you may get the error message, ``server database already exists. Proceed Y/N?'' You will have to proceed in order for PBS to run, but the server will continue to use the same database as it did before (meaning that any previous jobs you had running or options you had set will be reinstated).

Assuming that all proceeds normally, you may run:

# qmgr -c "set server scheduling=true"

You should now be able to start, enable and configure the scheduler.

2.3 The Scheduler

At the command line run:

# pbs_sched

With the scheduler activated, it is possible to evoke the Queue Manager (Qmgr). The qmgr command allow you to create and modify servers and queues for use within PBS. Without at least one active queue to run jobs and a server to moderate tasks for the queue, PBS is little more than a powerless daemon trapped within the system. To begin the process that will unleash them upon your cluster type:

# qmgr

This will start the queue manager. You will see:

qmgr
Max open servers: 4
Qmgr:

One server already exists-the clienthost ``master''. Therefore, the first order of business is to create a queue. To create a queue by the name of ``rque'', for example, one would type:

Qmgr: c q rque@master
queue_type=execution

Queues can be of either type=execution or type=scheduling. There must be at least one execution queue for jobs to run. In addition the queue has to know to which server it is bound, hence the @master prefix affixed to rque.

Now the server needs to be made aware of the queue insofar as knowing where to direct jobs by default. If multiple queues exists, it is possible to submit jobs to one queue specifically. So, for example, if rque, dque and mque were all @master, I could submit my job specifically to mque.

In any case the server requires a default queue. Assuming that you have already created rque@master, run:

Qmgr: s s master
default_queue=rque@master

At this point you can start and enable your queue:

Qmgr: s q rque@master enabled=true
Qmgr: s q rque@master started=true

At this point all that remains to be done before jobs can be submitted is to tell both the server and the queue(s) how many jobs can be submitted to them at a time. Do this with the attribute max_running. For example:

Qmgr: s s master max_running=10 Qmgr: s q rque@master max_running=10

At this point your system should be ready to receive, queue, run jobs, and deliver the output. Run ``xpbs'' at the command line to start a gui that will allow you to monitor the status of your queues and the jobs within those queues. Run ``xpbsmon'' to bring up a color-coated display of the nodes which are online, busy, available, etc.

If you have any problems seeing either of the GUIs, or if a user has a preference regarding color and/or font, many of those options may be set in the ~./xpbsrc file. More information on how to set xpbs preferences may be found on page 57 of the administrator's manual.

2. 31 Further Queue Attributes

Experience has shown that, in general, the fewer attributes assigned to a single queue or server, the better. Setting resource defaults regarding job size, cycles or CPU time can cause jobs to be rejected right and left. Unless severe problems arise as a result of jobs monopolizing nodes for weeks at a time, it is suggested that resource limits and defaults remain unset.

On the other hand, the query_other_jobs attribute can be extremely helpful, especially if there are groups of users on the system with a vested interested in the progress of each others' jobs. This attribute allows users to qstat, or view the status of jobs submitted by another users. To activate it, set:

Qmgr: s s master query_other_jobs=true

2. 32 Unsetting Queue Attributes, Queues, and Killing Jobs

Attributes may also be unset, as needed. If, for instance, a job is stuck in a queue, one solution in order to get things moving along again is to make another queue. In this case, in order prevent jobs from being submitted to the queue with the stopped job, you change the default queue server settings. Before you set it, though, you have to unset it. For example:

unset server master default_queue

Remember always to name objects @master, or @their respective server, or you will encounter errors like:

Qmgr: s s default_queue=dque
qmgr obj= svr=default: Unauthorized Request

Simply naming the applicable server, even if there is no other server, will solve the problem:

Qmgr: s s master
default_queue=dque@master
Qmgr:

Rque can then be disabled, or destroyed, as the case may be. To delete a queue, run:

Qmgr: delete queue dque@master

Once in a while you will take a completely valid action, with an entirely correct syntax, and the system will still complain. One example is:

Qmgr: delete queue dque@master
qmgr obj=dque svr=master: Cannot delete busy queue

If the queue is busy a user with administrative privileges may attempt to delete the job in question with the qsig command, which, when it takes an argument of 0 signals a running job to quit immediately.

Problems may arise when the job itself is still in the queue, but is not running. To see if a queued job is running, run xpbs and look under the Jobs section for the column labeled``S'' (``Status''). If the letter in the column is not an R, the job is not running and therefore cannot be killed with either the qsig command, or manually via the delete button in the GUI (xpbs).

Once in a while a job will fail during execution, and will remain in the queue with a Status of R (Running), despite its processes being halted. Jobs of this type can take up CPU time and lock up nodes, preventing them from being used for other jobs. PBS sees part of its jobs as ``protecting the job against system crashes''. As a result, if a system or process goes down, PBS will often keep the job in queue, with the status of running. Furthermore it will ignore attempts to manually deactivate these jobs. Stopping and restarting PBS will not serve to clear the queue, nor will rebooting the entire system, as PBS is specifically designed to preserve jobs through a system shutdown. The only viable recourse that the Aquila team has identified in the contingency of a job being irrevocably stuck in a queue is to -- quite literally -- kill off the server. That is, if the status of a job is reading as Q, E, or R for an extended period but is clearing making no progress, you can kill off PBS completely by sending all PBS daemons the kill -9 PID signal, and then erasing, completely, the PBS home directory. Of course, this requires then running the configuration script and, essentially, remaking and reinstalling PBS on the master node. On the upside, this is a fairly swift procedure, especially if you wrote a script to do your initial configuration. If the PBS/home/server_priv/nodes file and the PBS/home/mom_priv/config file have been copied elsewhere, they can simply be recopied back into the appropriate directories after the reinstallation. If you wait long enough, and kill off and bring back the server a few times, zombied jobs will sometimes simply disappear of their own accord. How they do this is unknown. This sort of exorcism works in mysterious and unpredictable ways, if it works at all. In general, it has been our experience you must erase the PBS/home directory in usr/spool, for upon restarting the server with pbs_server -t create command, you will get an error stating that the a server database already exists. Part of that database, apparently, is a complete record of all jobs that were running before the server was halted. Unless you erase all records of the server database, you will be right back where you started.

Section 3: Hanging and Unhanging the System

A number of fairly typical errors can cause jobs to ``zombie''-or become trapped in the queue, unable to run and/or exit, unable to be erased, even by a user with full administrative privileges. One of the most common causes of a zombied job is an inconsistency between a user's .login and .cshrc (or .bashrc) file. When Mom attempts to run a batch job, she does so by creating a new session the corresponds to the user's login session. In other words, Mom will attempt to run .login as well as (for example) .cshrc. If there is an irreconcilable conflict between those two files-for example, the same variable set to two different values-the job executor may crash while trying to create the session.

Another issue arises when the output of a job is undeliverable. Cases have occurred where either the directory to which the output was directed did not exist, or the permissions were not set properly, such that program was unable to write to the designated directory. In either case the job hangs in the queue as a result. Its status is that of Exiting, but since it cannot not deliver its output, the exit is unsuccessful.

Section 4: Job Rejection

If you request a job for which there are insufficient resources, your job will be rejected. On the Aquila cluster, this has only happened in one instance, for there are no set restrictions as to the amount of CPU time or cycles that a given job may take up. In the case of enabled shared memory, insufficient shared memory to run a job will cause that job to be dropped, with the errors message like:

p0_12177: (11.128671)
xx_shmalloc: returning NULL; requested 17640040 bytes.

This is an issue of insufficient amount of available, allocated global memory. You can increase the amount of memory by setting the environment variable P4_GLOBMEMSIZE (in bytes). Since the request for 17640040 bytes failed, one can reset the value of P4_GLOBMEMSIZE to 20000000 bytes with the following additions to the script:

For C Shell :

setenv P4_GLOBMEMSIZE 20000000
For Bourne Shell :
P4_GLOBMEMSIZE=20000000
export P4_GLOBMEMSIZE

Alternatively, if you have failed to create a maximum number of running processes (s q rque@master max_running=12 for example) you will get a message of qsub: Job exceeds queue resource limits. Simply setting max_running for the server and queue(s) to some value will ameliorate that issue.

Section 5: Job Submission

Assuming that all goes well during the installation and configuration-that the queues are active and enabled and everything has been configured as it should be-a user ought to be able to submit a job with the reasonable expectation that it will run its proper course and return the appropriate output.

The qsub command

In order to increase the odds of a successful run, users are encouraged to emulate the following format. Bear in mind that there are essentially 3 commands you use regularly:

Each of these has numerous options which can make your life easier once you become acquainted with the system.

Many options exists to modify these commands. To view those available, just type "man" and the command name. Job Scripts

While you can submit jobs a variety of ways, perhaps the most convenient is to createt a shell script that passes the PBS arguments. Here are 2 such examples -- one for a serial job one for a parallel.

Example One:

#!/bin/bash
#PBS -N Xi(r)
#PBS -k eo
#PBS -m a
#PBS -M mwhite@astron.berkeley.edu
#PBS -l nodes=1:ppn=1
#
cd /home/mwhite/treepm/lcdm300/
#
../GalFake/pnt_corrfn_mesh 300.0
tpmsph_1.0000.gal > galcorr.log
#

Example Two:

#!/bin/bash
#PBS -N Prop
#PBS -k eo
#PBS -m a
#PBS -M mwhite@astron.berkeley.edu
#PBS -l nodes=7:ppn=2
#
cd /home/mwhite/treepm/lcdm300/
#
/usr/local/bin/mpirun -np 14 \
-machinefile $PBS_NODEFILE \
./properties tpmsph_1.0000 16 300.0 \
0.3 0.0 0.7 > tpmsph_1.0000.prop
#

PBS_NODEFILE is an environment variable which will be set by PBS to point to a file which lists the nodes that that job has been allocated.

The format of the shell script is the usual. It starts with

#!/bin/bash

then you can have any number of lines beginning with #PBS. These are treated as comments by the shell, but contain information for PBS. The most common arguments are:

Following the PBS arguments are the shell script commands. In these exampels the scripts begin by "cd"ing to the working directory. As qsub remembers from where you submitted the job, this isn't strictly necessary. This is a useful technique, however, as the scripts and the relavant data often reside in different places.

Job Submission

In the above example, the job "prop.sh" runs a halo property finder on the 512^3 simulation using 14 processors (the whole cluster at present). The job "xi.sh" runs a serial code to compute the correlation function for a fake galaxy catalogue produced from this simulation. To run the latter, for example, one would type at the command line:

$ qsub xi.sh
99.master

When the job is successfully submitted, it will return a value that corresponds to the job ID numer (99 in this case). To observe your job in the queue type:

$ qstat
Job id           Name             User 
           Time Use S Queue
---------------- ---------------- -------- - -----
99.master        Xi(r)           
mwhite                  0 R dque

Job 99, named Xi(r) submitted mwhite is running. So far it has used 0 time, since it has just started. It has a status of R, running, and is doing so in queue dque. The queue to which jobs are submitted may be specified with the -q option. Since it wasn't specified, one may assume that dque is the default queue.

To delete the job from the queue one may use:


$ qdel 99

which, if successful, returns no output. One could also say "qdel 99.master". Now, if you try and view the queue, you will see that the queue is empty:

$ qstat

Stdout and Stderr

PBS creates files log files with messages pertaining to the run of the job. For example:

-rw-------    1 mwhite   mwhite  95 May  7 18:08 Xi(r).o99
-rw-------    1 mwhite   mwhite   0 May  7 18:07 Xi(r).e99

Containing stdout and stderr for job 99. The output contains simply:

[mwhite@master ~]$ more "Xi(r).o99"
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
Terminated

This reveals, essentially, that the job was terminated. The first two lines always appear and can be safely ignored.

* above section submitted by Martin White

Section 6: Notes

6. 1 MPI

In order to increase the efficiency with which PBS processes batch jobs on a parallel network, other software may be brought into play. The two most popular opensource applications for this purpose are PVM and MPI. As was mentioned briefly in the introduction, MPICH has been introduced into the Aquila cluster in the hope that process communication via message passing will significantly speed up the system.

MPI must be built directly in the /usr/local directory in order for all of the nodes to properly import the execution binaries. There it will build a home directory and source tree. Thereafter some minor tweaking is required in order to for it to ``see'' the nodes upon which PBS will be running batch processes.

According to the official MPI documentation, every time MPI runs, it creates a new file of machine names to use just for that run, using a machines file in the MPI In addition you must edit the machines file in $mpihomedir$/util/machines/machines.LINUX to contain the information from the server_priv nodefile (e.g. the names of the nodes and how many processors they each contain). Here is an example:

node1:2
node2:2
node3:2
node4:2
node5:2
node6:2
node7:2
node8:2

If you encounter difficulty more than 16 processes in PBS, despite having an adequate number of servers built in qmgr, it could possibly be an issue with MPI. By default mpirun will use at most the number of processors specified in the machines list for each node (up to 16 per machine, 16 TOTAL in our case). This can be controlled/altered with the MPI_MAX_CLUSTER_SIZE environmental variable.

6.2 SSH

When PBS was first brought up on Aquila, ssh and secure copy were being used for intranetwork communication. This proved to be both cumbersome and complicated. Additionally, SSH and MPI did not play especially well together, despite efforts on behalf of the system administrators to put public and private keys in all the proper places. When all was said and done it proved easier to put rsh and rcp on the system and leave ssh and scp on master only for purposes of communication with the outside world.

6.3 MISC

The author is this document is aware of the existence of an .rpm package for installation on systems running RedHat, but has no advice to provide in that regard, as the Aquila cluster ultimately ran SuSE as a primary OS.

Section 7: Acknowledgments

The author of this document wishes to acknowledge and thank the endless patience and support of the Aquila users, all of whom have braved rough waters and trying times while the adminitrative staff earn its sea legs and struggled to set this new ship to rights.

Special thanks to Martin white, who authored the section on job submission, and whose suggestions where always at the ready when we encountered difficulty.

Furthmore, this system would not have been possible without the continuous support of the senior systems adminitrator of the Department of Astrophysics, Kelley McDonald, and Professor Marc Davis, without whom the project would never have come into being. Gentlemen, my this bird stay upright, and be swift, to better serve the needs of science now and in the future.

PLEASE MAINTAIN AND UPDATE THIS DOCUMENT.

Thank you and happy batching.