Basic Information:
The Herdtools package is aimed at providing a set of user-level
tools with a simplified interface to control and run jobs across a
cluster of similar machines. We haven't paid much attention to portability,
so these scripts may only work on Linux-based clusters! However, they
are pretty standard Perl scripts, so they should work on many
different platforms.
Here is a screen shot showing the webwatch interface. Each machine has a row in the table. The two meters show CPU and memory usage, the remaining columns show user-defined statistics (provided by Rob Brown's procstatd). Of course, the title area, the message of the day, etc., are also customizable. You can even change the meter images if you'd like.
The project is hosted at SourceForge, which provides us with email lists, discussion forums, bug tracking, and download facilities. To get to the Herdtools SourceForge Page, CLICK HERE.
The Herdtools are released under the GNU General Public License (GPL). This license gives you the right to use, copy, and distributed original and modified versions of the code provided that all such modifications also adopt the GPL. We offer no warranty for the Herdtools programs, they should be considered "AS IS", without warranty of any kin, either expressed or implied, including, but not limited to, the implied warranties of merchantability and fitness for any particular purpose.
Ok, now that the legal stuff is out of the way...
Basic Configuration:
If you grabbed this code via an RPM package, then all the pieces should
automatically go into the correct locations. If you just have source
code (Perl code), then you should put the programs into an easily
accessible directory, perhaps /usr/bin or /usr/local/bin.
Some of the tools are probably best described as "sys admin" tools since
they may either have security or CPU-load implications (ie. having
many users run them could cause the cluster to become overloaded);
these tools should probably be put in something like
/usr/sbin or /usr/local/sbin.
Some of the tools assume that each machine in the cluster has Rob Brown's procstatd daemon running on it. This daemon serves out basic information about the current status of the machine. We actually use a modified version of procstatd which returns additional information for a parallel ps-like utility. It is probably best to have procstatd run at start-up, although manually running the daemon will work fine too.
Some of the tools also use the webwatchd daemon to monitor the cluster. The webwatchd daemon contacts a set of remote procstatd's and routinely asks them for information. This information is then published as a web page (and as a raw-data page) that users can refer to. The idea here is that having dozens of users repeatedly ping every machine in the cluster for information may increase the network load dramatically. Instead, all users can view the webwatchd-created page instead, reducing overall network traffic. It is probably best to have webwatchd run at start-up, although running it manually will work fine too.
The Herdtools generally all look to the same configuration file for basic info, called herdtools_site.pm. It is actually a Perl module file and so it is generally best to put it into /usr/lib/perl5/__version__ (maybe there'll be a site specific subdirectory). In this file, you'll find a number of options for the configuration of the system itself as well as configuration for various helper programs.
Since it is a Perl module, you'll need to know just a hint of Perl syntax, most of which you can pick up from what is shown in the default configuration. Most configuration variables start with a dollar sign followed by the variable name; this is just basic Perl syntax for a scalar variables. Character strings can be enclosed with either double or single quotes. All lines end with a semi-colon. The allhostinfo variable is called a "hash table" and has it's own syntax, explained below.
List of Tools:
Host Configuration:
The set of hosts in the cluster are defined in a Perl hash table
called "%allhostinfo". The percent sign indicates to Perl
that it is a hash, the rest of the syntax is straight-Perl (although
it may be a bit cryptic to non-Perl programmers).
The basic syntax for this hash table looks like:
%allhostinfo = ( "cow1" => [ "cow","mem1024","f90","gcc" ], "cow2" => [ "cow","mem1024","f90","gcc" ], "cow3" => [ "cow","mem1024","myprog" ], "cow4" => [ "cow","mem1024" ], "calf1" => [ "calf","mem512","f90","gcc" ], "calf2" => [ "calf","mem512","f90","gcc" ], "calf3" => [ "calf","mem512","myprog" ], "calf4" => [ "calf","mem512" ] );The "%allhostinfo" is the name of the hash variable (the percent sign indicates to Perl that it is a hash). The first quoted string on each line is the name of a machine (cow1, cow2, etc.). The entries between the square brackets are the assigned "tags" for that host. One caveat to note: most host definition lines end with a comma except the last one, which is ended with the closing parenthesis and semi-colon.
Usually, the tags assigned to a machine indicate the type of system it is, what machines it is similar to, what machines it is different from, etc. This can be things like the amount of memory in the machine, licensed compilers, other special software, or other hardware that is unique to that machine. Or it could indicate that certain machines share an ethernet switch or UPS. These tags can later be used to select all hosts that have a Fortran 90 compiler, on all hosts that have large memory spaces. Note that at this point, tags are really character strings -- so defining "mem1024" is useful to a human reader, but the system cannot determine that "mem1024" is bigger than "mem512". You can remove all of these if you wish, or you can get more and more detailed if you have many software packages that are installed on selected machines. See below for more details on why this might be useful.
Configuration File:
There are several basic system configuration options that you may want
to set, some for security reasons, others for more robustness in
connecting to remote machines. Note that the top part of the file
should not be tampered with -- the "package", "use",
"@ISA", and "@EXPORT" lines must be there in order
for Perl to import this into the herdtools programs.
Commandline Options to all Herdtools Commands:
To make matters easier, we have included a set of basic options that all
Herdtools commands will parse. These include mechanisms to select groups
of machines out of the entire cluster.
Basic options:
Advanced options:
Using Host Tags:
Host tags are simply an attempt to make life easier,
to allow quick access to common sets of machines. If several machines
have Fortran compilers and the rest do not, then it can provide a quick
means to running compilation jobs on the compilation-capable machines.
Similarly, you may have some machines with more memory than others.
Rather than remembering what hosts those are, you can simply select
all "mem1024" machines when you need to run a large memory job.
If a UPS system fails, you may want to quickly shut down all machines
that have tag "ups1". And of course, if software has only
been licensed for N machines, you can quickly request that
your job be run on one of those machines by selecting the appropriate
tag.
When you use the "-i tag" option, your hostlist will contain those machines that are marked with the given tag (think "include machines with...").
When you use the "-e tag" option, your hostlist will be re-scanned and those machines that are marked with the given tag will be remove (think "exclude machines with..."). Note that since this removes hosts from an assumed previously defined hostlist, you may want to use "-e" with the "-A" option to start with ALL hosts, then whittle down the list.
When using multiple "-i" or "-e" options, we tried to program for the usual case. For multiple "-i" options, hosts are selected only if they match ALL "include" tags. For multiple "-e" options, hosts are removed if they match ANY "exclude" tag. So if you have a set of "big memory" machines that you need to replace a compiler library on, you can use "-i fatmem -i f90" -- this will include only those machine which have lots of memory, and have the Fortran compiler installed. Since Matlab tends to hog CPU resources, you may want to search for machines which have lots of memory, but do not have Matlab installed -- "-i fatmem -e matlab".
Examples:
Parallel make:
We have included a command called "cl_run" which can be used
to craft a parallel make. Instead of using the "gcc"
command to compile, you tell make to use cl_run
(and then tell cl_run to execute the gcc command).
% make -j 8 CC="cl_run -- gcc"The "-j 8" option tells make to use at most 8 child processes in parallel (you can set this higher, if you have more compilation hosts). The "CC=..." part tells make what command to run for the C compiler. Note that we are assuming here that gcc is installed everywhere (since it is free), thus no tags are needed to select a compilation host. Also note that the "--" argument is sent to cl_run to separate out cl_run-specific arguments from the user program (compiler) arguments.
Why the name? Well, you have a "server farm" don't you? So these are tools to help tame your "herd"... ha ha? (I know, I know, don't quit the day job)
RCSID $Id: herdtools.html,v 1.4 2002/03/12 14:27:21 jbp4444 Exp jbp4444 $