procstatd - process status daemon

Synopsis:

usage: procstatd [options]
     -d port             start up in 'daemon mode' with given port
     -v                  verbose output

Description:
procstatd comes from Rob Brown in the Physics Department at Duke University. It is a straightforward process status monitor that communicates over TCP. It mostly provides data from the /proc filesystem, but it can also be adapted to look for other status information (ie. temperature sensors, UPS monitors, etc).

I have added two new commands to procstatd: quik and jobs. The quik command collects data only from /proc/stat and /proc/loadavg (on Linux systems) and is intended to reduce overhead of the monitoring so that a simple cluster load balancing system can repeatedly ping procstatd without significantly overloading the system. The jobs command is simply a remote ps utility. It replies with the number of jobs and then 1 line per job containing the contents of /proc/PID/status and /proc/PID/cmdline.

procstatd is written in C and is very easy to modify to suit your own cluster. In the past, I have also added a monitor to watch the auto-mount daemon since it was acting flaky for a while in our cluster (and if amd went down, you could not really use that machine), and I have also added support to monitor Unix IPC items (shared memory segments, semaphores, etc.).

You should probably have procstatd run at boot-up, although manually starting the daemon works just as well. Since procstatd gives out potentially sensitive information about the system (especially with our modifications), you might want to think about TCP wrappers (it works fine with them), IP chains, or IP tables to limit what machines are able to access that port on the remote machines.

We have found procstatd to be extremely useful in debugging system problems. It will generally still respond even when other methods might fail, and the ability to get a remote ps of a wonky machine is certainly worth the price... which is free (it is released under the GPL). For more info, see: Rob Brown's procstatd page.

Options:
Usually, I run it with:

     % procstatd -d 7885

The "-d" option puts it into daemon mode. Note that the port number you use here MUST coordinate with that in the cluster configuration file!

Output Data:

ident beowulf1.ee.duke.edu 152.3.196.135 0.00: basic identity of the machine; name and IP address
version 2.4.9-21smp 0.00: version of the OS that is running; note that some sys admins might consider this to be a security risk (if there is a hole in v.2.4.9, then you now know how to attack this machine)
cpu 0.68: total cpu load
cpu_user 0.00: amount of cpu load that is caused by user jobs
cpu_nice 0.00: amount of cpu load that has been "niced" to a lower priority
cpu_sys 0.68: amount of cpu load that is due to system tasks (could be I/O being done for a user too)
cpu_idle 99.32: amount of cpu that is idle
cpu0 1.35
cpu0_user 0.00
cpu0_nice 0.00
cpu0_sys 1.35
cpu0_idle 98.65: for multi-processor systems, you will see these lines repeated; they show the percent of CPU being used for that single CPU only; the non-numbered cpu figures are aggregate for the whole system
load1 0.00: the 1 minute load average of the machine
load5 0.00: the 5 minute load average of the machine
load15 0.00: the 15 minute load average of the machine
proc 0.00
ctxt 18.00
swap 0.00: swap activity
swap_in 0.00
swap_out 0.00
page 0.00: paging activity
page_in 0.00
page_out 0.00
intr 174.00
mem_total 524.74: total memory available on the system
mem_used 344.65: amount of memory currently in use by the system
mem_free 180.10: amount of free memory
mem_shared 0.00: shared memory currently in use by the system
mem_buff 19.99
mem_cache 232.18
mem_swap_total 2146.75: total swap space available
mem_swap_used 80.28: swap space currently being used by the system
mem_swap_free 2066.47: swap space that is currently free
eth0 26.00: ethernet connection #0; similar stats will appear for any other ethernet connections on the system
eth0_err 0.00
eth0_rx 23.00
eth0_rx_err 0.00
eth0_tx 3.00
eth0_tx_err 0.00
users 0.00: number of users currently logged in
time 10:00am 1014130807.00: what the machine thinks the current time is
uptime 11d:11h:18m:48.51s 991128.51: amount of uptime the machine has accrued
shm_num 0.00: number of Unix IPC shared memory segments that are currently allocated
shm_tot 0.00: amount of memory being used by Unix IPC shared memory segments
sem_num 0.00: number of Unix IPC semaphore sets that are currently allocated
sem_tot 0.00: total number of Unix IPC semaphores
msg_num 0.00: number of Unix IPC message queues that are currently allocated
msg_tot 0.00: total amount of Unix IPC message queues

RCSID $Id: procstatd.html,v 1.3 2002/03/12 14:11:18 jpormann Exp $