procstatd - process status daemon
Synopsis:
usage: procstatd [options]
-d port start up in 'daemon mode' with given port
-v verbose output
Description:
procstatd comes from Rob Brown in the Physics Department at
Duke University. It is a straightforward process status monitor
that communicates over TCP. It mostly provides data from the
/proc filesystem, but it can also be adapted to look for
other status information (ie. temperature sensors, UPS monitors, etc).
I have added two new commands to procstatd: quik and
jobs. The quik command collects data only from
/proc/stat and /proc/loadavg (on Linux systems) and
is intended to reduce overhead of the monitoring so that a simple
cluster load balancing system can repeatedly ping procstatd
without significantly overloading the system. The jobs
command is simply a remote ps utility. It replies with the
number of jobs and then 1 line per job containing the contents of
/proc/PID/status and /proc/PID/cmdline.
procstatd is written in C and is very easy to modify to suit
your own cluster. In the past,
I have also added a monitor to watch the auto-mount daemon
since it was acting flaky for a while in our cluster
(and if amd went down, you
could not really use that machine), and I have also added support to
monitor Unix IPC items (shared memory segments, semaphores, etc.).
You should probably have procstatd run at boot-up, although
manually starting the daemon works just as well. Since procstatd
gives out potentially sensitive information about the system
(especially with our modifications), you might want
to think about TCP wrappers (it works fine with them), IP chains, or
IP tables to limit what machines are able to access that port on the
remote machines.
We have found procstatd to be extremely useful in debugging
system problems. It will generally still respond even when other
methods might fail, and the ability to get a remote ps of a
wonky machine is certainly worth the price... which is free (it is
released under the GPL).
For more info, see:
Rob Brown's
procstatd page.
Options:
Usually, I run it with:
% procstatd -d 7885
The "-d" option puts it into daemon mode.
Note that the port number you use here MUST coordinate with that
in the cluster configuration file!
Output Data:
- ident beowulf1.ee.duke.edu 152.3.196.135 0.00
- basic identity of the machine; name and IP address
- version 2.4.9-21smp 0.00
- version of the OS that is running; note that some sys admins might
consider this to be a security risk (if there is a hole in v.2.4.9,
then you now know how to attack this machine)
- cpu 0.68
- total cpu load
- cpu_user 0.00
- amount of cpu load that is caused by user jobs
- cpu_nice 0.00
- amount of cpu load that has been "niced" to a lower priority
- cpu_sys 0.68
- amount of cpu load that is due to system tasks (could be I/O being
done for a user too)
- cpu_idle 99.32
- amount of cpu that is idle
- cpu0 1.35
- cpu0_user 0.00
- cpu0_nice 0.00
- cpu0_sys 1.35
- cpu0_idle 98.65
- for multi-processor systems, you will see these lines repeated; they
show the percent of CPU being used for that single CPU only; the
non-numbered cpu figures are aggregate for the whole system
- load1 0.00
- the 1 minute load average of the machine
- load5 0.00
- the 5 minute load average of the machine
- load15 0.00
- the 15 minute load average of the machine
- proc 0.00
-
- ctxt 18.00
-
- swap 0.00
- swap activity
- swap_in 0.00
-
- swap_out 0.00
-
- page 0.00
- paging activity
- page_in 0.00
-
- page_out 0.00
-
- intr 174.00
-
- mem_total 524.74
- total memory available on the system
- mem_used 344.65
- amount of memory currently in use by the system
- mem_free 180.10
- amount of free memory
- mem_shared 0.00
- shared memory currently in use by the system
- mem_buff 19.99
-
- mem_cache 232.18
-
- mem_swap_total 2146.75
- total swap space available
- mem_swap_used 80.28
- swap space currently being used by the system
- mem_swap_free 2066.47
- swap space that is currently free
- eth0 26.00
- ethernet connection #0; similar stats will appear for any other
ethernet connections on the system
- eth0_err 0.00
-
- eth0_rx 23.00
-
- eth0_rx_err 0.00
-
- eth0_tx 3.00
-
- eth0_tx_err 0.00
-
- users 0.00
- number of users currently logged in
- time 10:00am 1014130807.00
- what the machine thinks the current time is
- uptime 11d:11h:18m:48.51s 991128.51
- amount of uptime the machine has accrued
- shm_num 0.00
- number of Unix IPC shared memory segments that are currently allocated
- shm_tot 0.00
- amount of memory being used by Unix IPC shared memory segments
- sem_num 0.00
- number of Unix IPC semaphore sets that are currently allocated
- sem_tot 0.00
- total number of Unix IPC semaphores
- msg_num 0.00
- number of Unix IPC message queues that are currently allocated
- msg_tot 0.00
- total amount of Unix IPC message queues
RCSID $Id: procstatd.html,v 1.3 2002/03/12 14:11:18 jpormann Exp $