Sponsored by BMBF Logo

Installation Instructions for Ganglia and MDS4 (monitoring)

Contents:

  1. Overview
  2. Build Ganglia
  3. Install Ganglia
  4. Test Ganglia
  5. Configure System Service
  6. Configure Globus for Ganglia
  7. Configure Globus MDS
  8. Possible Problems
  9. Start Globus and Test

1. Overview

A general overview of Ganglia and its combination with Globus can be found at IBM:
Maximize your grid potential, Part 1: Ganglia.

In the following, only the monitoring daemon gmond is used. The Ganglia Meta Daemon gmetad is not considered further here, but is more suited for observing entire cluster complexes. MDS4 uses gmond.

First, we need the current Ganglia sources. The installation archives are available at SourceForge at:
http://ganglia.info/ (ganglia-3.0.7.tar.gz). These instructions have been tested with Ganglia 3.0.7.

If your host is running a firewall, note: for Ganglia to gather hardware statistics, the default port 8649 (both TCP and UDP) must be open on the host.  This port need not be open to the Internet however.

2. Build Ganglia

To build the software, start in the Globus directory. As user globus:

  • cd /work1/globus/
  • tar xvfz /tmp/ganglia-3.0.x.tar.gz

This will unpack the archive into a directory ganglia-3.0.x/. We now change to this directory:

  • cd ganglia-3.0.x/

All commands that follow are assumed to be executed from within this directory.

The Globus helper package contains a script to configure Ganglia. Suppose that the package has already been unpacked into a subdirectory globus-helper in the globus user's home directory:

  • cp ~/globus-helper/globus-install/ganglia.cfg .
  • sh -x ganglia.cfg

Edit the file
 gmond/gmond.init

Replace the line
 GMOND=/usr/sbin/gmond
by

 
GMOND=/usr/local/globus/ganglia/sbin/gmond

Now build Ganglia:

  • make

3. Install Ganglia

The installation of Ganglia is done as user root:

  • make install

This will install files under /usr/local/globus/ganglia/: libraries under lib/libganglia*, and
 include/ganglia.h
 bin/ganglia-config
 bin/gmetric
 bin/gstat
 sbin/gmond
should now exist there.

Create a configuration file as user root:

  • /usr/local/globus/ganglia/sbin/gmond -t > /etc/gmond.conf

Edit the file /etc/gmond.conf. Fill out the fields "name", "owner", and "latlong".

4. Test Ganglia

It should now be possible to execute gmond.

  • /usr/local/globus/ganglia/sbin/gmond

Then gmond should be listening at port 8649:

  • telnet localhost 8649

The XML output may look somewhat messy, but it is easy for machines to read. If you are running gmond also on local net segments computers already, be prepared to see output about other machines besides the local machine. This may lead to problems later on when running MDS4 on the output data, and may require alterations in the gmond configuration. See below.

  <?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>
<!DOCTYPE GANGLIA_XML [
   <!ELEMENT GANGLIA_XML (GRID|CLUSTER|HOST)*>
      <!ATTLIST GANGLIA_XML VERSION CDATA #REQUIRED>
      <!ATTLIST GANGLIA_XML SOURCE CDATA #REQUIRED>
   ...
]>
<GANGLIA_XML VERSION="3.0.5" SOURCE="gmond">
<CLUSTER NAME="AIP workstation cashmere" LOCALTIME="1193928615" OWNER="AIP" LATLONG="N52.4040 E13.1022" URL="unspecified">
<HOST NAME="cashmere.aip.de" IP="141.33.4.98" REPORTED="1193928599" TN="16" TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="1193928579">
<METRIC NAME="disk_total" VAL="339.425" TYPE="double" UNITS="GB" TN="27" TMAX="1200" DMAX="0" SLOPE="both" SOURCE="gmond"/>
...
</HOST>
</CLUSTER>
</GANGLIA_XML>

5. Configure Ganglia as system service

It is recommended to stop any running gmond (as user root):

  • gmond/gmond.init stop

and to install it permanently as a service (MDS4 works best with gmond being installed as a service):

  • cp gmond/gmond.init /etc/rc.d/init.d/gmond
  • /sbin/chkconfig --add gmond
  • /sbin/chkconfig --list gmond
  • /etc/rc.d/init.d/gmond start

6. Configure the Globus Toolkit for Ganglia

We now configure MDS4 to analyze the gmond output.

The configuration of the Globus Toolkit for Ganglia depends on the version installed.

a) For Globus Toolkit version < 4.0.5

Edit the file

  $GLOBUS_LOCATION/etc/globus_wsrf_mds_usefulrp/gluerp.xml

and replace the  "defaultProvider" line with

  <defaultProvider>java org.globus.mds.usefulrp.glue.GangliaElementProducer</defaultProvider>

b) For Globus Toolkit version ≥ 4.0.5

Globus 4.0.5+ uses the Resource Property Provider component of the UsefulRP subsystem to communicate information about specific grid services available on the resource over MDS. It comes with a tool mds-gluerp-configure that correctly configures the settings files.  The basic configuration to provide Ganglia information and a 'fork' job submission is generated in two separate lines:

  • mds-gluerp-configure none ganglia $GLOBUS_LOCATION/etc/globus_wsrf_mds_index/ganglia-config.xml

    Successfuly wrote configuration output file to: /usr/local/globus/gtk/etc/globus_wsrf_mds_index/ganglia-config.xml

  • mds-gluerp-configure fork ganglia $GLOBUS_LOCATION/etc/gram-service-Fork/gluerp-config.xml

    Successfuly wrote configuration output file to: /usr/local/globus/gtk/etc/gram-service-Fork/gluerp-config.xml

7. Configure Globus MDS

To configure Globus for MDS, the file
 $GLOBUS_LOCATION/etc/globus_wsrf_core/server-config.wsdd
is edited. The following lines are inserted in the section "<globalConfiguration>":

 
<parameter name="logicalHost" value="myhost.domain.de"/>
<parameter name="publishHostName" value="true"/>

where myhost.domain.de is to be replaced by the Internet address of the machine that will run gmond.

For the MDS upload, in the file
 $GLOBUS_LOCATION/etc/globus_wsrf_mds_index/hierarchy.xml
un-comment the existing commented-out section "<upstream>" and substitute its contents by

 
https://astrogrid-mds.aip.de:8443/wsrf/services/DefaultIndexService

8. Isolating/connecting machines

Ganglia's normal behavior is to exchange information between machines running Ganglia via UDP ports, which are set up in the /etc/gmond.conf sections udp_send_channel and udp_recv_channel.

If another machine is running

The automatic exchange of information with other computers using the UDP channel is sufficient for simple Ganglia usage, but is not sufficient for MDS4. And the machine should be grouped later using MDS4.

As an example, to prohibit communication with other gmond when problems with monitoring with MDS4 are encountered, edit the /etc/gmond.conf sections udp_send_channel and udp_recv_channel can be modified by changing the mcast_join addresses and the bind address to a value deviating from the value in the other gmond configurations.

In the future, advanced methods will be developed to monitor real clusters in order to produce complex gmond output that is processed without error by MDS4.

So at this time, more than one HOST entry may cause problems:

  • telnet localhost 8649 | grep "HOST NAME=" | wc -l

should ideally be equal to 1.

9. Start Globus and test

Start the Globus container:

  • /etc/init.d/globus restart

Here is an example of the contents of the log file $GLOBUS_LOCATION/var/container.log after a correct setup of Globus for Ganglia.

MDS4 and Ganglia should be communicating. We can verify this with the following query:

  • wsrf-query -a -z none -s https://127.0.0.1:8443/wsrf/services/DefaultIndexService

The answer may take a few seconds, but if MDS4 can analyze the Ganglia output correctly, we should receive information about the name of the computers and many details about the processor, main memory, disk space, operating system, load etc. If information is missing in the output, MDS4 has a problem. Possibly, one of the above mentioned problems might hinder output.

As an example, this is a fragment of the correct output of MDS4:

 
<ns11:AggregatorData>
<ns1:GLUECE xmlns:ns1="http://mds.globus.org/glue/ce/1.1">
<ns1:Cluster ns1:Name="Astrogrid-D" ns1:UniqueID="Astrogrid-D">
<ns1:SubCluster ns1:Name="main" ns1:UniqueID="main">
<ns1:Host ns1:Name="mohair.aip.de" ns1:UniqueID="mohair.aip.de">
<ns1:Processor ns1:CacheL1="0" ns1:CacheL1D="0" ns1:CacheL1I="0" ns1:CacheL2="0"
ns1:ClockSpeed="3206" ns1:InstructionSet="x86"/>
<ns1:MainMemory ns1:RAMAvailable="9" ns1:RAMSize="503" ns1:VirtualAvailable="1984"
ns1:VirtualSize="2612"/>
<ns1:OperatingSystem ns1:Name="Linux" ns1:Release="2.6.9-22.0.1.EL"/>
<ns1:Architecture ns1:SMPSize="1"/>
<ns1:FileSystem ns1:AvailableSpace="15633" ns1:Name="entire-system"
ns1:ReadOnly="false" ns1:Root="/" ns1:Size="27047"/>
<ns1:NetworkAdapter ns1:IPAddress="141.33.4.99" ns1:InboundIP="true" ns1:MTU="0"
ns1:Name="mohair.aip.de" ns1:OutboundIP="true"/>
<ns1:ProcessorLoad ns1:Last15Min="17" ns1:Last1Min="55" ns1:Last5Min="32"/>
</ns1:Host>
</ns1:SubCluster>
</ns1:Cluster>
<ns1:ComputingElement ns1:Name="default" ns1:UniqueID="default">
<ns1:Info ns1:TotalCPUs="1"/>
<ns1:State ns1:EstimatedResponseTime="0" ns1:FreeCPUs="1" ns1:RunningJobs="0"
ns1:Status="enabled" ns1:TotalJobs="0" ns1:WaitingJobs="0" ns1:WorstResponseTime="0"/>
<ns1:Policy ns1:MaxCPUTime="0" ns1:MaxRunningJobs="0" ns1:MaxTotalJobs="0"
ns1:MaxWallClockTime="0" ns1:Priority="0"/>
</ns1:ComputingElement>
</ns1:GLUECE>
</ns11:AggregatorData>

Note especially the strings Processor, ProcessorLoad, MainMemory, OperatingSystem, Architecture, FileSystem. These are derived from Ganglia information read by MDS.

Finally, in the $GLOBUS_LOCATION/var/container.log file, it is normal to see after the SOAP services listing lines like

  2008-06-17 12:14:46,580 INFO  impl.DefaultIndexService [ServiceThread-43,processConfigFile:107] Reading default registration configuration from file: /usr/local/globus/gtk/etc/globus_wsrf_mds_index/hierarchy.xml
2008-06-17 12:14:46,588 INFO  impl.DefaultIndexService
[ServiceThread-43,performDefaultRegistrations:193] Processing upstream registration to https://astrogrid-mds.aip.de:8443/wsrf/services/DefaultIndexService