Sponsored by BMBF Logo

Clusterfinder

Clusterfinder is a use case within the AstroGrid-D project that tests the deployment and performance of a typical data-intense astrophysical application. The algorithm for any point in the sky depends only on data from nearby points, so the data access and calculation can easily be parallelized, making Clusterfinder well-suited for production on the grid. Astronomy in recent years has seen a shift away from the study of individual or unusual objects to the statistics of large numbers of objects, observed at a variety of wavelengths across the electromagnetic spectrum, so that the techniques developed for Clusterfinder are applicable to many cutting edge astronomic studies.

The scientific purpose of Clusterfinder is to reliably identify clusters of galaxies by correlating the signature in X-Ray images with that in catalogs of optical observations.

Cosmology and galaxy clusters

After the Big Bang, matter collapsed into objects of various sizes. Gas collected into stars, stars into galaxies, but the largest such structures are clusters of hundreds of galaxies. Between the galaxies in a cluster is an ionized gas, which is so hot that it emits primarily X-rays. Clusters are ideal tracers of the large-scale structure of the universe, so the study of the properties of large numbers of clusters can yield answers to fundamental questions of cosmology.

There are a number of ways in which clusters can be observed. The most obvious is to find galaxies with an optical telescope and then look for areas of the sky with an unusually large number of galaxies. This method will occasionally go wrong because the galaxies may actually be spread out along the line of sight rather than in a compact cluster. Another method is to observe the X-ray emission of the gas between the galaxies. This is also not entirely reliable because there are many sources of X-rays besides clusters. To provide a more reliable identification of clusters over a large fraction of the sky, the "clusterfinder" methodology was developed at the Max-Planck-Institut für extraterrestrische Physik. The theory of point processes is applied to calculate the statistical "likelihood" of a cluster at any point in space, first using the galaxies from SDSS (the largest existing catalog of galaxies, covering a fifth of the sky and containing nearly 2 million galaxies) and then using the X-ray photons from RASS (the largest record of astronomical X-ray observations, documenting 150,000 X-ray sources). Since a peak in one of these data sets is probably a false positive unless there is also a peak in the other, the likelihoods from the two data sets are multiplied together, and then peaks in the combined likelihood extracted into a catalog of galaxy clusters.

Deployment, parallel computation, and logistics in the grid

The Clusterfinder has been implemented in a Fortran-90 program, which takes as input, in addition to a model of cosmology and galaxy clusters, the grid of sky coordinates (right ascension and declination) and redshifts (distance from the Earth) on which the likelihood is to be calculated. Since the likelihood of a cluster based on data from one part of the sky is independent of the likelihood somewhere else, the algorithm can be trivially parallelized. This is fortunate because the calculations can be quite intensive. Scanning the entire data available will require about 20,000 CPU-hours. While this would entail over two years on a single processor, the use of clusters and/or the grid can provide access to hundreds of processors at the same time, reducing the time to finish a whole-sky calculation to several days. An exploratory calculation on a smaller area, which might take a month on a single computer, can be done overnight on the grid.

In order to run a program on 100 machines, it must first be deployed on 100 machines. This can be a complex process in the best of circumstances, involving transfer of source code (or the appropriate binary), scripts, and configuration files, identification of the location of libraries, services, and commands, and setting of environment variables. The grid by definition is composed of heterogeneous machines, making the process much harder. To control the complexity of this process, two systems have been developed in AstroGrid-D: grid-modules and environments. Grid-modules encapsulates the diversity of various applications provides a uniform interface to the user for common processes like installing, updating, and compiling. By keeping the source code in a subversion repository, updates to the latest version or to a defined production version are easily performed. Enviroments, with a similar philosophy, encapsulates the diversity of the calculation hosts in the grid. With these systems, Clusterfinder can be brought into a state capable of production on many machines with a minimum of manual effort. A job, consisting of the calculation of a likelihood map on a certain patch of sky with a certain set of parameters, can then be submitted as a globus job to one of these grid hosts from any grid client (globus, CoG-kit, gsi-sshterm). Finally, the results are collected by a central host from the grid hosts, using either the post-staging capabilities of globus or by direct grid transfer using globus-url-copy.

In the case of Clusterfinder, special consideration must be given to the input data. The SDSS and RASS catalogs are too large to keep all the data on one machine, much less on 100 grid machines. Therefore the makefile controlling the Clusterfinder workflow is set up to request just the data needed

An important and difficult task is the level above job submission and file transfer, namely the logistics of splitting a complete calculation into jobs that can run in parallel, identifying a grid host with the capacity to accept a job at the given time, reassembling the individual results into a coherent whole, and documenting the internal and external conditions under which the calculation was carried out. In the case of Clusterfinder, this is handled with the help of a postgres database.

Clusterfinder as a grid service

A demonstration version of Clusterfinder is available as a portal application. The user can input coordinates and get back a likelihood map. It is planned to extend this portal to provide a production version of Clusterfinder as a grid service, including control over all the input parameters. Still later it is planned to allow users to replace certain modules, e.g. the one which calculates the cluster profile, and to use data from other sources, resulting in a powerful tool for answering a broad range of astronomic questions.

Contact and further information

Dr. Arthur Carlson (awc _a t_ mpe.mpg.de), Max-Planck-Institut für extraterrestrische Physik, further reading