Sponsored by BMBF Logo

Data Stream Management

Motivation

Some astrophysical applications require efficient and distributed Grid-based processing of vast data volumes which originate from databases as well as data streams. By means of our services, community members can subscribe to data streams linked to persistent data.
  1. Disseminating, Publishing, and Processing Data in the Grid.

    To an increasing degree, researchers intend to provide large data sets in the Grid (e.g., the Millennium-Simulation or other large catalogs) and process those efficiently in a distributed fashion. The means in use range from complex operations to complete process descriptions or workflows (see further developments on the Process Coordinator of the Planck project).

    By using mobile code, sophisticated description, and intelligent distribution mechanisms we efficiently parallelize the processing and minimize network traffic. To provide an appropriate load-balancing among the computers and across the network is pivotal.

  2. Grid-based Management and Processing of Data Streams.

    Providing decentralized and distributed information processing is the key to enable researchers to gain more thorough and new perceptions by associating and handling distributed persistent data (e.g., observations or simulations) or continuously generate data streams originating from sensors, gauging stations, telescopes and the like.

    To reach the necessary efficiency, we propose adaptive in-network query processing – moving the queries towards the data source – and thus optimizing the data flow in the network.

Architectural Design & Background

The AstroGrid-D data stream management system (DSM) is a collection of Grid-services that enable stream-based processing. An architectural overview is shown on the figure below.

For more details on the design and implementation of the data stream management, see the Project Documents of the "Distributed Database Access and Data Stream Management" working group (WG4).
The AstroGrid-D data stream management is based on the research of the StreamGlobe and StarGlobe projects of the database systems group at the Technische Universität München (IN.TUM).

Requirements

  • Java (Version 1.5 or higher)
  • Globus Toolkit (GT4.0.x). DSM only uses functionality provided by the Java WS Core of the Globus Toolkit. (The Java WS Core is contained in the full installer.)

Download

The AstroGrid-D data stream management is available as precompiled binary.
Download current version.

Installation

  1. Extract the streamglobe.gar file from the DSM archive. If you saved the DSM archive into INSTALL_DIR (e.g. /home/globus/gars), streamglobe.gar will be extracted into INSTALL_DIR.
    unzip dsm-<version>.zip
  2. Deploy the data stream management services from the streamglobe.gar file. In the  INSTALL_DIR directory the command is
    $GLOBUS_LOCATION/bin/globus-deploy-gar streamglobe.gar
    Note: Please use the account which is used to run the container (e.g. globus). Otherwise you can run into permission issues during start-up or undeployment.

  3. (Re-)start the Globus container.
    If the AstroGrid-D Data Stream Management is installed correctly, you should see the following five services running in your container:
    .../wsrf/services/streamglobe/ContentProvider
    .../wsrf/services/streamglobe/ContentProviderFactory
    .../wsrf/services/streamglobe/Peer
    .../wsrf/services/streamglobe/PeerFactory
    .../wsrf/services/streamglobe/SpeakerPeer
    This information is either shown on your terminal (if you start the container with globus-start-container) or in the log-file of your container (e.g. $GLOBUS_LOCATION/var/container.log) when you used globus-start-container-detached or the start-stop script from the Quickstart Guide.

Configuration

The configuration file for the data stream management system is $GLOBUS_LOCATION/etc/streamglobe/jndi-config.xml, which contains the configuration for all services of the data stream management. The default configuration should be fine for most setups. Find details for some of the configuration parameters below.

SpeakerPeer (<service name="streamglobe/SpeakerPeer" />)

  • gridDiscovery: if set to true, the SpeakerPeer uses grid services to verify that connected peers are alive. This is the default. Otherwise it uses multicast to do so (which of course is restricted to local area setups).

De-Installation

  1. Undeploy the data stream management from your Globus container.
    $GLOBUS_LOCATION/bin/globus-undeploy-gar streamglobe
  2. (Re-)start the Globus container.