Papers by Author | in ( Plenary | Parallel | Poster | Summary ) Sessions

Some Remarks on Parallel Data Mining Using a Persistent Object Manager

Neil Araujo, Robert Grossman, David Hanley, and Wen Xu
Kirill Denisenko, Mark Fischler, Mark Galli, and Seung Chan Ahn
              David Malon and Edward May


         University of Illinois at Chicago
          Fermi National Accelerator Laboratory
             Argonne National Laboratory

  • Paper (Postscript)
  • Paper (PDF)
                    
                         Abstract
    
    High performance data management will be as important as high performance computing by the beginning of the next millennium. In this environment, data mining will take on increasing importance. In this paper, we discuss our experience with parallel data mining on an IBM SP-2, focusing on the following four issues which we feel are emerging as critical for data mining applications in general:

    Lightweight Data Management. Since data mining applications read data often but write data rarely, the full functionality of a database is not needed. As an alternative, we view the data as a collection of persistent objects and use lightweight software tools to create, access, manage, and query it. This has the important added advantage that such a lightweight data management system can be more easily parallelized and tuned than can a full-function database.

    Balancing Parallel Input-Output and Parallel Execution. Our study used the CAP IBM SP-2 at the Fermi National Accelerator Laboratory, which consisted of twentyfour nodes. Eight of the twenty-four nodes of the IBM SP-2 were configured as input-output nodes, with approximately ten gigabytes of disk attached to each one. The other sixteen nodes were designated as query nodes. Multiple queries ran simultaneous ly on each of the query nodes, and each query node could request data from any of the input-output nodes, allowing each of them to obtain, in principle, an eightfold speed up in the delivery of data. While some data mining applications can effectively exploit complete data parallelism, with data analyzed locally on each node and no data communication necessary, for others such as ours, it is important to balance data-centered parallelism with parallel input-output.

    Data Placement. We developed an object oriented data model which logically viewed the data as a collection of events whose attributes were the banks of the underlying data. There are at least three ways of physically clustering the data on disk: by event, by attribute, or by some hybrid combination of event and attribute. An initial understanding was gained of the trade-offs for these different clustering strategies.

    Resource Management. We were caught unawares by the amount of effort that we had to devote to resource management. For example, even with only 24 nodes, a more flexible scheme was needed to dynamically reassign the function of a node, for example, from a query node to an input-output node. As another example, our multiple gigabyte data sets were divided into tens of thousands of physical collections, called folios. The folios were stored on disks attached to the input-output nodes, but their location would change from time to time. In retrospect, a more efficient and flexible scheme for tracking the physical location of the folios was needed. As a third example, the index sets for the data sets were too large to fit on the disks of the* query nodes and they themselves had to be managed.


    Submitter's Name: Robert Grossman
    Submitter's Institution: University of Illinois at Chicago
    Address of Institution: Laboratory for Advanced Computing (M/C 249)
                            851 S. Morgan Street
                            Chicago, IL 60607
    Submitter's EMAIL address: grossmanuic.edu
    Submitter's telephone number: 312 413 2176
    Fax number: 312 996 1491
    
    Authors: Neil Araujo, Robert Grossman, David Hanley, and Wen Xu
             University of Illinois at Chicago
    
             Kirill Denisenko, Mark Fischler, Mark Galli, and Seung Chan Ahn
             Fermi National Accelerator Laboratory
    
             David Malon and Edward May
             Argonne National Laboratory