Neil Araujo, Robert Grossman, David Hanley, and Wen Xu
Kirill Denisenko, Mark Fischler, Mark Galli, and Seung Chan Ahn
David Malon and Edward May
University of Illinois at Chicago
Fermi National Accelerator Laboratory
Argonne National Laboratory
Abstract
High performance data management will be as important as high performance
computing by the beginning of the next millennium. In this environment, data
mining will take on increasing importance. In this paper, we discuss our
experience with parallel data mining on an IBM SP-2, focusing on the following
four issues which we feel are emerging as critical for data mining
applications in general:
Lightweight Data Management. Since data mining applications read data often but write data rarely, the full functionality of a database is not needed. As an alternative, we view the data as a collection of persistent objects and use lightweight software tools to create, access, manage, and query it. This has the important added advantage that such a lightweight data management system can be more easily parallelized and tuned than can a full-function database.
Balancing Parallel Input-Output and Parallel Execution. Our study used the CAP IBM SP-2 at the Fermi National Accelerator Laboratory, which consisted of twentyfour nodes. Eight of the twenty-four nodes of the IBM SP-2 were configured as input-output nodes, with approximately ten gigabytes of disk attached to each one. The other sixteen nodes were designated as query nodes. Multiple queries ran simultaneous ly on each of the query nodes, and each query node could request data from any of the input-output nodes, allowing each of them to obtain, in principle, an eightfold speed up in the delivery of data. While some data mining applications can effectively exploit complete data parallelism, with data analyzed locally on each node and no data communication necessary, for others such as ours, it is important to balance data-centered parallelism with parallel input-output.
Data Placement. We developed an object oriented data model which logically viewed the data as a collection of events whose attributes were the banks of the underlying data. There are at least three ways of physically clustering the data on disk: by event, by attribute, or by some hybrid combination of event and attribute. An initial understanding was gained of the trade-offs for these different clustering strategies.
Resource Management. We were caught unawares by the amount of effort that we had to devote to resource management. For example, even with only 24 nodes, a more flexible scheme was needed to dynamically reassign the function of a node, for example, from a query node to an input-output node. As another example, our multiple gigabyte data sets were divided into tens of thousands of physical collections, called folios. The folios were stored on disks attached to the input-output nodes, but their location would change from time to time. In retrospect, a more efficient and flexible scheme for tracking the physical location of the folios was needed. As a third example, the index sets for the data sets were too large to fit on the disks of the* query nodes and they themselves had to be managed.
Submitter's Name: Robert Grossman
Submitter's Institution: University of Illinois at Chicago
Address of Institution: Laboratory for Advanced Computing (M/C 249)
851 S. Morgan Street
Chicago, IL 60607
Submitter's EMAIL address: grossmanuic.edu
Submitter's telephone number: 312 413 2176
Fax number: 312 996 1491
Authors: Neil Araujo, Robert Grossman, David Hanley, and Wen Xu
University of Illinois at Chicago
Kirill Denisenko, Mark Fischler, Mark Galli, and Seung Chan Ahn
Fermi National Accelerator Laboratory
David Malon and Edward May
Argonne National Laboratory