Mark Fischler, Mike Isely, Mark Galli, Stephan Lammel,
Dave Sachs, Jeff Mack
Fermi National Accelerator Laboratory
AbstractA new generation of experiments in high energy physics is approaching. With the approval of the Large Hadron Collider at CERN and the revised Main Injector plans at Fermilab, high statistics experiments will start operation within five and 10 years. With luminosities of 10^33 and 10^34 1/cm^2 s and several hundred thousand readout channels data from these experiments cannot be handled and analysed in the traditional old-fashioned HEP way. This paper discusses approaches to the data intensive phase of post-reconstruction analysis, characterised by a wide variety of physicists intitiating multiple scans through large bodies of data.
Approaches to analysis can be characterised by the type of investigations they support efficiently, the nature of selection criteria the user can supply, and the data and results delivered. Currently, experiments handle raw data in banks that are handled by FORTRAN memory management packages. A reconstruction pass creates objects of interest, again stored sequentially in files. In a second "pre-analysis" production pass events are sorted by characteristics like the existance of lepton candidates. Early data-intensive analysis handles the event information much like these production passes, fragmenting data and controlling organization through memory management packages. In the next analysis step, specific information per event is collected and saved in n-tuples. This information is then rapidly available for histogramming, visual- ization, and statistics. While bank headers provide limited information about the object stored, the organization of n-tuples is defined by the code that produced them. Access to data other than that extracted for n-tuples can be very time consuming. One approach to future analysis is to retain the current data handling methodolgy, and rely on improved computing hardware to cope with the factor of about 20 increase in data volume.
PAW and PIAF tools are already in use for data analysis. All hot information is collected and stored in n-tuples, distributed over several CPUs. Users then extract their desired information - histograms or further n-tuples - from this master n-tuple. Future analyses could be based on this second approach, which becomes especially attractive once n-tuples have been formed.
The Fermilab CAP project is focusing on a third approach, suitable where limitations imposed by n-tuple methods impede analysis. Data are loaded into the CAP system and hot information for event selection is stored in an object oriented format. Queries can provide full or summary event information; the volume of data extracted is assumed to be rather small and thus manageable on a standard workstation. An Event Query Language extends the familiar PAW style, automating conditions involving multiple particles or objects within the event and links between objects. The intricate hierarchical structure present in experiment data is expressed in a C++ -based paradigm of "physics objects". Efficient processing, avoiding repeated scans of the full data set, is achieved by grouping the physics objects of each kind into stores which can be scanned individually as needed. To implement relations between objects, and smoothly handle the data organization, lightweight persistent object management software is used. Starting from the Ptool persistent object API, we have created a Physics Object Persistency Manager (POPM) designed to support optimum query rates on a large parallel system. New features include data parallelism, low-overhead access to persistent data by locked "physical pointers", and read- ahead logic to achieve very high performance on individual queries. The DST data for D0 Run1B has been brought into the CAP system and can be "mined" via EQL queries.
A fourth approach manages the data in an even earlier phase of the analysis cycle. All information about an event, whether from the data aquisition system, the reconstruction passes, or even generated during the analysis phase, is collected in a database system. This data is available for distribution to processors requiring the information. Frequently accessed information stays at the workstation or analysis cluster; additional data is also present so that when small changes in selection criteria are applied, the missing information is made available to the user quickly and transparently. This approach is being investigated on the same testbed hardware as the CAP project.
The common feature of any approach to this phase of analysis is the need for high-performance access to large datasets. We discuss pros and contras of these four different analysis approaches currently being considered for the next Collider and Fixed Target runs at Fermilab. Ideally, various methods could be integrated on one or more systems, preferably sharing one copy of the event data on disk. Since efficient implementations of each approach have distinct data organization needs, a compromise must be reached trading some duplication of data for some efficiency improvements. We illustrate instances where efficient data sharing can be implemented, and present problems which may prevent this in other cases.
Submitter's Name: Mark Fischler
Submitter's Institution: Fermi National Accelerator Laboratory
Address of Institution: P.O. Box 500
Batavia, Illinois, 60510, USA
Submitter's E-mail address: mf@ffnal.gov
Submitter's telephone/fax: (708)840-4339
840-8208 fax
Authors:
Mark Fischler Fermi National Accelerator Laboratory
Mike Isely FNAL
Mark Galli FNAL
Stephan Lammel FNAL
Dave Sachs FNAL
Jeff Mack FNAL