Digital Archive Directions (DADs) Workshop
Digital Archive Directions (DADs) Workshop
DATE: June 22-26, 1998
HOST: The National Archives and Records Administration
8601 Adelphi Road
College Park, MD 20740-6001
1. Identification of Proposed Topic [Required]
Persistent Archives in a Digital Library Framework
1.2 Contributor(s)Enabling Technologies Group
San Diego Supercomputer Center
PO Box 85608
San Diego CA 92186-5608
Reagan Moore - firstname.lastname@example.org - Point of Contact
1.3 Description of Proposed Project
The supercomputer, digital library, and archival storage communities have common persistent archival storage requirements. Each of these communities is building software infrastructure to organize and store large collections of data.In particular, digital library technology provides many of the capabilities needed for a persistent archive. A co-evolution of super-computing, federated archiving and digital library technologies is critical for future application and information processing activities. We are developing an infrastructure called the "Data Intensive Computing Environment" (DICE) as a first step towards achieving this goal.
Currently, we are in the process of setting up a general digital library system for ingesting, managing, archiving and accessing several information collections of scientific data whose total size can grow to Petabytes with billions of objects. The content of these archives are scientific datasets including documents, images, field-generated data and simulation results in fields ranging from astronomical, environmental, sociological, ecological to neuro-science. An important issue is to make information in the archived digital libraries available through the web as well as through APIs for processing on platforms such as Cray C90, Cray T3E, IBM SP2 and TERA. We are also investigating automation of the ingestion of data into the Archive Digital Library. Many of the data sets are produced using scientific instruments or are the output of super computing applications.
Our system is built around a Storage Resource Broker (SRB) and a Metadata Catalog (MCAT), both developed at SDSC. The SRB provides a uniform API for access to heterogeneous archival storage systems and deals with federation of storage sites and replication of data objects. MCAT is a repository that handles different levels of metadata, including:
Both MCAT and SRB expose different levels of input/output mechanisms including web-based access, Unix-type shell level access, programmatic access and GUI-based access for both resource and data discovery and for data manipulation.
The persistent archive system can manage a wide variety of storage resources. SRB drivers have been implemented for archival storage systems such as the IBM High Performance Storage System (HPSS) and NSL UniTree, ADSM, database systems such as DB2, Oracle, Illustra, and for local and remote Unix file system access. Currently, the system runs supercomputers such as Cray C-90, T3E and IBM SP2, and on workstations such as Sun, SGI, and DEC platforms.
Some of the salient features of the SRB/MCAT system are as follows:
As mentioned in the previous section, current trends in data and information generation are leading to a paradigm shift in storing and manipulating datasets. This is driven by the creation of large collections of data (the Digital Sky Survey will access over 2 billion images) to large data objects (a brain image in the NeuroScience Database can eventually be as large as a terabyte). For scientific disciplines to survive under the onslaught of massive data loads, requires that efficient infrastructure needs to be developed to provide automated means of information ingestion, management, querying and ingestion into future computations.
Hence one needs to consider a system that is not only used for archiving digital objects, but that also provides several mechanisms for knowledge discovery and supports efficient access to the holdings in the system.
This requires an infrastructure that can integrate the three technologies provided by digital libraries, archival systems and super computers.
1.5 Definitions of Concepts and Special Terms
1.6 Expected Relationship with OAIS Reference Model
The infrastructure under development at SDSC is an extension of the OAIS model. In addition to the ability to maintain metadata needed to migrate data collections onto new media, migrate metadata into new catalogs, and migrate data holdings into new format standards, an effort is being made to be able to support information discovery and rapid access to the holdings. The migration of the collections forward in time also requires the ability to extend the ontology under which the collections are organized, incorporate new metadata, and modify semantic definitions of terms. Given these capabilities, a persistent archive can be created that meets the needs of the super-computing and digital library communities.
2. Scope of Proposed Standard [Desired]
An archival digital library should not only deal with different disciplines but also provide a means of interaction between the disciplines and their collections. This requires a meta catalog that can hold discipline-specific ontologies and semantics that can be used for such interaction. Also, because we are dealing with fields that are rapidly evolving, one needs to consider the longevity of such ontologies and migrate them forward in time.
Since we are dealing with objects that are to be accessed using different types of APIs and methods, one needs to migrate forward the methods and procedures that are used in analyzing data. Hence one needs to go beyond storing preservation-level metadata for the objects but also consider preservation-level metadata for methods and APIs.
2.1 Recommended Scope of Standard
The SRB/MCAT project is part of the data intensive computing environments thrust area of the the NPACI partnership (National Partnership for Advanced Computing Infrastructure) which spans 23 universities throughout the United Sates. The project will be used as a tool for integrating digital library technology with archival storage systems to support large scale projects such as the Digital Sky Survey project (at CalTech), the NeuroScience Database Project (at UCLA), etc.
2.2 Existing Practice in Area of Proposed Standard
Current practice in super computing environment is to manually store and retrieve data out of archival storage systems and perform manual data discovery and resource discovery.
Current practice in sharing scientific data is through anonymous ftp, email and other manually driven exchanges.
Current practice in digital libraries is to store data on disk-based file systems with no automatic access to archival storage systems. Replication, access control, and federations are not currently performed in digital libraries.
2.3 Expected Stability of Proposed Standard with Respect to Current and Potential Technological Advances
We expect to have a commercial product within the next three years that integrates digital library technology, archival storage systems and super computing systems.
Since, we are building on top of digital library technology, all advances in that area can be incorporated into providing value-added services for archival storage systems.
A service of NOST at NSSDC. Access statistics for this web are available. Comments and suggestion are always welcome.
Reagan Moore (
Arcot (raja) Rajasekar email@example.com) +1.619.534.8378
Curator: John Garrett (John.Garrett@gsfc.nasa.gov) +1.301.286.3575
Responsible Official: Code 633.2 / Don Sawyer (Donald.Sawyer@gsfc.nasa.gov) +1.301.286.2748
Last Revised:May 15, 1998, Arcot Rajasekar (May 26, 1998, John Garrett)