Digital Archive Directions (DADs) Workshop

(A part of the ISO Archiving Workshop Series)
 
 
  

     

Position Paper


Digital Archive Directions (DADs) Workshop

DATE: June 22-26, 1998

HOST: The National Archives and Records Administration
Archives II
8601 Adelphi Road
College Park, MD 20740-6001

 


 

1. Identification of Proposed Topic [Required]

1.1 Title

Persistent Archives in a Digital Library Framework

1.2 Contributor(s)

Enabling Technologies Group
San Diego Supercomputer Center
PO Box 85608
San Diego CA 92186-5608

Reagan Moore - moore@sdsc.edu - Point of Contact

1.3 Description of Proposed Project

The supercomputer, digital library, and archival storage communities have common persistent archival storage requirements. Each of these communities is building software infrastructure to organize and store large collections of data.In particular, digital library technology provides many of the capabilities needed for a persistent archive. A co-evolution of super-computing, federated archiving and digital library technologies is critical for future application and information processing activities. We are developing an infrastructure called the "Data Intensive Computing Environment" (DICE) as a first step towards achieving this goal.

Currently, we are in the process of setting up a general digital library system for ingesting, managing, archiving and accessing several information collections of scientific data whose total size can grow to Petabytes with billions of objects. The content of these archives are scientific datasets including documents, images, field-generated data and simulation results in fields ranging from astronomical, environmental, sociological, ecological to neuro-science. An important issue is to make information in the archived digital libraries available through the web as well as through APIs for processing on platforms such as Cray C90, Cray T3E, IBM SP2 and TERA. We are also investigating automation of the ingestion of data into the Archive Digital Library. Many of the data sets are produced using scientific instruments or are the output of super computing applications.

Our system is built around a Storage Resource Broker (SRB) and a Metadata Catalog (MCAT), both developed at SDSC. The SRB provides a uniform API for access to heterogeneous archival storage systems and deals with federation of storage sites and replication of data objects. MCAT is a repository that handles different levels of metadata, including:

  1. system-level metadata about storage resources that is used by the SRB to provide location transparency, access transparency and protocol transparency, user-level metadata, data-level metadata about type and formats, replication and partitioning of data sets;
  2. application-level metadata including ontology information for the relationship of the terms in the attribute domain as well as indexing of individual data objects into the ontology;
  3. usage level metadata including audit trail, authentication and access control;
  4. preservation-level metadata including lineage (creation characteristics), ingestion protocols and usage methods.

Both MCAT and SRB expose different levels of input/output mechanisms including web-based access, Unix-type shell level access, programmatic access and GUI-based access for both resource and data discovery and for data manipulation.

The persistent archive system can manage a wide variety of storage resources. SRB drivers have been implemented for archival storage systems such as the IBM High Performance Storage System (HPSS) and NSL UniTree, ADSM, database systems such as DB2, Oracle, Illustra, and for local and remote Unix file system access. Currently, the system runs supercomputers such as Cray C-90, T3E and IBM SP2, and on workstations such as Sun, SGI, and DEC platforms.

Some of the salient features of the SRB/MCAT system are as follows:

  • Storage-level
    • uniform access to multiple heterogeneous storage systems
    • replication, partition, federation of collections
    • user-definable proxy services -for server level processing (eg. data subsetting, data selection, ...)
    • authentication and encryption
    • caching and efficient access to replicated data
  • Metadata level
    • data and resource discovery through metadata
    • system-level metadata
    • metadata schema definition for application-level metadata,
    • resource definition through an abstraction model,
    • automatic mapping of queries and data structures from various data models to internal metadata model
    • automatic mapping of queries and data structures between various data models,
    • access control mechanism for collections and metadata
    • auditing facilities
  • User-level
    • generation of ingestion procedures
    • generation of rendering procedures and uniform APIs
    • uniform APIs for programming

1.4 Justification

As mentioned in the previous section, current trends in data and information generation are leading to a paradigm shift in storing and manipulating datasets. This is driven by the creation of large collections of data (the Digital Sky Survey will access over 2 billion images) to large data objects (a brain image in the NeuroScience Database can eventually be as large as a terabyte). For scientific disciplines to survive under the onslaught of massive data loads, requires that efficient infrastructure needs to be developed to provide automated means of information ingestion, management, querying and ingestion into future computations.

Hence one needs to consider a system that is not only used for archiving digital objects, but that also provides several mechanisms for knowledge discovery and supports efficient access to the holdings in the system.

This requires an infrastructure that can integrate the three technologies provided by digital libraries, archival systems and super computers.

1.5 Definitions of Concepts and Special Terms

Access Control
controlling access to data objects at per user, per object and/or per collection levels.
Access Transparency
access to heterogeneous storage systems without knowing their particular access protocol
Auditing
collection of usage statistics
Digital Library
a system for storing, curating and disseminating digital objects mostly through a web-based API.
Domain-specific Metadata
meta information on objects that are particular to the field and/or individual collection
Federation
coordinating use of more than one distributed system in a transparent manner
Location Transparency
access to data objects through their characteristics and without knowing their physical location information
Metadata Catalog (MCAT)
A repository of metadata information
Protocol Transparency
Access to storage, computing and communication systems without knowing their particular data exchange characteristics.
Proxy Functions
Functions that are applied to data objects at the storage-level (rather than at the client level)
Standardized Access
Access of heterogeneous resources through a homogeneous interface
Storage Resource Broker (SRB)
A system that allows standardized access to multiple, heterogeneous storage systems
Super Computing
computations that expect Giga and Tera Flops and require super computers or meta computers
System-level Metadata
Meta information that are systemic to all objects (as opposed to domain-specific metadata)

1.6 Expected Relationship with OAIS Reference Model

The infrastructure under development at SDSC is an extension of the OAIS model. In addition to the ability to maintain metadata needed to migrate data collections onto new media, migrate metadata into new catalogs, and migrate data holdings into new format standards, an effort is being made to be able to support information discovery and rapid access to the holdings. The migration of the collections forward in time also requires the ability to extend the ontology under which the collections are organized, incorporate new metadata, and modify semantic definitions of terms. Given these capabilities, a persistent archive can be created that meets the needs of the super-computing and digital library communities.

 


 

2. Scope of Proposed Standard [Desired]

An archival digital library should not only deal with different disciplines but also provide a means of interaction between the disciplines and their collections. This requires a meta catalog that can hold discipline-specific ontologies and semantics that can be used for such interaction. Also, because we are dealing with fields that are rapidly evolving, one needs to consider the longevity of such ontologies and migrate them forward in time.

Since we are dealing with objects that are to be accessed using different types of APIs and methods, one needs to migrate forward the methods and procedures that are used in analyzing data. Hence one needs to go beyond storing preservation-level metadata for the objects but also consider preservation-level metadata for methods and APIs.

2.1 Recommended Scope of Standard

The SRB/MCAT project is part of the data intensive computing environments thrust area of the the NPACI partnership (National Partnership for Advanced Computing Infrastructure) which spans 23 universities throughout the United Sates. The project will be used as a tool for integrating digital library technology with archival storage systems to support large scale projects such as the Digital Sky Survey project (at CalTech), the NeuroScience Database Project (at UCLA), etc.

2.2 Existing Practice in Area of Proposed Standard

Current practice in super computing environment is to manually store and retrieve data out of archival storage systems and perform manual data discovery and resource discovery.

Current practice in sharing scientific data is through anonymous ftp, email and other manually driven exchanges.

Current practice in digital libraries is to store data on disk-based file systems with no automatic access to archival storage systems. Replication, access control, and federations are not currently performed in digital libraries.

2.3 Expected Stability of Proposed Standard with Respect to Current and Potential Technological Advances

We expect to have a commercial product within the next three years that integrates digital library technology, archival storage systems and super computing systems.

Since, we are building on top of digital library technology, all advances in that area can be incorporated into providing value-added services for archival storage systems.


Wider Views

Overview of the DADs Workshop
Overview of US Effort
Overview of International Effort


URL: http://ssdoo.gsfc.nasa.gov/nost/isoas/dads/DADSbase.html

A service of NOST at NSSDC. Access statistics for this web are available. Comments and suggestion are always welcome.

Author: Reagan Moore ( moore@sdsc.edu)
and
Arcot (raja) Rajasekar sekar@sdsc.edu) +1.619.534.8378
Curator: John Garrett (John.Garrett@gsfc.nasa.gov) +1.301.286.3575
Responsible Official: Code 633.2 / Don Sawyer (Donald.Sawyer@gsfc.nasa.gov) +1.301.286.2748
Last Revised:May 15, 1998, Arcot Rajasekar (May 26, 1998, John Garrett)