Archival Workshop on Ingest, Identification, and Certification Standards (AWIICS)

(A part of the ISO Archiving Workshop Series)


Draft Report

Archival Workshop on Ingest, Identification, and Certification Standards
DATE: October 13-15, 1999

HOST: The National Archives and Records Administration
Archives II
8601 Adelphi Road
College Park, MD 20740-6001



Ingest Group Notes

Don Sawyer convened the Ingest Working group of AWIICS at 1430 hours on Oct. 13. The members of the group were:

  • William Callicott / NODC/NOAA
  • Geoffrey Goodrum / NOAA/NESDIS - National Climate Data Center
  • Michael Clark / US Government Printing Office
  • Jane Cohen / Defense Technical Information Center
  • Evelyn Frangakis / USDA, National Agricultural Library
  • Brian Lavoie / OCLC Online Computer Library Center,Inc.
  • Charles Luciano / StorageTek Inc.
  • Steve Marley / Raytheon System Company
  • Mike Martin / Jet Propulsion Laboratory
  • Warren Murphy / University of Alabama
  • Oya Rieger / Cornell University Library
  • Malcolm Rives / NASA MSFC / Sverdrup
  • Don Sawyer / NASA/GSFC/NSSDC
  • Tim Smith / USGS/EROS Data Center/Raytheon
  • John Stegenga / National Library of Canada
  • Helen Wu / Raytheon, ITSS
  • Xiaoshi Xing / CIESIN, Columbia University
  • Charles Early / NCI Information Systems, Inc.
After a period of introductions, and solicitation of presentations, Mike Martin began his presentation on "The Archive Ingest Process."

Mike is working on planetary data systems (PDS). There are three data systems.
Funded at 5 million dollars/year.
Distributed system.

Cassini is an example of a large project which will generate large amounts of data. The developers of Cassini are not concerned with archiving issues, which is a common theme we all deal with.

It's a concern of PDS to have an interface with these producers of data. There are processes to streamline these interfaces with smaller producers.

This talk is about the methodology PDS uses to deal with the projects that will produce this data.

The methodolgy is taken from the PDS Data Preparation Workbook.

  • orientation
  • Establish contact with the archive

  • Normally done by administration.

  • Provide General Information to the archive

  • Provide a submission agreement.

  • Obtain the Archive Orientation Material

  • The archive must have orientation material.
    Data preparation workbook
    standards workbook
    e.g. How many files should you put in a directory
    images should be 10 MB

    Data Dictionary

  • Establish Technical Contacts

  • Data Engineers - someone who's job it is to understand the archive, data formats, etc.
    Project interface team. Archive people working closely with the engineers designing the data sources.

  • Archive planing
  • Prepare a producer Data Management Plan (PDMP)
    PDMP can be a nightmare because of the complexity of the data interface and extracting data from an assortment of teams.
  • Prepare a submission agreement (SA)
    The formal interface between the product and the archive.
  • Plan for updates to the PDMP and SA
    Good planning.
  • Keep the Archive Data Engineer informed
    A little reminder
  • Participate in Planning Meetings
    A project may have it's own group but there needs to be a group that includes archive engineers.
  • Review/Sign off Archive Interface Plan
    Assigning internal resources are needed if the data is to be released.
  • Design
  • The goal is to minimize rework
  • Review Archive Standards
  • Design Data Products and Representation Information
  • Design the data product
  • Estimate File Sizes
  • Determine file formats
  • Access Patterns
  • Data objects and file configurations
  • Design data product representation information the structural information and the semantic information
  • Design Data Set or Collection
  • Define the Purpose and scope of data set or collection
  • Determine other daata set or collection components
  • Create Data Set or Collecton Names and Identifiers
  • Really important to review this and be sure people are following the guidelines.
  • Determine the Storage Medium
  • Design Volumes and Volume Sets
  • Map data sets or collections to volumes
  • Name the volumes and volume sets
  • determine non-data subdirectories and files for each volume
  • determine data organization
  • Design data production process
  • Plan data validation process Use checksums-hashes-digital signatures
  • Prepare the preservation descriptive information Preparing fixity data
  • Data Set Assembly and Validation
  • create the data products
  • create a data staging area for volume production
    You may need to buffer data for a period before archiving

  • prepare volume componenets
  • prepare a set of test volumes and distribute
  • execute data validation proceedures
  • transfer data to final medium
  • Review
  • Establish a review committee
  • Prepare for teh data delivery review or peer review
  • Conduct the data delivery review or peer review
  • Correct/Document review liens
  • Delivery
  • Coordinate generation of duplicat copies with the archive
  • data classification proceedurss
  • physically tranfer volumes to archive
  • update corrected or enhanced data sets
  • Steve Marley discussed data access patterns that are orthogonal with data ingestion patterns. There was considerable discussion about this problem and recognition that it may be necessary to have the data organized in multiple ways if performance is an issue.

    Mike has the same problem when he needs to collate date from multiple instruments so users can see the data overlaid.

    You can't expect the data producers to do much more than guarantee the Data gets to the archive, and the archive is responsible to transform the data into a useable format. This appears to be the norm.

    Steve Marley: At some level you need to realize that you are in a domain, and you need to optimize for that domain.

    William Callicot, there can be a huge cost in unanticipated patterns of usage of data.

    Oya Rieger: At Cornell the promise is to collect a very rich image collection, and to convert 'on the fly' to other formats as needed.

    If there are multiple orthogonal access patterns that are time critical, multiple copies must be stored.

    The enforcement of standards can lose your data providers. There should be a waiver that is possible but difficult to obtain.

    Where is the integrity of the data ensured?

    John Stegenga: In working with the small time publisher the rigor doesn't seem to be as necessary as in a large bureaucracy.

    The more proactive we are, when the concept for data is being created, the easier it is to involve ourselves with archival concerns. From a commercial view, we are committed to capture the data as submitted by the publisher. If we notice errors, we may point them out and ask that the publisher re-submit.

    Don: What are your biggest problems with the ingest process?

    John S.: Negotiation with the publisher. We need to be very wary of Copyrights and creative integrity. We spend a long time negotiating and then an equally long time negotiating on a technical level.

    Don: So then, formalizing these procedures may shorten the time spent in negotiation?

    John S: Yes

    jane C. What we get in is a deliverable on a product or project and the contractor gets paid whether the SIP is good or not as long as the missile works.

    ISO 11179 was mentioned. This is a Meta-data standard - Standards for documenting data elements and for registering them.

    Jane is more concerned with metadata aspects. Realistically not every archive has the clout to enforce standards.

    A research library's interest is in longevity, whereas a publisher is interested in more short term concerns.

    Action Item) Mike Martin takes an action item to provide the PDS manual via URL
    Action Item) Create a mailing list of the participants of the group

    You've got to show the data providers that you are providing a service that either saves them money, or helps them meet contractual obligations.

    Don: Are you interested in a standard along the lines of what Mike Described if it were properly generalized to accommodate the library services as well as small archives?

    EPA online has a software package you can download to create a metadata registry.

    What is the minimum set of metadata ? Could this be standardized?

    The web has impacted the relationship between the information provider and the archivist. There used to be a professional that mediated the transfer. Now there is a web service that provides the archivist function, except the web service doesn't completely fill the roll.

    Is there a scenario where the producer is responsible to guarantee the longevity, and the 'archive' is really just a data store?

    (Additional Notes are To Be Supplied)

    Wider Views

    Overview of the Archival Workshop
    Overview of US Effort
    Overview of International Effort


    A service of NOST at NSSDC. Access statistics for this web are available. Comments and suggestion are always welcome.

    Author: Archival Workshop Program Committee ( +1.301.286.3575
    Curator: John Garrett ( +1.301.286.3575
    Responsible Official: Code 633.2 / Don Sawyer ( +1.301.286.2748
    Last Revised: 31 October 1999, Don Sawyer