The Archive Ingest Process

NOTE: This document is derived from the Planetary Data System's Data Preparation Workbook, JPL Document D-7669

Mike Martin

10-7-99

This paper describes a methodology for the archive ingest process. It identifies the steps that need to be carried out by both the producer and the archive staff to plan and execute the generation and transfer of information products to an archive. It also identifies the resources (procedures, standards, tools) that will be required to support the ingest process. The term "project" is used to represent an entity that will be a producer and will constitute a major interface to the archive over a period of time.

As background, there are seven sub-functions in the OAIS reference model that support the ingest function. These functions are presented from the archives' point of view. In summary, the archive negotiates submission agreements with the producer, receives submissions, performs physical validations on the submissions, generates compliant archival information packages and descriptive information which are audited by administration and then transferred to archival storage and the data management system, respectively.

The ingest methodology must support these seven steps, but also must define the roles and responsibilities of both the producer and the archive in this process. There are six steps in the ingest process:

The approach to archiving will vary depending on the producer's relationship to the archive. For archiving simple data sets the interface may be streamlined, not requiring extensive plans and documents. For large projects, early and frequent contact is desirable and will often be formalized. Figure 1 shows a typical time-line and the correlation between data producer and archive events. Archive personnel and discipline experts may become involved, providing a wide range of support. This will help to ensure that the data flows smoothly into the system. Cooperative development along these lines is often cost-effective for both the producer and the archive.

Orientation

This section describes the orientation phase of the ingest process. During this phase contact is established between the archive and the data producer.

Establish Contact with the archive

The first point of contact with the archive should be through the archive administration team. The administration function coordinates support for all ingest activities and is responsible for negotiating a submission agreement, which will provide a written agreement between the archive and the producer regarding the archive submission.

Provide General Information to the archive

For large producers, the following general kinds of information will be helpful during early contact with the archive team:

Other general information that may be useful to the archive, to the extent it is known, may include:

Obtain the Archive Orientation Material

The archive orientation material should provide an overview of the archive, including general descriptions of the roles and responsibilities of producers and of the archive staff. The orientation may range from a formal presentation to the delivery of some printed documentation. As a minimum, the archive should provide the producer with an Archive Ingestion Package. This package should include the archive ingestion process description, archive standards references, archive nomenclature and data dictionary, and the archive tool reference.

Establish Technical Contacts

For large projects, the archive staff will identify a personnel who will provide technical support to the project. The project should also identify their management and technical contacts for archive issues at this time.

Generally, large projects will establish teams (Archive Working Group) to address project-wide archive issues. These teams are chaired by members of the project and attended by archive representatives and project personnel involved in data archiving.

In addition, the project or the archive may establish a Project Interface Team (PIT), which meets more frequently than the project team, and addresses some of the more detailed archive issues. The PIT will consist of members of the archive, discipline experts and the project.

Archive Planning

For large projects, archive planning consists of identifying the data to be archived, developing a detailed archiving schedule, and defining an end-to-end data flow through the project. Part of this planning also defines roles and responsibilities of the variety of teams involved in producing final archive products. This activity is less formal for data restorations, and does not require preparation of the documents discussed in the following steps.

Prepare a Producer Data Management Plan (PDMP)

A Producer Data Management Plan (PDMP) provides a general description of the project data processing, cataloging, and communication plan. The archive manager should be a signatory on this document.

The archive should provide assistance in developing the PDMP at the request of the project. The archive should provide guidelines for Producer Data Management Plans, including sample PDMPs. Once the project has completed a draft of this document, the archive will participate in it's review, and help to identify and resolve archive related issues.

Prepare a Submission Agreement (SA)

The Submission Agreement (SA) provides a detailed description of the production and delivery plans for archive products for a project. The archive manager is a signatory on this document.

The contents of an SA include:

The archive will also provide assistance in developing the SA at the request of the project. Example SAs may be obtained from the administration team or from the archive data engineer assigned to the project. The archive will participate in reviewing the SA,

Plan for Updates to the PDMP and SA

It is inevitable that changes will occur that will affect both the PDMP and the SA. In particular, detailed data set lists and schedules found in the SA often develop over time, making these appendices "working" guidelines for archive planning. The archive data engineer should be notified of changes to these plans as they occur, and document revisions should be scheduled periodically. Changes include additions and deletions of products, changes in schedule, and changes in quantity, product content or format.

Keep the Archive Data Engineer Informed

For large projects, the assigned data engineer will need to be placed on all relevant project distribution lists. This individual may also be invited to attend certain project meetings that involve discussions about archiving. By including the archive in data system planning efforts, there may be ways to reduce the work required later for preparing data for archiving.

Participate in Planning Meetings

For active flight projects, the archive data engineer or Project Interface Team (PIT) leader will schedule regular planning meetings. During the design phases of a project, these meetings may be used to help with drafting and reviewing the PDMP and SA. During the execution phase of a project, when the emphasis shifts to archive testing and production, these meetings may be used to develop details of archive transfer procedures between the project and the archive.

Review/Sign off Archive Interface Plan

For large projects, an Archive Interface Plan (AIP) may be written by the archive data engineer or another member of the archive staff. The Archive Interface Plan establishes the general roles and responsibilities of the archive, its discipline experts, related archives, and each project team that has an interface with archive. There will be signatories on this document for each identified interface. This document will require review and signature, and issues may be brought to Project Inteface Team planning meetings for discussion.

Archive Design

Archive design consists of reviewing archive standards, designing data objects, representation information and preservation description information, packaging them into data sets and collections, determining storage media, designing volumes and volume sets, designing the data production process, planning for data validation and developing high level descriptive information for the archive catalog. These tasks are not meant to be sequential. In many cases there may be several iterations between various steps.

Review Archive Standards

Once a data set has been identified for archiving, it is time to review various ways to organize the data so that it will be most accessible and usable to a broad community. The archive should have developed standards to help in both organizing and describing data. A review of the Archive Ingestion Process Description, the Standards Reference, and the Data Dictionary and Nomenclature Standards should provide a basic understanding of the archive operations and standards.

For large projects, the archive may sponsor one of more Archive Ingestion Workshops that will focus on the use of the archive standards. The archive data engineer and discipline experts are available to answer your questions and provide guidance on archive design issues. They will be able to provide examples of data volumes already entered into the archive, to show how data is packaged, labeled, documented, catalogued, and ordered. Seeing examples of what has been done in the past is usually the easiest way to understand the design process.

Design Data Products and Representation Information

This activity includes the determination of both the contents and the file format of any digital data products for a data set. This includes the definition of the data objects that make up the product and the definition of the representation information.

Define the Data Product. This activity involves the definition of the data object from a user perspective. This includes defining how the data should be divided into individual data products and determining which parameters or measurements will be included. Typically, the definition of a data product will depend on several factors. The most important consideration, however, is the probable way the data will need to be accessed and the expected frequency of access. There will probably be some iteration in defining the data product, and the next step, estimating the file sizes.

Estimate File Sizes. Once the data product has been identified, some estimates of the size of data files that will result from this definition must be made. The producer should determine both average size files and maximum size files that could be produced.

Determine the Data Format. The format for each type of data product file that will be present in the data set or collection is determined next. Data formats may be found in software user's guides, requirements or interface specification documents, or embedded in the software source code designed to read or write the data. In many cases, it may be best to reformat the data into a more portable format. Once again, the anticipated use of the data will have a great bearing on the type of format and organization of the data. Generally, the more structured the anticipated usage, the less concern about the storage format. For example, compressed data products archived together with special purpose software to perform decompression and other functions may have specialized formats. Tabular data that would be well suited to use in an off-the-shelf database product or spreadsheet package should be stored in easily interpreted data formats.

Determine the Data Objects and File Configurations. This step will involve determining the data objects in the data products that have been identified. The archive should provide a set of standard data object descriptions to be used to define both the contents and structure of data products. Once the data objects of a given data product have been identified, the producer must determine the physical file structure and storage architecture for the objects within the file system. Quite often there is a one-to-one correspondence between a data product and a file, however there may be reasons for splitting data object components into separate files or even storing them in separate directories.

Design Data Product Representation Information. One of the most important steps in designing the data products for a data set, is the design of the representation information. This information is critical to the long term understanding of the structure of the data product and to the interpretation of the data values.

Design the Data Set or Collection

Define the Purpose and Scope of the Data Set or Collection. Defining the objective and scope of a data set or a data set collection is often done before, or at the same time as the definition of the individual data products. Some questions to consider in this determination are:

Data sets may be grouped together with other data sets into data set collections. Data set collections consist of data sets that are related by observation type, discipline, target, or time, and should be treated as a unit, to be archived and distributed together for a specific scientific objective or analysis.

Determine Other Data Set or Collection Components. Determine the ancillary data (e.g., navigation data, calibration data, pointing information), software, and documentation that will be included with the data set or data set collection. In certain cases, ancillary data may be archived as separate data sets, particularly if it is applicable to a wide variety of data sets.

Create Data Set or Collection Names and Identifiers. Choosing a name and and identifier for your data set or collection is usually straightforward. Since these values are a required part of the data product label, these are usually determined early in the design process. The archive standards document should provide guidance for naming conventions for data sets and collections.

Determine the Storage Medium. Media used for storage of archival data sets should be utilized efficiently. This includes full recording of individual media volumes, selection of recording formats to minimize wasted space on the media, and the use of simple data compression techniques to reduce the volume of infrequently accessed data. If a medium other than magnetic tape or CD-ROM is to be used, then the archive should be notified to determine whether it can be accommodated. The use of replicable media, such as CD-ROM, for products that are expected to be widely used, is recommended. Media recommendations should be included in the archive standards document.

Design Volumes and Volume Sets.

A volume represents a physical unit of data such as a magnetic tape, floppy disk, or CD-ROM. Delivery media that support a hierarchical organization, should be used if possible, e.g., CD-ROM, CD-R. When using a serial medium, such as magnetic tape, a hierarchical organization can not be physically implemented and may have to be recreated by the archive staff. If a stream oriented media is chosen, then the explicit volume structure must be documented. For the purposes of this discussion, we will consider a data volume that supports a hierarchical directory structure.

Map Data Sets or Collections to Volumes. Taking into account the total volume of data to be archived (including ancillary data), estimates of individual file sizes, schedule of availability of the data (including any proprietary period), and operational constraints of the volume assembler, determine the allocation of data to physical volumes. When a data set spans multiple volumes, it is recommended that each volume provide a complete set of ancillary data pertaining to the data contained on the volume.

Name the Volumes and Volume Sets. If the data set or data set collection spans more than one volume, a name and volume set identifier must be selected to uniquely identify it. This is especially important if the volume set needs to be ordered and distributed as a single unit. The ingestion standards document should provide guidance on Volume and Volume Set Naming.

Determine non-data Subdirectories and Files for each Volume. A number of additional directories and/or files are added to each archive volume to provide needed documentation, indices, software, etc. to allow proper use of the data volume. Some of these directories and files are required by the archive system, some are recommended, and some have specific applications. The standards for required and optional subdirectories and files should be specified in the archive standards document.

Determine the Data Organization. Archival data sets are generally organized by time, by some target of the observation, or by some event which is being studied. It is important to minimize the number of directory levels that must be traversed to get to the data and to minimize the amount of directory changing required during normal data access operations. For example, if two related data files are always processed in conjunction with each other, the files should be grouped in the same directory.

Establish Data Product File Naming and Directory Naming Conventions. The file naming scheme utilized should result in file names that uniquely identify the data included in the file but have some characteristics in common with other files. The archive standards document should provide guidance for file and directory naming conventions.

Determine the Indices Needed. An index is an important part of a archive volume. A good index or a set of indices provides the user with the means to rapidly locate specific data files of interest. For instance, columns containing latitude and longitude ranges or time periods followed by a column of directory and file names will provide the user with information needed to find individual data file of interest for a particular investigation. Wherever possible, indices should be built directly from the data product labels.

Design Data Production Process

This activity will be specific to the given environment of the data provider. The archive data engineer may be able to provide procedures or tools to assist in the volume production process.

Plan Data Validation Process

For large projects, data validation plans are usually documented in project operational procedures. For other submissions, planning the data validation process is part of the review process. The archive should provide several validation tools that can be useful in a variety of validation steps.

For large projects, a Data Validation Plan may be prepared which defines the operational procedures that will be in place for validating the content and physical organization of the volumes produced. The archive manager should have a signature line on this document. A sample Validation Plan may be obtained from the archive data engineer. Draft Data Validation Plans should be provided to the archive data engineer for review.

When planning data validation, the following should be addressed:

Prepare the Preservation Descriptive Information

Data submitted to the archive should be accompanied by a set of Preservation Descriptive Information. This includes reference information, fixity information, provenance information and context information. This information will be included in the submission information package. Examples and assistance in completing these items will be provided by the archive data engineer.

Data Set Assembly and Validation

Data set assembly consists of collecting and formatting all of the components of the data set, processing it according to the design, and storing it on the planned medium. It also involves preparing data product labels, writing the volume documentation, and creating the volume indices.

Create the Data Products

Process and format the data objects into products as designed. Creation of representation information may be concurrent with processing the data objects themselves, or may be done independently after other processing has been completed.

Create a Data Staging Area for Volume Production

Before volume production begins, a data staging area should be created. This area should provide a structure similar to that which will be used on the final medium; this allows the volume assembler to store the pieces of the data set as they are assembled.

Prepare Volume Components

This includes preparing any needed volume documentation, collecting preservation description information, collecting and formatting documentation, preparing supporting software, collecting ancilliary files and generating any data indices needed to accompany the volume.

Prepare a Set of Test Volumes and Distribute

Test volumes should be prepared and distributed to the archive data engineer for volume validation.

Execute Data Validation Procedures

Volumes will be validated according to the operational procedures and validation criteria outlined in the Project's Data Validation Plan or the Submission Agreement.

Transfer the Data to the Final Medium

After the test volumes have been validated the archive products are written to the media that will be used for submission to the archive.

Review

Prior to acceptance by the archive, the submission information package needs to be reviewed. The purpose of the review process is to ensure the accuracy, dependability, and usefulness of the data to its designated community. This review process is flexible, and depends on both the amount and complexity of the data being archived. In some cases the archive may require only an internal review to assure that submitted information packages meet archive standards. Other reviews may require a process called a Peer Review where potential users of the data are called upon to.

Establish a Review Committee

The archive manager or delegate is responsible for establishing a review committee. Members typically include the archive data engineer, the data producer and both discipline experts and potential users from the designated community. A chairperson of the committee will be named for coordinating the review process. For large projects, a committee established by a project ingest team may often perform a substantial part of this review function.

Prepare for the Data Delivery Review or Peer Review

Reviews may be handled in a variety of ways, and are usually determined by the availability and location of participants on the review committee. Most often, in order to provide reviewers enough time to adequately review the data, a review period is established. A meeting (or a teleconference) can then be held at the end of the review period to discuss results and identify any liens. Three to four weeks is a typical time period for a review, but this depends on the size or complexity of the data, and whether results from any prior reviews are available.

The chairperson of the review committee will provide instructions to the review committee. This includes providing either copies of the data set(s)(e.g. CD-WO media) or electronic access to the data (or samples of them), supporting documentation, completed preservation description information, product representation information samples if applicable, and review criteria. If special software is available for viewing, analyzing, or ordering the data, distribution or access to this software will be provided also.

Conduct the Data Delivery Review or Peer Review

Once preparations for the review are complete, the review is conducted according to the instructions provided by the review chairperson. If the review is conducted as a meeting, data providers or their representatives may make presentations on the submission information packages which they have provided. This may include demonstrations of associated software to display, subset, or order the data. For data being reviewed over a longer time period, reviewers may read provided documentation, test out supplied software, and where possible, try using the data for science purposes and with locally available software.

At the end of the review meeting, or review period, the review chairperson summarizes the results of the review, and any liens identified are documented. Liens are usually classified as major and minor. One or more reviews may be held before all the data are delivered.

Correct/Document Review Liens

Wherever feasible, liens are corrected prior to delivery of the submission information package to the archive. If it is not feasible to correct all of the liens, the results of the review will be archived with the data sets to document all known errors and discrepancies. The review committee is responsible for deciding whether to go ahead and archive the data with documented liens.

Delivery

After the Data Delivery Review or Peer Review is complete, data sets are considered to be delivered to the archive.

Coordinate Generation of Duplicate Copies with the Archive (if applicable)

If the project chooses to produce products using a replicable media (e.g. CD-ROM) for internal distribution, the archive may wish to provide funding for additional copies to be made for the archive designated community. The project should keep the archive informed about production plans in order to avoid multiple setup costs for additional duplicates.

Data Classification Procedures

Their may be special procedures limiting access to submitted products. It is important that the archive ensure that its products meet Federal data distribution requirements of the Departments of Commerce, State, and Defense. The archive needs to determine the appropriate security classifications for all submitted data and handle the data accordingly.

Physically Transfer Volumes to the Archive

Projects should implement the data delivery steps specified in their Submission Agreement. These procedures will ensure the project notifies the archive personnel when the physical products are delivered to the archive.

Update Corrected or Enhanced Data Sets

Data sets can be updated to include corrections or enhancements after they have been archived. Contact the archive data engineer for information on this process.