The Archive Ingest Process
NOTE: This document is derived from the Planetary Data System's Data Preparation Workbook, JPL Document D-7669
This paper describes a methodology for the archive ingest process. It identifies the steps that need to be carried out by both the producer and the archive staff to plan and execute the generation and transfer of information products to an archive. It also identifies the resources (procedures, standards, tools) that will be required to support the ingest process. The term "project" is used to represent an entity that will be a producer and will constitute a major interface to the archive over a period of time.
As background, there are seven sub-functions in the OAIS reference model that support the ingest function. These functions are presented from the archives' point of view. In summary, the archive negotiates submission agreements with the producer, receives submissions, performs physical validations on the submissions, generates compliant archival information packages and descriptive information which are audited by administration and then transferred to archival storage and the data management system, respectively.
The ingest methodology must support these seven steps, but also must define the roles and responsibilities of both the producer and the archive in this process. There are six steps in the ingest process:
The approach to archiving will vary depending on the producer's relationship to the archive. For archiving simple data sets the interface may be streamlined, not requiring extensive plans and documents. For large projects, early and frequent contact is desirable and will often be formalized. Figure 1 shows a typical time-line and the correlation between data producer and archive events. Archive personnel and discipline experts may become involved, providing a wide range of support. This will help to ensure that the data flows smoothly into the system. Cooperative development along these lines is often cost-effective for both the producer and the archive.
This section describes the orientation phase of the ingest process. During this phase contact is established between the archive and the data producer.
The first point of contact with the archive should be through the archive administration team. The administration function coordinates support for all ingest activities and is responsible for negotiating a submission agreement, which will provide a written agreement between the archive and the producer regarding the archive submission.
For large producers, the following general kinds of information will be helpful during early contact with the archive team:
Other general information that may be useful to the archive, to the extent it is known, may include:
The archive orientation material should provide an overview of the archive, including general descriptions of the roles and responsibilities of producers and of the archive staff. The orientation may range from a formal presentation to the delivery of some printed documentation. As a minimum, the archive should provide the producer with an Archive Ingestion Package. This package should include the archive ingestion process description, archive standards references, archive nomenclature and data dictionary, and the archive tool reference.
For large projects, the archive staff will identify a personnel who will provide technical support to the project. The project should also identify their management and technical contacts for archive issues at this time.
Generally, large projects will establish teams (Archive Working Group) to address project-wide archive issues. These teams are chaired by members of the project and attended by archive representatives and project personnel involved in data archiving.
In addition, the project or the archive may establish a Project Interface Team (PIT), which meets more frequently than the project team, and addresses some of the more detailed archive issues. The PIT will consist of members of the archive, discipline experts and the project.
For large projects, archive planning consists of identifying the data to be archived, developing a detailed archiving schedule, and defining an end-to-end data flow through the project. Part of this planning also defines roles and responsibilities of the variety of teams involved in producing final archive products. This activity is less formal for data restorations, and does not require preparation of the documents discussed in the following steps.
A Producer Data Management Plan (PDMP) provides a general description of the project data processing, cataloging, and communication plan. The archive manager should be a signatory on this document.
The archive should provide assistance in developing the PDMP at the request of the project. The archive should provide guidelines for Producer Data Management Plans, including sample PDMPs. Once the project has completed a draft of this document, the archive will participate in it's review, and help to identify and resolve archive related issues.
The Submission Agreement (SA) provides a detailed description of the production and delivery plans for archive products for a project. The archive manager is a signatory on this document.
The contents of an SA include:
The archive will also provide assistance in developing the SA at the request of the project. Example SAs may be obtained from the administration team or from the archive data engineer assigned to the project. The archive will participate in reviewing the SA,
It is inevitable that changes will occur that will affect both the PDMP and the SA. In particular, detailed data set lists and schedules found in the SA often develop over time, making these appendices "working" guidelines for archive planning. The archive data engineer should be notified of changes to these plans as they occur, and document revisions should be scheduled periodically. Changes include additions and deletions of products, changes in schedule, and changes in quantity, product content or format.
For large projects, the assigned data engineer will need to be placed on all relevant project distribution lists. This individual may also be invited to attend certain project meetings that involve discussions about archiving. By including the archive in data system planning efforts, there may be ways to reduce the work required later for preparing data for archiving.
For active flight projects, the archive data engineer or Project Interface Team (PIT) leader will schedule regular planning meetings. During the design phases of a project, these meetings may be used to help with drafting and reviewing the PDMP and SA. During the execution phase of a project, when the emphasis shifts to archive testing and production, these meetings may be used to develop details of archive transfer procedures between the project and the archive.
For large projects, an Archive Interface Plan (AIP) may be written by the archive data engineer or another member of the archive staff. The Archive Interface Plan establishes the general roles and responsibilities of the archive, its discipline experts, related archives, and each project team that has an interface with archive. There will be signatories on this document for each identified interface. This document will require review and signature, and issues may be brought to Project Inteface Team planning meetings for discussion.
Archive design consists of reviewing archive standards, designing data objects, representation information and preservation description information, packaging them into data sets and collections, determining storage media, designing volumes and volume sets, designing the data production process, planning for data validation and developing high level descriptive information for the archive catalog. These tasks are not meant to be sequential. In many cases there may be several iterations between various steps.
Once a data set has been identified for archiving, it is time to review various ways to organize the data so that it will be most accessible and usable to a broad community. The archive should have developed standards to help in both organizing and describing data. A review of the Archive Ingestion Process Description, the Standards Reference, and the Data Dictionary and Nomenclature Standards should provide a basic understanding of the archive operations and standards.
For large projects, the archive may sponsor one of more Archive Ingestion Workshops that will focus on the use of the archive standards. The archive data engineer and discipline experts are available to answer your questions and provide guidance on archive design issues. They will be able to provide examples of data volumes already entered into the archive, to show how data is packaged, labeled, documented, catalogued, and ordered. Seeing examples of what has been done in the past is usually the easiest way to understand the design process.
This activity includes the determination of both the contents and the file format of any digital data products for a data set. This includes the definition of the data objects that make up the product and the definition of the representation information.
Data sets may be grouped together with other data sets into data set collections. Data set collections consist of data sets that are related by observation type, discipline, target, or time, and should be treated as a unit, to be archived and distributed together for a specific scientific objective or analysis.
This activity will be specific to the given environment of the data provider. The archive data engineer may be able to provide procedures or tools to assist in the volume production process.
For large projects, data validation plans are usually documented in project operational procedures. For other submissions, planning the data validation process is part of the review process. The archive should provide several validation tools that can be useful in a variety of validation steps.
For large projects, a Data Validation Plan may be prepared which defines the operational procedures that will be in place for validating the content and physical organization of the volumes produced. The archive manager should have a signature line on this document. A sample Validation Plan may be obtained from the archive data engineer. Draft Data Validation Plans should be provided to the archive data engineer for review.
Data set assembly consists of collecting and formatting all of the components of the data set, processing it according to the design, and storing it on the planned medium. It also involves preparing data product labels, writing the volume documentation, and creating the volume indices.
Process and format the data objects into products as designed. Creation of representation information may be concurrent with processing the data objects themselves, or may be done independently after other processing has been completed.
Test volumes should be prepared and distributed to the archive data engineer for volume validation.
Volumes will be validated according to the operational procedures and validation criteria outlined in the Project's Data Validation Plan or the Submission Agreement.
After the test volumes have been validated the archive products are written to the media that will be used for submission to the archive.
Prior to acceptance by the archive, the submission information package needs to be reviewed. The purpose of the review process is to ensure the accuracy, dependability, and usefulness of the data to its designated community. This review process is flexible, and depends on both the amount and complexity of the data being archived. In some cases the archive may require only an internal review to assure that submitted information packages meet archive standards. Other reviews may require a process called a Peer Review where potential users of the data are called upon to.
The archive manager or delegate is responsible for establishing a review committee. Members typically include the archive data engineer, the data producer and both discipline experts and potential users from the designated community. A chairperson of the committee will be named for coordinating the review process. For large projects, a committee established by a project ingest team may often perform a substantial part of this review function.
Reviews may be handled in a variety of ways, and are usually determined by the availability and location of participants on the review committee. Most often, in order to provide reviewers enough time to adequately review the data, a review period is established. A meeting (or a teleconference) can then be held at the end of the review period to discuss results and identify any liens. Three to four weeks is a typical time period for a review, but this depends on the size or complexity of the data, and whether results from any prior reviews are available.
The chairperson of the review committee will provide instructions to the review committee. This includes providing either copies of the data set(s)(e.g. CD-WO media) or electronic access to the data (or samples of them), supporting documentation, completed preservation description information, product representation information samples if applicable, and review criteria. If special software is available for viewing, analyzing, or ordering the data, distribution or access to this software will be provided also.
Once preparations for the review are complete, the review is conducted according to the instructions provided by the review chairperson. If the review is conducted as a meeting, data providers or their representatives may make presentations on the submission information packages which they have provided. This may include demonstrations of associated software to display, subset, or order the data. For data being reviewed over a longer time period, reviewers may read provided documentation, test out supplied software, and where possible, try using the data for science purposes and with locally available software.
At the end of the review meeting, or review period, the review chairperson summarizes the results of the review, and any liens identified are documented. Liens are usually classified as major and minor. One or more reviews may be held before all the data are delivered.
Wherever feasible, liens are corrected prior to delivery of the submission information package to the archive. If it is not feasible to correct all of the liens, the results of the review will be archived with the data sets to document all known errors and discrepancies. The review committee is responsible for deciding whether to go ahead and archive the data with documented liens.
If the project chooses to produce products using a replicable media (e.g. CD-ROM) for internal distribution, the archive may wish to provide funding for additional copies to be made for the archive designated community. The project should keep the archive informed about production plans in order to avoid multiple setup costs for additional duplicates.
Their may be special procedures limiting access to submitted products. It is important that the archive ensure that its products meet Federal data distribution requirements of the Departments of Commerce, State, and Defense. The archive needs to determine the appropriate security classifications for all submitted data and handle the data accordingly.
Projects should implement the data delivery steps specified in their Submission Agreement. These procedures will ensure the project notifies the archive personnel when the physical products are delivered to the archive.
Data sets can be updated to include corrections or enhancements after they have been archived. Contact the archive data engineer for information on this process.