National Aeronautics and Space Administration
NASA Space Science Data Coordinated Archive

White paper on NASA science data retention

Contents

Objective

This brief note addresses which NASA science data should be retained indefinitely, and the conditions under which certain data may and should be released.

Discussion

Observational data from NASA missions record the state of some aspects of our Earth, solar system, or universe at some time point. Because our world is continually changing, time-stamped data cannot be recreated once abandoned or lost. New and old data, addressed in new combinations and in new ways, enable us to increasingly understand our physical world. Therefore NASA observational data represent an asset which must be retained in a usable state into the indefinite future.

There are many "datasets" that result from a given NASA investigation, ranging from the initially downlinked telemetry data to some highly summarized versions of the physical parameters derived from the measurements. Not all such datasets need to be retained indefinitely.

The concept of the definitive dataset resulting from a given investigation is important here. The definitive dataset contains all of the science potential of the investigation in that no irreversible transformations have been applied to the data. The definitive dataset has been cleaned for telemetry errors, has typically been time-aligned with overlaps removed, has needed ancillary data (e.g., orbit/attitude) appended, and has been otherwise annotated as appropriate. In other words, the definitive dataset is the highest level not-yet-irreversibly-transformed dataset from an investigation. In some circles, this is called the Level 1A data processing level.

Definitive datasets from NASA missions should be retained indefinitely.

Datasets leading up to the production of the definitive dataset should be retained only to a point six months past the creation and certification of the definitive dataset. Certification implies that all appropriate tests have been done by mission personnel to ensure that the definitive dataset faithfully captures the content of the datasets from which it was generated.

Typically, one or more data sets are generated from definitive datasets. These derived datasets ("post-definitive data sets") are typically more immediately usable than the definitive datasets, often because sensor outputs have been convolved with calibration coefficients to produce geophysical parameters with which science analyses are done. Some of these derived datasets capture much of the science potential of the investigation, insofar as full or near-full coverage and resolution in the independent variable space (time, space, energy, etc.) of the investigation are retained. On the other hand, many higher level derived datasets retain only a small (but possibly highly important) fraction of the science potential of the original observations.

Derived datasets should be retained as long as they remain scientifically viable (i.e., algorithms or coefficients used in their derivation remain credible) and the cost of regenerating them (for some anticipated request level) outweighs the cost of their retention and maintenance. (Note that some derived data sets become widely used and become considered "definitive"; we use "definitive" in a different sense in this paper, as described above.)

Datasets to be retained indefinitely should be archived in adherence to then-relevant standards regarding formats, media, etc., and with all supporting material (documentation, ancillary data, software, etc.) needed to make the data correctly and independently usable. Data integrity must be maintained (no data loss due to media deterioration) and datasets must continue to be findable, accessible, and usable.

Ensuring continuing data integrity and usability requires periodic data renewal cycles. Some such cycles will involve only bit migration from old to new media. Other such cycles may involve more resource intensive data reorganizations and/or reformattings and sometimes even recreation of related software.

Upon entering into particularly resource intensive data renewal cycles for definitive data sets, archive managers may solicit from the appropriate NASA Science Associate Administrator (AA) approval to release (destroy) the dataset rather than to renew it. The AA will approve such release if, after consultations with representatives of the potential data-using community, it is determined that the data renewal costs would not be justified in light of limited projected future use of the data.

As long as independently usable definitive data set(s) from a given investigation continue to exist, derived (post-definitive) datasets from that investigation about to undergo a data renewal cycle may be released by archive managers as long as in their judgments, and with the concurrence of their advisory committee of potential users, the cost of renewal and further retention is not justified by projected future data usage, especially given the dataset regeneration option.

Where derived data sets still exist for a given investigation but the corresponding definitive data set(s) no longer exist, then the derived data set retaining the greatest measure of the science potential of the investigation will be treated as if it were a definitive data set in terms of requiring AA approval for its release.

NASA will endeavor to minimize data renewal cycle costs, and hence to maximize renewal/retention of definitive and derived data sets and their supporting materials, by definition and implementation of standards for data formats, metadata, software, etc.

Synopsis of Retention Policies

  1. Pre-definitive data from a given mission or investigation will be released six months after project personnel certify that the definitive data set(s) created therefrom faithfully replicate their content.
  2. Definitive datasets and, for missions or investigations where definitive datasets no longer exist, derived data sets retaining most science potential of the mission or investigation, will be retained indefinitely except they can be released with approval of the relevant Science Associate Administrator on the basis of the cost to renew/retain exceeding anticipated future value. (Such releases are expected to be rare.)
  3. Other derived data sets will be retained indefinitely as long as they are scientifically credible, except they can be released by archive managers on the basis of the cost to renew/retain exceeding the cost to regenerate to meet anticipated future demand.

Derivative Requirements (not specific to data retention/release)

  1. Projects must create and certify optimally standards-adherent definitive data sets, and accompanying material (documentation, ancillary data, software, etc.) as needed to make the data independently usable, and deliver them to NASA's "managed archives." Projects' plans should be spelled out in Project Data Management Plans.
  2. NASA archives must ensure the continuing preservation, accessibility, and usability of the data in their care. Plans for doing so should be spelled out in Archives' Operating Plans.
  3. NASA archives must have user advisory committees to advise on (among other things) the likely future use and value of datasets candidate for resource-intensive renewal cycles.
[USA.gov] NASA - nasa.gov