White paper on NASA science data retention
This brief note addresses which NASA science data should be retained indefinitely, and the conditions under which certain data may and should be released.
Observational data from NASA missions record the state of some aspects of our Earth, solar system, or universe at some time point. Because our world is continually changing, time-stamped data cannot be recreated once abandoned or lost. New and old data, addressed in new combinations and in new ways, enable us to increasingly understand our physical world. Therefore NASA observational data represent an asset which must be retained in a usable state into the indefinite future.
There are many "datasets" that result from a given NASA investigation, ranging from the initially downlinked telemetry data to some highly summarized versions of the physical parameters derived from the measurements. Not all such datasets need to be retained indefinitely.
The concept of the definitive dataset resulting from a given investigation is important here. The definitive dataset contains all of the science potential of the investigation in that no irreversible transformations have been applied to the data. The definitive dataset has been cleaned for telemetry errors, has typically been time-aligned with overlaps removed, has needed ancillary data (e.g., orbit/attitude) appended, and has been otherwise annotated as appropriate. In other words, the definitive dataset is the highest level not-yet-irreversibly-transformed dataset from an investigation. In some circles, this is called the Level 1A data processing level.
Definitive datasets from NASA missions should be retained indefinitely.
Datasets leading up to the production of the definitive dataset should be retained only to a point six months past the creation and certification of the definitive dataset. Certification implies that all appropriate tests have been done by mission personnel to ensure that the definitive dataset faithfully captures the content of the datasets from which it was generated.
Typically, one or more data sets are generated from definitive datasets. These derived datasets ("post-definitive data sets") are typically more immediately usable than the definitive datasets, often because sensor outputs have been convolved with calibration coefficients to produce geophysical parameters with which science analyses are done. Some of these derived datasets capture much of the science potential of the investigation, insofar as full or near-full coverage and resolution in the independent variable space (time, space, energy, etc.) of the investigation are retained. On the other hand, many higher level derived datasets retain only a small (but possibly highly important) fraction of the science potential of the original observations.
Derived datasets should be retained as long as they remain scientifically viable (i.e., algorithms or coefficients used in their derivation remain credible) and the cost of regenerating them (for some anticipated request level) outweighs the cost of their retention and maintenance. (Note that some derived data sets become widely used and become considered "definitive"; we use "definitive" in a different sense in this paper, as described above.)
Datasets to be retained indefinitely should be archived in adherence to then-relevant standards regarding formats, media, etc., and with all supporting material (documentation, ancillary data, software, etc.) needed to make the data correctly and independently usable. Data integrity must be maintained (no data loss due to media deterioration) and datasets must continue to be findable, accessible, and usable.
Ensuring continuing data integrity and usability requires periodic data renewal cycles. Some such cycles will involve only bit migration from old to new media. Other such cycles may involve more resource intensive data reorganizations and/or reformattings and sometimes even recreation of related software.
Upon entering into particularly resource intensive data renewal cycles for definitive data sets, archive managers may solicit from the appropriate NASA Science Associate Administrator (AA) approval to release (destroy) the dataset rather than to renew it. The AA will approve such release if, after consultations with representatives of the potential data-using community, it is determined that the data renewal costs would not be justified in light of limited projected future use of the data.
As long as independently usable definitive data set(s) from a given investigation continue to exist, derived (post-definitive) datasets from that investigation about to undergo a data renewal cycle may be released by archive managers as long as in their judgments, and with the concurrence of their advisory committee of potential users, the cost of renewal and further retention is not justified by projected future data usage, especially given the dataset regeneration option.
Where derived data sets still exist for a given investigation but the corresponding definitive data set(s) no longer exist, then the derived data set retaining the greatest measure of the science potential of the investigation will be treated as if it were a definitive data set in terms of requiring AA approval for its release.
NASA will endeavor to minimize data renewal cycle costs, and hence to maximize renewal/retention of definitive and derived data sets and their supporting materials, by definition and implementation of standards for data formats, metadata, software, etc.