Digital Archive Directions (DADs) Workshop
Digital Archive Directions (DADs) Workshop
DATE: June 22-26, 1998
HOST: The National Archives and Records Administration
8601 Adelphi Road
College Park, MD 20740-6001
1. Identification of Proposed Topic [Required]
HDF as an Archive Format
National Center for Supercomputing Applications
University of Illinois at Urbana-Champaign
605 E. Springfield Ave.
Champaign IL 61820
1.3 Description of Proposed Project
The Hierarchical Data Format (HDF) is a multi-object file format for storing and transferring scientific data. HDF is widely used for scientific data management and has been selected as a standard data format for Earth Observing System (EOS) data products.
HDF data files are "self-describing" in the sense that they include information describing the type, storage structure, and location of the data in the file. They also provide convenient structures for applications to store application-specific metadata. HDF files are also "architecture-transparent" in the sense that a file's contents is represented in a form that can be accessed by computers with different ways of storing integers, characters, and floating-point numbers. HDF files also provide structures that facilitate efficient direct access, so that a small subset of a large dataset may be accessed efficiently, without first reading through all the preceding data.
HDF is not a data format that can be unpacked easily by knowing word locations, byte ordering and so on. With HDF, the user is insulated from these details, so that differences in system specific storage details are transparent. Because of its complexity, HDF files can only be accessed through the HDF library of subroutine and function calls (from FORTRAN or C).
NCSA and NASA have been working together since 1991 to provide support for the HDF format and access software. As the EOS project ramps up in the next two years, we will increase our efforts to fine tune the access software to improve its quality and performance, and to improve the usability of HDF.
[We anticipate that these would be a maximum of two pages with figures. If a current specification is being offered for standardization, abstract that specification here and provide references to the specifications.]
HDF as an archive format
Although HDF was not originally designed to be an archival format, the fact that enormous amounts of EOS data will be stored in HDF, coupled with HDF's own self-describing features, argues strongly for using HDF as an archival format.
Recognizing the likelihood that HDF will be used for archiving, we are seeking to identify and address the requirements that will need to be met in order to make HDF an effective archival format. We began by identifying the characteristics of a good scientific data archiving format:
These criteria are described in detail in the white paper: HDF as An Archive Format: Issues and Recommendations. Table 1 contains brief comments on the strengths and weakness of HDF with respect to the criteria. The only criteria that HDF satisfies unequivocally are #4 and #8. Others are partially satisfied, and some, such as #11, not at all.
Table 1 HDF strengths and weaknesses as an archive format
It has been determined that the one criterion that is fundamental to all others if #6: the HDF format, API, and low-level i/o library must be rigorously defined so that future applications will be able to make sense of the format, even when existing software becomes obsolete.
During the coming year we plan to focus our efforts on this requirement by revising and extending the HDF specification and code documentation. After that we will
1.5 Definitions of Concepts and Special Terms
1.6 Expected Relationship with OAIS Reference Model
The submission of digital metadata, i.e., data about digital or physical data sources, to the archive. Because HDF easily accommodates metadata, the ingest function can be supported by ensuring that producers of HDF files store much (perhaps all) of the Descriptive Information that must be produced during the ingest operation. Since HDF files can easily be added to, the ingest operation could be used to add extra information, such as browse images, to HDF files as they are brought into the system.
Protocol standard(s) to search and retrieve archive metadata information. Some work has been done on developing a formal descriptive information based on the object description language (ODL) developed for the Planetary Data System. A specifications has been created for the HDF-EOS profile, which supports, swath, point, and grid data. "HDF Configuration Records" (HCR), based on this specification, can be stored in an HDF file or separately. If producers would ensure that these records were stored in HDF files, they could be extracted from HDF files and be part of the descriptive information provided to the Data Management function area.
Manage storage hierarchy. It is possible to store HDF files in a number of different storage formats. For instance, it is possible to store large blobs of raw data (Content Information) in flat files while keeping the Representation Information and smaller amounts of Content Information in an HDF file. This makes it possible to use highly accessible media (e.g. disk) for the data that is most likely to be accessed, and less accessible, lower cost media for data that is less likely to be accessed, while still maintaining the integrity of an HDF "file".
Error checking. HDF does not currently support error checking mechanisms, but the format of HDF makes it easy to add such information at any level of granularity. If HDF is to be used for archiving, such mechanisms should be added.
Disaster recovery. Since new objects can easily be added to HDF, disaster recovery could be enhanced by storing the AIP for a file in the file itself.
Develop standards and policies. HDF is such a flexible format that it is very important to develop standards and policies for organizing data in HDF files. The HDF-EOS standard is a good example of this. The HDF configuration record, described above, can be used to help encourage such standards.
Access and dissemination (the delivery of digital sources from the archive)
Prepare finding aids. There is a great demand for this function for HDF. It is common for HDF files to contain browse images. It is also common for users to want to scan large numbers of images or other datasets in collections of HDF files. For instance NCSA is currently working on a project to extract feature information from large collections of HDF files, and much of that project involves building aids to assist users in searching for desired features.
Generate DIP. In those common cases where HDF files contain very large granules, it is common to desire only a subset of the granule. It is important that data be organized in such a way that subsetting be supported. This functionality has certainly been stressed by HDF users, and consequently much emphasis has been put on support for efficient subsetting from HDF files. HDF supports internal structures that can improve subsetting efficiency substantially.
2. Scope of Proposed Standard [Desired]
2.1 Recommended Scope of Standard
The complexity of HDF is both a strength and a weakness in determining its appropriateness as an archive format. It is a strength because of the flexibility it affords for organizing data and metadata for archival storage and access. It is a weakness in that very complex software must be available in order to access HDF files. The proposed effort will concentrate on finding ways to overcome the weaknesses resulting from HDF's complexity.
2.2 Existing Practice in Area of Proposed Standard
Currently efforts to deal with the problem of HDF complexity are focusing on creating comprehensive documentation for the format and software. We are very interested in finding other ways that we can address the problems.
2.3 Expected Stability of Proposed Standard with Respect to Current and Potential Technological Advances
The format: We do not anticipate that changes in technology will cause problems for supporting the format. At this time, there are no plans to change the current HDF format beyond minor enhancements. We may simplify the format somewhat by removing certain information that is required by older versions of the software, but this will only improve the suitability of the format for archiving.
The library: As long at the C and Fortran 77 remain viable, the library should be maintainable. However, over time the efficiency and maintainability of the library is expected to suffer as software technologies advance.
A service of NOST at NSSDC. Access statistics for this web are available. Comments and suggestion are always welcome.
Author: The Author (The Author@The Org) +1.The Phone
Curator: John Garrett (John.Garrett@gsfc.nasa.gov) +1.301.286.3575
Responsible Official: Code 633.2 / Don Sawyer (Donald.Sawyer@gsfc.nasa.gov) +1.301.286.2748
Last Revised: (May 26, 1998, John Garrett)