Digital Archive Directions (DADs) Workshop

(A part of the ISO Archiving Workshop Series)
 
 
  

     

Position Paper


Digital Archive Directions (DADs) Workshop

DATE: June 22-26, 1998

HOST: The National Archives and Records Administration
Archives II
8601 Adelphi Road
College Park, MD 20740-6001

 


 

1. Identification of Proposed Topic [Required]

1.1 Title

HDF as an Archive Format

1.2 Contributor(s)

Mike Folk

National Center for Supercomputing Applications

University of Illinois at Urbana-Champaign

605 E. Springfield Ave.

Champaign IL 61820

 

mfolk@ncsa.uiuc.edu

fax: 217-244-1987

phone: 217-244-0647

1.3 Description of Proposed Project

The Hierarchical Data Format (HDF) is a multi-object file format for storing and transferring scientific data. HDF is widely used for scientific data management and has been selected as a standard data format for Earth Observing System (EOS) data products.

HDF data files are "self-describing" in the sense that they include information describing the type, storage structure, and location of the data in the file. They also provide convenient structures for applications to store application-specific metadata. HDF files are also "architecture-transparent" in the sense that a file's contents is represented in a form that can be accessed by computers with different ways of storing integers, characters, and floating-point numbers. HDF files also provide structures that facilitate efficient direct access, so that a small subset of a large dataset may be accessed efficiently, without first reading through all the preceding data.

HDF is not a data format that can be unpacked easily by knowing word locations, byte ordering and so on. With HDF, the user is insulated from these details, so that differences in system specific storage details are transparent. Because of its complexity, HDF files can only be accessed through the HDF library of subroutine and function calls (from FORTRAN or C).

NCSA and NASA have been working together since 1991 to provide support for the HDF format and access software. As the EOS project ramps up in the next two years, we will increase our efforts to fine tune the access software to improve its quality and performance, and to improve the usability of HDF.

[We anticipate that these would be a maximum of two pages with figures. If a current specification is being offered for standardization, abstract that specification here and provide references to the specifications.]

1.4 Justification

HDF as an archive format

Although HDF was not originally designed to be an archival format, the fact that enormous amounts of EOS data will be stored in HDF, coupled with HDF's own self-describing features, argues strongly for using HDF as an archival format.

Recognizing the likelihood that HDF will be used for archiving, we are seeking to identify and address the requirements that will need to be met in order to make HDF an effective archival format. We began by identifying the characteristics of a good scientific data archiving format:

  1. The format is self-describing
  2. The format is compact
  3. The format supports sequential access
  4. The format is suitable for a variety of storage technologies
  5. The format is simple
  6. Access software is available for the format or is easy to implement
  7. A rigorous definition of the format, API and I/O library is available
  8. The format is widely used
  9. There is long-term institutional support for the format
  10. The format is stable in that it changes little or not at all over time
  11. Tools are available that supports efficient access on mass storage systems

These criteria are described in detail in the white paper: HDF as An Archive Format: Issues and Recommendations. Table 1 contains brief comments on the strengths and weakness of HDF with respect to the criteria. The only criteria that HDF satisfies unequivocally are #4 and #8. Others are partially satisfied, and some, such as #11, not at all.

Table 1 HDF strengths and weaknesses as an archive format

Characteristic

HDF strengths

HDF weaknesses

1. Self-describing

   

Application-specific metadata

Attributes and annotations are available for users for storing metadata.

HDF itself does not support profiles such as HDF-EOS. They must be provided by the community that defines them.

Structural metadata

Partially. For instance, pre-defined HDF tags indicate the structure and content of data elements, and number types are encoded in an HDF file.

The syntax and semantics of simple objects, higher level data structures, and special storage methods are defined in the documentation, not in the file. The HDF software interprets them.

2. Compact

   

Low overhead

Reasonably low if structures are defined carefully.

High overhead if many small structures are stored.

Compact number representations

Yes. Supports binary data and n-bit number types.

 

Support for data compression

Supports several kinds of compression.

Some compression software is third-party software. No clear procedure is in place for maintaining this software over the long term.

Also, no clear mechanism exists for incorporating new compression methods.

3. Sequential access

 

"Header" information can be anywhere in the file, including at the end.

4. New storage technologies

HDF is not tied to any particular technology.

 

5. Simple

 

It is not easy to interpret HDF files without using the HDF library. The same is true for any API that uses HDF, such as HDF-EOS.

6. Ease of writing access software

 

Because the HDF format is complex, access software is not easy to write. There is heavy reliance on the NCSA library.

7. Rigorously defined

   

Format

Partially. Tags are fully specified in Specification.

Higher level structures need to be documented. Tag documentation needs to be reviewed.

API

Yes. User's Guide and Reference Manual available.

Needs revision. Both manuals are currently under review.

Library source code

Somewhat. The internal (in-code) documentation is good.

Only a small amount of documentation is available in the Specification. Use of third-party code is also a problem.

8. Wide use

The use of HDF is widespread and growing.

 

9. Long-term institutional support

NASA is committed to supporting HDF at a level of 1.5 fte through 2001. ASCI may provide similar support.

There is no long-term commitment to support the development, maintenance, and support of HDF. The heavy reliance for basic HDF access software on a small, centralized group, rather than relying on the user community, is risky.

10. Stability

 

At some cost in complexity, backward compatibility has so far been maintained in the HDF format and library. Some older structures may be omitted in the future, but the library will always read them.

Future changes to the format could increase the complexity of accessing HDF archives.

Also, HDF5 will be available soon. It is an entirely new format. Issues raised by this development are to be covered in a separate document.

11. Mass storage support

HDF could be made to have some of the features of "tar."

HDF does not have tar-like tools at this time.

It has been determined that the one criterion that is fundamental to all others if #6: the HDF format, API, and low-level i/o library must be rigorously defined so that future applications will be able to make sense of the format, even when existing software becomes obsolete.

During the coming year we plan to focus our efforts on this requirement by revising and extending the HDF specification and code documentation. After that we will

1.5 Definitions of Concepts and Special Terms

[None]

1.6 Expected Relationship with OAIS Reference Model

Ingest

The submission of digital metadata, i.e., data about digital or physical data sources, to the archive. Because HDF easily accommodates metadata, the ingest function can be supported by ensuring that producers of HDF files store much (perhaps all) of the Descriptive Information that must be produced during the ingest operation. Since HDF files can easily be added to, the ingest operation could be used to add extra information, such as browse images, to HDF files as they are brought into the system.

Data Management

Protocol standard(s) to search and retrieve archive metadata information. Some work has been done on developing a formal descriptive information based on the object description language (ODL) developed for the Planetary Data System. A specifications has been created for the HDF-EOS profile, which supports, swath, point, and grid data. "HDF Configuration Records" (HCR), based on this specification, can be stored in an HDF file or separately. If producers would ensure that these records were stored in HDF files, they could be extracted from HDF files and be part of the descriptive information provided to the Data Management function area.

Archival storage

Manage storage hierarchy. It is possible to store HDF files in a number of different storage formats. For instance, it is possible to store large blobs of raw data (Content Information) in flat files while keeping the Representation Information and smaller amounts of Content Information in an HDF file. This makes it possible to use highly accessible media (e.g. disk) for the data that is most likely to be accessed, and less accessible, lower cost media for data that is less likely to be accessed, while still maintaining the integrity of an HDF "file".

Error checking. HDF does not currently support error checking mechanisms, but the format of HDF makes it easy to add such information at any level of granularity. If HDF is to be used for archiving, such mechanisms should be added.

Disaster recovery. Since new objects can easily be added to HDF, disaster recovery could be enhanced by storing the AIP for a file in the file itself.

Administration

Develop standards and policies. HDF is such a flexible format that it is very important to develop standards and policies for organizing data in HDF files. The HDF-EOS standard is a good example of this. The HDF configuration record, described above, can be used to help encourage such standards.

Access and dissemination (the delivery of digital sources from the archive)

Prepare finding aids. There is a great demand for this function for HDF. It is common for HDF files to contain browse images. It is also common for users to want to scan large numbers of images or other datasets in collections of HDF files. For instance NCSA is currently working on a project to extract feature information from large collections of HDF files, and much of that project involves building aids to assist users in searching for desired features.

Generate DIP. In those common cases where HDF files contain very large granules, it is common to desire only a subset of the granule. It is important that data be organized in such a way that subsetting be supported. This functionality has certainly been stressed by HDF users, and consequently much emphasis has been put on support for efficient subsetting from HDF files. HDF supports internal structures that can improve subsetting efficiency substantially.

 


 

2. Scope of Proposed Standard [Desired]

2.1 Recommended Scope of Standard

The complexity of HDF is both a strength and a weakness in determining its appropriateness as an archive format. It is a strength because of the flexibility it affords for organizing data and metadata for archival storage and access. It is a weakness in that very complex software must be available in order to access HDF files. The proposed effort will concentrate on finding ways to overcome the weaknesses resulting from HDF's complexity.

2.2 Existing Practice in Area of Proposed Standard

Currently efforts to deal with the problem of HDF complexity are focusing on creating comprehensive documentation for the format and software. We are very interested in finding other ways that we can address the problems.

2.3 Expected Stability of Proposed Standard with Respect to Current and Potential Technological Advances

The format: We do not anticipate that changes in technology will cause problems for supporting the format. At this time, there are no plans to change the current HDF format beyond minor enhancements. We may simplify the format somewhat by removing certain information that is required by older versions of the software, but this will only improve the suitability of the format for archiving.

The library: As long at the C and Fortran 77 remain viable, the library should be maintainable. However, over time the efficiency and maintainability of the library is expected to suffer as software technologies advance.


Wider Views

Overview of the DADs Workshop
Overview of US Effort
Overview of International Effort


URL: http://ssdoo.gsfc.nasa.gov/nost/isoas/dads/DADS16.html

A service of NOST at NSSDC. Access statistics for this web are available. Comments and suggestion are always welcome.

Author: The Author (The Author@The Org) +1.The Phone
Curator: John Garrett (John.Garrett@gsfc.nasa.gov) +1.301.286.3575
Responsible Official: Code 633.2 / Don Sawyer (Donald.Sawyer@gsfc.nasa.gov) +1.301.286.2748
Last Revised: (May 26, 1998, John Garrett)