[FEP LOGO]  

FEP - Format Use by a Researcher - Eduardo Santiago - HDF

Eduardo Santiago
Los Alamos National Laboratory
 
Comment on this template in the HyperNews Discussion.  

1. Format (Format System) Identification

HDF 4.1r1

2. Original Motivation

The ACE Science Center (ASC) provides our Level 1 data in HDF.

I have been tremendously impressed with HDF, especially with ASC's wrapper systems. The HDF format is space-efficient and fast to read. Using ASC's hdfgen tools, time to develop routines for reading/writing HDF files is measured in minutes.

3. Data Types

Processing Level: Level 1 and higher

Object Types: Time Series, Multidimensional.

4. Support

Our use of HDF is essentially split into two: the NCSA libraries, and the ASC wrappers.

Plain HDF (from NCSA) is nasty. The developers went for flexibility, at the cost of usability... jumping in to writing HDF code is intimidating and painful.

This is not an indictment on their efforts: I believe they did the right thing. The flexibility of HDF allows serious developers to use the format and provide simple, clean abstraction layers of their own.

Fortunately for those of us who care only about reading and writing simple time series, the ASC folks have developed a magnificent front end to HDF, providing a simple open/read/write/close interface that deals with C data structures. With the ASC code, using HDF is a pleasure.

If I had to use HDF without the ASC tools, I would be forced to reinvent them... which would require learning a lot more about HDF than I really care to know.

5. Software

I wrote some simple IDL functions for automatically reading any existing hdfgen-written HDF file. One single user-visible function, READ_ASC_HDF(), will return a properly formatted array of data structures containing the contents of the desired file. This makes it trivial for researchers to read data files, without knowing anything at all about the underlying HDF format.

6. Environment

UNIX-only (Linux, Solaris) with GNU tools.

7. Usage

We use HDF all over the place. We process Level 1, and save the results in HDF format. All our plotting and analysis code read the results from these HDFs.

Furthermore, we have used the ASC hdfgen.pl code to generate HDF templates for various other missions (SWOOPS, LENA-P), in order to store their datasets in HDF format.

8. Experience

 >Relative to its ability to carry and manage research-needed metadata

One HDF file can contain multiple not-necessarily-associated datasets. It is thus possible, and easy, to archive some metadata along with principal data.

Sometimes this isn't appropriate, though... in particular, calibration factors and pointing solutions often have to be recomputed for the whole mission. It is just as easy -- not to mention wiser -- to save such metadata in separately-distributed HDF files.

 >Relative to its related software

I can only compare to CDF, ASCII, and raw binary. That said:

  • HDF and CDF seem mostly equivalent, at my level of experience.
  • ASCII (when done properly, with relevant headers) is my usual favorite for distributing data. It requires no special libraries, machine architectures, or whatnot. However, it is slow to read, and usually requires human intervention. Properly written HDF files can be read by a single function, without writing any special-purpose code.
  • raw binary is a pain in the neck. See my comments.

9. Desired Functionality

As Andrew said, it would be nice to have CDF-like mechanisms for describing individual dataset elements.

A good 2-page "Introduction to HDF" would have helped me when learning it. I still don't have a good mental overview of the library, but perhaps that's my own obtuseness.

10. Selection Criteria

  1. file-level TRANSPORTABILITY. Believe it or not, there are some formats out there that cannot read a big-endian-written file on a little-endian machine. Obviously, this is not acceptable in a professional environment. Also, floating point values must be stored in IEEE format.
  2. code-level PORTABILITY. One distribution must compile and run on anything imaginable. Furthermore, custom config/make scripts are pathetic. There is no reason not to use autoconf.
  3. LANGUAGE INDEPENDENCE. Designing with Fortran (or C, or C++, or SNOBOL, or Lisp) in mind will guarantee headaches when that language becomes obsolete. Not only must current culture be considered when designing a format, but designers must make a good effort to peer into their crystal ball and foresee upcoming developments in language use. Tough one.
  4. files must be STANDALONE. Other than the format libraries, there MUST NOT be any framework (project config files, file descriptions) required for reading a data file.
  5. data files must be SELF-DESCRIBING. It should be possible for a programmer to figure out the contents of a data file with no external references. This is closely related to (3), but subtly different enough to warrant its own bullet.
  6. ROBUSTNESS. The library must perform its own error checking, and guarantee returning good data or a well-defined error status. Core dumps are a sign of stupidity and sloppiness. Returning corrupt data is worse, and never, EVER acceptable.
  7. AVAILABILITY. This is impossible to determine a priori, of course, but any new format should be supported by a wide variety of existing data analysis tools... if it is any good.
  8. SPACE EFFICIENCY. An N-byte quantity in memory should take up no more than N bytes on disk. Per-file overhead is natural and expected; per-record overhead is not.
  9. INTUITIVENESS. This is a hard one to quantify. If a user cannot develop a simple (and correct) mental model of how records are stored and accessed, though, the format is doomed.
  10. THE FUTURE. How do we ensure that our descendants are able to read data stored today? Storing data as bit streams (as opposed to, say, VAX/VMS proprietary-"RMS" files) is a good start. What other cultural factors need to be meta-examined?
  11. VERSION CONTROL. It would be nice if "data versions" were supported as part of the format. For instance, we often have multiple revisions of our "level 2" products. I write wrappers around the file lookup/open() functions to deal with this, but it might be useful to have that ability as part of the file format. Then again, perhaps it just adds complexity without enough benefit.

11. Impact on Research

Learning a new format is always painful. Sometimes it is worth the effort. HDF certainly was: a small investment in time resulted in huge productivity gains on several projects.

Some formats are a living nightmare. Raw binary, for instance, is unpleasantly common, and always nasty to deal with. It pales in comparison with UDF, though.

12. Other Comments

No, I'm not really a researcher... I'm a software factotum who provides tools to the real science types, so they can read and analyze data with a one-minute-or-less learning investment. By your classification, though, I didn't seem to fit under "Tool Developer".

Comment on this template in the HyperNews Discussion.

 

Wider Views

Formats Evolution Process (FEP) Discussion Forums Page
Formats Evolution Process (FEP) Home Page
NASA/Science Office of Standards and Technology (NOST) Home Page

URL: http://ssdoo.gsfc.nasa.gov/nost/fep/researcher-santiago-hdf.html

A service of NOST at NSSDC.
Access statistics for this web are available.
Comments and suggestions are always welcome.

Author: Eduardo Santiago / Los Alamos National Laboratory / ACE/SWEPAM, Ulysses/SWOOPS, LENA-P (esm@lanl.gov) +1 505/665-3130
Curator: John Garrett (John.Garrett@gsfc.nasa.gov) +1.301.286.3575
NASA Official: Code 633.2 / Don Sawyer (Don.Sawyer@gsfc.nasa.gov) +1.301.286.2748
Last Revised: 1999-12-15 T18:49:22, Eduardo Santiago