[FEP LOGO]  

FEP - Format Use by a Researcher - Eduardo Santiago - UDF

Eduardo Santiago
Los Alamos National Laboratory
 
Comment on this template in the HyperNews Discussion.  

1. Format (Format System) Identification

UDF

2. Original Motivation

The IMAGE mission distributes level 1 data in UDF.

UDF has been unsatisfactory as a data format. It violates nearly every single one of my requirements for a file format (posted elsewhere):

  • STANDALONE DATA: Data files require a complex, unwieldy, per-project infrastructure. Without a set of "VIDF" files -- furthermore, without exactly the same "VIDF" file used to write the data -- you cannot read a data file. If you're lucky, you'll get a core dump. If not, you might end up with corrupt data. There is no provision for alerting the user of this situation.

    Worse, data files cannot be ftp'ed over and replicated locally: a cumbersome clicky-GUI process is required to "install" data (I wrote myself a standalone, cron-able Perl equivalent, but nonprogrammers will be stuck with tedious daily clicking).

    Data "installation" is prone to occasional failure due to improperly packaged or created data files. While this in itself isn't necessarily unacceptable (it can happen with any project), with UDF, these failures can leave your "installation" in an uncertain -- possibly unusable -- state.

  • SELF-DESCRIBING DATA FILES: Even if the data files were standalone, and could be read without the "VIDF" files, the only description of the contents is in a separate set of ("PIDF") files -- which are not viewable by humans except through special tools.

  • TRANSPORTABILITY: Once data files have been "installed", they can only be read back in on a machine of the same endianness. You cannot share data between, say, Intel and SPARC.

    Also, data files must be "installed" according to UDF rules. If your mission spans five years, you will have at least 5x365x2 files per directory! Even if you, like me, strive to create nice per-year hierarchies, UDF will not allow this.

    Six months after launch, one data directory contains 1,685 files.

  • INTEGRITY: UDF insists on converting all telemetry values -- bytes, shorts, or ints -- to floating-point format.

  • ROBUSTNESS: the library SEGVs frequently, often due to uninitialized pointer references. This is unacceptable behavior from a library (but at least it's fixable, in theory).

    Furthermore, some of the infrastructure files are hand-maintained ASCII files, written in an unmaintainable format. Typographical errors (affecting data integrity) abound, and are nearly impossible to find. With just a little effort put into human-centered design, most of these errors would become impossible (e.g., counting elements by hand), and many others would stick out, even to the untrained eye.

    Even data files are not robust. The "installation" process (see STANDALONE, above) comprises several complex steps: untarring a collection of files, moving them into place, and running various utilities on them. A failure in any of these steps can result in an inconsistent database, which noone but the UDF wizard can fix. With a proper format such as HDF, corrupt data files can simply be removed from the filesystem, or re-grabbed from the source, without worrying about other internal state.

    From a programmer's point of view, the software is unusable. Even though UDF is meant to be a library, it frequently violates the most elementary rules of library code: it calls printf() on some errors instead of returning a code; many times it masks error codes, so multiple possible failures all map onto the single code "FAILURE", with no hope of tracking down what the root cause is; it even calls "exit()" in some circumstances!

    Finally, even six months after launch, new revisions of the UDF code are released weekly, occasionally daily. Enough said.

  • INTUITIVENESS: UDF uses -- but never adequately documents -- data accessed via "sensors", "scans", "ancillary sensors", "cal sets", and so on.

    To my knowledge, there is no overview-style document describing the UDF data model. Even with months of experience working with UDF, I still have not built a mental model of the philosophy behind it. It doesn't make any sense to me. Admittedly, this could be my own denseness.

    It's even impossible to know what data are available. With a proper file format, one can simply use "ls" to see which files are there. UDF, however, maintains all that in a database, requiring special tools to find out what's available.

    UDF is supposed to simplify peoples' life by converting telemetered bits to physical quantities. For the most part, it seems to do this well... but in an unmaintainable fashion. Arcane "Tables" and "Operations" are used, so "y = mx + b" must be written in a bizarre form, incomprehensible to any but a handful of people on the planet.

  • VERSION CONTROL: It is impossible to determine the version or processing level of a data file, except by the timestamp. This is fabulously unreliable.

  • DOCUMENTATION: is nearly nonexistent, often wrong, and poorly cross-referenced. Examples are scarce and not always helpful.

    The only way I was able to get anything working is through frequent interaction with the developer. Even so, reasons for doing things a certain way was never clear. For instance, one has to use file_pos() to position the data pointer in most instances, but under other subtly different circumstances, one must use FilePosRec().

This is a representative, but by no means complete, sample of the problems inherent in UDF. For more details, please contact me.

3. Data Types

Processing Level: Level 1.

Object Types: Time Series, Multidimensional, Spectra.

4. Support

An enormous amount of support has been necessary so far. I have been in constant contact with the UDF developer (UDF is pretty much a one-man show).

The developer does provide prompt support: from responding to bug reports, to analyzing my code snippets and describing where my thinking has gone awry of the UDF model, he always seems to be available. However, one shouldn't need full-time access to a developer, just to read data.

Even with constant support, "uh-oh" issues arise almost weekly. For instance, when I was unable to reposition the data pointer on one dataset, I was informed that I had to use the special routine ToThisTime(). That was the first I'd heard of it, and it wasn't clear why the existing file_pos() function didn't perform that function.

Finally, it isn't always possible to resolve issues. On various occasions, I have reported bugs to the developer (along with copious documentation explaining why it's a bug, how to reproduce it, even how to fix it), and the bugs still remain.

5. Software

I have developed, from scratch, an IDL package that provides a simple open()/read()/close() interface to UDF. Working on the assumption that researchers just want to get their hands on their data, I have simplified UDF as much as is humanly possible.

The complexity of the underlying code -- just for reading data, using UDF library calls -- is overwhelming: over four thousand lines of code (C and Perl) are required to provide this interface.

6. Environment

UNIX-only (Linux, Solaris) with GNU tools.

7. Usage

Since we're given data in UDF, and duplicate storage costs are pretty high, we do keep the UDF files and read from them.

All subsequent Level 2 (and beyond) products are written using HDF, of course.

8. Experience

 >Relative to its ability to carry and manage research-needed metadata

This seems to be a -- possibly the -- fundamental rationale behind UDF.

I don't know if or how it will succeed in this goal. Even if it performs as claimed, the cost is prohibitive. I've seen other missions accomplish this with far less effort.

 >Relative to its related software

UDF does do something that no other format does: convert packed telemetry bits to "normal" quantities. However, IMnsHO, this should be done by ground station software which then saves and distributes data files in a common, robust format.

9. Desired Functionality

None.

Although the idea behind UDF is a tempting one, there are too many problems with the implementation itself. A complete redesign and reimplementation, from ground zero is necessary to make it useful.

10. Selection Criteria

See my comments elsewhere.

11. Impact on Research

See my comments elsewhere.

For the most part, I like to accept data in whatever form the distributor likes (as long as it's well documented). If it's not in a useful form, I convert it to HDF or CDF (via automated scripts), and forevermore ignore the source data files.

This has worked successfully so far, requiring only a few days' effort (at most) to be able to handle a new mission data format. After this, no thought is ever again given to data format, since generic wrapper functions are used to access data.

With UDF, I've spent over six months getting things working. This unnecessarily hampered my productivity on other projects.

12. Other Comments

In writing these comments, I have succeeded in alienating and infuriating a large number of people. Such is life.

My intent is not to belittle or insult the UDF developer: he seems to be a terrific fellow, provides quite good support, and is really trying hard to make UDF work. However, that doesn't excuse UDF.

UDF can dazzle with its promise of quick access to data, and instant-gratification pretty pictures. But any attempt to do anything more with it will immediately show its weaknesses.

Comment on this template in the HyperNews Discussion.

 

Wider Views

Formats Evolution Process (FEP) Discussion Forums Page
Formats Evolution Process (FEP) Home Page
NASA/Science Office of Standards and Technology (NOST) Home Page

URL: http://ssdoo.gsfc.nasa.gov/nost/fep/researcher-Eduardo Santiago-UDF.html

A service of NOST at NSSDC.
Access statistics for this web are available.
Comments and suggestions are always welcome.

Author: Eduardo Santiago / Los Alamos National Laboratory / IMAGE/MENA (esm@lanl.gov) 505/665-3130
Curator: John Garrett (John.Garrett@gsfc.nasa.gov) +1.301.286.3575
NASA Official: Code 633.2 / Don Sawyer (Don.Sawyer@gsfc.nasa.gov) +1.301.286.2748
Last Revised: 2000-10-02 T15:51:24, Eduardo Santiago