Digital Archive issues from the Perspective of an Earth Science Data Producer

 

Bruce R. Barkstrom

 

Atmospheric Sciences Division

NASA Langley Research Center

Hampton, VA 23681-0001

 

 

Introduction

Earth science data have several unique characteristics that play a role in that data’s archival:

Each of these characteristics influences the way data producers generate their data and the way users access and retrieve it. Of course, the perspective of the data producers is far from identical with the perspective of the many user communities. As we explore these perspectives and their implications in this paper, we will be reminded that digital archives have a long-term view of their data, but that they also need to be in intimate contact with the living communities of producers and users for that data.

We anticipate that the producer and user communities will experience significant changes in the way they produce data, store it, search for it, and use it as they gain experience in working with data that are linked in the fashion of ‘hypertext’. The WWW pages and links that move us from web site to web site show how the familiar ‘Table of Contents’ and ‘Index’ expand when computers power these ‘search engines’ for books. Eventually the influence of these pages and links will reach into the way data producers package their ‘data products’. When we arrive at that part of the next millenium, fluid forms of data packaging will fit naturally into the world of the digital archive. However, we still have the pangs of childhood and adolescence to endure before this field is mature.

In the body of this paper, we begin with the views of the data producers. We are particularly interested in the data structures that producers currently create. These structures often reveal a great deal about the scientific producer’s ‘world view’. They also reveal the constraints that current technology imposes on data production. For example, most of the data producers in the first EOS missions produce ‘files’. One of the reasons they do so is that database approaches still raise serious concerns about data throughput. An additional concern arises from the complexity of data production and configuration management. We illustrate this complexity by showing how these file structures provide a useful ‘tree’ to index the data in the archives. Data producers have to introduce special branches in these trees to account for versions of data products that arise from validating scientific data. We also recognize that this tree structure for files and file collections corresponds to a ‘Table of Contents’ for a particular data producer’s products. Metadata that aggregates data values in the data files then plays the role of an ‘Index’ that allows users to search through data in a ‘random access’ mode.

The second section of the body of this text considers how users view what the data producers create. Users often want to search for data in ways that are quite different from the paths that data producers use. We first consider the diversity of user communities that may access Earth science data. From this discussion, we see that users often adopt an ‘object-oriented’ frame of mind, a ‘retail’ approach to data. Data producers, on the other hand, tend to look at data from a ‘wholesaler’ point of view that emphasizes uniform blocks of data that do not have many external differences. When the archivist needs to mediate between these two points of view, he may want to use several different approaches to building searchable structures. He might include producer-provided statistical summaries of the data product files. He might use secondary indices to the files. Independent investigators could build such indices. We are probably just at the beginning of an era in which this kind of ‘data scholarship’ will markedly expand the usefulness of scientific data archives.

In the third section of this text, we consider the role of digital archive centers for scientific data. Before we discuss this role, we consider that the divergence between data producer views and data user views naturally leads us to consider subsetting data in files and then combining the subsets into new supersets. In print libraries, such an approach would be equivalent to allowing users to Xerox individual pages, then to excerpt the material from the pages, and finally, to create new scholarly works from interpretations of the excerpts. With subsetting and supersetting of the basic data, scientists can produce entirely new interpretations of natural phenomena. Of course, these new interpretations need peer review and extended discussion by scientists before they provide the scientific community accepts them. The archive centers that contain the data may help by serving as guides and facilitators for this discussion and consensus building activity. In order to fulfill this role, these centers need to

This perspective suggests that in the long run, digital archive centers will become centers for ‘scholarly access’ to scientific data. It also suggests that the current notion of placing ‘used’ data in ‘long-term archives’ (which often appear to be viewed as ‘write-once, read-never’ data sinks in a managerial context) is incorrect. A much more useful notion is to view digital archive centers as specialized research libraries that are one component of a community of data scholars.

 

 

A Producer Perspective on Earth Science Data

Data Producers as Members of a Scientific Community

From the perspective of this paper, a data producer is a scientist (or a scientific team) who agrees to produce data for peer-reviewed scientific work. Because a critical component of scientific work is being able to reproduce results, data producers provide their data to institutions we will call data centers. In previous times, data producers created their own data centers and distributed their data to other researchers. However, it has been helpful to have data centers that can deal with data distribution, documentation, and preservation. In some cases, data centers also partner with data producers to generate scientific data products.

Clearly, the scientists who generate data products are familiar with the instrumentation that generates the raw data. In addition, these scientists need to be familiar with the algorithms that convert raw data into useful scientific information. Given these skills and knowledge, data producers form a relatively homogeneous reference community for scientific data. In other words, scientific data are created within a community familiar with the disciplinary basis for the scientific investigations that serve as the initial justification for collecting the data.

Some Unique Characteristics of Scientific Data

If a data center was dealing with financial data, it would need computers to collect, manipulate, and store these data. In the present environment, such financial data are often generated in discrete transactions from sites such as Automated Teller Machines or Point of Sale terminals in stores. From these discrete sites, they are transmitted to more centralized locations for ingesting into databases. There, the computers create statistical summaries and provide the results to programs that compare the transaction summaries with results from programs that model the financial flows within the firm and within the economy as a whole.

In what ways do scientific data differ from financial data? We can identify several unique characteristics of scientific data, and of Earth science data in particular:

Let us discuss these characteristics in more detail. They govern much of what data producers do.

 

Spatial and Temporal Sampling for Earth (or Space) Science Data

The spatial and temporal coordinates that underlie Earth science data are critical to these scientific data values. In a few cases (mainly for in-situ measurements), instruments provide samples of the environment at particular points in space. However, in remote sensing, the data points sample volumes of the Earth’s atmosphere or areas on its surface. For example, in remote sensing with a satellite imager, the data value in a given pixel, m, is related to the radiance, I, leaving the top of the atmosphere at a latitude, l, and longitude,, through a Point Spread Function (PSF). The imager has a coordinate system related to its optical axis, the position of the satellite (also in latitude and longitude), and the satellite’s attitude. Without going through a detailed derivation, we can write the relationship as an integral:

Even if a data producer records the latitude, longitude, and time of the data value, a data user may still need information about the PSF in order to determine how much the data taking process has smeared the underlying distribution of radiance.

For the same reason, data users may need very detailed information about the spatial and temporal sampling pattern of the instruments that collects the data. While data producers often supply illustrations of sampling patterns, they must supply quantitative descriptions of them as well. Even a simple push-broom sensor that samples from limb-to-limb is likely to have samples that are closer together at nadir (in distance on the Earth’s surface) than they are near the limb. A microwave imager may use a conical scan, in which the data points are uniformly spaced around the scan. A user that wants to compare the push-broom imager data with the conical scanning microwave data will have to understand the geometry of both instruments. The need for this kind of detailed and precise sampling information is one of the distinguishing characteristics of scientific data.

The Influence of the Data Production System Architecture

Data producers tend to design production system architectures that make the data they work with as uniform as possible. Thus, they tend to work with uniform temporal intervals (hours, days, or months) or uniform spatial chunks (the entire globe, latitudinal profiles that are averages over all longitudes, equal-angle grids or equal-area grids). They often design their production processes so that data enters these processes in uniform time intervals even though the data have complex spatial sampling patterns within the time interval. Later in the processing, the same data producer may rearrange the data into a more uniform spatial structure that has a time series for each spatial region – particularly if the data come from an instrument that creates a more-or-less continuous stream of measurements.

The Spatial and Temporal Structures Underlying Earth Science Data

Figure 1 is a schematic representation of the underlying spatial and temporal structure for Earth science data. Some in-situ instruments take data from a fixed point on the Earth. This kind of sampling would appear in figure 1 as a straight line that goes from the front surface to the back, parallel to the time axis. Geostationary satellite instruments typically collect data from a circular cylinder in this figure. The center of the cylinder is located directly under the satellite. Its axis extends back through the volume, parallel to the time axis. Low-Earth orbiter instruments weave a sideways, ‘S’-shaped swath through this sampling space.

 

 

 

 

 

 

 

Figure 1. Projected Spatial and Temporal Coordinates for Earth Science Data. For many kinds of Earth science data, it is useful to think of the data obtained by various sensors as having attached variables that give spatial location and time for each data value. In complete generality, there are three spatial variables (latitude, longitude, and altitude) as well as time. As this schematic figure illustrates, we often project the three spatial coordinates onto two.

Earth Science Data ‘File’ (or Relation) Schemas

Data producers and archivists are not entirely consistent in recording these underlying spatial and temporal sampling structures. One factor that influences both of these groups is the expense of storing large amounts of data. In products that contain gridded data, the structure is regular enough that data producers may not include the grid boundaries in the data product. Often, they feel that they can reference the grid in documentation. They expect users to work with the data just as well as they could if they put the grid directly into the file. In other cases, the data producer will embed sample locations within the data products themselves.

Figure 2 illustrates a ‘canonical’ way of thinking about how data may be incorporated into the ‘files’ within a data product. We assume that in this representation, the data producer arranges the data values into records. The first four values in a record provide the centroid of the space and time sampling for the data values. The data values after these first four are then the measurement values taken within this sampling volume. Noting that data producers and archivists desire ‘self-documenting’ files, we also show that each file contains a second collection of records that ‘annotate’ the data values. As figure 2 shows, our suggestion for this annotation provides a Name for each field, a Description of it, the Units of that field’s measurements, Bad Value Definitions, and similar kinds of information.

The form shown in figure 2 is intended to be isomorphic with the layout of data fields in a conventional database. If a data producer wanted to use this kind of software, he would design the tables similar to the rows of records and the columns of fields. A few of the data producers for the Earth Observing System have moved in this direction.

More commonly, data producers rearrange the sequence of fields and records. They may choose to build blocks of storage that contain only one field. The file then consists of sequences of homogeneous blocks of data values. Often, data producers do not include all of the fields identified in light gray in this figure. For example, if a data producer is creating files that contain images, he might choose to leave out the time field and the altitude field. He might embed samples of latitude and longitude every tenth line and for every tenth pixel in the image, similar to the structure NOAA uses for providing data from the Advanced Very High Resolution Radiometer (AVHRR). Alternatively, if there are many spectral bands for a very high data rate instrument, the data producer might put the latitude and longitude of each pixel in a separate file that users access when they need that information.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 2. Canonical Treatment of Data and Annotation Structures within a ‘File’. In this schematic illustration of the way data appear within a single ‘file’, we show records of data values and records for annotating the fields within the data value records. As we show the data structure, data values collected by a particular sampling of space and time are stored in records of successive fields. Each field has a name, a description, units, etc., as we show in the "Annotation" Records. A complete file would contain at least these two kinds of records if it were to be completely self-documenting. In practice, scientific files may treat the space and time sample values implicitly so that they are not recorded in the file at all. We display the Data Value Records and the Annotation Records in a form appropriate for database tables. In practice, data producers may group the fields and annotations in other sequences than the one we show here.

Scientific data from real instruments also contains ‘bad values’. Sometimes the data collection system doesn’t work and loses data. At other times, the instrumentation records incorrect values. In either event, data producers create quality assurance logic in their software and institute quality assurance procedures that help their teams identify bad values. The indicators of bad values are sometimes ‘flag’ fields and are sometimes encoded as particular data values. The canonical data structure in figure 2 supports either method of communicating predefined bad values.

 

Data Producer Configuration Management Complexities

The data producer that creates the data and the data center that stores it for distribution have the responsibility for ensuring that the data fields and the ties to the spatial and temporal sampling are adequately documented for users. In addition, the producer and the data center need to pay careful attention to configuration management of scientific data. This task can become quite complex as we show in the next few figures and paragraphs.

Data users are often fooled about the simplicity of data production because producers present simple diagrams showing how archival data products are connected with the algorithms that ingest one form of data to create another. The top diagram in figure 3a shows such a simplistic view for production of data by the Earth Radiation Budget Experiment (ERBE). In this diagram, data from the ERBE instruments enters on the left. The first process converts the raw instrument data into instantaneous fluxes at the top of the Earth’s atmosphere. The second process averages the instantaneous fluxes to produce monthly averages. However, this linear depiction is much too simple to describe the actual dependence of the archival products upon the ancillary data that the ERBE production team has to supply.

The lower diagram in figure 3a shows these additional kinds of files. The first process has to have calibration coefficients and satellite ephemeris data. The second process has to have Angular Distribution Models that enter directly into the conversion from radiances to fluxes. This process also needs instrument spectral characteristics and a map of the underlying surfaces that cover the Earth (oceans are different than deserts, etc.). The monthly averaging process needs to account for the systematic diurnal pattern of desert surface heating during the sunlight hours, as well as the variation of albedo over the course of the day. These technical details are not the primary concern of an archivist. However, the complexity of this topology is an important characteristic of scientific data. The lower diagram in this figure has a considerably more complex topology than the advertising diagram in the upper portion.

As the astute reader might expect, even the lower diagram in figure 3a does not convey the true complexity of production topology. Figure 3b on the following page shows some of that complexity for producing the monthly average, albeit assuming that a month has just two days. Of course a real month has about thirty days, which means that the daily parts of the process would be repeated thirty times. To be completely correct, we would have to include the last day of the previous month and the first day of the following month. In addition, the ERBE monthly production had to include options that process data from only one satellite or as many as three. We leave the appropriate diagram connecting files and processes to the reader’s imagination.

The connectivity we illustrate directly concerns the data center and the serious data user. It determines the traceability or genealogy of data in a particular data product. It also affects how producers label versions of data sets. Data producers spend a great deal of their time validating data after they start producing it. A substantial part of that work involves finding unexpected discrepancies between data from one source and data from other sources. When such a discrepancy is discovered and verified, the data producer teams have to identify the source and correct the problem. Sometimes the problem lies in the algorithms that underlie the production processes. At other times, the problem lies in the input data. Once the producer identifies the cause and corrects it, he has to decide how to remove the problem from the data in the data center. Occasionally, the problem is small enough that the producer can simply note it for the users to take into account when they have a unique data use. More likely, some of the data in the data center will have to be reprocessed. After reprocessing, the data user will encounter the problem of multiple versions of the data.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 3a. Production Topology Issues – Simplistic Connectivity of ‘Files’ Used in Data Production. From an external view, it is often useful to present an extremely simple view of the relationship between data product files and the production processes. The upper view shows such a simplification, in which we remove all of the data ‘files’ except those that are the ‘archival’ data product files. The lower view shows the generic data products that enter the production ‘jobs’. Data ‘files’ are shown as rectangles; processes that ingest and create data are shown as circles.

 

Figure 3c illustrates the feedback processes involved in scientific data validation. In the ERBE measurements that we illustrate here, each satellite carried both a scanning instrument and a non-scanning one. Both instrument types also carried internal and solar calibration sources. These instruments flew on several different satellites whose sampling of the top of the atmosphere overlapped at orbit intersections. Each of these calibration sources and intercomparison opportunities offered a separate validation intercomparison whose results could feed back on the coefficients that produced the data. For example, if the solar calibration targets suggested that the non-scanner instrument changed gain, then the team would change the non-scanner calibration coefficients. As we can see from the figure, multiple validation opportunities introduce complex feedback loops – at the same time that they increase the certainty of the data.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 3b. Production Topology Issues – Genealogical Complexity. This figure illustrates a partial expansion of the ‘file-process’ topology that represents an actual production run which creates a monthly average data product from a month of input data. For careful archival work, the final data products need a configuration management system that will allow a researcher to track the genealogy of the antecedent data used by the production process, as well as the algorithms that created the final data products.

Many Earth science data producers work to produce data products that have long-term consistency. For this purpose, they may need to generate global data products for several months in a year before they can evaluate whether or not the data need further refinement. The ERBE data provide an excellent example of this requirement. Obtaining a net radiation balance that is close to zero over an annual cycle is a very stringent requirement. The ERBE science team was not satisfied with their data until they had computed the net radiation balance with a year of data to see that the observed imbalance was relatively small (which it was). Many other groups make such long-term, large area consistency checks part of their validation effort. From a user perspective, it is easy to talk of ‘eager’ and ‘lazy’ approaches to data production. From a data producer perspective, either of these approaches is overly simplistic. When they have to do serious validation work, data producers would probably prefer to describe their approach as ‘considered’.

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 3c. Production Topology Issues – Versioning. This figure illustrates a partial expansion of the validation process that data producers use to reduce uncertainties in their final data products. The rectangles indicate the coefficient files the producers use to generate the calibrated data. Each change in these files generates a new Data Set Version in the hierarchy we describe below.

The Topology of Earth Science Data Inventories

Data producers generate collections of files using process-product topologies that contain complex webs of coefficient and algorithm genealogies. These webs are not random, with untraceable weavings of relationships. Rather, they are like tapestries with highly regular patterns. Data producers create groups of files that are very similar to one another, particularly for satellite data. Within one of these groups, the fields within the records have the same sequence and data types. In figure 4, we call this top level of homogeneity the Data Product level. A Data Product consists of Data Sets, in which the files contain data from a homogeneous collection of sources (a single satellite, a single in-situ data source). For ERBE we would have some Data Sets that contained only data from the Earth Radiation Budget Satellite (ERBS), others from NOAA-9 and some that contained both. As a science team proceeds through validation, they modify either the algorithms or coefficients. These variations introduce Data Set Versions within the Data Sets. In practice, we may need to introduce an additional configuration version to produce a unique index for each file.

The data, files, and file collections form a ‘natural’ hierarchy for data producers, a tree structure. Figure 4 illustrates this structure. Assuming that a data producer does not put the same data record twice into a data center, this tree provides a unique location for each data record. We also note that in many cases, we can place the files within a Data Set Version into a sequence ordered by the starting time of the data the file contains. In more formal terms, we might say that time and space sequences often provide a unique indexing for the possible file opportunities within a Data Set. As the files are created within a Data Set Version, the data center can relate the ‘file opportunity index’ to the file position in the version sequence.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 4. Data, Files, and File Collections as a Tree. This hierarchy describes a standard organization of Earth science data that is common in the EOS Distributed Active Archive centers. Most of the data is kept in files made of records. The records consist of data values that are typically a ‘float’ data type, i.e., 4-byte floating-point numbers. Data set versions are intended to be "as homogeneous as possible" in their contents, so that both processing algorithms and input coefficient changes are minimized within the files in this grouping. Data sets are groups of data set versions, in which the dominant difference from data set to data set is the time and space sampling. Data products are collections of data sets. Data products are likely to be fairly uniform in content and structure. Data files may contain several record types, as well as appropriate documentation and simple metadata. These metadata components do not appear in this figure.

 

The material we have covered briefly here should convey some of the perspective that data producers bring to the problems of data production and archival. Data producers are not insensitive to users. However, the users that are most important to them are usually scientists with interests and experiences similar to the data producer. Production performance becomes paramount. These influences lead data producers to treat the data for which they are responsible in terms of relatively homogeneous ‘chunks’ of data that are often easiest to order in a time sequence or in gridded spatial structures. Data producers are also more likely to be sensitive to configuration issues than are data users. In the section that follows, we consider the other point of view.

 

Some Thoughts on the User Perspective

Science Data User Communities

In contrast with the relative homogeneity of the data producer’s community, users come from diverse communities:

As we move from the highly specialized world of the scientific researcher to the diverse inclinations and background of the general public, it becomes increasingly important to provide a support infrastructure. We expect researchers to be familiar with long words and precise understanding of the meaning of data annotation. The general public grows impatient with long words and long definitions. Regardless, none of these communities wants to wait for data.

Spatial and Temporal Structure Needs of Different Users

The divergence among the user communities forces data centers to take several different approaches when they present the spatial and temporal structure of data to data users. Researchers working in a particular scientific discipline receive grounding in the spatial and temporal structures of their discipline’s data as part of their education. Concise tables or documentation written by other researchers probably suffice for researchers. Researchers doing interdisciplinary work are likely to have to deal with a variety of conventions. A common documentation format for describing coordinate systems and data formats is very helpful to researchers working with several different data cultures. Educational and public users may not be familiar with spatial gridding conventions. It is easy to picture the confusion that different map projections create. Students used to seeing how large Greenland is with respect to Africa on a Mercator projection may be shocked at how much Greenland shrinks when it is displayed on an equal-area projection. Both of these communities need simple narratives that can help users locate map features.

We can find examples of this diversity even within a single disciplinary community. The International Satellite Cloud Climatology Project (ISCCP) uses an ‘igloo-like’ gridding scheme that is approximately equal-area. ISCCP starts numbering its boxes at the South Pole. ERBE uses an equal-angle gridding whose box numbering starts at the North Pole. When the author wants to compare ERBE fluxes with ISCCP clouds, he has to be careful about transforming from one grid to another. Since he has lived with these two data sets for a long time, the author simply expects to take some time in dealing with the underlying grids and indexing approaches. Other users are not so patient.

Getting the communities to establish a common and well-documented agreement on these underlying conventions is certain to exceed the patience of Job (and the travel budget of even the Defense Department). The author recalls discussions on establishing a common grid for EOS. These talks extended over a year and a half without really reaching a consensus that was enthusiastically supported by the EOS scientific communities. It is probably more realistic to work toward getting each community to document its conventions. Then, the archival community can at least point users to this documentation and, perhaps, provide software to translate from one convention to another.

Reaching a consensus on the underlying spatial and temporal structures and on the software services that use them is more difficult for scientific data than it seems on the surface. The sampling processes that create these data vary from instrument to instrument and mission to mission. Sophisticated users may want to enhance the resolution of digital data or remap it to conform to conventions not in the original data. Interdisciplinary users combining data from different sensor systems and different missions also need to resample data so they can put data in their own spatial and temporal structure. These users face particularly difficult problems in obtaining consistent, reliable, and quantitative documentation from multiple sources about the PSF, calibration, and algorithmic basis for the data products they use in their research.

It isn’t easy to supply software to scientific data users. The number of scientific software users is much smaller than the number of users for utilities like word processors. Thus, vendors have a smaller user base over which they can amortize their development effort. In addition, the scientific community is ‘tribal’ in nature, with each tribe having its own ‘data religion’. Software vendors are left to face diverse, warring communities that create niche markets. Standards may help, but scientists don’t like to work on standards. Developing standards takes time away from scientific research. ‘Shareware’ developed by researchers may help other users, but scientists are often hard-pressed to support a variety of hardware and software products. If data centers agree to distribute such software, they face difficult resource allocation decisions as well as problems with user expectations and liability.

User Spatial ‘Objects’

There are other differences between the data worlds of producers and of users. Producers think in terms of large blocks of data. The author believes that many data users think in terms of objects that come from different spatial classes:

Figure 5 illustrates these classes. It also places the classes in the context of the kinds of data structures producers are likely to use.

Cities and islands are relatively easy to identify in longitude and latitude. So are fixed regions, such as continents or field experiments. Identifying ecosystems and snowfields requires expert help. For a definitive decision on whether a remote sensing feature is a snowfield or a glacier, consult a glaciologist. Scientists certainly need to help identify moving targets such as storms, dust clouds, smoke plumes, and fires. Distinguishing between clouds and smoke is not a trivial problem.

There is considerable interest in ‘data mining’ techniques that identify ‘interesting objects’ in data. Marketing or financial engineering may benefit from such techniques. However, their applicability to scientific data is unproven. Scientific researchers place great importance on the heritage and physical basis of the algorithms they use to derive results. This community will expect object identification algorithms to be understandable and repeatable. Feature identification algorithms should separate one class of objects from another with well-understood uncertainties. Proposals to apply non-peer-reviewed ‘data mining’ techniques to remotely sensed data will probably not achieve scientific credibility.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 5. Spatial and Temporal Structures for User Objects. The upper sequence of rectangles illustrates a time sequence of ‘global maps’, each of which is contained in a single file. This is the ‘standard order’ in which a data producer has inserted the files into the data set. The lower sequence of rectangles illustrates a time sequence of ‘global maps’ in which the first user is interested in a fixed target, the second is interested in a region for a brief period of time, and the third is interested in a moving target that changes shape over time. The dashed arrows for the moving target are intended to help the reader see that the object is the same from one map to the next.

 

Data Search Services

Each of the user spatial classes we described creates challenges for a data center trying to help users obtain data. Data centers need several approaches to help users find data for each of these object classes. Two approaches come to mind immediately:

The first approach delivers files; the second delivers data for objects. Between these two approaches there is a considerable spectrum of search services and delivery mechanisms.

Let us broaden the discussion to the general problem of helping users find data, that is of providing search services. There appear to be five primary approaches to providing these services:

Let us briefly consider each of these methods.

Inventory Search

In the first approach to finding data, the data center maintains a user-accessible representation of the tree structure we displayed in figure 4. Users traverse that tree structure to find the files they want. Data centers can provide this search tree in several ways. Small collections might have HTML presentations that would allow users to traverse from page to page until they located the files that interested them. Larger collections of data files will use databases to store this kind of information. Because the tree structure can be portrayed as an outline, this approach to finding data is very similar to using a ‘Table of Contents’ for a book. In this metaphor, the files correspond to pages of text. Data Products correspond to chapters, Data Sets to sections, and Data Set Versions to subsections of text.

Parameter (Keyword) Search

In the second approach, users locate appropriate data products by searching through tables that list parameters contained in the data product files. These tables can use the same implementation mechanisms we just discussed for using the tree structure of file collections. A parameter list that points to files or file collections is similar to an index for a printed text. However, the metaphor isn’t exact. The parameters in data products are nearly identical from one data set to another. If we take this metaphor literally, it would apply to books that used different words in different chapters, but in which the sections, subsections, and pages within a chapter had the same words. As we suggest below, developing a list of parameters in files is probably easiest to do from documentation.

Parameter lists also need to incorporate aliases and related kinds of ‘pointers’. A user who wants to find data that contain the Earth’s ‘radiation balance’ may not think to look for data containing the Earth’s ‘radiation budget’, even though these two terms are essentially identical. To further complicate the job that data centers undertake, terminology evolves as the communities of discourse evolve. An index for the Philosophical Transactions of the Royal Society in 1700 would contain a vastly different set of words and phrases than one for that publication in 2000. Data centers need to find ways of identifying terminology that their data producers use when they are generating their products. Then, the data centers need to periodically review the evolution of that terminology.

Metadata Searches

In the third search method, users examine statistical summaries of the data in individual files. Data centers seem to associate the term metadata with these summary values. We can distinguish between two different kinds of statistical summary data:

Both of these kinds of metadata present issues for data centers.

Aggregated Statistical Values

Aggregate summaries of the data in a file may be useful in searching for particular files if the summary values differ significantly from one file to another. Where data files contain distinctly different samplings, widely separated in space, such summaries may be quite useful. For example, the EOS ASTER data are images about 60 km on a side. A user might well be able to use the average of the ratio of two spectral bands over the whole image to separate files from one another. CERES ES8 files provide a counter-example. These files contain data from a single satellite for an entire day. The average fraction of the data identified as ‘Overcast’ within each ES8 file is so stable that searching on this fraction will provide no useful distinction between one file and another. Data users still need some sense of the semantics of these statistics in order to use them wisely.

The author includes spatial and temporal boundaries in the aggregate summary metadata. When data searches use these boundaries, the boundary representation can be important in the data center’s response to user queries. For example, in dealing with satellite data, it may be more efficient to use a coordinate system oriented with respect to the orbit path and distance along that path. Such an approach is similar to the ‘Row-Path’ representation Landsat uses. On the one hand, this coordinate system appears to make it easy to find coincident data from several instruments on the same satellite. On the other hand, an orbital coordinate system is more difficult for users to interact with.

To simplify the interaction between users and a data center, a system designer might be tempted to use an Earth-fixed bounding rectangle to summarize the spatial limits of the data in a file. Certainly an Earth-fixed coordinate system using longitude and latitude is relatively easy to relate to other reference material. However, that referential simplicity carries a price. Figure 6 illustrates some of the geometric problems that an Earth-fixed, bounding rectangle approach to spatial search may have. Finding an appropriate match between the data structure and the search mechanism is one of the ‘semantic’ problems that face data producers, data users, and data centers.

 

 

 

 

 

 

 

 

 

 

Figure 6. Orbital Geometry Complications in an Earth-Fixed (Longitude–Latitude) Representation. The shaded geometry in each of these maps represents the portion of the Earth sampled in the data file. The dotted lines represent the bounding rectangles that we might use in an Earth-fixed coordinate system to help users conduct a spatial search. For each of these geometries, the Earth-fixed bounding rectangle would produce a ‘false positive’ query response for many user locations. The ratio of shaded area to total area in the bonding box gives the fraction of ‘false positives’.

Structure Preserving Summaries

Structure preserving summaries raise many of the same issues that statistical aggregate metadata does for producers, users, and data centers. Browse images are a classic example of this kind of metadata. In generating a browse image, a data producer will sample the original image with a lower frequency than the original image used. As an example, a data producer might choose to place every tenth pixel of every tenth line in the browse image. If the original image were 2000 pixels by 2000 pixels, the browse image would be 200 pixels by 200 pixels. The data volume of the browse image is thereby reduced by a factor of 10,000 from the original image.

Browse images (and similar summary search structures) are useful, but they do not solve all search issues. Users still have to be very careful about the characteristics of these summaries. We need to understand the possible difficulties users face when they try to relate summaries of files that have vastly different spatial and temporal resolutions.

A MODIS image and a CERES ES8 file provide a useful example of this difficulty. The MODIS image includes about two minutes of data and covers an area about 2000 km by 2000 km with a resolution of 1 km. Let us assume that the MODIS team generates a browse image by taking every tenth pixel of every tenth scan line. In other words, the browse image contains data with a resolution of 20 km by 20 km. The CERES ES8 file includes twenty-four hours of data and covers the whole globe with an average resolution of about 30 to 50 km. One immediate problem with a summary of the ES8 file that preserves static spatial structure is that the data covers most areas of the Earth twice in one file. Suppose we build a grid of 100-km regions to summarize the ES8 data product. An average region will have two ES8 data values – one during the day and one during the night. For the sake of argument, we create a browse image for this spatial grid by accepting the last non-zero value that was observed.

Now suppose a user wanted to compare a file of ES8 data with a MODIS image. Would the browse images of each file be useful? The MODIS browse image has a spatial resolution comparable with the original resolution of the CERES ES8 data. The ES8 browse image would have about four hundred regions within the MODIS browse image’s bounding rectangle. However, some of the regions would have data from the first satellite overpass, others would have data from the second. Comparing these browse images might – or might not be useful.

 

Documentation Search

In the fourth approach, users search for data based on documentation provided by the data producers and the data centers. In the Earth science communities served by EOSDIS, two forms of documentation have become standard: the Algorithm Theoretical Basis Document (ATBD) and the ‘Guide’ documents that the EOS Distributed Active Archive Centers (DAACs) provide. The ATBD’s provide moderately detailed descriptions of the algorithms, the input data, the output data, and important intermediate data products for each EOS investigation. The EOS data producers write the ATBD’s for their own investigation. After they are written, the EOS Project arranges for peer review and discussion of these documents. In the end, the ATBD’s appear as WWW documents that can facilitate data use by individuals who are quite unfamiliar with the background of a given investigation. The EOS Guide documents are brief summaries of data products. The DAACs typically write the Guides for WWW access by the user community. Both of these documents are examples of background material that data producers and data centers provide to help users locate and productively interact with data.

Secondary Index Search

With the fifth method, users traverse secondary indexing structures to find data objects that interest them. As we commented previously, data producers tend to concentrate on rather uniform ‘chunks’ of data to ease their production software. When a data producer identifies ‘objects’ in his data, they are likely to be closely tied to the original intent of the investigation. For example, ERBE and CERES identify atmospheric columns that are clear and distinguish these columns from those that have some cloud. It is relatively straightforward to extract clear-sky data from either project’s data products because the notion of ‘clear-sky’ is already embedded in the data. However, the ERBE and CERES data producers do not identify "storm systems" or "hurricanes" in their products. Later investigators have to supply algorithms to locate these new objects and identify the data values in the original data product files that belong to them.

Building secondary indexes appears to be a very cost-effective way of increasing the value of data in data centers. It does not involve massive investments in new instruments and launch vehicles. On the other hand, this approach does require access to data sets that may be quite voluminous. It also requires developing the software to identify and retrieve the objects.

Print Technology and Hypertext

The five search methods we have just discussed use "metadata" to help users search. As figure 7 suggests, much of our current thinking about metadata appears to have structural similarities to the search mechanisms we are familiar with from print technology. As Landow [1997] points out, print technology has taken about four hundred years to evolve reasonably standard forms of annotation that guide readers to information they need. This navigation technology includes almost unnoticeable devices such as spaces between words and periods at the end of sentences. It includes indentation or white space between paragraphs. This technology also includes page numbers, section and subsection headings, or even "chapter and verse" references. At the level identified in figure 7, these devices include Tables of Contents and Indexes.

Figure 8 suggests two extensions to these aids that move us in the direction of employing ‘hypertext’ technology. Such technology would allow the entire data using community to create entirely new ways of exploring and interacting with scientific data. It stands in marked contrast with ‘print’ technology. The latter creates an expectation that there is a unique sequence that properly orders data. Users should strictly follow this sequence when they want to interact with data. A ‘hypertext’ view suggests that there are many possible sequences in which users can traverse data to find meaning.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 7. ‘Print’ Technology Search Data Structures. In a non-fiction-printed document, we expect two kinds of data structures to help us navigate through the text. This figure identifies some analogues for the data structures we have identified: a searchable ‘file inventory’ serves as the equivalent to a ‘table of contents’, a ‘list of features’ serves as the equivalent to an ‘index’. In practice, the inventory structure will have more hierarchical layers corresponding to the hierarchy we introduced with ‘files’, ‘data set versions’, ‘data sets’, and ‘data products’. Likewise, there may be several lists that provide random access suggestions to the data itself. These random access items generally are grouped into ‘metadata’.

Figure 8 on the following page illustrates two ‘hypertext’ traversal mechanisms. At the lower right, we illustrate the secondary index search mechanism. Near the top, we suggest that data producers might embed pointers from one file to another. In a truly object-oriented approach to data, the pointers could also contain references to functions users could activate to interact further with it.

As we explore below, this view of data and its access mechanisms opens up much more fluid possibilities than we might expect from previous visions of data centers and data archives. The new vision also creates new problems. Solving them and reducing the experimentation we experience in the WWW to a body of accepted, standard, and useful practice constitutes part of the interesting journey we have embarked upon as a scientific community.

Figure 8 is relatively ‘tame’. More interesting possibilities open when we consider extending the objects users can interact with from one data set to many. Figure 9 illustrates the underlying topology that this possible extension suggests. In the foreground, we have a data set obtained early in our remote sensing history. Later, the community makes a new data set with instruments that share a common sampling and measurement capability. As researchers work with the later data set, they realize that another data set in the same period provide significant additional information. For example, we might consider the TOA fluxes from ERBE as an early data set. CERES provides a continuation of that data into the EOS era. In the new era, Lightning Imaging Sensor data on TRMM provide new insights into where clouds have ice. The LIS data may add considerably to the value of the CERES data.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 8. ‘Hypertext’ Technology Search Data Structures. This figure illustrates ‘hypertext’ styles of linkages directly from one data feature to another. In a fully object-oriented approach to data, these pointers could become active elements that invoke ‘methods’ to create particular kinds of responses from systems that would recognize the appropriate semantics of the methods. The figure is more conservative in illustrating only passive links between features in the data, or in adding secondary indices to the structures created by the data producer.

When an investigator develops a method of identifying interesting objects in one of these data sets, it is useful to think about how that method might be extended to identify similar objects in the other data sets. For example, suppose an investigator develops a method of identifying "storms" in the ERBE data and can extend that method to the CERES data. To take maximum advantage of LIS information, our investigator needs to be able to identify "storms" in the LIS data set and to identify how these two views of the underlying phenomena complement each other.

There is no free lunch in the digital archive world. The next section explores some of the burdens the hypertext approach to data brings with it. Thinking that all we have to do is to build secondary indexes is much too simple.

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 9. Configuration Management Relationships among Data Collections. This figure illustrates three data sets. The one in the foreground was collected early in the history of this data producer community. The second, in the background right is similar to the first in general characteristics, but has different spatial and spectral sampling. The third, in the left background is a related data set obtained with different sensors and containing different physical quantities.

Inter-Data Collection Configuration Management Issues

Configuration management issues connected with scientific data sets over long time horizons are easy to identify in figure 9:

In each of these system configuration management issues, we can see challenges for digital archives – not just in dealing with changes in the format of the data stream, but with the scientific (or semantic) content of the data. Although we present these issues in the section on the user perspective on scientific data, they look forward to the next section, in which we take the data center’s perspective.

 

An Archive View

The previous two sections provided a perspective on data production and search from the standpoint of communities whose members lie outside of data and archival centers. In this section, we want to sharpen our perception of some of the issues these organizations face from the inside. We put these perceptions into the context of an ‘Open Archival Information System (OAIS)’, where there is a strong emphasis on interactions between organizations involved in these kinds of efforts.

Producer Data Ingest and Production

We begin our discussion of the data center view by thinking about the interface with the data producers. An archive center certainly has a strong interest in reducing the differences in the unique infrastructure they have to build for each producer. There are enough difficulties in dealing with the differences in scientific data content to keep data centers busy for a long time. It would seem useful to develop standard interface templates for the kind of material data centers need if they are going to produce data or if they are going to accept it from producers. Certainly data centers want to allow users to search for the data and to order it. The CCSDS Reference Model for an Open Archival Information System (OAIS) [1998] provides the beginning of such a standards-based approach. This Reference Model concentrates on developing a framework that will allow data centers to identify common ‘packagings’ that Earth science data producers can use to develop data collections that archives can ingest.

The Reference Model is still silent regarding some of the issues that we have raised about what data producers need to understand. Perhaps the most important of these is the issue of preserving the semantic (i.e., the scientific) content of Earth science data. While data formats are important, they are considerably more transient that the underlying sampling structure that is unique to scientific data. It would appear that the data centers have a responsibility for bringing these semi-conscious structures into clear visibility. At the same time, data producers cannot be bullied into providing what the community needs over the long haul. It is probably not even possible to attain a uniform and consistent set of definitions without some incentive that is more concrete than "the community’s interest and the common good". The most practical approach may be to develop a community consensus on a (hopefully) small set of standards in each discipline that provides data. Over time, the visibility of the standards may evolve towards a community consensus across the whole realm of the Earth sciences. In natural systems, energy is required to create order out of disorder. Data systems exhibit the same requirement.

User Data Searching and Distribution

The data center interface with users is similar to the interface with data producers in its diversity – except that there are more types of users and more of them. Unanimity is not to be expected within the foreseeable future. Indeed, homogeneity in interfaces is probably undesirable because it removes the interfaces from experimentation and regeneration. In other words, we should expect data centers to innovate and even to compete. Diversity is interesting and beneficial in the long run.

Technology will continue to ensure some of this useful diversity. Careful examination of the existing data distribution statistics shows a clear pattern:

Many users order small amounts of data; a few users order large data amounts.

We do not expect technology to change these habits in the near future. Media shipments will be with us for a long time.

At the same time, we do not expect the user community to stand still. The prevailing paradigm is moving in the direction of using objects. We will comment later on the overall effect of this change in user expectations. Using objects in a hypertext context should radically change the way data centers work. One of the most profound changes appears in the services we discuss in the next subsection: subsetting and supersetting.

Subsetting and Supersetting

In the classic approach EOSDIS has taken to providing data access, users are forced into thinking about ‘tidy’ but large packages of data – the files created by the data producers. If the users are similar to the data producers and are prepared to handle these packages, there is no real problem. On the other hand, if users are thinking within a framework that is different than the one the data producers had, the user’s life may be more complicated.

One of the problems users encounter is that the uniform ‘chunking’ of data that is natural to a data producer also creates files that include more data than many users want. If a user wants ERBE longwave fluxes over the Sahara desert, he has to get ERBE longwave fluxes for the whole globe. If a user wants one day of ISCCP cloud data, he has to order a whole month. If a user just wants data from Lake Geneva in Walworth County, WI, he has to order the entire Landsat image that includes Walworth, Green, and Rock counties. It is as if the data providers had never encountered retail butchers, but insisted that data users accept the whole cow and butcher it themselves.

To be fair, users often want rather strangely shaped objects. A NASA Headquarters manager once offered to allow the author to take data from the state of Virginia as a subset – and didn’t realize for several years how oddly shaped such data chunks might become. Algorithms to provide a reasonable user selection of subsetted data may not be trivial, particularly since such algorithms need to be highly efficient to be cost effective.

As we have already seen, the user object orientation often crosses the boundaries of the data chunks created by data producers. When this happens, users need to be able to create a new collection of data from pieces of existing chunks. In other words, they need to be able to superset subsets.

As figure 10 shows, we can view subsetting and supersetting as methods of creating new ‘file’ objects from old ‘file’ objects. In slightly different (more UNIX-like) language, subsetting and supersetting become a filtering operation in a ‘production pipe’. For this to take place, users need to be able to identify the files they need, the subsets they want from each file, and the order in which the final objects appear in the superset file.

The performance issue is similar to the performance of ad hoc queries in databases. We need methods of optimizing the query structure to minimize the load on the data center resources. Of course, there is a spectrum of solutions to this problem. At the extreme end lie queries that we could state as the classic "Polar Bear" problem: "Find all the polar bears under the Arctic Ice during a winter when there is an ozone hole and where there is an algal bloom." [The perverse form of this query arises when "Polar Bear", "Arctic", "ice", "winter", "ozone hole", and "algal bloom" are objects that the data system must construct from understanding natural language constructs and the contents of the existing system. A related query can be stated "search through all the data in all the archives to find me something interesting to think about." It appears that these queries are NP-hard. To the author, there are enough scientifically interesting problems that this kind of computer science research is uninteresting.] Trying to set up systems to solve these queries automatically does not seem useful at this time. The more interesting issues arise when we try to couple human knowledge with computers – to form a synergistic approach to the query optimization problem.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 10. Subsetting and Supersetting as Ways of Overcoming Differences between Producer ‘Data Chunking’ and User ‘Object-Orientation’. The rectangles represent files, such as those we illustrated in figure 5. The subsetting service at a data center extracts ‘objects’ from these files and places them in a new file containing the superset. We can think of this combined operation as creating a new file by filtering several old ones. If the data producer has carefully incorporated the Annotation Records we identified in figure 2, then the file format of the subsetted files will be identical with the file format of the superset file.

Semantic Requirements for Data Interchange

Both data producers and data users are moving toward an "object-oriented" approach to scientific data. In the long-run, both communities would benefit from a distributed "information architecture". With such an architecture, interdisciplinary investigators could set up automated procedures in their own facilities that would find appropriate data at several different data centers, subset and superset the data files, and extract the results for their own use. Educational users could find lesson plans that automatically extracted sample data and provided software to experiment with. The public could sit down to an automated ‘documentary’ that extracted and displayed complex images to the accompaniment of audio clips. For now, such a vision appears to be an interesting goal, but one that will require an extraordinary amount of work to achieve. The problem is not merely technological; it is sociological. To be practical, we need to find ways of fostering cost-effective data centers. Almost certainly this means avoiding attempts to build "one-size fits all" systems.

To ensure cost effectiveness, it seems reasonable to suggest that interoperability is only required where there is significant interchange of data. For this reason, interoperability between NASA’s space science enterprise and the agency’s Earth science enterprise probably do not need much more than the ability to exchange lists of parameters and the data products that contain them. Only as communities move toward continual exchange of data do we need to arrange for long-term commitments between institutions.

When we look at these institutional interfaces in this light, it is clear that the interfaces are one of the costs of community interchange. We cannot avoid the fact that useful data interchange between archives requires common semantic structures and content. Astronomers locate their data by Right Ascension and declination; Earth scientists locate their data by latitude and longitude. We could waste a lot of time and money trying to merge incompatible structures.

Tentative Conclusions

Where do these views lead us? One perspective suggests that over the long term, scientific data archives will move to an object orientation. In the previous text, we suggested what such object orientation means for Earth science data. Here, we summarize that view. We identify some interesting issues for digital archives. Some of these issues are particular to scientific data; others are common to all kinds of digital data. After raising these issues, we suggest some opportunities to move forward.

An Object Oriented View of Archive Information Evolution

Looking at the architecture of data centers over the next twenty years, several computer science groups envision the possibility of Cooperative Information Centers, which are made of heterogeneous data centers that exchange data, e.g., Papazoglou, M. P. and G. Schlageter [1998]. Authors espousing this idea have had enough experience to feel that it will take a long time to implement. On the other hand, this vision does suggest a future that contains individual data centers that cooperate to provide services that are more helpful than any could provide alone. This vision does not require a single, homogeneous approach. It fits well with the structure we have suggested in the previous sections of this paper.

First, we do not see a future in which data producers and data users have a single view of what they want from data. We expect that the data producer’s ‘chunks’ and the user’s ‘objects’ will always be part of the data landscape. This dichotomy is probably a permanent feature of scientific data archival.

Second, it is also clear that we need a more visible information structure than we have had so far. Long-term archival requirements for scientific data include need for documenting

The underlying, semantic structure has not been as visible as the formatting issue for scientific data. However, getting this structure documented is probably more important in the long run. Likewise, it is difficult to see how we can make progress in providing effective data services without paying considerably more attention to the data structure of archival holdings. The file and metadata structures determine how long it takes to respond to user requests. We have made some progress in documenting the scientific algorithms for data production. Because these algorithms are very discipline dependent, it is difficult to foresee useful standardization of these key data production tools. In addition, when we put algorithms into a data production environment, we have to become much more systematic about documenting the production topology, the connectivity between ‘jobs’ and ‘files’ that determine the genealogy of scientific data in an archive.

Third, long-term data access and retrieval requirements for scientific data need mechanisms that allow

Scientific Data Archival Issues

We can list the issues these considerations raise as follows:

A Perspective on the Future of Digital Archives for
Scientific Data

The evolution of the scientific data environment will change the needs of data producers and data users in a number of interesting ways:

When we think about these changes in the communities that use data archives, we may also expect these institutions to evolve as well. It is not clear at the moment whether it is more effective to hire Ph.D. computer scientists at the data centers or to raise the computer science expertise of the data producers and data users. Regardless of the approach, the environment of continuous change that surrounds information technology will be with us indefinitely. The entire conception of scientific data archives may change from ‘producer file collections’ and ‘write-once, read-none deep archives’ into ‘communities of data scholars’. It is these communities that will have to carry the responsibility for continually translating the past record of what we have observed into useful information about the future.

References

Consultative Committee on Space Data Systems Standards (CCSDS), 1998: Reference Model for an Open Archival Information System (OAIS), CCSDS 650.0-W-3.0, April 15, 1998.

Landow, G. P., 1997: Hypertext 2.0, Being a revised, amplified edition of Hypertext: The Convergence of Contemporary Critical Theory and Technology, The Johns Hopkins University Press, Baltimore, MD, 353 pp.

Papazoglou, M. P. and G. Schlageter, 1998: Cooperative Information Systems: Trends and Directions, Academic Press, San Diego, CA, 369 pp.