Digital Archive Directions (DADs) Workshop
Digital Archive Directions (DADs) Workshop
DATE: June 22-26, 1998
HOST: The National Archives and Records Administration
8601 Adelphi Road
College Park, MD 20740-6001
1. Identification of Proposed Topic [Required]
1.1 TitleThe British Atmospheric Data Centre - a Pragmatic Archive of Atmospheric Data
Dr Peter M Allan
Rutherford Appleton Laboratory
Oxon OX11 0QX
1.3 Description of Proposed Project
The British Atmospheric Data Centre (BADC) holds digital data gathered from a wide range of instruments that measure atmospheric processes. Some data are from satellites, some from aircraft, some from ground-based measurements and some from computer models. It is the primary archive for data on the atmosphere that has been collected as a result of research funded by the UK Natural Environment Research Council. The BADC also has strong links to other organizations such as the UK Meteorological Office and the European Centre for Medium range Weather Forecasts and provides access to some of their data products for scientific research. On account of the diverse range of data that are stored at the BADC, we have had to develop an archive system that is versatile enough to cope with the variety and size of data products, while at the same time being easy for customers to use. The following describes how data are archived at the BADC, how customers can search the range of data held and how they can obtain data. It also describes some areas where we are aware of shortcomings of our existing system, those we already have plans to improve and items where standardization along the lines indicated by the OAIS model would bring additional benefits.
The BADC archives data from many sources over which we have no control. Consequently our policy is to store data in the file format in which they are given to us and not to attempt a conversion to a 'standard' format. All data formats make assumptions about the data being held and experience has shown that no matter how carefully one designs a 'standard' format, the next new type of data you come across will not fit into the existing format. In other words, you cannot think of everything.
Most data are held on-line on magnetic disk as this provides rapid access. It also provides an easy upgrade path. For example, some of our 5 year old 2 GB drives have been replaced with 9 GB ones. Others are currently being replaced with even high capacity drives. When the size of a dataset makes it too costly to store on magnetic disk, we make use of a robotic tape system elsewhere within our laboratory. This currently has a capacity of 30 TB (although the BADC only uses about 0.5 TB). Finally, we provide access to some data that are stored on CD-ROMs by means of a dedicated network server.
The metadata consists primarily of a catalogue of the data files in a relational database. As with the data, the requirements on the catalogue are diverse. Some satellite datasets come as daily files with essentially global coverage. They always contain the same type of data and arrive at a regular rate. At the opposite extreme, files from a project on atmospheric chemistry contain measurements of a variety of chemical species. The species measured vary from day to day, the location of the observation sites vary (the airborne ones within a measurement) and the whole project is grouped into measurement campaigns of a few weeks each spread over two years.
Customers of the BADC access our data holdings in the following manner. The initial entry point is our web home page http://www.badc.rl.ac.uk/. From here, they can search the list of our data holdings and from web pages relating to the individual datasets, they can search the catalogue of individual files. For all files that are on-line, the customer can copy the files by ftp. Those that are stored on the robotic tape store must be retrieved before being copied. Some data files are publicly available whereas other have restricted access imposed by their suppliers. Access is controlled by use of Unix groups.
To sum up, the techniques described above provide customers with on-line access to our data holdings and the ability to search our catalogue from the web. However, there are areas that could be improved to provide a better service to our customers and make it easier for staff to maintain the data centre.
Although our catalogue can be searched from our web pages, a potential customer needs to know of the existence of the BADC and that it might hold data they need before they can search anything. In order to advertise our data more widely and to build a distributed search capability for all of the data centres funded by the UK Natural Environmental Research Council (NERC), we are actively participating in programmes that make use of Z39.50 based search systems. One such is the Catalogue Interoperability Protocol (CIP) that is being promoted by the CEO (Centre for Earth Observation). Other initiatives make use of other Z39.50 profiles and CIP, GEO, GILS and ENRM and all relevant. The lack of a single search profile is hampering our progress in this area.
As mentioned above, data are stored on a range of media. This is for two reasons. Data supplied on CD-ROM is general best kept in that format. However, the need to use a robotic tape system for large volume datasets is inconvenient as the access method for the customer depends on the media used. It also increases the maintenance overheads for BADC staff. The use of a hierarchical storage management (HSM) system appears attractive. However, the lack of standards in this area makes us unwilling to invest in a particular technology that might be superseded within a few years. Given the way we operate, an attractive HSM would have to make all data files appear to be on-line, would retrieve near-line files within one minute and would use industry standard hardware in a manner that could easily be upgraded. Ideally it should integrate with our existing robotic tape system.
This paper raises the question of whether the OAIS model formally accommodates an archive such as the BADC which is of a moderate size (1 TB). The BADC has been designed the way it has in order to be versatile and cost effective. If the OAIS model is to become widely accepted, it is important either that an archive like the BADC fits within the OAIS model, or that it can clearly be demonstrated that building an archive along OAIS lines is more versatile and cheaper than other methods (i.e. it should scale well to small archives).
1.5 Definitions of Concepts and Special Terms
1.6 Expected Relationship with OAIS Reference Model
As described above, the BADC maps well onto the OAIS model. There is ingest of data through a new dataset being acquired wholesale or by members of a designated project submitting files via ftp to a project ingest system. There is archival storage using magnetic disk, CD-ROM and robotic tape. Data management is provided by means of web pages describing the data and by the catalogue of files. Access and dissemination are provided by means of the web and ftp. Administration ensures that the hardware and software are appropriate to their tasks and provides a human interface for customers.
2. Scope of Proposed Standard [Desired]
2.1 Recommended Scope of Standard
The scope is to examine how archives of moderate size fit into the OAIS model and how the model can be made attractive to small archives.
2.2 Existing Practice in Area of Proposed Standard
Existing practice is to advertise the existence of the archive on the web and to describe each dataset using a set of web pages with a consistent look and feel across each dataset. The catalogue can be searched by means of a web interface. Data that is on-line can be picked up directly by ftp. Data that is not on-line must be ordered first.
2.3 Expected Stability of Proposed Standard with Respect to Current and Potential Technological Advances
The overall design of the BADC is well liked by customers and support staff and shows no signs of needing radical changes in the future. However, there are several areas that would benefit from improved technology. A key area at present is to make our catalogue more widely accessible by means of Z39.50 related search systems. This needs a standard profile to become really useful. The use of a hierarchical storage management system would bring benefits to customers in terms of ease of access and to support staff in term of ease of maintenance.
A service of NOST at NSSDC. Access statistics for this web are available. Comments and suggestion are always welcome.
Dr Peter M Allan (
Curator: John Garrett (John.Garrett@gsfc.nasa.gov) +1.301.286.3575
Responsible Official: Code 633.2 / Don Sawyer (Donald.Sawyer@gsfc.nasa.gov) +1.301.286.2748
Last Revised: May 11, 1998, Dr. Peter M. Allan (May 26, 1998, John Garrett)