Science Archives in the 21st Century
On April 25 - 26, 2007, the NSSDC sponsored a workshop entitled "Science Archives in the 21st Century" at the University of Maryland University College Inn and Conference Center, to facilitate communication and elicit best practices and outstanding challenges from practicing science data manager. Emphasis was placed on good stewardship of NASA's Heliophysics, Planetary, Astrophysics, and Earth science data as well as perspectives from other science archives in the US and internationally.
The agenda included a keynote presentation by Raymond Walker / UCLA, invited talks by Robert Hanisch / Space Telescope Science Institute and Aaron Roberts / NASA, and was structured into sessions on Long-Term Preservation, Archival Policies and Implementation, Emerging Archival Standards and Technologies, Meeting User Needs, and Provider Interactions Poster presentations were an integral part of the workshop with poster presenters introducing their poster topics in a Poster Madness session to all participants of the workshop, and with four separate poster sessions set aside for one-on-one interaction.
54 persons participated, representing
The Executive Planning Committee for the workshop consisted of:
Ed Grayzeck (chair)/NSSDC,
Don Sawyer (co-chair)/NSSDC,
Ben Kobler (logistics)/NASA GSFC Code 586,
Mike A.Hearn/University of Maryland,
Bob McGuire/SPDF, and
A complete list of all participants, the agenda, and all presentations is available at: http://nssdc.gsfc.nasa.gov/nost/conf/archive21st/.
Ed Grayzeck started off the workshop by reintroducing the three goals of the gathering:
Ed outlined the response. He highlighted the breadth of the experience of the 54 participants as a benefit to the group. Our challenge was to find in the five prime topics (long-term preservation, policies and implementation, standards and technology, meeting user needs and provider interactions) common ground, lessons learned and future actions. He remarked that the initial invitations had gone out to select diverse participants from earth science, planetary studies, astrophysics, and solar/space physics. The resulting group came as managers and scientists from NASA, sister government agencies, university environments, and international data partners. He further pointed out that the poster sessions would be interleaved with the oral talks so as to get full participation.
After a short introduction of the supporting staff and NSSDC sponsorship, all were invited to introduce themselves, giving a concise background. The official welcome was presented by Joe Bredekamp, NASA headquarters, who gave us the history of the NASA effort to unify the data environment and its evolution along scientific lines.
Ray Walker presented the keynote presentation The Path Toward Data System Integration. As a scientist involved in archiving over the past 30 years, Ray Walker pointed to a persistent dream - A global data environment in which all Earth and space science data are organized in a common way with "one stop shopping" for any data product. He outlined his experience and derived five attainable goals:
To achieve these goals, the fifth bullet is new and Ray sees archiving interleaved with data distribution. He cautioned that we need to work with existing standards, to evolve them, maybe re-establish the core needs and develop an interlingua that permits speaking across the science disciplines. There were two examples he highlighted from his experience. First, the Planetary Data System with its rich data model and protocols. Second, he outlined the development of SPASE as a tool to harness the diverse community of space physics.
He then identified the following evolving challenges.
During the remainder of the workshop, the participants discussed these challenges and brought out news, especially relating to metadata and establishing data quality levels.
The session on Long-Term Preservation started with three perspectives from the astrophysical, social and earth science, and computer science arenas: Bob Hanisch spoke on Long-Term Preservation of Astronomical Research Results, Bob Chen spoke on Government-University Collaboration in Long-Term Archiving of Scientific Data, and Reagan Moore spoke on Rule Based Preservation Systems.
The themes followed on the keynote: assure data is preserved (>20 yrs), useable, and findable. In modern scientific inquiry, the source of the data is worldwide and international efforts are needed to streamline interoperability. Three such instances are the IVOA, IPDA, and SPASE. There is a tension between the need to preserve and the need to serve the data. Libraries and universities have a long history of preservation but are usually centralized. More recently, governments and international agencies have taken a role. The archive must decide on its role as preserver in the digital arena and should look at lessons learned by analog archives. Centralized archives in the digital age are evolving and becoming more distributed. A new method which builds on this loose federation are data grids which provide for a preservation aspect centrally through use of storage resource brokers and support for infrastructure independence, where preservation is thought of as communicating with the future. Future technology will be different from today's technlology. The preserved records need to be migrated onto the future technology. But preservation is also communication from the past. In order to make assertions about authenticity, chain of custody, and integrity, we need to be able to characterize the policies that governed prior management of the records. The management policies and preservation processes comprise representation information about the preservation environment. Preservation requires provision of representation information about both the records and the preservation environment. With each of the respective archives acting as independent sites, we need guidelines for identifying when an archive is robust such as the OCLC work and the Trusted Repository Assessment Criteria. In addition, data needs metadata and there should be quality flags on both. And there needs to be recognition that science data is not normally just text.
In the panel discussions, the provenance issue was raised and was declared very important, i.e., it is best to track the data as it is migrated both in content and format. The question of a centralized archive was debated and most found the trend was to distribute both the data and the expertise. Most agreed that we need to keep on top of the fixity issue as well as technology for any migration and long-term preservation.
There were three oral presentations to identify current practices in three science areas within NASA (Heliophysics, Planetary Science and Earth Science). Aaron Roberts spoke on Archiving in the Data Environment of Heliophysics with NASA, Reta Beebe spoke on NASA Planetary Data System: Structure, Mission Interfaces and Distribution, and Jeanne Behnke spoke on Evolving a Ten Year Old Data Archive.
The themes spoke about the goal of NASA policies for space science - to ensure data sharing. There can be different models given a specific scientific community but in all cases that group must be involved. The models range from a centralized system that evolves to be more inclusive through a confederation of curator groups through a series of operating missions and data repositories that are loosely managed inside NASA.
A few simple lessons were given to the workshop:
The discussions revolved around questions of implementation and cost savings. All agreed that standards must be customer based and that higher level data was best.
In this session, Don Sawyer spoke on An Overview of Selected ISO Standards Applicable to Digital Archives, David Giaretta spoke on Towards and International standard for Audit and Certification of Digital Repositories, and Joey Mukherjee spoke on Usability Issues Facing 21st Century Data Archives.
There are a number of international standards addressing digital data with particular reference to archives as addressed in An Overview of Selected ISO Standards Applicable to Digital Archives. Some are full ISO standards and others are in development. The ones highlighted during this session addressed the following topic areas:
All of these are applicable across the science domain and are not specific to any discipline. It can take several years for such standards to become recognized and extensively used. The uptake can also vary greatly across different communities. For example, the OAIS reference model (Reference Model for an Open Archival Information System (OAIS)) has become very widely adopted by all types of organizations. It was the right standard at the right time and it continues to meet the critical need to be able to communicate about archival systems and their information models.
The newest of the above efforts, and potentially one that will have a very wide impact, is the certification of archives. The presentation Toward an International Standard for Audit and Certification of Digital Repositories describes the current situation. Experience has shown that it is difficult to preserve bits over a long time period, and even more difficult to preserve their information content, and thus there is wide interest in identifying criteria by which an archive/repository can be judged. Several efforts have developed documents addressing such criteria, and particularly noteworthy is the TRAC document (Trustworthy Repositories Audit & Certification: Criteria and Checklist). However all have been developed by groups with limited participation. The ISO standardization process is taking these documents as input and is open to participation by all. One can obtain these materials and participate by going to http://wiki.digitalrepositoryauditandcertification.org.
In the presentation Usability Issues Facing 21st Century Data Archives, the focus was on making data archives more useful and easier to maintain for providers, users, and management. It is argued that the current archiving reality does not adequately capture enough of the data needed by future scientists and its quality is uneven. Quality processed data should flow from the processing team and eventually get to the long-term archive. What is needed is a better format that meets all these needs, one that is simpler to use, easy to extend, and widely applicable so that it becomes widely adopted. Further, it might already exist or be some combination of the best features of a number of common formats such as HDF, IDFS, FITS, etc. It would need buy-in from visualization tool vendors and from archivists as well as archives.
During the discussion session regarding the emerging ISO standards, it was noted that very small repositories/archives may have difficulty keeping up with such standards.
Some participants had read the TRAC document and reactions were varied. One noted that he would be afraid to show it to his management, while another found it readily useful and applicable. Some leveling of the criteria seemed needed, and it was unclear how the evaluation would actually be done. It was noted that, particularly where there might be competition between archives, these criteria could become important. Also there may eventually be a high level management requirement for certification.
Regarding the prospects for a new format, or broad adoption of some newly emerging format, the prospects for securing buy-in was a central concern. Will adequate tools ensure buy-in? One comment was that what is needed is better interoperability through mapping of scientific content, not a new format.
The advisability and practicality of holding all data in a single format was questioned as it may be difficult to do ensure adequate data cleanup for higher level products, such as maps. In some cases the low level data needs to be saved because it has critical information, but in other cases it is never requested. Still, it is generally not a problem to save the low level data. The value of storing data, no longer actively being requested, in a useful form is clear and a recent example is NSSDC lunar data not looked at for many years, now of interest for future missions.
This session tried to deal with the new goal as presented from Ray Walker - namely that to aid a scientist user these days involves more than simple data access. Four approaches were outlined by the following presenters: Arnold Rots spoke on Associating Persistent Identifiers between Trustworthy Repositories, Vincent Genot spoke on Science Archives Need to Communicate more than Data: the Example of AMDA and CDPP, Christophe Arviset spoke on ESA Scientific Archives and Virtual Observatory Systems, and Mark Showalter spoke on Accessing Diverse Data Sets at the PDS Rings Node.
In the digital age, the accessibility and distribution of data/metadata are prominent, and the evolving archives both centralized and decentralized have valuable lessons.
In the former category (ESAC), a function ordered approach permits the reuse of software and knowledge base is maintained from mission to mission. By separating by functions, the archive can handle both proprietary data sets as well as widely public offering. Interoperability is largely gained by insisting on one simple format - FITS. The Planetary Data System (PDS) is an example of a confederation which handles diverse data through a series of independent discipline nodes that customize data access and distribute data to specialized communities. In addition, translation tools are provided to convert a wide variety of formats. Interoperability is achieved through higher order processing.
A loose federation of missions, virtual observatories and resident archives is illustrated in the heliophysics data environment. A common data model and inter lingua (SPASE) allows cross discipline interaction. A few concepts such as time bases and simple tools provide the structure to agree on working formats in a few areas. The idea of having archives as publishing houses also was discussed since the web now allows instant exposure of the data but no implication about quality. DOI and other identifiers remove ambiguity and can be offered by archives, societies, and commercial entities.
The panel discussion showcased how this goal was fast evolving and must be customized for users in their respective scientific communities.
This session had three presentations from working archives and how they were streamlining the input from data providers. Andrew Davis spoke on Integrating an ACE Science Data Center and SAMPEX Resident Archive into the Emerging Virtual Observatory System: Practical Experience and Perspectives, Bruce Berriman spoke on Best Practices in Ingestion and Data Access at the InfraRed Processing and Analysis Center, and Dan Kowal spoke on Applying Submission Agreements to Long Existing Data Flows - A NOAA Story.
There are a few lessons learned that can make the job of the data provider easier. First, during the ingestion process, make sure that an outline of the submission agreement or package is clearly understood. The rule of thumb is that it is essential to make the data useable and combinable from the start. Second, the provider needs to follow community standards on formats but the archive must make its usability criteria known. The goal is to produce well-documented data that is then bundled with the final product. Third, the archive needs to respond to the user community through a set of tools that guide it in setting the services available. These tools need to be modular so they can be reused or modified for later submissions.
All agreed that you can't start early enough.
A poster summary is given in Appendix A. It contains a short summary of the posters from notes taken during the poster author's 2-minute presentation.
Rapporteur reports on the poster were prepared by Lou Reich, Kathy Fontaine, and Steve Joy. A summary of those reports appears in Appendix B.
The workshop was favorably received by all that participated and some were quite surprised at how useful it was. The promenance given to posters was appreciated by many and the length and format of the workshop was favorably received. In response to the original goals for the workshop, the following were achieved:
The consensus was that a meeting every 2-3 years would be appropriate and useful, but that more frequent focused meetings on selected topics such as services, formats, or data models should also occur. It was felt that a general WIKI would not be that useful, though focus groups might be.