Summary Document for
Science Archives in the 21st Century
A workshop held at the University of Maryland University College Inn and Conference Center, on April 25 - 26, 2007
[MS Word version of this document]
(Body of this document)
(Jump to Summary and Conclusion of this document)
APPENDIX A - POSTER SUMMARIES
Bruce Barkstrom - Provenance, Production and Planning
Much of earth science data and some of space science data result from files and jobs, which are indexable as time series and denumerable. Provenance will not fit within files and metadata standards are incomplete. New versions of datasets may use previously produced data to guide next step in production, and may be produced by four4 kinds of changes: input, source code, coefficients, and connectivity.
Kirk Borne - LSST: Preparing for a Data Avalanche through Partitioning, Parallelization, and Provenance.
LSST will generate a huge amount of data, 65 petabytes of images and 70 petabytes of metadata. LSST data management features sky partitioning to speed ingest and retrieval, data parallelization to speed pipeline processing, and data provenance to track all changes and allow creation of data products on-the-fly, thus reducing archival storage requirements. Lots of science and related science education will be available.
Paul Butterworth - A NASA LAMBDA Report
LAMBDA is the smallest data center represented at the conference (two full FTEs). A small, focused data center can be very agile and responsive, but are nervous when they hear of data management problems and outside [standards] requirements.
Dan Crichton - Developing the International Planetary Data Alliance
IPDA's purpose is two-fold: (1) to develop international standards which allow sharing scientific data products across agencies and missions; (2) to develop technical information standards to allow interoperability between data systems. IPDA is also working to ensure alignment within the international scientific community, proposing a session for the July 2008 COSPAR meeting.
Ken Ebisawa - Scientific Satellite Data Archives at JAXA
JAXA archives scientific satellite data in Japan, specifically solar, solar-terrestrial and astronomical data, archived in DARTS database. Next large mission data to be received is SELENE. Looking ahead to heliospheric plasma science data from Bepi-Colombo MMO, which requires interoperability with heliospheric plasma science database, something more than PDS structure can provide.
Ed Grayzeck - Role of a Permanent Archive in the Evolving NASA Space Science Environment
Ray Walker spoke about evolving data centers. The Resident Archives and their services will be a new approach for missions as they near completion and for NSSDC to manage them. The poster shows explanation of terms and structures of resident archives. Missions which might have first Resident Archives are IMAGE (now on life support) and POLAR I (funded only through fall 2007).
Ed Guinness - Approaches for Archiving and Distributing Science Data from Planetary Missions
The Geosciences Node has been involved in PDS since before PDS existed. They have curated about 100 datasets from 20 missions, learned a lot from experience and from users, and have created standard practices for working with providers and users. For instance early involvement with a mission and instrument teams is important.
Users have diverse requirements, raw data vs. highly derived. Tools for users to find/analyze subsets of data are becoming more important.
Ted Haberman - Freeing Ourselves from the "Tyranny of the OR"
NASA and NOAA are now separate and efficient archive stovepipe systems, built for "science users" OR "GIS users". He is interested in exploring ways to build the AND archive, rather than OR, to bridge both user groups. Using a geospatial database in front of a tape archive system is one approach, allows geospatial tools to access data and also allows the creation and staging of data products for science users. His poster includes a discussion of the future of and number of formats used in archived data.
Kent Hills - An Application of CCSDS Archival Standards to Meet Both the Submitter and Archive Needs during Data Ingest
For NSSDC, one of things we have to do when we bring in a dataset is to populate the data system with all the required attributes. His poster shows a new method to bring in all required information and deal with it all in ways as automated as possible.
Joe Hourcle - FRBR in a Scientific Context
When a scientist asks "What data do you have?", a big variety of answers is possible. It is the equivalent of asking a librarian "How many books do you have?" Does the Bible count as one or many? How do we count hard or soft back? His poster discusses Functional Requirements for Bibliographic Records developed for libraries. It will take time to refine FRBR; so far it works fine for files and not so fine for collections.
Steven Hughes - The Application of Semantic Technologies to Scientific Archives
Metadata is the key; no key word or identifier exists in a vacuum. The development and subsequent management of the Information Model is the most significant factor for developing information systems on time, within budget and that remain viable with time.
Barry Jacobs - NASA Datasets Management Using Process Libraries and Electronic Handbooks
Datasets are complicated enough, then you have to distribute datasets, and also manage and improve datasets. Everyone does it differently. He builds a process library and subprocess library and shows how each data center does it, including links to samples, documents and levels of access.
Nate James - Show Me the Data
NSSDC is surveying science users for lessons learned. His poster highlights four areas: the top three things users say they need, what keeps them from getting the data they want, tools/techniques that have worked best for them, and new technologies proven to be user friendly.
Todd King - Implementing a Virtual Observatory: Models, Frameworks, Tools
Virtual observatories will help us normalize access to data and data centers. His poster describes how to build a VO through experience with PDS & SPASE. They are currently trying to build a VO for magnetospherics (VMO), and discuss what to expect and how to move forward.
Mike Martin - Whither Physical Media
PDS answers the 200 yr preservation problem mentioned earlier. His PDS working group assessed the CDs and DVDs at PDS nodes and found the media quality and how they're written highly variable. They have recommend migration from CD-R and DVD-R archives to on-storage systems with high-density (DLT) backup.
Pat McCaslin - Use of Archive Information Packages at the NSSDC
National Space Science Data Center (NSSDC) has adopted concepts from the Reference Model for an Open Archival Information System (OAIS)
. They are using Archive Information Packages for the preservation of digital data. His poster presents the advantages of using AIPs and describes NSSDC's experience do so.
Bob McDonald - Replication Policies for Distributed Digital Preservation Environments
His project is called Chronopolis, a collaborative partnership of many institutions, to curate intellectual capital on a national scale. They are starting with a storage swap between SDSC and NCAR. They hope to extend to a cross-institution replication of critical data.
Based on that experience, develop a framework and replication strategy between data repositories. Storage swap is running thru 100 TB (2006) to 300 TB (2010) .
Tom McGlynn - Data Preservation and Data Reuse in Archive Design and Implementation
His poster muses on tension between what we say archives are for and what we use them for. Astrophysics has tried to address, but is all over the map. Some preserve every version make every one available, confusing users; others only serve the latest version. What about recovering datasets that require dynamic processing using changing software? Chandra data are a good middle ground, serves a default version but saves others.
John Moses - Guidance for Science Data Centers through Understanding Metrics
ESDIS has program to understand metrics for tracking how well an archive is doing. There are real time postings of user statistics, they get 40,000 hits/days; each data center has access to only their data. They have a customer satisfaction survey which now uses 50 rating questions for annual survey. Now moving to more COTS adapted to needs.
Jim Thieman - Tradeoffs in the Development of the SPASE Data Model
SPASE data model for heliophysics provides common terminology that enables unified searches and ready comparison of results. His poster discusses trade offs in development of SPASE data model, e.g. to what level of detail should datasets be described?
Joe Zender - Science Archives over the Past Centuries: What Can We Learn?
Look back in time, time scales for archiving, using data. Mankind is 150 generations old
Were people aware of archiving in old days, did they preserve data? Where are the data? Can we find data from Huygens? Contemporary data from Ahearn? Long term preservation of records, need to go on optical media as before.
Return to Workshop Home Page