Science Archives in the 21st Century
Topic: Meeting provider needs - ingestion of data; Meeting user needs - fast access to large data sets
The Infrared Processing and Analysis Center (IPAC) at Caltech hosts the NASA/IPAC Infrared Science Archive (IRSA) and the Michelson Science Center (MSC) Archive. IRSA is the steward of the scientific data sets of NASA's Infrared missions, and the MSC facilitates NASA's planet-finding and exo-planet science program, including multi-mission archives. Together they serve nearly 30-TB of data across the entire electromagnetic spectrum from 17 missions and projects. They share a common hardware and software architecture. This presentation describes their best practices in the areas of ingestion and user access.
Ingestion: While some providers are large missions, others are small groups of astronomers inexperienced in delivering products. Provision of standards and interface specifications for data delivery within a Submission Information Package are necessary for ingestion but have proven insufficient. Communication with the provider starts at the beginning of the project, and the provider is asked to deliver draft products for inspection before their pipelines have entered production. On-line validation tools, whose design has been driven by common mistakes in data delivery, have proven a powerful aid to providers. Functionality offered includes validation of the structure and content of catalogs; generation of the documentation of the attributes of catalogs; registration of images on the sky; and the syntax, content and astrometric accuracy of astronomical images.
Access: The archives must return in real time subsets of large data sets (catalogs, images and spectra) that it will curate for the indefinite future. The archive is optimized for efficient access, maintainability, portability and is highly fault tolerant. Catalogs and are housed in flat tables on a high-end EMC disk farm configured as RAID 0+1. An Informix DBMS offers dynamic parallelization of queries, but indexing for spatial queries is resident in memory outside the database. There are no stored procedures in the DBMS. All queries are composed through "thin" interfaces that sit atop a component based architecture of re-usable ANI-C modules that are "plugged" together for easy development of new applications. This architecture enables cost-effective deployment of new access services, such as those provided for NASA Stellar and Exo-planet Database, the Cosmic Evolution Survey Archive and the Keck Observatory Archive.