Dear Colleague: Enclosed are the proceedings from the Open Meeting on Space Science Data Systems held last week here at NASA Headquarters. I appreciate your participation and interest in this very important topic. I look forward to your continued involvement as we proceed toward a more coherent federation of space science data systems. In the meantime, I welcome any comments and thoughts, either via e-mail to joe.bredekamp@hq.nasa.gov, or phone 202-358-2348. Thanks again. Joe Bredekamp ================================================================ Joseph H. Bredekamp joe.bredekamp@hq.nasa.gov Senior Science Program Executive/ Voice: 202-358-2348 Information Systems Sec: 202-358-1588 NASA Office of Space Science Fax: 202-358-3097
9:00 Introduction Joe Bredekamp 9:15 Space Science Vision Henry Brinton 9:45 Synopsis of Current Data Environment John Nousek/Penn State Ray Arvidson/Washington U Tim Killeen/U. Michigan 10:45 Data Management Task Group Report Jeff Linsky/U. Colorado 11:15 SSDS Concept and Approach Joe Bredekamp 12:00 Lunch 1:00 Plenary Discussion Management Issues Technical Issues Transition Issues 4:00 Summary Statements 4:30 Adjourn
Dr. John Nousek from Pennsylvania State University described what a science user expects from astrophysics data today: free; available on request in any volume; convenient to use indexing, browse and location tools; prompt data delivery; usable and understandable support analysis tools, especially the software to process and interact with the data; freedom to select alternative forms; exchange standards; user supplied software extensions; re-processing/re-calibration available on demand; immediate correction of software/data errors; and availability of detailed expertise. An important concept underlying astrophysics data is the data/software life cycle. It starts out before launch with calibration data. After launch there is an early (performance verification) phase, where the data is proprietary. After this first testing phase, there are general observations, still proprietary, but guest investigators start to use the data. This evolves into the open time, where there is a mixture of proprietary data and data going into the archive. When the mission becomes mature, there is new data coming in and data going into archive, and re-processing of the data set occurs. This is usually very time consuming and expensive. After the end of the mission (approximately one year), all of the proprietary rights have disappeared, all data is archival, and archival tools are essential. Somewhat later there is final reprocessing of the data, which goes into the archive. After this time is the extended archive phase (science archive research center) and a new set of people take over responsibility for the data. After the computer genre has evolved, data reclamation and reformatting is necessary. Finally, there is a historical role to preserve data integrity. Dr. Nousek presented some lessons learned from his experience: dramatic chance can occur suddenly and even the best projects can become irrelevant; agility is essential to minimize the planning/managerial overhead; no progress occurs until the first use of the system; technology is always astounding, and no project is ever technology bound. The point of data/software is scientific discovery, and development/resources must follow the user patterns. The data environment empowers scientists to make discoveries.
Dr. Ray Arvidson from Washington University continued the discussion from a planetary sciences perspective, both data user and producer. In response to a Committee on Data Management and Computation (CODMAC) report in 1982 and 1984, the Pilot Planetary Data System (PDS) was formed to prototype ideas and begin serving the community. The best data systems are those that have direct science involvement. In 1991, the PDS became operational, and serves the community today. It has archived data from about 20 missions and currently works with 11 active missions. The PDS publishes and distributes peer-reviewed data from completed missions; establishes planetary archive standards; works with flight projects to ensure generation of PDS compatible products; and provides science expertise on the archived data. There is a central node at JPL for system management, but the PDS is fully distributed into the community. A process has been established for archiving that directly involves the PDS with the mission or the community members. A set of standards or approaches has been developed that are accepted by the community and are used by the missions and communities. Peer review is an important part of the entire process. The current system is discipline-oriented, and it is not set up to cut across the traditional discipline boundaries. The SSDS, properly done, will enable theme-based research to be accomplished. However, in moving to SSDS, Dr. Arvidson emphasized that the successful activities should not be unraveled. The challenge is for a distributed system that keeps what works, meets existing obligations, and enables coordination of activities and seamless access to OSS-produced information and data. In response to a question, Dr. Arvidson described how PDS is organized. Funding goes to JPL, and the nodes (competitively selected) are operated under JPL contract. Most of the archiving and the work is done at the distributed sites. It can be thought of as a "virtual institute." One of the challenges of the SSDS will be management. With respect to interacting with the programs, many of those involved in PDS are also involved in the missions; also, there is a Planetary Data System Management Council which has worked very well.
Dr. Tim Killeen from the University of Michigan completed the discussions from the perspective of space physics (Sun-Earth Connections). There appears to be a five year transforming cycle in data systems, and the pace of change should be driving the development of the SSDS. In the early 1980's, everything was PI driven (the data belonged to the individual). By the mid-1980's, coordinated multifaceted data sets were seen. NASA invented SPAN, which transformed the way science was being done. In the 1990's, the Web was transforming everything, and is becoming a way of publishing in a fast-moving field. Another aspect was the move by the community to data theory closure. The science is no longer in disciplines, but is in connections and understanding processes: a unified system of mass, energy, and momentum. What is now important for the science are the connections. The SSDS must include the next transforming cycle in data systems. Some questions are: How does the SSDS fit into emerging technologies? How does it fit into the larger picture of science in the future? Increasingly, science will require modules (data and others) that can be readily combined. Collaboration technologies are improving rapidly and there is an insatiable demand. Any new system must track the "agents of change." The current SPDS is an excellent, well-constructed web presence, with multiple capabilities. It is comprehensive and well maintained. Permanent archive should be a NASA responsibility. The use of the SPDS capabilities seems to have been less than what one might expect. The "growing pains" have put some off permanently, and there is a one-dimensional nature to the products. The system requires significant investment to optimize personal involvement. Also, there have been some unrealistic expectations or appreciation for the magnitude of the task. Of importance to the science environment are: model/data source "brokers"; high performance visualizers; digital libraries; data base agents; electronic workshops and campaigns with replay capabilities; and Space Weather science testbeds. The SSDS must have collaboration tools, digital libraries, electronic workshop, and on-line help. The deployed user nodes should be competitively selected; have value-added expertise; collective oversight responsibility; overlapping specialization spanning the full field; and natural alliances with international and interdisciplinary partners. There should be regular evaluation reviews, and progress towards objectives must be tracked. The system should use the university analogy: strong deans with direct financial responsibility, and a provost with institution-wide oversight and responsibility. A suggestion was made that the SSDS provide full references/citations, easily available for users.
Recommended building blocks for the SSDS include: selection (competitive for a fixed period of time); a coordination office (does not hold any data, but sets standards and advises projects); distributed data (to "data lovers" in the user communities); data nodes (distributed centers that curate, maintain, validate, and distributed space data); periodic peer review (for each component of the SSDS); an advisory structure (Science Information Systems and Operations MOWG); a permanent archive; existing data nodes (keep the ones that are working well); user communities; education; and NSSDC (disestablish after transition to SSDS is complete). The current environment is characterized by scarce resources, exploding data bases, changing research styles, rapidly changing technology, and distributed resources. In addition, decreasing manpower at NASA Headquarters and a desire to outsource indicate that NASA's role should change from Manager to Partner with the user communities in the management of the SSDS. A change in role from Manager to Sponsor would be dangerous in the long term.
Dr. Linsky addressed the question of whether or not the Coordination Office (CO) should be outsourced. Arguments in favor: competition brings out the best ideas; an outside group is better able to flight for proper funding; and universities are highly innovative environments. Arguments against: minimize the cost and disruption of changing the host and the location during the transition period; a NASA Center is better able to withstand "glitches" in funding; a very long term commitment is essential; NASA must remain an active player; and a NASA Center entity can enforce NASA's data requirements better than a non-NASA entity.
The Task Group recommended assigning the CO to GSFC. However, a minority of the group would like to see the CO competed. Any assignment to GSFC should be done with a clear charter and an active Management Council that represents the user communities. IF NASA decides to compete the CO, then NASA centers should be allowed to compete with other entities. The CO should be small, and the functions should be strictly limited to: project management (with the input of the Management Council); leadership in the setting of standards; leadership of system engineering for SSDS; pro-active coordination and planning with active missions and instrument teams, including advice on writing and implementing Project Data Management Plans; and participation in the periodic reviews of the data nodes and permanent archive. The CO will have no in house science data sets. The Management Council represents the user communities and acts as a "Board of Directors" and a source of expertise for the CO. The SSDS is funded by NASA, through GSFC (or other NASA Center) contracts.
There were a number of comments and questions associated with the Coordinating Office. A concern was expressed over how the CO could maintain credibility and expertise without some direct connection with the data. Dr. Linsky noted this concern. Also, if a small management office is formed, how can it ensure that the right elements get into the Project Data Management Plans (PDMPs)? It is envisioned that the Information Systems and Operations Working Group provide policy advice on PDMPs. There is a need for financial stability of the CO, but there are alternatives in between the two extreme positions described in the presentation. The existing discipline nodes have different cultures, e.g., planetary has a central coordinating node, but astrophysics does not. However, all nodes should be organized in a way that enables interdisciplinary science. Every discipline node should provide data for permanent archive. Permanent archive should be linked with the CO. NASA needs to think about how this role will be played by the working data sets residing in the nodes. The question of a "world data center" has not been addressed. It may be outmoded given the Web and a permanent archive. The ISOMOWG should ensure that NASA maintain the position of stakeholder. One of the functions of the CO would be to direct queries in the right direction on the Web site. The structure being implemented for OSS education and public outreach includes four forums (one for each science theme) as well as a set of broker/facilitators. This data system must interoperate with those locations.
There are three principal elements in the architecture: permanent archive; distributed domain nodes; and a management node. The management node is vested in the community (envisioned as a "institute without walls" and an opportunity for partnership). The intent would be to select this node through a competitive process, and have it be the principal funding entity. The management node will conduct the competitive selection of domain nodes. The SSDS becomes an organic part of the infrastructure for science. It must have well-defined and coherent interaction with the OSS Education and Public Outreach structure, and must maintain strong international cooperation. In planning and implementing the space science data environment, OSS seeks participation across broad community segments: science users, data system engineers and technologists, academia and government, private sector and technology innovators, and educators and the general public. Next steps in the process are: further study of the architecture and design; refinement of the implementation plan; and development of the solicitation for the management node. There could be a follow-up workshop (or series of workshops) leading to the development of the solicitation.
Comments from the audience:
Tom Garrard (Cal Tech) - Community Input Needs: The Space Physics Data System (SPDS) is a parallel organization to the PDS. There are substantial cultural differences between the space physics community and the astrophysics community. The SP issues are associated with data complexity and calibration. SPDS has been a volunteer/community driven organization with very limited funding. Emphasis has been on mechanisms for input from the community, and keeping the community informed. Recommendations are: stronger mechanisms for community feedback, including more town meetings and workshops, electronic and/or geographically distributed; publicity via AGU etc., and via WWW; domain coordination teams with large memberships and a clear (hierarchical?) path to the "top". These recommendations are addressed to OSS and the ISOMOWG, as well as to the coordination office when it is established.
Nick White (HEASARC-GSFC) - Lessons Learned and Future Prospects: Using catalogs and data at all levels requires access to a multitude of sites. NASA created ADS and ESA created ESIS to address this problem. Both projects failed to achieve their objectives, and have either been reduced in scope or canceled. Their mistakes should not be repeated. ADS put a burden on the data providers, and also conflicted with existing user interfaces from data providers. Ultimately, the system was overtaken by the advent of the WWW. Requirements for a distributed data system are: do not put a burden on the data providers; use public domain access methods (e.g., ftp, http, etc.); do not try to provide a do-it-all user interface (concentrate on defining a few critical standards and enable the community to build their own user interfaces). Rich sources of survey data are already network accessible. A seamless integration of services using URLs can be expected. A few simple standards should be agreed upon. In response to a question, Dr. White noted that probably user community involvement in ADS was insufficient, particularly during implementation.
There is an effort underway to review all of the active space physics data sets. The ones that are exceptionally well done will be recognized; those that are not will be improved with the participation of the community. How can users be motivated to interact in the development phase of the system? One good example of a system that works well is CDAW Web.
Roger Brissenden (SAO) - High Energy Data Provider: From the astrophysics community point of view, it is very important to build on the existing system throughout the community. The infrastructure should be "light-weight" and have low overhead. Look at systems that use the Web. Rapid evolution of systems at the point of expertise should be enabled.
Michael Kurtz (SAO) - Astrophysics Digital Library: A system already exists among text-based astronomy systems. The data providers are providing it based upon the final product. The journals maintain tables of data in machine readable form. The nomenclature is agreed upon, and objects have names. The Abstract Service covers planetary data as well.
Bob Hanisch (STScI) - Technical Opportunities and Scientific Rationale: Be very careful not to force from the top down. Look at what level of integration is valuable scientifically and achievable technically. Nomenclature may have very little overlap. The most important role is finding data that is relevant to the scientific problem. Once data is found, the important function is getting it in a form in which it can be used. This can be done by building upon the excellent facilities already in place, and integrating them. ADS drew upon proprietary systems and put burdens upon data providers. Related services will want to participate in SSDS. The technology is now in hand, and prototypes have been developed. In terms of public outreach, data services must support the educational outreach community, but it is a mistake to make the archive systems fully accessible as a public resource: the data can be too complicated and voluminous to be useful to them.
Kevin Gamiel (NCSA, U. Ill) - Call for Collaboration: There are already a lot of systems in which investments have been made. NCSA will offer a free software package that drops into existing network accessible data repositories, and a standard set of queries with full support of concepts for searching different types of astronomical resources.
David Book (U Md.) - Email Archive: Today, the intellectual archive (notes) should also be for the public and historians. E-mail could be flagged as "archive" and it would be sent to a repository. [The sense of the group was that this would not be very useful.]
Joe Mazzarella (IPAC) - Reality Checks: Scope and Resources: The SSDS should have a well thought out cost benefit analysis early on. There are limitations to use of a system for interdisciplinary research. There are common needs, such as an image browsing tool for the Web. How does the SSDS differ from using the Web with common sense? The cost of hardware and people should be factored into the $ 20 million budget. Problems have been encountered relative to large data transfer over the Web. Take advantage of what people in other fields know, and leverage technology developments.
Bob Fowler (NSF) - For a scientist to maintain a data archive that will be generally useful to a large segment of the community in a technologically changing environment will be a challenge.
The evolution of Web technology is exciting. There has to be some confidence that the data obtained is trustworthy.
Don Sawyer (GSFC) - Technology Comments: There are new technologies coming along which may significantly impact how data is used. Don't get over enamored with any one technology that may exclude others.
Dennis Gallagher (NASA/MSFC) - The scientists are still thinking in terms of holding the data. Data does not have to be held (e.g., a "data mine") in order to be accessible. What could be more easily competed is a "data miner" service, which is people based. The information technology who are managing the data for science must work in close collaboration with the people who are doing the science from the data.
Comments
Other comments
Recommendations
Another model
Use a successful model (e.g., PDS)
Plan -> Pilot -> Plan -> Comp. -> plan -> comp.
Put together a pilot activity (an appointed coordination office) to define
what is in the permanent archive, the services, etc., and to develop what
the new nodes should look like. As a function of time, the new CO would be
put in place, without interrupting services that have begun. Assess how
this is working, and continue to iterate until the goal is reached.
Morris Aizenman (NSF): What is it that the customer wants? Where are we going with this new structure and why? What is it that we don't have that we need? What is the CO going to do? What is the value added?
Ray Tatum: It sounds like a paradigm is being laid down. Because of the diverse cultures, some distribution is required. The critical part is the transition phase. Do not move too fast; spend some time studying. Industry wants to be involved in the process. The things the science community does well, it should continue to do; the things that industry does well, it should do in partnership with the science community.
Ray Walker: Space Science Data Coordination
Other Comments
Barry Madore: The question is not how NASA should implement these solutions, but whether they should be implementing solutions. Ultimately, the most important thing is knowledge. An enforcement of a system is not going to lead to answers; scientists are self motivated without this structure.
Nick White: One of the biggest impacts of the Linsky Report is to dissolve the NSSDC. The NSSDC has been around for a long time, and it would be wise to let those who are running NSSDC develop a plan on how it should be dissolved.