The SEDAC Environmental Treaties and Resource Indicators Project (ENTRI): Integrating Diverse Data via the World Wide Web

presented at the Science Information Systems Interoperability Conference, College Park, Maryland, November 8, 1995

Frederick Zimmerman
Research Scientist, Environmental Research Institute of Michigan
Consortium for International Earth Science Information Network (CIESIN)
Socioeconomic Data and Applications Center (SEDAC)
4251 Plymouth Road, Ann Arbor, MI 48105
Phone: 313/741-4657 | Fax: 313/663-6622
E-mail: fzimmerm@ciesin.org | WWW URL: http://sedac.ciesin.org/

This paper was prepared and written by Frederick Zimmerman, supported with funds provided by CIESIN/NASA. The opinions, conclusions, and recommendations contained herein represent those of the writer and are not necessarily those of CIESIN or NASA

(c) 1995, CIESIN, Frederick Zimmerman

.

Abstract

The Call for Papers for this session on organizing WWW resources compares the current state of the Web to the Tower of Babel, but holds out the hope that the Web can ultimately become an Alexandrine Library. This paper discusses the approach taken by SEDAC's Environmental Treaties and Resource Indicators (ENTRI) project to organizing information resources for access via the Web. SEDAC has developed a prototype application called the Policy Instruments Database (PIDB), a WWW-based reference tool whose function is to provide information about international environmental agreements. (WWW URL: http://sedac.ciesin.org /pidb/pidb-home.html). The strategy behind the development of the PIDB is--to continue the library metaphor--that a single well-edited reference book is often more useful than a dozen discrete and unintegrated monographs.

1. User Needs

This project was initiated in response to national and international scientific assessments that there is a need for improved access to information about international environmental agreements. These assessments include authoritative recommendations by the U.S. Global Change Research Program (USGCRP), NASA's Mission to Planet Earth (MTPE), the Human Dimensions of Global Environmental Change Programme of the International Social Science Council (HDP), and the SEDAC Users' Working Group (UWG). The recommendations of these expert bodies support the proposition that information about international environmental agreements is critical to understanding the human dimensions of global change.

Because nation-states are, and will likely continue to be, the primary political units forced to deal with global environmental issues such as global climate change, land use and land cover patterns, stratospheric ozone, and so on, the need for future international environmental agreements is likely to be substantial and continuing. Furthermore, there is strong expert support for the proposition that the implementation of existing agreements is also extremely important, especially as the need for international cooperation on global environmental change increases with the connectivity and complexity of global environmental, economic, and political systems.

Thus, international environmental agreements are increasing in importance as tools to help promote equitable and efficient strategies to mitigate the adverse effects of global environmental problems. Until the development of the SEDAC PIDB service, there was no integrated resource for obtaining information about the range of international environmental agreements that might be applicable to a particular country or set of countries, nor any systematic way to track the progress of the ratification and implementation process across multiple agreements. Discovering such information using traditional methods requires access to a major research library and hours or days of painstaking research and perhaps telephone calls and letters to treaty secretariats.

Very little treaty information was available on the Web when SEDAC began developing the PIDB. Most of the organizations that had relevant information had not yet provided networked electronic access to their information systems. So the challenge that faced SEDAC was more than organizing existing and new Web resources. Rather, SEDAC needed to bring an entire class of information resources onto the Web for the first time.

The ENTRI team decided to focus its initial efforts on "policy instruments" such as treaties, agreements, and laws; information on the negotiation, structure, and status of these legal instruments; and directives, initiatives, and statements from government agencies and non-governmental organizations. In response to user comments and the advice of an expert panel, SEDAC decided that it needed to collect policy instruments related to nine global environmental issues: global climate change, stratospheric ozone depletion, transboundary air pollution, desertification and drought, conservation of biological diversity, deforestation, oceans and their living resources, trade and the environment, and population. Again with the guidance of an expert advisory panel, SEDAC identified several different information resources (detailed in section 4 below) that might help meet user needs for information about international environmental agreements.

If SEDAC had limited itself to organizing existing networked information resources, it would have been much less responsive to its users' needs. In the case of the Policy Instruments Database, SEDAC was able to draw on competencies CIESIN had developed specifically for the purpose of encouraging networking between data providers and end users.

2. The Information Cooperative as a Tool for Developing Network Access to Information Resources

CIESIN developed its Information Cooperative program to faciliate networked access to information resources related to global environmental change. The program revolves around several technical and organizational competencies, including distributed metadata dissemination and discovery (see, for example, the CIESIN Gateway software, which won a 1995 Computerworld/Smithsonian award), capacity building (see, for example, the Info Coop "country nodes" in Eastern European economies in transition, such as Estonia and Latvia), and--most important for the purposes of this session--providing networked access to hitherto non-networked data (see, for example, the WWW-searchable Social Indicators of Development from the World Bank).

The organizational competencies developed by the Information Cooperative were at least as important to SEDAC as the technical ones. For example, many data providers have strong concerns about intellectual property rights and return on prior investments. Addressing such concerns requires persistent, sophisticated attention from staff expert in data policy issues. Similarly, most organizations invited to participate in data partnership activities need to understand how the partnership activities will advance their own missions. Making the case for a mutually beneficial joint activity requires a cadre of staff who are adept in quickly coming to understand the inner workings and external objectives of large, complex international organizations.

These Information Cooperative organizational competencies helped SEDAC meet the policy information needs of its users. As part of the Policy Instruments Database effort, SEDAC staff developed Information Cooperative relationships with the Environmental Law Centre at the World Conservation Union (IUCN) and the Environmental Law and Institutions Programme Activity Centre (ELI/PAC) at the United Nations Environment Programme (UNEP). Similar relationships were established with Freedom House and the Multilaterals Project at the Fletcher School of Law and Diplomacy at Tufts University (although these relationships were not formally part of the Information Cooperative program). Building such partnerships required very significant investments of staff time and other resources. Without them, there would have been no PIDB.

The moral? CIESIN and SEDAC were able to organize Web resources more effectively because they had developed expertise at organizing organizational resources. Efforts to organize Web resources on a purely technical level are unlikely to be as effective as efforts which address the organizational needs of information providers. If SEDAC had adopted a purely technical solution to meeting the policy information needs of its users, it would have been limited to a "lowest common denominator" solution--"whatever we can digitize plus whatever else we can link to on the Web"--which would not have provided a comprehensive solution to user needs. Addressing the organizational needs of information providers allows Web organizers to build more complex systems that address user needs from first principles.

3. Technical Approach to Data Integration

This section of the paper discusses the technical details of SEDAC's approach to integrating the data that constitutes the Policy Instruments Database. Several different data sets were acquired.

The World Conservation Union (IUCN) provided a snapshot of its International Treaties Database, created to track the status and content of international treaties related to the environment. The IUCN International Treaties Database uses a proprietary software package developed by IUCN in the late Sixties running on an early-80s technology IBM mainframe platform. The section of the database provided by IUCN contained 410 records, each about 50 lines long, in a non-fielded ASCII dump format. There was no existing network access to the IUCN system.

The Environmental Law and Institutions Programme Activity Centre (ELI/PAC) at the United Nations Environment Programme (UNEP) provided a digital version of its Register of International Treaties and other Agreements Related to the Environment. This resource consisted essentially of text files, each about 100 lines long, providing key summary information about approximately 150 treaties, in a standard layout, but not in a fielded format. There was no existing network access to this information resource.

The Multilaterals Project at the Fletcher School of Law and Diplomacy provided about 25 digitized treaty texts. These texts were already available on the Internet via FTP server. CIESIN also digitized by scanning and proofreading approximately 50 treaty texts which were not available elsewhere in electronic format. Digitizing these texts, which were typically between 10 and 30 single spaced pages long, required a substantial effort. Freedom House provided political and civil liberties ratings for all the countries of the world on a 1-7 scale. Freedom House also provided 1990 GDP data for all the countries in the world. The data came at first in 15 (!) Mac disks containing a single large Stuffit file. SEDAC staff had to unstuff the 15 MB file and then extract the data tables from the PageMaker format file used for producing Freedom House's annual survey, Freedom In The World. SEDAC staff also had to normalize the country names used by Freedom House to be compatible with the country names used in the IUCN and UNEP status information systems.

Thus, the initial pool of resources included approximately 75 full treaty texts; 150 treaty summaries; 410 status files; and three simple political and socioeconomic variables for the approximately 200 nations in the world. Some of the resources were already available in electronic format, some were not; none of the resources shared common formats. A small subset of the resources were already available via electronic networking, but most were not.

SEDAC decided to provide two principal avenues for Web access to the PIDB data: 1) simple Web browsing and WAIS free text searching of documents arranged in an HTML "tree" and 2) Web queries of an Oracle relational database. The first approach required information technology support of relatively low sophistication (although the sheer number of HTML files--more than a thousand--imposed some nontrivial complexity to the Web management). The second approach required a considerably more sophisticated understanding both of Web/cgi technology and of Oracle relational databases.

Data integration for browse access required translating the source data into HTML files, one per data "granule" (e.g., a treaty text, or a status record for a particular treaty), then storing them on a Unix system; writing appropriate Web menu pages, e.g. for treaty text files, for treaty status files, and for treaty summary files; and cross-linking the text files with the corresponding treaty status and summary files, so that a user who browsed to a treaty text could click to the corresponding status and summary information. This approach had the advantage that it supported "casual browsing" of the data collection in a manner familiar to most Web users. It also had the advantage that it preserved the structure of the data on its own terms and in its own mode of presentation.

Data integration for search access was more complicated. It required first building and maintaining an Oracle database, an expensive proposition in itself. Nonfielded ASCII-dumped data coded in a cryptic syntax had to be loaded into the Oracle database in a fielded format. Most of the documentation for the IUCN database was in German. Data loading took weeks. Next, Web cgi-bin scripts had to be written to enable the httpd server to query the Oracle database.

SEDAC science staff's analysis of user needs drove the development of a set of "basic questions" for the Oracle database. Using these questions, users could query the database to discover the answers to recurring questions like "which countries are party to treaty X?" and "to which treaties is country Q a party?" This turned out to be an important and successful feature of the human interface.

When a user asks the database a "basic question", the Web cgi-bin script queries the Oracle database and returns an HTMLized query result. The fact that the database output is HTMLized is very important because it means that the query results are live links. For example, the question "to which treaties is country Q a party" returns a list of hotlinked treaty names. The user can click on the treaty name and go to the actual text of the treaty, which is in turn crosslinked to treaty summary information and the full status record. Putting all the pieces together took a lot of work. Moreover, making the database search capability fully accessible to the general public raised some complex system administration and security issues which ultimately took months to resolve. The effort was worthwhile; the PIDB has received numerous positive comments from users and is serving a growing user community.

The moral for this session is that sometimes the most powerful approach to organizing Web resources is to bring new resources onto the Web in a way that fully addresses your users' needs.

4. Future Developments

The Environmental Treaties and Resource Indicators team expects to add considerable new functionality to the PIDB in the coming year. One major effort will be to generalize the PIDB's ability to ask database questions that interrelate treaty information and socioeconomic national indicators. The database will be expanded to cover additional socioeconomic, environmental, and Earth science national indicators, and the database interface will be improved. This activity, like previous PIDB prototyping, will not draw heavily on existing Web resources, but will be directed more at utilizing organizational partnerships and information technology skills to develop comprehensive new "2d generation" Web resources.

Now that a core of comprehensive functionality has been established starting from a "pre-Web" origin point, it makes more sense to opportunistically take advantage of the often rich but also often inconsistent resources that are becoming available elsewhere the net. User needs analysis, limitations on SEDAC resources, and observation of Web trends all seem to indicate that to be successful in the long run, the PIDB must be not just a database of documents, but a database of networked resources: a gateway. Accordingly, another major theme of this year's development efforts will be enhancing the PIDB's ability to organize and take advantage of existing networked resources.

User needs analysis indicates that information about differing national policy responses to global environmental issues is critical for understanding the international and national context in which the scientific results of global change research are applied. In the coming year, SEDAC hopes to build spider scripts that enable the user to navigate from relational database query results to navigate to information about how different nations are responding to particular international environmental issues and agreements. These spiders will provide access to distributed national response information, primarily but not exclusively textual, residing elsewhere on the Internet; for example, National Action Plans filed online at various treaty secretariats.

User needs analysis and Web first principles also indicate that in the long run, it is better to be able to provide access to treaty texts and status information residing at locations, such as treaty secretariats, where the information can be most readily kept up to date and accurate. SEDAC expects to enhance the functionality of the PIDB in the future to support Lycos-like full text indexing across documents residing at multiple network locations. SEDAC also expects ultimately to enhance the PIDB to deal with status information residing in remote database.

Thus, it can be seen that in some respects the PIDB represents a "transitional" stage in organizing Web resources. Its design reflects the judgments that 1) in some situations, more value can be provided to users by bringing new resources onto the Web than by organizing existing Web resources, but 2) in the long run, more value will be provided to users by efforts to achieve interoperability with other Web resources.

Frederick Zimmerman, SEDAC, WWW URL http://sedac.ciesin.org/pidb/pidb-home.html