Searching for Data Information at a Directory Level

Volume 10, Numbers 3 & 4, December 1994
by Jim Thieman
Interoperable Systems Office

Introduction

The NSSDC has for many years operated and continues to operate the NASA Master Directory (NMD) This service enables the research and general community to quickly find information about and the location of data of interest, especially in the space sciences. Recently, network tools, such as the Web and Gopher, have provided capabilities similar to the NMD for the finding of data and information. Is the NMD something that can now be replaced by these tools? This article will discuss that question and the various approaches to obtaining directory information.

The NMD directory service is available online and is accessible through several types of interfaces. The directory interfaces have evolved as the technology for creating them has progressed. A Web interface has been developed for the Global Change Master Directory (GC MD) [Information Systems Office Newsletter - Issue 33, p. 31, August 1994] by the GCMD staff and has been modified by the NMD staff so it can be integrated into the many other data and information services that the NSSDC has to offer. This interface offers simple yet powerful search capability. Through this and the other interfaces, the NMD provides information about data not only at NSSDC, but throughout the world. For some of the datasets there are systems in other locations which provide more detailed information about the data. The older interfaces provide automated connections, called links, to these other systems so the user can find information and data quickly. The Web hyperlink capability will be used to provide similar service for Web directory users.

Although the network tools such as the Web and Gopher are simple, powerful, and access a wealth of information, they still have shortcomings which can lead to ineffective searches for data and information. There are advantages and disadvantages to each of several approaches to finding directory information.

The approaches to be examined are:

WWW Search Methods

The popularity of the World-Wide Web is very evident. The chart
Chart of NSFnet Network Usage Statistics
shows that Web usage on NSFnet has increased by a factor of six in the first nine months of 1994 and it is overtaking many popular methods of transferring bits through the networks. The ability to use embedded hyperlinks to hop from one information source to another at a very different location makes it easy to reach widely separated sources of information. What is more difficult, however, is finding exactly the right piece of information. A number of "spider" services have been created which navigate through the web automatically and gather information from the titles and text found on the web pages. Often this textual information is indexed or organized so that a user of the service can quickly find where a few keywords of interest have been used. Several of these services use the Wide Area Information Servers (WAIS) text indexing capability, for example, the Webcrawler service offered by Brian Pinkerton at the University of Washington.

If users have some idea of particular sources for the information they are seeking they can often reach their desired destination with well chosen keywords. A search for all sources of information relevant to a particular topic, however, will often be difficult due to: the lack of standardization of terms describing the topic; the mixing of the relevant information with other information irrelevant to the search; the multiplicity of pages which must often be navigated to reach exactly what you are looking for; and the difficulty for major institutions to describe in detail all the information they have in a succinct manner.

For example, if the user is interested in radio astronomy data from the planet Jupiter, entering the keywords "radio", "astronomy", and "Jupiter" in the Webcrawler mentioned above yields some very useful pages compiling lots of information and hyperlinks in astronomy. One must search through those pages thoroughly to determine which lead to Jupiter radio astronomy data. The NSSDC archives Jupiter radio astronomy data, but it is not listed among the Webcrawler search results since it cannot describe each of the thousands of data sets contained in NSSDC on Web pages. Then there is the common problem that there are still many data and information sources that are not yet a part of the Web.

Return to approaches list.

Centralized Directory Searches with Standards

Information professionals, such as librarians, are using the Web and the other network tools to benefit their users. Many prefer, however, to search for specific topics through services such as Dialog, where the information has been put into standard form and a search is likely to yield a more complete set of information. The NMD stores information in a standard form, called the Directory Interchange Format or DIF. The DIF requires a description of datasets to have controlled keywords attached in particular fields. Use of the controlled terms in a search yields a higher probability that the results are comprehensive in the subject area.

The problem with the standardized information sources is the amount of time and effort it can take to put information into the standard form. To create a DIF with the minimum amount of required information takes only a few minutes. A more thorough description of the data which is very helpful to the user may require a few hours, depending on the amount of detail included. For a person unfamiliar with the DIF format it takes some more time to learn enough about the DIF to write one.

Though a standardized information source is the best for user purposes, it takes cooperative effort and dedicated attention to gather information in a standard form from a wide variety of sources. The NMD has been made as comprehensive as possible with respect to NASA-funded data, but the NMD also includes descriptions of space science data from other sources in the world. It is difficult to quickly put all of that information in the DIF format. There are compromises, however, which can help to alleviate this problem.

Return to approaches list.

WAIS Text Retrieval Directory Searches

Overview information about datasets exists in many directories around the world and the content of this information does not differ greatly from one directory to another. Many of these directories exist in electronic form. It is not too difficult to gather the directory databases in a central location and use the WAIS or similar indexing software to provide text search capability into the accumulated information. This differs from generalized WAIS searching of the Web since the gathered information is limited to directory information on a particular subject, such as space science data. For example, there is a WAIS searching capability available in the NMD and the entry of the keywords "radio", "astronomy", and "Jupiter" quickly yields a list of datasets relevant to that topic and the information about where they are located.

Again, the success of searches depends on the commonality of terminology used in the directory entries. To make searches more likely to yield comprehensive information, it may be necessary to provide some "added value" to the databases by attaching keywords as necessary chosen from standardized lists. The simple attachment of keywords would not require as much effort as the restructuring of the information into the DIF format. Also, simple skeletal DIFs could probably be made from these value-added entries through automated processes. For datasets which were considered to be particularly useful or important the full process of DIF creation should be followed and the dataset would then be retrievable by the wider variety of search techniques available through the traditional NMD.

Return to approaches list.

Distributed Directory Searches

Even with the simplified process of gathering all forms of directory databases into a single text-searchable database it is still difficult for any one organization to gather comprehensive information and keep it up-to-date. It would be better if each of the sources could keep their information at their location and update it as changes or new additions occur. The Committee on Earth Observing Satellites International Directory Network (CEOS IDN) is a federation of network-connected directory database nodes which share information with each other in the DIF format. One future scenario calls for users to be able to log into any of the directory nodes and submit a search query which would be sent to many or all of the nodes and return results to the original node. Then each node need only have a database of information describing the datasets in its domain. Even if there are not enough local resources to put the information in the DIF format, general text descriptions could be gathered and made text searchable through the WAIS distributed server techniques. The advantage of this approach over simple Web text searches is the narrowing of the search domain to directory information in specific subject areas only. Again, the addition of controlled keywords to the entries in the databases could make the search even more effective.

Return to approaches list.

Conclusion

In conclusion, the Web has brought a revolution in the ease of access to a wealth of information. It has not, however, replaced the need to assemble information about specific subject areas and to make that information specially findable through standardized techniques. NSSDC's NASA Master Directory offers this value-added service in the search for NASA-funded and NASA-relevant data and will continue to gather information on important space science datasets world-wide. This service is readily available to Web users. With an increased use of text searching techniques on less rigidly formatted information and potential cooperative effort by distributed data information sources the service can be made even more useful for the community. It is still important to have as much commonality as possible in the databases being searched. The amount of commonality incorporated will depend on resources available, but newer search tools are lessening the amount of effort required. We invite you to try the present system and send us your suggestions for improvement.

NMD ACCESS INFORMATION

World-Wide Web Universal Resource Locator (URL)

http://nssdc.gsfc.nasa.gov/nmd/nmd.html

The NMD may also be found within the NSSDC Online Data and Information Service (NODIS) which can be accessed in the following ways.

Internet

TELNET NSSDCA.GSFC.NASA.GOV
USERNAME: NODIS

NSI/DECnet (SPAN)
$ SET HOST NSSDCA
USERNAME: NODIS

Return to NSSDC News Table of Contents


NASA home page GSFC home page GSFC organizational page

Author:Miranda Beall
Curators: Erin Gardner and Miranda Beall
Responsible Official: Dr. Joseph H. King, Code 633
Last Revised: 21 Nov 1996 [EDG]