Searching for Data Information at a Directory Level
Volume 10, Numbers 3 & 4, December 1994
by Jim Thieman
Interoperable Systems Office
Introduction
The NSSDC has for many years operated and continues to operate the
NASA Master Directory (NMD)
This service enables the research and general community to
quickly find information about and the location of data of interest, especially
in the space sciences. Recently, network tools, such as the Web and Gopher,
have provided capabilities similar to the NMD for the finding of data and
information. Is the NMD something that can now be replaced by these tools?
This article will discuss that question and the various approaches to obtaining
directory information.
The NMD directory service is available online and is accessible through several
types of interfaces. The directory interfaces have evolved as the technology
for creating them has progressed. A Web interface has been developed for the
Global Change Master Directory (GC
MD)
[Information Systems Office Newsletter -
Issue 33, p. 31, August 1994] by the GCMD staff and has been modified by the
NMD staff so it can be integrated into the many other data and information
services that the NSSDC has to offer. This interface offers simple yet powerful
search capability. Through this and the other interfaces, the NMD provides
information about data not only at NSSDC, but throughout the world. For some
of the datasets there are systems in other locations which provide more
detailed information about the data. The older interfaces provide automated
connections, called links, to these other systems so the user can find
information and data quickly. The Web hyperlink capability will be used to
provide similar service for Web directory users.
Although the network tools such as the Web and Gopher are simple, powerful, and
access a wealth of information, they still have shortcomings which can lead to
ineffective searches for data and information. There are advantages and
disadvantages to each of several approaches to finding directory information.
The approaches to be examined are:
The popularity of the World-Wide Web is very evident. The chart
shows that Web
usage on NSFnet has increased by a factor of six in the first nine
months of 1994 and it is overtaking many popular methods of transferring bits
through the networks. The ability to use embedded hyperlinks to hop from one
information source to another at a very different location makes it easy to
reach widely separated sources of information. What is more difficult,
however, is finding exactly the right piece of information. A number of
"spider" services have been created which navigate through the web
automatically and gather information from the titles and text found on the web
pages. Often this textual information is indexed or organized so that a user
of the service can quickly find where a few keywords of interest have been
used. Several of these services use the Wide Area Information Servers (WAIS)
text indexing capability, for example, the
Webcrawler
service offered by Brian
Pinkerton at the University of Washington.
If users have some idea of particular sources for the information they are
seeking they can often reach their desired destination with well chosen
keywords. A search for all sources of information relevant to a particular
topic, however, will often be difficult due to: the lack of standardization of
terms describing the topic; the mixing of the relevant information with other
information irrelevant to the search; the multiplicity of pages which must
often be navigated to reach exactly what you are looking for; and the
difficulty for major institutions to describe in detail all the information
they have in a succinct manner.
For example, if the user is interested in radio astronomy data from the planet
Jupiter, entering the keywords "radio", "astronomy", and "Jupiter" in the
Webcrawler mentioned above yields some very useful pages compiling lots of
information and hyperlinks in astronomy. One must search through those
pages thoroughly to determine which lead to Jupiter radio astronomy data. The
NSSDC archives Jupiter radio astronomy data, but it is not listed among the
Webcrawler search results since it cannot describe each of the thousands of
data sets contained in NSSDC on Web pages. Then there is the common problem
that there are still many data and information sources that are not yet a part
of the Web.
Return to approaches list.
Information professionals, such as librarians, are using the Web and the other
network tools to benefit their users. Many prefer, however, to search for
specific topics through services such as Dialog, where the information has
been put into standard form and a search is likely to yield a more complete set
of information. The NMD stores information in a standard form, called the
Directory Interchange Format or DIF.
The DIF requires a description of
datasets to have controlled keywords attached in particular fields. Use of the
controlled terms in a search yields a higher probability that the results are
comprehensive in the subject area.
The problem with the standardized information sources is the amount of time and
effort it can take to put information into the standard form. To create a DIF
with the minimum amount of required information takes only a few minutes. A
more thorough description of the data which is very helpful to the user may
require a few hours, depending on the amount of detail included. For a person
unfamiliar with the DIF format it takes some more time to learn enough about
the DIF to write one.
Though a standardized information source is the best for user purposes, it
takes cooperative effort and dedicated attention to gather information in a
standard form from a wide variety of sources. The NMD has been made as
comprehensive as possible with respect to NASA-funded data, but the NMD also
includes descriptions of space science data from other sources in the world. It
is difficult to quickly put all of that information in the DIF format. There
are compromises, however, which can help to alleviate this problem.
Return to approaches list.
Overview information about datasets exists in many directories around the world
and the content of this information does not differ greatly from one directory
to another. Many of these directories exist in electronic form. It is not too
difficult to gather the directory databases in a central location and use the
WAIS or similar indexing software to provide text search capability into the
accumulated information. This differs from generalized WAIS searching of the
Web since the gathered information is limited to directory information on a
particular subject, such as space science data. For example, there is a WAIS
searching capability available in the NMD and the entry of the keywords
"radio", "astronomy", and "Jupiter" quickly yields a list of datasets relevant
to that topic and the information about where they are located.
Again, the success of searches depends on the commonality of terminology used
in the directory entries. To make searches more likely to yield comprehensive
information, it may be necessary to provide some "added value" to the databases
by attaching keywords as necessary chosen from standardized lists. The simple
attachment of keywords would not require as much effort as the restructuring of
the information into the DIF format. Also, simple skeletal DIFs could probably
be made from these value-added entries through automated processes. For
datasets which were considered to be particularly useful or important the full
process of DIF creation should be followed and the dataset would then be
retrievable by the wider variety of search techniques available through the
traditional NMD.
Return to approaches list.
Even with the simplified process of gathering all forms of directory databases
into a single text-searchable database it is still difficult for any one
organization to gather comprehensive information and keep it up-to-date. It
would be better if each of the sources could keep their information at their
location and update it as changes or new additions occur. The Committee on
Earth Observing Satellites International Directory Network (CEOS IDN) is a
federation of network-connected directory database nodes which share
information with each other in the DIF format. One future scenario calls for
users to be able to log into any of the directory nodes and submit a search
query which would be sent to many or all of the nodes and return results to the
original node. Then each node need only have a database of information
describing the datasets in its domain. Even if there are not enough local
resources to put the information in the DIF format, general text descriptions
could be gathered and made text searchable through the WAIS distributed server
techniques. The advantage of this approach over simple Web text searches is
the narrowing of the search domain to directory information in specific subject
areas only. Again, the addition of controlled keywords to the entries in the
databases could make the search even more effective.
Return to approaches list.
Conclusion
In conclusion, the Web has brought a revolution in the ease of access to a
wealth of information. It has not, however, replaced the need to assemble
information about specific subject areas and to make that information specially
findable through standardized techniques. NSSDC's NASA Master Directory offers
this value-added service in the search for NASA-funded and NASA-relevant data
and will continue to gather information on important space science datasets
world-wide. This service is readily available to Web users. With an increased
use of text searching techniques on less rigidly formatted information and
potential cooperative effort by distributed data information sources the
service can be made even more useful for the community. It is still important
to have as much commonality as possible in the databases being searched. The
amount of commonality incorporated will depend on resources available, but
newer search tools are lessening the amount of effort required. We invite you
to try the present system and send us your suggestions for improvement.
NMD ACCESS INFORMATION
World-Wide Web Universal Resource Locator (URL)
http://nssdc.gsfc.nasa.gov/nmd/nmd.html
The NMD may also be found within the
NSSDC Online Data and Information
Service (NODIS) which can be accessed
in the following ways.
Internet
TELNET NSSDCA.GSFC.NASA.GOV
USERNAME: NODIS
NSI/DECnet (SPAN)
$ SET HOST NSSDCA
USERNAME: NODIS
NASA
home page
GSFC home
page
GSFC organizational page
Author:Miranda Beall
Curators: Erin Gardner
and Miranda Beall
Responsible Official: Dr. Joseph H. King, Code 633
Last Revised: 21 Nov 1996 [EDG]