next up previous



Contents


List of Figures

Generating a Domain-specific Thesaurus Automatically:
An Experiment on FlyBase

Hsinchun Chen 1, Bruce Schatz 2, Joanne Martinez 3, Tobun Dorbin Ng 4

Abstract:

This research describes an algorithmic approach to the automatic generation of thesauri that can serve as online search aides for scientific databases or electronic community systems. Using object filtering, automatic indexing, and co-occurrence analysis techniques, we extracted key terms and phrases representing Drosophila-related research knowledge from a core collection of Medline and Biosis abstracts. In addition to subject descriptors from the named databases, objects used in object filtering included gene and protein names, cellular and biological function names, researcher names, and subject descriptors from FlyBase, a database currently in use by molecular biologists in the Drosophila melanogaster-related research community. On average, each term had about 41 weighted neighboring terms indicating relevant concepts.

We tested the fly thesaurus in an experiment with six fly biologists of varying degrees of expertise and background. This study confirms earlier research (of C. elegans worm thesaurus) that demonstrated the feasibility of an algorithmic approach to the vocabulary problem in online information retrieval. The knowledge and literature representing Drosophila research presents important challenges, including non-standardized naming conventions, a long history with a great deal of vocabulary fluidity, and about ten times the volume of literature representing the worm research. The experiment showed that the thesaurus was an excellent memory-jogging device and that it supported learning and serendipity browsing. Despite some occurrences of obvious noise, the system was useful in suggesting relevant concepts for the researchers' queries. The experience and lessons learned during the fly thesaurus development and evaluation process are presented in detail.

Introduction

Biological research is highly data-intensive, and data accumulation in this area is growing rapidly because of massive efforts such as the Human Genome Initiative and other genome mapping projects. The complexity of the data and their inter-relationships have made genome informatics even more challenging. At present, the genomes of several organisms are being sequenced and mapped, including Caenorhabditis elegans (nematode worm), Drosophila melanogaster (fruit fly), Mus musculus (mouse), Homo sapiens (human), Escherichia coli (bacterium), Saccharomyces cerevisiae (yeast), and Arabidopsis thaliana (plant). Because communities in molecular biology form around organisms rather than techniques or problems, the results generated are stored in separate databases by each scientific community. Information about the identity, function, cytological and genetic location, mutations, and aberrations of any particular gene is therefore scattered among a variety of distributed, heterogeneous databases.

These systems range from elaborate, sophisticated database management systems to unstandardized machine-readable files created through text conversion. Distributed, heterogeneous databases are preferable to vast, centralized databases because of such inherent difficulties as file size, currency, updating, and retrieval. Nevertheless, the resulting problem of information scattering makes it difficult for scientists to share and compare information that exists in different files. Connectivity among databases is essential if scientists are to make full use of their contents. Adequate tools for structuring and accessing the contents of a diverse and heterogeneous collection of scientific databases are needed to unlock the knowledge embedded in them. Tools developed for this purpose must address the following problems: information overload, information scattering, vocabulary fluidity over time, unstandardized nomenclature, and semantic differences in the vocabularies used by the various subdisciplines of biology.

Scientific Databases, Electronic Community Systems, and Information Retrieval Problems

The Human Genome Initiative (HGI) offers tremendous challenges not only to the biology, biomedicine, and genetics research communities, but also to the information science and computer science communities. According to Courteau [13], the Human Genome Project ``will generate more data than any single project to date in biology,'' resulting in complete sequences and physical maps containing the location of every gene of the human genome and the genomes of other model organisms. The vast amount of knowledge accumulated during the project's scientific discovery process can only be managed with the use of computing technologies that support efficient and effective storage, retrieval, and analysis of information, that foster seamless distributed scientific collaboration, and that facilitate timely information dissemination and sharing.

FlyBase [20] is a set of linked databases designed to provide the Drosophila research community with access to broad and thorough coverage of molecular, genetic, and community information. The Drosophila community is one of the oldest groups in biological research. Most of the knowledge it has generated has been recorded in two sources, the ``Redbook'' [31] and the publications of the Drosophila Information Service. FlyBase gives promise of providing improved access to domain knowledge through use of computer technologies.

A consortium of Drosophila researchers, funded by grants from the U.S. National Institutes of Health (Washington) and Medical Research Council (London), guides FlyBase development and provides advice on preferred formats to groups developing allied databases and to individual contributors of data. FlyBase is accessible and searchable through Gopher (and Veronica) client software or obtainable via anonymous ftp from its primary archival site at the Department of Biology at Indiana University (IU). The primary database consists of a series of flat files that contain molecular and genetic data on Drosophila melanogaster, including the entire contents of the ``Redbook.''

In order to provide comprehensive access to Drosophila information, the IU Department of Biology also provides access to numerous ``allied'' databases developed at other institutions, each with its own focus. Included are a unified bibliography on Drosophila; pointers to nucleic acid and protein sequence databases; stock lists directing researchers to sources of clones and other biological reagents; a genetic map of Drosophila; a list of Drosophila genes sorted by function; lists of clones from European and American sources, and instructions for placing orders; a directory of Drosophila workers; and other associated databases useful to Drosophila researchers. Forums for informal communication between researchers are provided through the Drosophila Information Newsletter and a Bionet News group dedicated to Drosophila research. Each information source is resident on computers at its ``home'' institution. Access to the complete set of databases is organized through the IUBIO Gopher site. Table 1 summarizes some of the major online information sources of relevance to fly biologists.


 
Table 1: Multiple, disparate online information sources for fly biologists
Information Source Description
Redbook ``The Genome of Drosophila melanogaster''
Drosophila Information Service molecular and genetic data on Drosophila
bibliography on Drosophila allied database
nuclear acid and protein sequence databases allied databases
stock lists sources of clones and other biological reagents
genetic map Drosophila-related
genes sorted by function Drosophila-related
clones from Europe and U.S. sources allied databases
Drosophila workers directory
Drosophila Information Newsletter informal communication
Bionet News group online news group
Worm Community System allied database
 

In addition to various genome databases such as FlyBase, electronic community systems (ECS) have been proposed and implemented [43]. These have drawn significant attention recently due to the rapid proliferation and advancement of computing, databases, and telecommunication technologies. An electronic community system encodes a research community's information and knowledge and provides an online environment to support the manipulation of that knowledge. An advantage of this type of system over traditional databases is that an ECS enables users to browse the available knowledge easily, record their own knowledge for others to use, indicate authorizations for users to either view or annotate their own data, annotate entries from others' research, and form interrelationships between concepts [13]. An ECS is much like an electronic library where users can browse for relevant information, filter out information they do not currently need, and share data that they have collected.

Another novel characteristic of an ECS is its ability to handle a wide variety of community knowledge, both formal and informal. In order to ``live effectively within a community, one must have available both formal archival material and informal transient folklore'' [43]. An ECS provides both by integrating published literature about formal research findings with intermediate results, experimental protocols, laboratory notes, electronic bulletin board discussions, e-mail exchanges, and other informal data sources, and therefore is becoming an indispensable tool that allows researchers to browse, explore, and understand a vast and rapidly-changing world of scientific knowledge at the same time it creates a group memory.

The Worm Community System (WCS), which is a major NSF-funded collaboratory project, has been considered a model electronic community system [39] [44]. Constructed for C. elegans researchers, it offers traditional database functionalities along with literature, informal information and research lore, mapping programs and graphics, and the ability for users to browse, share, and filter a large amount of timely worm community knowledge. The system is intended to serve not only the entire community of worm biologists but also other related biology and biomedical community members [43] [13] [44]. In previous research, we developed a worm thesaurus based on the complete WCS literature [12]. The worm thesaurus was found to be an excellent memory-jogging tool and concept-based search aide and was incorporated into the WCS Release 2 made available to worm biologists in August 1993.

While the formats of the WCS and FlyBase are very different, each attempts to serve as a central, electronic ``clearinghouse'' for information a researcher needs to remain up-to-date. However, despite the potentially substantial benefits of being able to access, retrieve, and analyze data and information about other organisms, the use of distributed heterogeneous databases presents many significant obstacles. Besides the technical problems inherent in the use of various database formats, other sources of difficulties include information overload, information scattering, vocabulary fluidity over time, and differences in semantic meaning and nomenclature conventions between domains. In the context of information retrieval, providing access to multiple, disparate information sources (e.g., the fly information sources, as shown in Table 1) and supporting cross-domain information retrieval (e.g., fly biologists accessing information about recent findings in worm genetics) are two pressing research areas.

In this paper, we present an algorithmic approach to generation of a fly thesaurus. The main techniques used in our approach are presented in Section 3. Section 4 discusses in detail the algorithms and sample results for object filtering and automatic indexing of Drosophila literature and the algorithms and findings of the cluster analysis process. Section 5 presents the current status of our system implementation and an experiment we conducted involving subject area experts, novices, and (fly) community outsiders. Experimental results are discussed in detail in Section 6. Directions for future research are presented in Section 7.

An Algorithmic Approach to Automatic Thesaurus Generation: An Overview

Information retrieval in large document collections often requires vocabulary expansion aids because, as Blair and Maron [4] contend, ``vocabulary problems make high recall impossible in full-text databases.'' Gomez et al. [24] [23] found in their studies that ``searcher success is markedly improved by greatly increasing the number of names per object.'' They proposed an ``unlimited aliasing'' strategy, which allows essentially unlimited numbers of aliases for objects, to alleviate the vocabulary difference problem.

Many research groups have created vocabulary-based search aids for online information retrieval systems by making use of existing thesauri or dictionaries. Thesauri, in particular, exhibit a structure similar to human word-association networks. While these tools are able to provide the searcher with alternate terms to use in searching, they do not overcome the knowledge acquisition bottleneck [25]: the cognitive demand required of humans (indexers or domain experts) to create thesauri or dictionaries in the first place. An alternative approach to creating vocabulary-based search aids is based on automatic thesaurus generation.

The specific algorithms adopted in this research include: object filtering, automatic indexing, and co-occurrence analysis. In the following section, we present an overview of these techniques and our modifications.

Generating a Fly Thesaurus Automatically

Although the above techniques had been employed in other applications, including the development of a highly domain-specific, up-to-date automatic thesaurus for the worm community, the volume of Drosophila literature and the unstandardized naming scheme utilized in the fly community offered special challenges. During our system development process, significant adaptation was required to meet the specific constraints and novel characteristics of FlyBase.

Specific concepts like gene names, function names, researchers, and subject descriptors obtained from FlyBase, and from the large number of fly-related documents in Medline and Biosis abstracts, provided the foundation for automatic thesaurus generation. FlyBase itself represents the long research history of the Drosophila community. The volume of information is estimated as 10 times that of the WCS. The automatic thesaurus generated for FlyBase used as a knowledge source a collection of 5,854 abstracts (more than 10 MBs of textual information) from Medline and Biosis CD-ROM, with a time span between 1983 and 1993. After interviewing several fly biologists at the University of Arizona, we chose to generate and represent the important concepts described in recent literature instead of capturing the concepts exhibited in all fly literature, because the latest concepts may be of most interest to current researchers. The experience and lessons learned during the system development and thesaurus evaluation process are reported below.

Object Filtering

We created four lists of fly-related keywords from several directories in FlyBase and from Biosis and Medline. Unlike the unified and systematic naming convention used in the worm community, the format in Drosophila is more free-style. Capitalization is important for gene and function names. Furthermore, there is no systematic way to name alleles in Drosophila. Special characters such as left parenthesis ``('', pipe ``|'', apostrophe ``''', comma ``,'', slash ``/'', and plus sign ``+'', are commonly used in the naming convention. Here are some examples of those names: ``(-Glycerol phosphate dehydrogenase'', ``Ac-SD'', ``Su(Ste)'', ``l(3)4.15'', ``ort1 ninaE1'', ``ADP/ATP translocase'', ``fructose-1,6-bisphosphate aldolase'', ``(Na+ K+) ATPase ( subunit'', and ``Casein kinase II, | subunit''. Because of this unconstrained naming convention, a parser that had been developed previously and used for several applications [9] [12] [8] was revised to handle these scientific terms. Our revised parser used space as the delimiter and considered all special characters and punctuations valid for forming a term. We created four groups of object filters as follows:

In total, we pre-identified 23,773 terms known to be specific to the fly domains using object filters. These terms were of four types: gene, function, researcher, and subject descriptor (terms identified by the four different filters were assigned to different types). We used these object lists to ``filter'' domain-specific concepts from the Biosis and Medline abstracts and index the objects to the abstracts.

Automatic Indexing and Term Weighting

After documents were ``filtered'' using the four lists, the remaining text in the abstracts was processed by the automatic indexing routine. The following steps were executed in order:

Co-occurrence Analysis

After the concept descriptors for each abstract were identified, we proceeded to perform term co-occurrence analysis for all documents in the document collection. (We refer to the thesaurus as a concept space to distinguish it from the information space represented by the Biosis and Medline documents.)

Current Status and Experimental Design

Our prototype system, which consisted of a thesaurus generation component and a thesaurus query system, was developed in ANSI C and ran on SUN SPARC stations, DECstations, and DEC Alpha machines. It took 10.6 hours of CPU time to generate the thesaurus using a DECstation 5000/120 (25 MIPS, ULTRIX-based). The resulting size of the fly thesaurus was 13.5 MBs. Users can browse the thesaurus through a query interface that was previously developed for use with the worm thesaurus [12]. We first present a sample fly thesaurus query session and then discuss an experimental design that was adopted to evaluate the usefulness of the fly thesaurus and to identify areas for improvement.

A Sample Session

A sample (typical) sequence of query sessions is shown in Figures 2, 3, and 4. (The Fly Thesaurus Query System had loaded the complete fly thesaurus used in this sample session in our earlier pilot study, i.e., without applying the 100-link threshold.) After a previous unrelated session had involved ``(cell) death'' (terms related to ``death'' were displayed in the main window), the user was interested in finding something about homeotic genes, a class of genes involved in body pattern formation in the developing fly. The term ``HOMEOTIC'' had been entered in the new query box. Although multiple terms may be searched together, the user elected to search only one term, as indicated by the lower case ``q'' entered for term 2 (Figure 2, bottom box). Figure 3 shows a display of co-occurring terms in decreasing weighted order. As expected, a number of gene names appeared on the list of relevant terms. Term 24, ``EMBRYONIC'', suggests that ``HOMEOTIC'' may be related to genes in the developing fly. Terms 1, 10, 11, 18, and 32 are all variations on the word ``antenna''. A group of genes important in the proper development and placement of the fly antennae are known to be clustered together on the chromosome, and are called ``Antennapedia Complex.'' Furthermore, terms 3, 7, 9, 12, 14, 15, and 16 are all gene names.

The user was interested in determining whether any of the system-suggested genes are included in the Antennapedia Complex, and marked terms 1 (ANTC), 2 (HOMEOTIC GENES), and 17 (COMPLEX) for further searching. The menu choice ``Use term(s) with marker'' under ``Thesaurus Re-Query'' re-activated the thesaurus. Figure 4 shows the result of the iterative query using the three selected terms. We see that ANTC (term 1) appears to function as a member of an immunoglobin superfamily, and that the protein expressed by the gene appears to be glucose dehydrogenase. We also see that six of the gene names that appeared on the list of terms suggested for ``HOMEOTIC'' (3, 4, 7, 14, 15, and 16) directly relate to all three of the marked terms, as indicated in the square brackets following the co-occurrence weight.

A user can iteratively query additional marked system-suggested terms as many times as needed to arrive at a list of sufficiently specific terms. Alternatively, the thesaurus can be re-activated by entering new user-suggested terms. Terms selected by a user during the iterative thesaurus browsing process are recorded in a separate area. The next stage of development for this thesaurus will involve traversal from the fly concept space into the previously generated worm concept space and will feature capability to retrieve worm documents using fly query terms through the help of a conjoined fly-worm concept space. This fly-worm traversal project aims to examine issues related to cross-domain vocabulary switching by developing automatic domain-specific thesauri and merging multiple thesauri and by providing spreading activation algorithms for concept traversal [11].


  
Figure 2: A user entered HOMEOTIC in the Current Selection box of the Fly Thesaurus Query System
\begin{figure}\rule{6in}{.01in}
\par\vspace{5in}
\par\rule{6in}{.01in}
\end{figure}


  
Figure 3: The user selected three new terms, ANTC, HOMEOTIC GENES, and COMPLEX
\begin{figure}\rule{6in}{.01in}
\par\vspace{5in}
\par\rule{6in}{.01in}
\end{figure}


  
Figure 4: The user invoked the fly thesaurus using all four query terms
\begin{figure}\rule{6in}{.01in}
\par\vspace{5in}
\par\rule{6in}{.01in}
\end{figure}

Experimental Design

A fly thesaurus evaluation experiment was conducted in Fall 1993. The experiment consisted of two parts: a term association experiment and a searcher browsing experiment. In order to investigate the effects of a searcher's background on thesaurus usage, six subjects at three levels of expertise were selected. Subjects 1 and 2 were considered experts in Drosophila biology. Both were faculty members with more than 9 years experience in Drosophila genetics, and had published numerous papers in this area. Subjects 3 and 4 were considered novices: one was a doctoral student in Molecular and Cellular Biology; the other was a postdoctoral fellow in Ecology and Evolutionary Biology. Both had worked in laboratories dedicated to Drosophila research for at least 2 years. Subjects 5 and 6 were considered outsiders. Both worked outside the fly domain but had had exposure to fly concepts and had reasons to be interested in using a fly thesaurus. Subject 5 was a faculty member with limited previous experience in a Drosophila research laboratory (1 year sabbatical). Subject 6, a doctoral student with extensive experience in worm genetics, had frequently attended fly-worm joint seminars.

Results of Fly Thesaurus Evaluation

Results of the Association Experiment

Results of the Browsing Experiment

The subjective evaluation of the fly thesaurus reported here is the result of analyzing logged browsing sessions and transcripts of the subjects' verbal protocols. During browsing, subjects were asked to think aloud and to give specific comments, observations, or suggestions. These protocols provided clues for system improvement.

A search involved either typing in a new term (user-suggested), or marking system-suggested terms (iterations). A new search was defined as an entry of a searcher's own term. The online thesaurus suggested additional terms, which were shown on the system display. Whether they were using their own terms or system-suggested terms, most subjects used terms that they were either familiar with or curious about. Analysis of the logged search sessions revealed that the six subjects performed a total of 19 searches. Most search topics were related to biological functions or phenomena (e.g., ``LASER ABLATION,'' SPERMATOCYTE,'' ``SPERMATOGENESIS,'' ``LONGIVITY,'' ``LINEAGE PATTERN,'' etc.). However, a few search topics were related to author names (e.g., ``John, T.'') and gene names (e.g., ``SPE-12,'' ``SPE-26''). The greatest number of iterations following a new search was three, with nearly 90% (17/19) of all searches being followed by 0, 1, or 2 iterations. This observation is underscored by an observation made by Subject 1, who commented that after 3 iterations she was getting no new terms/information, and that the same set of terms was reappearing.

As a result of our analysis of the session logs and verbal protocols, a taxonomy was developed to represent the observations of the subjects (see Figure 9). Included in this taxonomy are problems identified, which may be taken as potential pitfalls to consider in development of a domain-specific automatic thesaurus, and advantages/benefits identified, which may be taken as potential usefulness of such a thesaurus. This experiment gave us a better idea of how the thesaurus could be modified and used for more real-life purposes and by different user groups.


  
Figure 9: A taxonomy of system problems as identified by searchers
\begin{figure}\rule{6in}{0.01in}
{\scriptsize\bf
\begin{tex2html_preform}\begin{...
...ory-Jogging\end{verbatim}\end{tex2html_preform}}
\rule{6in}{0.01in}
\end{figure}

Potential Pitfalls

A. Quality of nodes and links:

B. Quality of System and Interface

Some of the pitfalls reported can be resolved algorithmically, while others may require manual attention. After the experiment, we examined the high-frequency terms in the fly thesaurus and manually removed many terms which are noise and/or too general to be useful for biologists. More extensive fly thesaurus development work that will involve a larger fly collection is also planned. A new interface based on the user-friendly and accessible WWW has recently been created. The current fly thesaurus can be accessed at: http://bpaosf.bpa.arizona.edu:8000/cgi-bin/BioQuest.

Potential Usefulness

Verbal protocols revealed two significant, novel areas of potential usefulness.

A. Serendipity Discovery:

One expert initially was concerned about what she termed the ``misdirectedness'' of the the thesaurus's search method and expressed frustration with it. However, after a few searches, she conceded that, ``It's like a...word thesaurus in that there are probably going to be things in here that I absolutely never would have thought about...and routes though the data, through the literature that I never would take without it."

Most users thought the thesaurus would be useful in introducing them to various aspects of the domain that were beyond their present level of experience. One expert and one novice found that by imagining a scenario in which multiple terms were related resulted in deductive arrival at possible contexts for seemingly unrelated terms. We have not seen this kind of creative use of a thesaurus reported before. Two subjects (an expert and an outsider) did not see this potential and were primarily interested in seeing the things they knew to be related and relevant. Anything unexpected was viewed as noise.

B. Memory-Jogging:

The second novel usefulness for the fly thesaurus is memory-jogging. One novice searched for the gene ``wingless''. The term list retrieved contained the term ``signalling'', which triggered a spark of recognition in the subject. She said, ``Let's try `signalling', because `wingless' is presumed to be some sort of signalling molecule.'' By applying deductive reasoning to the thesaurus's inherent characteristics emphasizing knowledge discovery and memory-enhancement, the subject's creativity was triggered. This research has shown that automatic thesauri have the potential to offer improvements to searching that exceed those possible with manual thesauri.

In summary, the results from the experiment were encouraging. The thesaurus suggested relevant terms and concepts that would not only be helpful for different users, but useful in spurring user ideas and desire to acquire knowledge.

Conclusion and Future Directions

Our most immediate future research directions, as part of a long term effort to develop a more efficient and ``intelligent'' framework and design for the management, retrieval, sharing, and dissemination of information for distributed scientific computing include the following:

The rationale behind our research is that instead of letting knowledgeable information specialists (knowledgeable in several subject areas) perform term matching and consultation for different users, we could automatically create different domain-specific thesauri tailored to the vocabularies and concepts exhibited in the related disciplines and develop an online search aid to bridge the vocabulary differences. We hope, by expanding the concept (terminology and linkage) coverage of the underlying databases, information retrieval systems will eventually be able to assist in seamless and ``intelligent'' concept-based information retrieval.

Acknowledgments

This project was supported mainly by two NSF grants: the NSF CISE Research Initiation Award, IRI-9211418, 1992-1994 (H. Chen, ``Building a Concept Space for an Electronic Community System'') and NSF CISE Special Initiative on Coordination Theory and Collaboration Technology, IRI-9015407, 1990-1993 (B. Schatz et al., ``Building a National Collaboratory Testbed''). We would also like to thank the faculty and students of the Molecular and Cellular Biology Department, Ecology and Evolutionary Biology Department, Anatomy Department, and Biochemistry Department at the University of Arizona for their kind assistance and valuable suggestions, in particular, those of Dr. Samuel Ward, Dr. Danny Brower, Dr. John Clark, Dr. John Little, Dr. Lynn Manceau, Dr. Mary Rykowski, Ms. Alicia Minniti, and Ms. Lisa Werner.

Bibliography

1
T. Ahlswede and M. Evens.
Generating a relational lexicon from a machine-readable dictionary.
International Journal of Lexicography, 1(3):214-237, 1988.

2
M. J. Bates.
Subject access in online catalogs: a design model.
Journal of the American Society for Information Science, 37(6):357-376, November 1986.

3
L. M. Bellamy and L. Bickham.
Thesaurus development for subject cataloging.
Special Libraries, pages 9-15, Winter 1989.

4
D. C. Blair and M. E. Maron.
An evaluation of retrieval effectiveness for a full-text document-retrieval system.
Communications of the ACM, 28(3):289-299, 1985.

5
A. Y. Chamis.
Vocabulary Control and Search Strategies in Online Searching.
Greenwood Press, New York, 1991.

6
M. A. Chaplan.
Mapping Laborline thesaurus terms to Library of Congress subject headings: Implications for vocabulary switching.
Library Quarterly, 65(1):39-61, January 1995.

7
H. Chen and V. Dhar.
Cognitive process as a basis for intelligent retrieval systems design.
Information Processing and Management, 27(5):405-432, 1991.

8
H. Chen, P. Hsu, R. Orwig, L. Hoopes, and J. F. Nunamaker.
Automatic concept classification of text from electronic meetings.
Communications of the ACM, 37(10):56-73, October 1994.

9
H. Chen and K. J. Lynch.
Automatic construction of networks of concepts characterizing document databases.
IEEE Transactions on Systems, Man and Cybernetics, 22(5):885-902, September/October 1992.

10
H. Chen, K. J. Lynch, K. Basu, and D. T. Ng.
Generating, integrating, and activating thesauri for concept-based document retrieval.
IEEE EXPERT, Special Series on Artificial Intelligence in Text-based Information Systems, 8(2):25-34, April 1993.

11
H. Chen and D. T. Ng.
An algorithmic approach to concept exploration in a large knowledge network (automatic thesaurus consultation): symbolic branch-and-bound vs. connectionist Hopfield net activation.
Journal of the American Society for Information Science, 46(5):348-369, June 1995.

12
H. Chen, B. R. Schatz, T. Yim, and D. Fye.
Automatic thesaurus generation for an electronic community system.
Journal of the American Society for Information Science, 46(3):175-193, April 1995.

13
J. Courteau.
Genome databases.
Science, 254:201-207, October 11, 1991.

14
J. P. Courtial and J. Pomian.
A system based on associational logic for the interrogation of databases.
Journal of Information Science, 13:91-97, 1987.

15
C. J. Crouch.
An approach to the automatic construction of global thesauri.
Information Processing and Management, 26(5):629-640, 1990.

16
C. J. Crouch and B. Yang.
Experiments in automatic statistical thesaurus construction.
In Proceedings of the Fifteenth Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pages 77-88, Copenhagen, Denmark, June 21-24 1992.

17
L. B. Doyle.
Indexing and abstracting by association.
American Documentation, 13(4):378-390, October 1962.

18
F. C. Ekmekcioglu, A. M. Robertson, and P. Willett.
Effectiveness of query expansion in ranked-output document retrieval systems.
Journal of Information Science, 18:139-147, 1992.

19
B. Everitt.
Cluster Analysis.
Second Edition, Heinemann Educational Books, London, England, 1980.

20
FlyBase.
The Drosophila Genetic Database.
Available from the ftp.bio.indiana.edu, network server and Gopher site, 1993.

21
E. A. Fox.
Development of the CODER system: A testbed for artificial intelligence methods in information retrieval.
Information Processing and Management, 23(4):341-366, 1987.

22
E. A. Fox, J. T. Nutter, T. Ahlswede, M. Evens, and J. Markowitz.
Building a large thesaurus for information retrieval.
In 2nd Conference on Applied Natural Language Processing, Association for Computational Linguistics, Pages 101-108, Ballard, Bruce, Editor; Morristown, NJ: Bell Communications Research., 1988.

23
G. W. Furnas, T. K. Landauer, L. M. Gomez, and S. T. Dumais.
The vocabulary problem in human-system communication.
Communications of the ACM, 30(11):964-971, November 1987.

24
L. M. Gomez, C. C. Lochbaum, and T. K. Landauer.
All the right words: finding what you want as a function of the richness of indexing vocabulary.
Journal of the American Society for Information Science, 41(8):547-559, 1990.

25
F. Hayes-Roth, D. A. Waterman, and D. Lenat.
Building Expert Systems.
Addison-Wesley, Reading, MA, 1983.

26
S. Jones, M. Gatford, S. Robertson, M. Hancock-Beaulieu, J. Secker, and S. Walker.
Interactive thesaurus navigation: intelligent rules OK?
Journal of the American Society for Information Science, 46(1):52-59, 1995.

27
Y. W. Kim and J. H. Kim.
A model of knowledge based information retrieval with hierarchical concept graph.
Journal of Documentation, 46:113-116, 1990.

28
S. D. Knapp.
Creating BRS/TERM, a vocabulary database for searchers.
DATABASE, 7(4):70-75, December 1984.

29
M. E. Lesk.
Word-word associations in document retrieval systems.
American Documentation, 20(1):27-38, January 1969.

30
D. A. Lindberg and B. L. Humphreys.
The UMLS knowledge sources: Tools for building better user interface.
In Proceedings of the Fourteenth Annual Symposium on Computer Applications in Medical Care, pages 121-125, Los Alamitos, CA: Institute of Electrical and Electronics Engineers, November, 4-7 1990.

31
D. L. Lindsley and G. G. Zimm.
The Genome of Drosophila melanogaster.
Academic Press, New York, NY, 1992.

32
A. T. McCray and W. T. Hole.
The scope and structure of the first version of the UMLS semantic network.
In Proceedings of the Fourteenth Annual Symposium on Computer Applications in Medical Care, pages 126-130, Los Alamitos, CA: Institute of Electrical and Electronics Engineers, November, 4-7 1990.

33
R. T. Niehoff.
Development of an integrated energy vocabulary and the possibilities for on-line subject switching.
Journal of the American Society for Information Science, 27(1):3-17, January-February 1976.

34
R. T. Niehoff and S. Kwansy.
The role of automated subject switching in a distributed information network.
Online Review, 3(2):181-194, 1979.

35
H. J. Peat and P. Willett.
The limitations of term co-occurrence data for query expansion in document retrieval systems.
Journal of the American Society for Information Science, 42(5):378-383, June 1991.

36
T. Petersen.
The AAT: A model for the restructuring of LCSH.
Journal of Academic Librarianship, 9(4):207-210, September 1983.

37
T. Petersen.
Developing a new thesaurus for art and architecture.
Library Trends, 38(4):644-658, Spring 1990.

38
A. B. Piternick.
Searching vocabularies: a developing category of online search tools.
Online Review, 8(5):441-449, 1984.

39
R. Pool.
Beyond database and e-mail.
Science, 261:841-843, 13 August 1993.

40
E. Rasmussen.
Clustering algorithms.
In Information Retrieval: Data Structures and Algorithms, W. B. Frakes and R. Baeza-Yates, Editors, Prentice Hall, Englewood Cliffs, NJ, 1992.

41
B. F. Ryan, B. L. Joiner, and T. A. Ryan.
MINITAB Handbook, 2nd Edition.
PWS-KENT Publishing Company, Boston, MA, 1985.

42
G. Salton.
Automatic Text Processing.
Addison-Wesley Publishing Company, Inc., Reading, MA, 1989.

43
B. R. Schatz.
Building an electronic community system.
Journal of Management Information Systems, Winter, Special Issue 1991/1992.

44
B. R. Schatz.
Building collaboratories for molecular biology.
In National Collaboratories: Applying Information Technology for Scientific Research, National Research Council, National Academy Press, Washington, DC, 1993.

45
H. E. Stiles.
The association factor in information retrieval.
Journal of the Association of Computing Machinery, 8(2):271-279, 1961.

next up previous
hchen@bpa.arizona.edu