next up previous



Contents


List of Figures

Generating a Domain-specific Thesaurus Automatically:
An Experiment on FlyBase

Hsinchun Chen 1, Bruce Schatz 2, Joanne Martinez 3, Tobun Dorbin Ng 4

Abstract:

This research describes an algorithmic approach to the automatic generation of thesauri that can serve as online search aides for scientific databases or electronic community systems. Using object filtering, automatic indexing, and cluster analysis techniques, we extracted key terms and phrases representing Drosophila-related research knowledge from a core collection of Medline and Biosis abstracts. In addition to subject desriptors from the named databases, objects used in object filtering included gene and protein names, cellular and biological function names, researcher names, and subject descriptors from FlyBase, a database currently in use by molecular biologists in the Drosophila melanogaster-related research community. On average, each term had about 41 weighted (0..1) neighboring terms indicating relevant concepts.

We tested the fly thesaurus in an experiment with six fly biologists of varying degrees of expertise and background. This study confirms earlier research (of C. elegans worm thesaurus) that demonstrated the feasibility of an algorithmic solution to the information overload problem in databases and the vocabulary problem in online information retrieval. The knowledge and literature representing Drosophila research presents important challenges, including non-standardized naming conventions, a long history with a great deal of vocabulary fluidity, and it is about ten times the volume of literature representing the worm research. The experiment showed that the thesaurus was an excellent memory-jogging device and that it supported learning and serendipity browsing. Despite some occurrences of obvious noise, the system was useful in suggesting relevant concepts for the researchers' queries. The experience and lessons learned during the fly thesaurus development and evaluation process are presented in detail.

Introduction

Biological research is highly data-intensive, and data accumulation in this area is growing extremely rapidly because of massive efforts such as the Human Genome Initiative and other genome mapping projects. At present, the genomes of several organisms are being sequenced and mapped, including Caenorhabditis elegans (nematode worm), Drosophila melanogaster (fruit fly), Mus musculus (mouse), Homo sapiens (human), Escherichia coli (bacterium), Saccharomyces cerevisiae (yeast), and Arabidopsis thaliana (plant). Because communities in molecular biology form around organisms rather than techniques or problems, the results generated are stored by each scientific community in their own set of databases. Information about the identity, function, cyctological and genetic location, mutations, and aberrations of any particular gene is therefore scattered among a variety of distributed, heterogeneous databases.

These systems range from elaborate, sophisticated database management systems to unstandardized machine-readable files created through text conversion. Distributed, heterogeneous databases are preferable to vast, centralized databases because of such inherent difficulties as file size, currency, updating, and retrieval. Nevertheless, the resulting problem of information scattering makes it difficult for scientists to share and compare information that exists in different files. Connectivity among databases is essential if scientists are to make full use of their contents. Adequate tools for structuring and accessing the contents of a diverse and heterogeneous collection of scientific databases are needed to unlock the knowledge embedded in them. Tools developed for this purpose must address the following problems: information overload, information scattering, vocabulary fluidity over time, unstandardized nomenclature, and semantic differences in the vocabulary used by the various subdisciplines of biology.

Scientific Databases, Electronic Community Systems, and Information Retrieval Problems

The Human Genome Initiative (HGI) offers tremendous challenges not only to the biology, biomedicine, and genetics research communities, but to the information science and computer science communities as well. According to Courteau [7], the Human Genome Project ``will generate more data than any single project to date in biology,'' resulting in complete sequences and physical maps containing the location of every gene of the human genome and the genomes of other model organisms. The vast amount of knowledge accumulated during the project's scientific discovery process can only be managed with the use of computing technologies that support efficient and effective storage, retrieval, and analysis of information, foster seamless distributed scientific collaboration, and facilitate timely information dissemination and sharing.

FlyBase is a set of linked databases designed to provide the Drosophila research community with access to broad and thorough coverage of molecular, genetic, and community information. The Drosophila community is one of the oldest groups in biological research. The knowledge it has generated has been recorded mostly in two sources, the ``Redbook,'' (i.e., ``The Genome of Drosophila melanogaster,'' by D.L. Lindsley and G.G Zimm, Academic Press, 1992), and the publications of the Drosophila Information Service. FlyBase promises improved access to domain knowledge through use of computer technologies.

A consortium of Drosophila researchers funded by grants from the U.S. National Institutes of Health (Washington) and Medical Research Council (London) is responsible for its development. This group guides the development and provides advice on preferred formats to groups developing allied databases and to individual contributers of data. FlyBase is accessible and searchable through Gopher client software, or obtainable via anonymous ftp from its primary archival site at the Department of Biology at Indiana University (IU). The primary database consists of a series of flat files that contain molecular and genetic data on Drosophila melanogaster, including the entire contents of the ``Redbook.''

In order to provide comprehensive access to Drosophila information, the IU Department of Biology also provides access to numerous ``allied'' databases developed at other institutions, each with its own focus. Included are a unified bibliography on Drosophila; pointers to nucleic acid and protein sequence databases; stock lists directing researchers to sources of clones and other biological reagents; a genetic map of Drosophila; a list of Drosophila genes sorted by function; lists of clones from European and American sources, and instructions for placing orders; directory of Drosophila workers; and other associated databases useful to Drosophila researchers. A forum for informal communication between researchers is provided through the Drosophila Information Newsletter and a Bionet News group dedicated to Drosophila. Each information source is resident on computers at its ``home'' institution. Access to the complete set of databases is organized through the IUBIO gopher site.

In addition to the various genome databases such as FlyBase, electronic community systems (ECS) have been proposed and implemented, and have drawn significant attention recently due to the rapid proliferation and advancement of computing, databases, and telecommunication technologies. An electronic community systems encodes a research community's information and knowledge and provides an online environment to support the manipulation of that knowledge. An ECS enables researchers of a scientific community to enter and share community knowledge and findings in a timely manner and in a distributed environment, and thereby function more efficiently and effectively within the community.

An advantage of this type of system over traditional databases is that an ECS enables users browse the available knowledge easily, record their own knowledge for others to use, indicate authorizations for users to either view or annotate their own data [7], annotate entries from others' research, and form interrelationships between concepts [19]. An ECS is much like an electronic library where users can browse for relevant information, filter out the information they do not currently need, and share the data that they have found [19].

Another novel characteristic of an ECS is its ability to handle a wide variety of community knowledge, both formal and informal. [19]. In order to ``live effectively within a community, one must have available both the formal archival material and the informal transient folklore'' [19]. An ECS does this by integrating published literature about formal research findings with intermediate results, experimental protocols, laboratory notes, electronic bulletin board discussions, e-mail exchanges, and other informal data sources, and therefore is becoming an indispensable tool that allows researchers to browse, explore, and understand a vast and rapidly-changing world of scientific knowledge while it also creates a group memory.

The Worm Community System (WCS), which is a major NSF-funded collaboratory project, has been considered as a model electronic community system [15] [20]. Constructed for C. elegans researchers, it offers traditional database functionalities along with literature, informal information and research lore, mapping programs and graphics, and the ability for users to browse, share, and filter a large amount of timely worm community knowledge. The system is intended to serve not only the entire community of worm biologists but also other related biology and biomedical community members [19] [7] [20]. In previous research, we developed a worm thesaurus based on the complete WCS literature [6]. The worm thesaurus was found to be an excellent memory-jogging tool and a concept-based search aide and was incorporated into the WCS. Release 2 of the WCS was made available to worm biologists in August 1993.

While the formats of the WCS and FlyBase are very different, each attempts to serve as a central, electronic ``clearinghouse'' for information a researcher needs to remain up-to-date. However, despite the potential substantial benefits of being able to access, retrieve, and analyze data and information about homologues of their gene or protein in other organisms, the use of distributed heterogeneous databases presents many significant obstacles. Besides the technical problems inherent in the use of various database formats, other sources of difficulties include information overload, information scattering, vocabulary fluidity over time, and differences in semantic meaning and nomenclature conventions between domains.

In this paper, we present an algorithmic approach to generation of a fly thesaurus. The main techniques used in our approach are presented in Section 3. Section 4 discusses in detail the algorithms and sample results for object filtering and automatic indexing of Drosophila literature and the algorithms and findings of the cluster analysis process. Section 5 presents the current status of our system implementation and an experiment we conducted involving subject area experts, novices, and (fly) community outsiders. Experimental results are discussed in detail in Section 6. Directions for future research are presented in Section 7.

An Algorithmic Approach to Automatic Thesaurus Generation: An Overview

In this research, our aim was to apply an algorithmic approach to the generation of a robust knowledge base based on statistical correlation analysis of the concepts (knowledge) embedded in the documents of real-life, textual databases. The research output consisted of a thesaurus-like knowledge base, which can aid in concept-based information management and retrieval. This automatically-generated thesaurus component, akin to a manually-created thesaurus, can also play an important role in solving a searcher's vocabulary problems during information retrieval.

In information science, use of a thesaurus or a knowledge base for ``intelligent'' information retrieval has drawn significant attention in recent years. There have been many attempts to capture experts' domain knowledge for information retrieval. A few examples are described below. CoalSORT [14], a knowledge-based interface, facilitates the use of bibliographic databases on coal technology. A semantic network, representing an expert's domain knowledge, embodies the system's intelligence. Fox's CODER system [11] consists of a thesaurus that was generated from the Handbook of Artificial Intelligence and Collin's Dictionary. The ``Intelligent Intermediary for Information Retrieval'' (I3R), developed by Croft [8], consists of a group of ``experts'' that communicate via a common data structure, called the blackboard. The system consists of a user model builder, a query model builder, a thesaurus expert, a search expert (for suggesting statistics-based search strategies), a browser expert, and an explainer. Chen and Dhar [2] incorporated a portion of the Library of Congress Subject Headings into the design of an intelligent retrieval system. The system adopted a branch-and-bound spreading activation algorithm to assist users in articulating their queries. The National Library of Medicine's Unified Medical Language System (UMLS) project aims to to build an intelligent automated system that understands biomedical terms and their interrelationships and uses this understanding to help users retrieve and organize information from machine-readable sources [13] [12]. The UMLS includes a Metathesaurus, a Semantic Network, and an Information Sources Map. The Metathesaurus contains information about biomedical concepts and their representation in more than 10 different vocabularies and thesauri.

Most of the knowledge bases adopted in these intelligent systems were either generated manually from domain experts, using the knowledge acquisition process, or derived from existing thesauri (which were also created manually in the first place by some indexing/subject experts). A complementary approach to manual knowledge base creation is the automatic thesaurus generation approach.

Virtually all techniques for automatic thesaurus generation are based on the statistical co-occurrence of word types in text [6] [4] [9] [18]. Similarity coefficients are often obtained between pairs of distinct terms based on coincidences in term assignments to the documents of the collection. For example, a cosine computation can be used to generate normalized term similarities between 0 and 1. When pairwise similarities are obtained between all term pairs, an automatic term-classification process such as single-link or complete link classification can group into common classes all terms with sufficiently large pairwise similarities [10] [17] [18]. The terms in the thesaurus classes can replace the initial search terms and be used to increase retrieval recall.

The specific algorithms adopted in this research include: object filtering, automatic indexing, and cluster analysis. In the following section, we present an overview of these techniques and our modifications.

Although the above techniques had been employed in other applications, including the development of a highly domain-specific, up-to-date automatic thesaurus for the worm community, the volume of Drosophila literature and the unstandardized naming scheme utilized in the fly community offered special challenges. During our system development process, significant adaptation was required to meet the specific constraints and novel characteristics of FlyBase.

Specific concepts like gene names, function names, researchers, and subject descriptors obtained from FlyBase, as well as the large number of documents accumulated in Medline and Biosis abstracts, provide the foundation for automatic thesaurus generation. FlyBase itself represents the long research history of the Drosophila community. The volume of information is estimated as 10 times that of the WCS. The reported generation of an automatic thesaurus for FlyBase uses as a knowledge source a collection of 5,854 abstracts (more than 10MB of textual information) from Medline and Biosis CD-ROM with a time span between 1983 and 1993. After interviewing several fly biologists at the University of Arizona, we chose to generate and represent the important concepts described in recent literature instead of capturing the concepts exhibited in all fly literature, because the latest concepts may be of most interest to current researches. The experience and lessons learned during the system development process and thesaurus evaluation are reported below.

Generating a Fly Thesaurus Automatically

In order to generate a fly thesaurus from the fly literature, we performed object filtering, automatic indexing, and cluster analysis in order. Sample results are shown in this section.

Object Filtering

We created the following lists of fly-related keywords from several directories in FlyBase and from Biosis and Medline. Unlike the unified and systematic naming convention used in the worm community, the format in Drosophila is more free-style. Capitalization is important for gene and function names. Furthermore, there is no systematic way to name alleles in Drosophila. Special characters like left parathesis ``('', pipe ``|'', apostrophe ``''', comma ``,'', slash ``/'', and plus sign ``+'' are commonly used in the naming convention. Here are some examples of those names: ``(-Glycerol phosphate dehydrogenase'', ``Ac-SD'', ``Su(Ste)'', ``l(3)4.15'', ``ort1 ninaE1'', ``ADP/ATP translocase'', ``fructose-1,6-bisphosphate aldolase'', ``(Na+ + K+) ATPase ( subunit'', and ``Casein kinase II, | subunit''. Because of this unconstrained naming convention, a parser that had been developed previously and used for several applications [4] [6] [3] was revised to handle these scientific terms. We created four groups of object filters as follows:

In total, we identified 23,773 fly-specific terms, which served to identify important concepts in Biosis and Medline documents.

Automatic Indexing

After documents were ``filtered'' by the four lists, the remaining texts in documents were processed by the automatic indexing routine. Automatic indexing was implemented mainly based on the procedure reported in [18]. The following steps were executed in order:

Cluster Analysis

After the concept descriptors for each abstract were identified, we proceeded to perform term co-occurrence analysis for all documents in the document collection. A term weighting scheme based on the Vector Space model [18] and an asymmetric similarity function [4] similar to the popular cosine function [10] were adopted for analysis. The blueprint for generating such a concept space (we refer to the thesaurus as a concept space to distinguish it from the information space represented by the Biosis and Medline documents) is shown below:

Current Status and Experimental Design

Our prototype system, which consisted of a thesaurus generation component and a thesaurus query system, was developed in ANSI C and ran on SUN SPARK stations, DECstations, and DEC Alpha. It took 10.6 hours of CPU time to generate the thesaurus using a DECstation 5000/120 (25 MIPS, ULTRIX-based). The resulting size of the fly thesaurus was 13.5 MBs. Users browse the thesaurus through a query interface that was previously developed for use with the worm thesaurus [6]. We present a sample fly thesaurus query session first. We then discuss an experimental design which was adopted to evaluate the usefulness of the fly thesaurus and to identify areas for improvement.

A Sample Session

A sample sequence of query sessions are shown in Figures 2, 3, and 4. The user was interested in finding something about homeotic genes (after a previous session which invloved cell death, as shown on the main window), a class of genes involved in body pattern formation in the developing fly. The term ``HOMEOTIC'' had been entered in the new query box. Although multiple terms may be searched together, the user elected to search only one term, as indicated by the lowercase ``q'' entered for term 2 (Figure 2, bottom box). Figure 3 shows a display of relevant terms in decreasing weighted order. As expected, we see that a number of gene names have appeared on the list of relevant terms. We also see that term 2 indicates that the term ``HOMEOTIC'' is descriptive of certain genes. Term 24, ``EMBRYONIC'', confirms that ``HOMEOTIC'' does indeed refer genes in the developing fly. Terms 1, 10, 11, 18, and 32 are all variations on the word ``antenna''. This makes sense, too. A group of genes important in the proper development and placement of the fly antennae are known to be clustered together on the chromosome, and are called ``Antennapedia Complex.'' Furthermore, terms 3, 7, 9, 12, 14, 15, and 16 are all gene names.

The user was interested in determining whether any of the system-suggested genes are included in the Antennapedia Complex, and marked terms 1 (ANTC), 2 (HOMEOTIC GENES), and 17 (COMPLEX) for further searching. The menu choice ``Use term(s) with marker'' under ``Thesaurus Re-Query'' re-activated the thesaurus. Figure 4 shows the result of the iterative query using the three selected terms. We see that ANTC (term 1) appears to function as a member of an immunoglobin superfamily, and that the protein expressed by the gene is glucose dehydrogenase. We also see that six of the gene names that appeared on the list of terms suggested for ``HOMEOTIC'' (3, 4, 7, 14, 15, and 16) directly relate to all three of the marked terms, as indicated in the square brackets following the co-occurrence weight. Thus we can surmise that these are all member genes of the Antennapedia Complex.

A user can iteratively query more marked system-suggested terms as many times as needed to arrive at a list of sufficiently specific terms. Alternatively, the thesaurus can be re-activated by entering new user-suggested terms. Terms selected by a user during the iterative thesaurus browsing process are recorded in a separate area. For the fly thesaurus, the abstracts from which the terms were extracted are not retrievable. The next stage of development for this thesaurus will involve traversal from the fly concept space into the previously generated worm concept space, and will feature a facility to retrieve worm documents using fly query terms.


  
Figure 2: Enter HOMEOTIC
\begin{figure}\rule{6in}{.01in}
\par\vspace{5in}
\par\rule{6in}{.01in}
\end{figure}


  
Figure 3: Select new terms
\begin{figure}\rule{6in}{.01in}
\par\vspace{5in}
\par\rule{6in}{.01in}
\end{figure}


  
Figure 4: Iterate with new terms
\begin{figure}\rule{6in}{.01in}
\par\vspace{5in}
\par\rule{6in}{.01in}
\end{figure}

Experimental Design

A fly thesaurus evaluation experiment was conducted in Fall 1993. The experiment consisted of two parts: a term association experiment and a searcher browsing experiment. In order to investigate the effects of a searcher's background on thesaurus usage, six subjects at three levels of expertise were selected. Subjects 1 and 2 were considered experts in Drosophila biology. Both were faculty members with more than 9 years experience in Drosophila genetics, and had published numerous papers in this area. Subjects 3 and 4 were considered novices: one was a doctoral student in Molecular and Cellular Biology; the other was a postdoctoral fellow in Ecology and Evolutionary Biology. Both had worked in laboratories dedicated to Drosophila research for at least 2 years. Subjects 5 and 6 were considered outsiders. Both worked outside the fly domain, but had had exposure to fly concepts and had reasons to be interested in using a fly thesaurus. Subject 5 was a faculty member with limited previous experience in a Drosophila research laboratory (1 year sabbatical). Subject 6, a doctoral student with extensive experience in worm genetics, had frequently attended joint fly-worm seminars.

Results of Fly Thesaurus Evaluation

Results of the Association Experiment

Results of the Browsing Experiment

The subjective evaluation of the fly thesaurus reported here is the result of analyzing the logged browsing session and the transcripts of the subjects' verbal protocols. During browsing, subjects were asked to think aloud and to give specific comments, observations, or suggestions.

A search involved either typing in a new term (user-suggested), or marking a system-suggested term (iterations). A new search was defined as entry of a searcher's own term. The online thesaurus suggested additional terms, which were shown on an X-Window display. Whether they were using their own terms or system-suggested terms, most subjects used terms that they were either familiar with or curious about. Analysis of the logged search sessions revealed that the six subjects performed a total of 19 searches. All subjects performed 2, 3, or 4 new searches during their browing sessions. Figure 1 shows that the greatest number of iterations following a new search was three, with nearly 90% (17/19) of all searches being followed by 0, 1, or 2 iterations. This observation is underscored by the observation of Subject 1, who commented that after 3 iterations she was getting no new terms/information, and that the same set of terms was reappearing. We conclude that for most searches, the system will converge after as few as 2 or 3 iterations. This has interesting implications for implementation of the spreading activation algorithms in the cross-domain traversal stage of the project.


 
Table 1: Number of iterations per new search used by subjects while browsing
Expertise Subject Number of iterations for each search
level no. 1 2 3 4
Expert 1 3 0 - -
Expert 2 1 2 1 -
Novice 3 3 1 2 2
Novice 4 0 0 1 -
Outsider 5 1 1 0 -
Outsider 6 2 2 0 0

 

As a result of our analysis of the session logs and verbal protocols, a taxonomy was developed to represent the observations of the subjects (see Figure 10). Included in this taxonomy are problems identified, which may be taken as potential pitfalls to consider in development of a domain-specific automatic thesaurus, and advantages/benefits identified, which may be taken as potential usefulness of such a thesaurus. This section is organized according to the structure of the taxonomy. This experiment gave us a better idea of how the thesaurus could be modified and used for more real-life purposes and by different user groups.


  
Figure 10: Taxonomy of system problems as identified by searchers
\begin{figure}\rule{6in}{0.01in}
{\scriptsize\bf
\begin{tex2html_preform}\begin{...
...ory-Jogging\end{verbatim}\end{tex2html_preform}}
\rule{6in}{0.01in}
\end{figure}

Potential Pitfalls

A. Quality of nodes and links:

B. Quality of System and Interface

Potential Usefulness

A. Serendipitous Discovery:

Subjective evaluation of the fly thesaurus revealed two primary, novel areas of potential usefulness. First is serendipity discovery. As one expert at first was concerned about what she termed the ``misdirectedness'' of the the thesaurus' search method, and expressed frustration with it. However after a few searches, she conceded that, "It's like a ... word thesaurus in that there are probably going to be things in here that I absolutely never would have thought about... and routes though the data, through the literature that I never would take without it." Most users thought the thesaurus would be useful in introducing them to various aspects of the domain that were beyond their present level of experience. The experience of the expert and the novice, discussed above, who found that by imagining a scenario in which multiple terms were related resulted in a deductive arrival at a possible context for seemingly unrelated terms. This kind of creative use of the thesaurus is an observation we have not seen reported before. Two subjects (an expert and an outsider) did not see this potential. They were primarily interested in seeing the things they knew to be related and relevant, and anything unexpected was viewed as noise.

B. Memory-Jogging:

The second novel usefulness for the fly thesaurus is in memory-jogging. One novice searched for the gene ``wingless''. The term list retrieved contained the term ``signalling'', which triggered a spark of recognition in the subject. She said, ``Let's try `signalling', because `wingless' is presumed to be some sort of signalling molecule.'' In one case, serendipity discovery and memory-jogging, occurred almost simultaneously. The subject recognized the terms ``m4'' and ``96F'' on the list of retrieved terms, and was reminded that, ``m4 is a minute, which may exist at 96F; I know there are some out there [in the chromosome]. So I would say that maybe there is a rearrangement. Oh I know what it is, this is probably a clonal analysis using a minute to diminish the size of the wild type clones, so that the mutant clones can grow.'' By applying deductive reasoning to the thesaurus' inherent characteristics emphasizing knowledge discovery and memory-enhancement, the subject's creativity was triggered. This research has demonstrated that automatic thesauri have the potential to offer improvements to searching that far exceed those possible with manual thesauri.

In summary, the results from the experiment were very encouraging. The thesaurus suggested relevant terms and concepts that would not only be helpful for different users, but useful in spurring user ideas and desire to acquire knowledge. The thesaurus was tested by objective and subjective measures and produced results indicating that it did produce relevant terms, improve search recall, and would be useful as a thesaurus for the FlyBase.

Conclusion and Future Directions

Our most immediate future research directions, as part of a long term effort to develop a more efficient and ``intelligent'' framework and design for the management, retrieval, sharing, and dissemination of information for distributed, scientific computing, include the following:

Acknowledgments

This project was supported mainly by two NSF grants: the NSF CISE Research Initiation Award, IRI-9211418, 1992-1994 (H. Chen, ``Building a Concept Space for an Electronic Community System'') and NSF CISE Special Initiative on Coordination Theory and Collaboration Technology, IRI-9015407, 1990-1993 (B. Schatz et al., ``Building a National Collaboratory Testbed''). We would also like to thank the faculty and students of the Molecular and Cellular Biology Department, Ecology and Evolutionary Biology Department, Anatomy Department, and Biochemistry Department at the University of Arizona for their kind assistance and valuable suggestions, in particular, Dr. Samuel Ward, Dr. Danny Brower, Dr. John Clark, Dr. John Little, Dr. Mary Rykowski, Ms. Alicia Minniti, and Ms. Lisa Werner.

Bibliography

1
M. J. Bates.
Subject access in online catalogs: a design model.
Journal of the American Society for Information Science, 37(6):357-376, November 1986.

2
H. Chen and V. Dhar.
Cognitive process as a basis for intelligent retrieval systems design.
Information Processing and Management, 27(5):405-432, 1991.

3
H. Chen, P. Hsu, R. Orwig, L. Hoopes, and J. F. Nunamaker.
Automatic concept classification of text from electronic meetings.
Communications of the ACM, 37(10):56-73, October 1994.

4
H. Chen and K. J. Lynch.
Automatic construction of networks of concepts characterizing document databases.
IEEE Transactions on Systems, Man and Cybernetics, 22(5):885-902, September/October 1992.

5
H. Chen, K. J. Lynch, K. Basu, and D. T. Ng.
Generating, integrating, and activating thesauri for concept-based document retrieval.
IEEE EXPERT, Special Series on Artificial Intelligence in Text-based Information Systems, 8(2):25-34, April 1993.

6
H. Chen, B. R. Schatz, T. Yim, and D. Fye.
Automatic thesaurus generation for an electronic community system.
Journal of the American Society for Information Science, 46(3):175-193, April 1995.

7
J. Courteau.
Genome databases.
Science, 254:201-207, October 11, 1991.

8
W. B. Croft and R. H. Thompson.
I3R: A new approach to the design of document retrieval systems.
Journal of the American Society for Information Science, 38(6):389-404, 1987.

9
C. J. Crouch.
An approach to the automatic construction of global thesauri.
Information Processing and Management, 26(5):629-640, 1990.

10
B. Everitt.
Cluster Analysis.
Second Edition, Heinemann Educational Books, London, England, 1980.

11
E. A. Fox.
Development of the CODER system: A testbed for artificial intelligence methods in information retrieval.
Information Processing and Management, 23(4):341-366, 1987.

12
D. A. Lindberg and B. L. Humphreys.
The UMLS knowledge sources: Tools for building better user interface.
In Proceedings of the Fourteenth Annual Symposium on Computer Applications in Medical Care, pages 121-125, Los Alamitos, CA: Institute of Electrical and Electronics Engineers, November, 4-7 1990.

13
A. T. McCray and W. T. Hole.
The scope and structure of the first version of the UMLS semantic network.
In Proceedings of the Fourteenth Annual Symposium on Computer Applications in Medical Care, pages 126-130, Los Alamitos, CA: Institute of Electrical and Electronics Engineers, November, 4-7 1990.

14
I. Monarch and J. G. Carbonell.
CoalSORT: A knowledge-based interface.
IEEE EXPERT, pages 39-53, Spring 1987.

15
R. Pool.
Beyond database and e-mail.
Science, 261:841-843, 13 August 1993.

16
B. F. Ryan, B. L. Joiner, and T. A. Ryan.
MINITAB Handbook, 2nd Edition.
PWS-KENT Publishing Company, Boston, MA, 1985.

17
G. Salton.
Generation and search of clustered files.
ACM Transactions on Database Systems, 3(4):321-346, December 1978.

18
G. Salton.
Automatic Text Processing.
Addison-Wesley Publishing Company, Inc., Reading, MA, 1989.

19
B. R. Schatz.
Building an electronic community system.
Journal of Management Information Systems, Winter, Special Issue 1991/1992.

20
B. R. Schatz.
Building collaboratories for molecular biology.
In National Collaboratories: Applying Information Technology for Scientific Research, National Research Council, National Academy Press, Washington, DC, 1993.

next up previous
hchen@bpa.arizona.edu