Hsinchun Chen 1, Bruce Schatz 2, Joanne Martinez 3, Tobun Dorbin Ng 4
We tested the fly thesaurus in an experiment with six fly biologists of varying degrees of expertise and background. This study confirms earlier research (of C. elegans worm thesaurus) that demonstrated the feasibility of an algorithmic solution to the information overload problem in databases and the vocabulary problem in online information retrieval. The knowledge and literature representing Drosophila research presents important challenges, including non-standardized naming conventions, a long history with a great deal of vocabulary fluidity, and it is about ten times the volume of literature representing the worm research. The experiment showed that the thesaurus was an excellent memory-jogging device and that it supported learning and serendipity browsing. Despite some occurrences of obvious noise, the system was useful in suggesting relevant concepts for the researchers' queries. The experience and lessons learned during the fly thesaurus development and evaluation process are presented in detail.
Biological research is highly data-intensive, and data accumulation in this area is growing extremely rapidly because of massive efforts such as the Human Genome Initiative and other genome mapping projects. At present, the genomes of several organisms are being sequenced and mapped, including Caenorhabditis elegans (nematode worm), Drosophila melanogaster (fruit fly), Mus musculus (mouse), Homo sapiens (human), Escherichia coli (bacterium), Saccharomyces cerevisiae (yeast), and Arabidopsis thaliana (plant). Because communities in molecular biology form around organisms rather than techniques or problems, the results generated are stored by each scientific community in their own set of databases. Information about the identity, function, cyctological and genetic location, mutations, and aberrations of any particular gene is therefore scattered among a variety of distributed, heterogeneous databases.
These systems range from elaborate, sophisticated database management systems to unstandardized machine-readable files created through text conversion. Distributed, heterogeneous databases are preferable to vast, centralized databases because of such inherent difficulties as file size, currency, updating, and retrieval. Nevertheless, the resulting problem of information scattering makes it difficult for scientists to share and compare information that exists in different files. Connectivity among databases is essential if scientists are to make full use of their contents. Adequate tools for structuring and accessing the contents of a diverse and heterogeneous collection of scientific databases are needed to unlock the knowledge embedded in them. Tools developed for this purpose must address the following problems: information overload, information scattering, vocabulary fluidity over time, unstandardized nomenclature, and semantic differences in the vocabulary used by the various subdisciplines of biology.
The Human Genome Initiative (HGI) offers tremendous challenges not only to the biology, biomedicine, and genetics research communities, but to the information science and computer science communities as well. According to Courteau [7], the Human Genome Project ``will generate more data than any single project to date in biology,'' resulting in complete sequences and physical maps containing the location of every gene of the human genome and the genomes of other model organisms. The vast amount of knowledge accumulated during the project's scientific discovery process can only be managed with the use of computing technologies that support efficient and effective storage, retrieval, and analysis of information, foster seamless distributed scientific collaboration, and facilitate timely information dissemination and sharing.
FlyBase is a set of linked databases designed to provide the Drosophila research community with access to broad and thorough coverage of molecular, genetic, and community information. The Drosophila community is one of the oldest groups in biological research. The knowledge it has generated has been recorded mostly in two sources, the ``Redbook,'' (i.e., ``The Genome of Drosophila melanogaster,'' by D.L. Lindsley and G.G Zimm, Academic Press, 1992), and the publications of the Drosophila Information Service. FlyBase promises improved access to domain knowledge through use of computer technologies.
A consortium of Drosophila researchers funded by grants from the U.S. National Institutes of Health (Washington) and Medical Research Council (London) is responsible for its development. This group guides the development and provides advice on preferred formats to groups developing allied databases and to individual contributers of data. FlyBase is accessible and searchable through Gopher client software, or obtainable via anonymous ftp from its primary archival site at the Department of Biology at Indiana University (IU). The primary database consists of a series of flat files that contain molecular and genetic data on Drosophila melanogaster, including the entire contents of the ``Redbook.''
In order to provide comprehensive access to Drosophila information, the IU Department of Biology also provides access to numerous ``allied'' databases developed at other institutions, each with its own focus. Included are a unified bibliography on Drosophila; pointers to nucleic acid and protein sequence databases; stock lists directing researchers to sources of clones and other biological reagents; a genetic map of Drosophila; a list of Drosophila genes sorted by function; lists of clones from European and American sources, and instructions for placing orders; directory of Drosophila workers; and other associated databases useful to Drosophila researchers. A forum for informal communication between researchers is provided through the Drosophila Information Newsletter and a Bionet News group dedicated to Drosophila. Each information source is resident on computers at its ``home'' institution. Access to the complete set of databases is organized through the IUBIO gopher site.
In addition to the various genome databases such as FlyBase, electronic community systems (ECS) have been proposed and implemented, and have drawn significant attention recently due to the rapid proliferation and advancement of computing, databases, and telecommunication technologies. An electronic community systems encodes a research community's information and knowledge and provides an online environment to support the manipulation of that knowledge. An ECS enables researchers of a scientific community to enter and share community knowledge and findings in a timely manner and in a distributed environment, and thereby function more efficiently and effectively within the community.
An advantage of this type of system over traditional databases is that an ECS enables users browse the available knowledge easily, record their own knowledge for others to use, indicate authorizations for users to either view or annotate their own data [7], annotate entries from others' research, and form interrelationships between concepts [19]. An ECS is much like an electronic library where users can browse for relevant information, filter out the information they do not currently need, and share the data that they have found [19].
Another novel characteristic of an ECS is its ability to handle a wide variety of community knowledge, both formal and informal. [19]. In order to ``live effectively within a community, one must have available both the formal archival material and the informal transient folklore'' [19]. An ECS does this by integrating published literature about formal research findings with intermediate results, experimental protocols, laboratory notes, electronic bulletin board discussions, e-mail exchanges, and other informal data sources, and therefore is becoming an indispensable tool that allows researchers to browse, explore, and understand a vast and rapidly-changing world of scientific knowledge while it also creates a group memory.
The Worm Community System (WCS), which is a major NSF-funded collaboratory project, has been considered as a model electronic community system [15] [20]. Constructed for C. elegans researchers, it offers traditional database functionalities along with literature, informal information and research lore, mapping programs and graphics, and the ability for users to browse, share, and filter a large amount of timely worm community knowledge. The system is intended to serve not only the entire community of worm biologists but also other related biology and biomedical community members [19] [7] [20]. In previous research, we developed a worm thesaurus based on the complete WCS literature [6]. The worm thesaurus was found to be an excellent memory-jogging tool and a concept-based search aide and was incorporated into the WCS. Release 2 of the WCS was made available to worm biologists in August 1993.
While the formats of the WCS and FlyBase are very different, each attempts to serve as a central, electronic ``clearinghouse'' for information a researcher needs to remain up-to-date. However, despite the potential substantial benefits of being able to access, retrieve, and analyze data and information about homologues of their gene or protein in other organisms, the use of distributed heterogeneous databases presents many significant obstacles. Besides the technical problems inherent in the use of various database formats, other sources of difficulties include information overload, information scattering, vocabulary fluidity over time, and differences in semantic meaning and nomenclature conventions between domains.
In this paper, we present an algorithmic approach to generation of a fly thesaurus. The main techniques used in our approach are presented in Section 3. Section 4 discusses in detail the algorithms and sample results for object filtering and automatic indexing of Drosophila literature and the algorithms and findings of the cluster analysis process. Section 5 presents the current status of our system implementation and an experiment we conducted involving subject area experts, novices, and (fly) community outsiders. Experimental results are discussed in detail in Section 6. Directions for future research are presented in Section 7.
In this research, our aim was to apply an algorithmic approach to the generation of a robust knowledge base based on statistical correlation analysis of the concepts (knowledge) embedded in the documents of real-life, textual databases. The research output consisted of a thesaurus-like knowledge base, which can aid in concept-based information management and retrieval. This automatically-generated thesaurus component, akin to a manually-created thesaurus, can also play an important role in solving a searcher's vocabulary problems during information retrieval.
In information science, use of a thesaurus or a knowledge base for ``intelligent'' information retrieval has drawn significant attention in recent years. There have been many attempts to capture experts' domain knowledge for information retrieval. A few examples are described below. CoalSORT [14], a knowledge-based interface, facilitates the use of bibliographic databases on coal technology. A semantic network, representing an expert's domain knowledge, embodies the system's intelligence. Fox's CODER system [11] consists of a thesaurus that was generated from the Handbook of Artificial Intelligence and Collin's Dictionary. The ``Intelligent Intermediary for Information Retrieval'' (I3R), developed by Croft [8], consists of a group of ``experts'' that communicate via a common data structure, called the blackboard. The system consists of a user model builder, a query model builder, a thesaurus expert, a search expert (for suggesting statistics-based search strategies), a browser expert, and an explainer. Chen and Dhar [2] incorporated a portion of the Library of Congress Subject Headings into the design of an intelligent retrieval system. The system adopted a branch-and-bound spreading activation algorithm to assist users in articulating their queries. The National Library of Medicine's Unified Medical Language System (UMLS) project aims to to build an intelligent automated system that understands biomedical terms and their interrelationships and uses this understanding to help users retrieve and organize information from machine-readable sources [13] [12]. The UMLS includes a Metathesaurus, a Semantic Network, and an Information Sources Map. The Metathesaurus contains information about biomedical concepts and their representation in more than 10 different vocabularies and thesauri.
Most of the knowledge bases adopted in these intelligent systems were either generated manually from domain experts, using the knowledge acquisition process, or derived from existing thesauri (which were also created manually in the first place by some indexing/subject experts). A complementary approach to manual knowledge base creation is the automatic thesaurus generation approach.
Virtually all techniques for automatic thesaurus generation are based on the statistical co-occurrence of word types in text [6] [4] [9] [18]. Similarity coefficients are often obtained between pairs of distinct terms based on coincidences in term assignments to the documents of the collection. For example, a cosine computation can be used to generate normalized term similarities between 0 and 1. When pairwise similarities are obtained between all term pairs, an automatic term-classification process such as single-link or complete link classification can group into common classes all terms with sufficiently large pairwise similarities [10] [17] [18]. The terms in the thesaurus classes can replace the initial search terms and be used to increase retrieval recall.
The specific algorithms adopted in this research include: object filtering, automatic indexing, and cluster analysis. In the following section, we present an overview of these techniques and our modifications.
In [1], Bates proposed a design model for subject access in online catalogs. She stressed the importance of building domain-specific lexicons for online retrieval purposes. A domain-specific, controlled list of keywords can help identify legitimate search vocabularies and help searchers ``dock'' on to the retrieval system. For most domain-specific databases, there generally appear to be some existing lists of subject descriptors (e.g., the subject indexes at the back of a textbook), researchers' names (e.g., author indexes or researcher directory), and other domain-specific objects (e.g., genes, experimental methods, organizational names, etc.), either online or obtainable through OCR scanning. These domain-specific keywords can be used to help in automatic identification of important concepts in documents.
In [4], we used several domain-specific controlled lists of subject keywords, researchers' names, and organizational names for indexing in a Russian computing database (with about 200 MBs and 40,000 documents). In creating the worm thesaurus [6], we utilized author indexes from literature sources, the WCS gene list, the subject index from the Worm Book, and an existing keyword list.
After object filtering, the texts remaining may still contain many important concepts. An algorithmic approach to the identification of remaining descriptors is required. An effective and robust method for content identification that is simple and domain-independent is the automatic indexing technique, often used in information science for indexing literature. In [18], Salton presents a blueprint for automatic indexing, which typically includes dictionary look-up, stop-wording, word stemming, and term-phrase formation. The algorithm first identifies individual words. A stop word list is then used to remove non-semantic bearing words such as the, a, on, in, etc. After removing the stop words, a stemming algorithm is used to identify the word stem for the remaining words. Finally, term-phrase formation that formulates phrases by combining only adjacent words is performed.
While automatic indexing identifies subject descriptors in a document, the relative importance of each descriptor to representing the content of the document may vary. Salton's Vector Space Model associates with each descriptor a weight to represent its descriptive power. Among the many probabilistic techniques that have been developed by various information science researchers, techniques which typically incorporate term frequency and inverse document frequency have been found to be simple and yet very useful [18]. The basic rationales underlying these two measures are that: terms which appear more times in a document should be assigned higher weights (term frequency), and terms which appear in fewer documents in the whole database (the more specific terms) should have higher weights (inverse document frequency).
Based on cluster analysis [10], the Vector Space Model has been extended for automatic thesaurus generation (or automatic knowledge base generation). The first stage in many cluster analyses is to convert the raw data (e.g., indexes and weights) into a matrix of inter-individual similarity, dissimilarity or distance measures. The result of a cluster analysis will be a number of groups, clusters, types, or classes of individuals [10]. In automatic thesaurus generation [9] [4], the most commonly-used algorithms compute probabilities of indexes co-occurring in all documents of a database (sometimes referred to as co-occurrence analysis). Just as a human inductive learning process generates concepts from a set of examples and benefits from the largest possible number of examples, a thesaurus created from a textual database becomes more ``knowledgeable'' as it becomes more subject-specific and larger in the size of its collections. In the biomedical and biological domains, we have found that the subject matter is often very specific (i.e., gene names, experimental methods, and topics) and provides a sound basis for performing cluster analysis [6].
Although the above techniques had been employed in other applications, including the development of a highly domain-specific, up-to-date automatic thesaurus for the worm community, the volume of Drosophila literature and the unstandardized naming scheme utilized in the fly community offered special challenges. During our system development process, significant adaptation was required to meet the specific constraints and novel characteristics of FlyBase.
Specific concepts like gene names, function names, researchers, and subject descriptors obtained from FlyBase, as well as the large number of documents accumulated in Medline and Biosis abstracts, provide the foundation for automatic thesaurus generation. FlyBase itself represents the long research history of the Drosophila community. The volume of information is estimated as 10 times that of the WCS. The reported generation of an automatic thesaurus for FlyBase uses as a knowledge source a collection of 5,854 abstracts (more than 10MB of textual information) from Medline and Biosis CD-ROM with a time span between 1983 and 1993. After interviewing several fly biologists at the University of Arizona, we chose to generate and represent the important concepts described in recent literature instead of capturing the concepts exhibited in all fly literature, because the latest concepts may be of most interest to current researches. The experience and lessons learned during the system development process and thesaurus evaluation are reported below.
In order to generate a fly thesaurus from the fly literature, we performed object filtering, automatic indexing, and cluster analysis in order. Sample results are shown in this section.
We created the following lists of fly-related keywords from several directories in FlyBase and from Biosis and Medline. Unlike the unified and systematic naming convention used in the worm community, the format in Drosophila is more free-style. Capitalization is important for gene and function names. Furthermore, there is no systematic way to name alleles in Drosophila. Special characters like left parathesis ``('', pipe ``|'', apostrophe ``''', comma ``,'', slash ``/'', and plus sign ``+'' are commonly used in the naming convention. Here are some examples of those names: ``(-Glycerol phosphate dehydrogenase'', ``Ac-SD'', ``Su(Ste)'', ``l(3)4.15'', ``ort1 ninaE1'', ``ADP/ATP translocase'', ``fructose-1,6-bisphosphate aldolase'', ``(Na+ + K+) ATPase ( subunit'', and ``Casein kinase II, | subunit''. Because of this unconstrained naming convention, a parser that had been developed previously and used for several applications [4] [6] [3] was revised to handle these scientific terms. We created four groups of object filters as follows:
In total, we identified 23,773 fly-specific terms, which served to identify important concepts in Biosis and Medline documents.
After documents were ``filtered'' by the four lists, the remaining texts in documents were processed by the automatic indexing routine. Automatic indexing was implemented mainly based on the procedure reported in [18]. The following steps were executed in order:
After the concept descriptors for each abstract were identified, we proceeded to perform term co-occurrence analysis for all documents in the document collection. A term weighting scheme based on the Vector Space model [18] and an asymmetric similarity function [4] similar to the popular cosine function [10] were adopted for analysis. The blueprint for generating such a concept space (we refer to the thesaurus as a concept space to distinguish it from the information space represented by the Biosis and Medline documents) is shown below:
We first computed the term frequency and the document frequency for each term in a document. Term frequency, tfij, represents the number of occurrences of term j in document i. Document frequency, dfj, represents the number of documents in a collection of ndocuments in which term j occurs. High term frequency indicates that a term is highly related to a document. High document frequency, on the other hand, indicates that a term is too general to be useful as a descriptor (i.e., has no descriptive power).
Usually terms identified from the title of a document are more descriptive than terms identified from the abstract of the document. This is also the case here, as confirmed by fly experts at University of Arizona. In addition, terms identified through object filtering are usually more accurate than terms generated by automatic indexing. This is due to the fact that terms generated by automatic indexing are relatively ``noisy.'' In our research, terms identified in titles were assigned heavier weights than terms in abstracts and terms identified by object filtering were assigned heavier weights than terms identified by automatic indexing.
We retained automatic indexing terms that appeared more than three times, in order to remove incidental noise terms. The total numbers of unique terms generated (through both object filtering and automatic indexing) from the fly literature are shown in the following table. Not surprisingly, subject descriptors constituted the largest share of the descriptors.
| Type of Descriptor | Number of Terms |
| Genes | 4,875 |
| Functions | 182 |
| Researchers | 8,349 |
| Subject Descriptors | 155,523 |
| Total | 168,929 |
We then computed the combined weight of term j in document i, dij, based on the product of ``term frequency'' and ``inverse document frequency'' as follows:
where N represents the total number of fly documents, and wj represents the number of words in descriptor Tj. Multiple-word terms were assigned heavier weights than single-word terms because multiple-word terms usually conveyed more precise semantic meaning than single-word terms.
We then generated a term co-occurrence table based on the asymmetric ``Cluster Function'' developed by the authors [4]. We have shown that this asymmetric similarity function represented better term association than the popular cosine function. The weighting factor appearing in the equations below is a further improvement of our cluster algorithm.
These two equations indicate the similarity weights from term Tj to term Tk (the first equation) and from term Tk to term Tj ( the second equation). dij and dik were calculated based on the equation in the previous step. dijk represents the combined weight of both descriptors Tj and Tk in document i. dijk is defined similarly as follows:
where tfijk represents the number of occurrences of both term j and term k in document i (the smaller number of occurrences between the terms was chosen). dfjkrepresents the number of documents (in a collection of N documents) in which terms j and k occur together. wj represents the number of words of descriptor Tj.
In order to penalize general terms (terms which appeared in many places) in the co-occurrence analysis, we adopted the following weighting schemes:
Terms with a higher dfk value (more general terms) had a smaller weighting factor value, which caused the co-occurrence probability to become smaller. In effect, general terms were pushed down in the co-occurrence table (terms in the co-occurrence table were presented in reverse probabilistic order, with more relevant terms appearing first).
Sample entries in the system-generated co-occurrence tables are shown in Figure 1. As shown in the co-occurrence table, ``DISCS'' was found to be most strongly related to researcher ``BROWER, D.'' with a weighted probability of 0.286650. The other terms related to researcher ``BROWER, D.'' were listed in descending order. In the second entry, gene ``e(bx)'' was found to be most relevant to gene ``white''. In the third entry, the subject ``LOCALIZED'' was founded to be the most relevant to researcher ``LEHMANN, R''. The last entry revealed that the subject ``HOMEOTIC'' had the most relevant term as a gene ``ANTC''.
Without setting a probabilistic threshold for the co-occurrence table, the total number of co-occurrence pairs was 811,356. Some terms may have a few thousand related concepts. The enormous volume of these data not only used a lot of memory, it might also overwhelm searchers during the thesaurus browsing process. For productive user-system interaction, only highly relevant concepts should be suggested to searchers. With the experience from worm thesaurus generation [6], we chose 100 as the maximum number of links for any node. The resulting fly thesaurus contained 748,253 pairs of related concepts. After applying the thresholds, the total number of unique terms found in the four sources was 18,099 (a significant portion of the less common subject descriptors was removed as a result of this process). On average, each term had about 41 relevant neighboring concepts.
Our prototype system, which consisted of a thesaurus generation component and a thesaurus query system, was developed in ANSI C and ran on SUN SPARK stations, DECstations, and DEC Alpha. It took 10.6 hours of CPU time to generate the thesaurus using a DECstation 5000/120 (25 MIPS, ULTRIX-based). The resulting size of the fly thesaurus was 13.5 MBs. Users browse the thesaurus through a query interface that was previously developed for use with the worm thesaurus [6]. We present a sample fly thesaurus query session first. We then discuss an experimental design which was adopted to evaluate the usefulness of the fly thesaurus and to identify areas for improvement.
A sample sequence of query sessions are shown in Figures 2, 3, and 4. The user was interested in finding something about homeotic genes (after a previous session which invloved cell death, as shown on the main window), a class of genes involved in body pattern formation in the developing fly. The term ``HOMEOTIC'' had been entered in the new query box. Although multiple terms may be searched together, the user elected to search only one term, as indicated by the lowercase ``q'' entered for term 2 (Figure 2, bottom box). Figure 3 shows a display of relevant terms in decreasing weighted order. As expected, we see that a number of gene names have appeared on the list of relevant terms. We also see that term 2 indicates that the term ``HOMEOTIC'' is descriptive of certain genes. Term 24, ``EMBRYONIC'', confirms that ``HOMEOTIC'' does indeed refer genes in the developing fly. Terms 1, 10, 11, 18, and 32 are all variations on the word ``antenna''. This makes sense, too. A group of genes important in the proper development and placement of the fly antennae are known to be clustered together on the chromosome, and are called ``Antennapedia Complex.'' Furthermore, terms 3, 7, 9, 12, 14, 15, and 16 are all gene names.
The user was interested in determining whether any of the system-suggested genes are included in the Antennapedia Complex, and marked terms 1 (ANTC), 2 (HOMEOTIC GENES), and 17 (COMPLEX) for further searching. The menu choice ``Use term(s) with marker'' under ``Thesaurus Re-Query'' re-activated the thesaurus. Figure 4 shows the result of the iterative query using the three selected terms. We see that ANTC (term 1) appears to function as a member of an immunoglobin superfamily, and that the protein expressed by the gene is glucose dehydrogenase. We also see that six of the gene names that appeared on the list of terms suggested for ``HOMEOTIC'' (3, 4, 7, 14, 15, and 16) directly relate to all three of the marked terms, as indicated in the square brackets following the co-occurrence weight. Thus we can surmise that these are all member genes of the Antennapedia Complex.
A user can iteratively query more marked system-suggested terms as many times as needed to arrive at a list of sufficiently specific terms. Alternatively, the thesaurus can be re-activated by entering new user-suggested terms. Terms selected by a user during the iterative thesaurus browsing process are recorded in a separate area. For the fly thesaurus, the abstracts from which the terms were extracted are not retrievable. The next stage of development for this thesaurus will involve traversal from the fly concept space into the previously generated worm concept space, and will feature a facility to retrieve worm documents using fly query terms.
A fly thesaurus evaluation experiment was conducted in Fall 1993. The experiment consisted of two parts: a term association experiment and a searcher browsing experiment. In order to investigate the effects of a searcher's background on thesaurus usage, six subjects at three levels of expertise were selected. Subjects 1 and 2 were considered experts in Drosophila biology. Both were faculty members with more than 9 years experience in Drosophila genetics, and had published numerous papers in this area. Subjects 3 and 4 were considered novices: one was a doctoral student in Molecular and Cellular Biology; the other was a postdoctoral fellow in Ecology and Evolutionary Biology. Both had worked in laboratories dedicated to Drosophila research for at least 2 years. Subjects 5 and 6 were considered outsiders. Both worked outside the fly domain, but had had exposure to fly concepts and had reasons to be interested in using a fly thesaurus. Subject 5 was a faculty member with limited previous experience in a Drosophila research laboratory (1 year sabbatical). Subject 6, a doctoral student with extensive experience in worm genetics, had frequently attended joint fly-worm seminars.
The first step of the term association experiment was to give each subject a pre-selected term. Ten fly terms chosen with the help of several fly researchers were presented to each subject in order. These terms included researchers' names, gene names, and subject descriptors. The subjects were asked to write down concepts (genes, researchers, and subject descriptors) related to each pre-selected term. A sample experimental sheet for one of the terms, ``white'' (a gene name)5 and the 12 related terms generated by Subject 2 are shown in Figure 5. Subjects were then asked to mark terms suggested by the fly thesaurus as irrelevant, somewhat relevant, or very relevant. Figure 6 shows a sample system term association sheet for the same term ``white'' after a subject generated his own terms.
After the term-association experiment, subjects were asked to browse the online fly thesaurus freely - using any terms they preferred and exploring any way they liked. During browsing, subjects were asked to think aloud, and to give specific comments, observations, or suggestions regarding the user interface and the quality of the fly thesaurus. Their complete online sessions were logged. Verbal protocols were recorded and later transcribed for analysis. We aimed to identify directions for system improvement throught the detailed analysis of the protocols. The complete association and browsing experiment lasted 1 to 1.5 hours for each subject.
By counting the numbers of terms generated by the subjects themselves and the system-suggested terms marked relevant (both somewhat relevant and very relevant) by the subjects, we were able to tabulate and analyze whether the thesaurus was able to contribute relevant terms during a retrieval process. An analysis of variance procedure (ANOVA) using a statistical package MINITAB [16] was conducted for the search terms, followed by a two-sample t-test to determine the differences in means. The results are summarized in Figure 7. Overall, for each term the fly thesaurus (KB) was able to suggest 11.200 terms. Subjects (All) were able to generate 12.017 terms by themselves. Like the findings in WCS, the expert group performed better than both novice and outsider groups. However, unlike the findings in WCS, the outsider group performed better than the novice group in generating relevant terms.
For the fly thesaurus, the
two-sample t-test revealed that there were no statistically significant
differences
(at a level of significance of
)
in means for (ALL vs. KB, P-value=0.689),
(NOVICES vs. KB, P-value=0.708), and
(OUTSIDERS vs. KB, P-value=0.367).
In contrast to the document recall and precision measures typically used in information science research, we adopted concept recall and concept precision for evaluation. Instead of examining the number of relevant documents retrieved, we counted the number of relevant terms (concepts) identified. These two measures were considered appropriate for evaluating the quality of term-association in thesauri [4]. They were computed as follows:
For all subjects, the terms they initially generated and the terms selected from the thesaurus were included to represent the Total Relevant Concepts - the target set of concepts which can be obtained through user-thesaurus interaction. Based on this target set of concepts, we were able to examine the subjects' initial terms (generated without any thesaurus help) and determine the subjects' concept recall and concept precision levels when the thesaurus component was unavailable, i.e., by counting the number of terms which matched the target terms. We then evaluated the concept recall and concept precision levels for the thesaurus by counting the number of thesaurus terms which matched with the target terms. Both ANOVA tests and two-sample t-tests were performed for concept recall and concept precision.
The ANOVA results for concept recall are shown in Figure 8. Overall, there were no significant differences (P-value = 0.366) between the subject groups and the thesaurus. On average, subjects' recall level was 53.81%, while that of the thesaurus was 49.80%. These two percentages implied that the degree of overlap between the set of terms from subjects and that from thesaurus was only 3.61%. Furthermore, such findings indicated that subjects could generate almost 50% of total relevant terms independently, The thesaurus was able to help subjects associate another half. Among experts, novices, and outsiders, there were no significant differences (P-value = 0.277, P-value = 0.774, and P-value = 0.791 respectively) between each subject group and the thesaurus. We found that for these three subject groups the percentages of overlapping between the set of terms generated by subjects and that by the thesaurus were consistent with the overall overlapping percentages. They are 4.33%, 3.09%, and 3.34% respectively. At all levels of expertise, subjects could only recall half of relevant terms without any assistance. With the assistance from the thesaurus, another half of relevant terms were recalled.
As shown in Figure 9, the thesaurus produced a low
level of precision compared with those produced by the human subjects.
Overall, human subjects had about a 98.20% concept precision level;
the thesaurus had a 37.33% precision level. The differences in overall
level and all three subject-group levels were significant (P-value
0.000 in all levels). The low precision level of the
thesaurus was attributed partially
to the noise terms (mostly terms considered too general) in
the thesaurus, as reported in the subjects' determination of the
relevancy of those terms to their queries and needs. As is
evident in information science research, even man-made thesauri are
only useful when terms are presented in the context of the searchers'
needs and when selected by the searchers themselves. Thesauri should
be used for consultation purposes, not for automatic term
replacement. Searchers' involvement during the thesaurus consultation
process is crucial to the success of thesaurus usage.
In conclusion, in terms of quantity, the thesaurus and the human subjects were able to generate the same number of relevant terms; but the contents of their lists were almost mutually exclusive. However, human subjects were more precise than the thesaurus. With close human-computer interaction, it appears that an automatic thesaurus-augmented search process can become very fruitful and productive.
The subjective evaluation of the fly thesaurus reported here is the result of analyzing the logged browsing session and the transcripts of the subjects' verbal protocols. During browsing, subjects were asked to think aloud and to give specific comments, observations, or suggestions.
A search involved either typing in a new term (user-suggested), or marking a system-suggested term (iterations). A new search was defined as entry of a searcher's own term. The online thesaurus suggested additional terms, which were shown on an X-Window display. Whether they were using their own terms or system-suggested terms, most subjects used terms that they were either familiar with or curious about. Analysis of the logged search sessions revealed that the six subjects performed a total of 19 searches. All subjects performed 2, 3, or 4 new searches during their browing sessions. Figure 1 shows that the greatest number of iterations following a new search was three, with nearly 90% (17/19) of all searches being followed by 0, 1, or 2 iterations. This observation is underscored by the observation of Subject 1, who commented that after 3 iterations she was getting no new terms/information, and that the same set of terms was reappearing. We conclude that for most searches, the system will converge after as few as 2 or 3 iterations. This has interesting implications for implementation of the spreading activation algorithms in the cross-domain traversal stage of the project.
As a result of our analysis of the session logs and verbal protocols, a taxonomy was developed to represent the observations of the subjects (see Figure 10). Included in this taxonomy are problems identified, which may be taken as potential pitfalls to consider in development of a domain-specific automatic thesaurus, and advantages/benefits identified, which may be taken as potential usefulness of such a thesaurus. This section is organized according to the structure of the taxonomy. This experiment gave us a better idea of how the thesaurus could be modified and used for more real-life purposes and by different user groups.
A. Quality of nodes and links:
The issue of chronology was important to all subjects. However, we learned that the users had different expectations with regard to the emphasis placed on recent and seminal work, and on frequency of citation. This strongly affected their evaluation of the system-generated thesaurus. Two subjects (an expert and an outsider) were more concerned about the priority of more recent publications, citing numbers of publications by each person on the list, and the co-authors and recency of each. One subject (an expert) was more concerned about the priority of the authors responsible for the more seminal works. Two users expected to see certain names close to the top of the list because they had made major contributions to the field (seminal work). However, they were very far down the list because their work was published many years ago, or not on the list at all because their work was published earlier than the range of years included in the database.
In contrast, other subjects suggested that the order of author names was inappropriate because they felt that those at the top of the list were not necessarily more relevant to the area because they had not contributed as much to the field. One was particularly concerned that the authors who published the greatest number of papers in an area should be listed closer to the top of the list, and those who contributed the least should be weighted less heavily.
One subject (an outsider) did not search for a single author. One subject (a novice) searched primarily for authors. This suggests that some users place greater importance on a greater number of citations, and others place more importance on the more historical, foundational work in the field.
In addition to frequency of occurrence and chronological importance, users judged the appropriateness of term weighting according to more subtle degrees of relatedness, which required more subject expertise. One user (an expert) searched on ``N'', the abbreviation for the gene Notch. Associated, but separate genes, such as ``deltex'' and ``enhancer of Split'' were were ranked higher than researchers who actually work on the Notch gene. She also found our synonymous linkage between ``Notch'' and ``Abruptex'' inappropriate. While these two genes lie in the same chromosomal region, and were once thought to be identical, they are now known to be separate genes. The subject pointed out that this distinction may not be obvious to anyone not fully aware of the historical development of the field, specifically younger investigators.
Spelling (A1.1) and hyphenation (A1.2) of words is not standardized in the abstracts, and this created redundancies in the thesaurus. Spelling differences such as ``Homeodomain'' (American) and ``Homoeodomain'' (British), ``discs'' and ``disks'', ``signaling''and ``signalling'' were often confusing for the user. One subject spent moe than 30 minutes retracing previous searches trying to find a gene named ``map'' that she was unfamiliar with. She had first identified it by marking the term ``signalling''. When she retraced her steps, she typed in ``signaling'', and was unable to find the desired term. She finally gave up in frustration.
One of the outsiders recognized the appearance of two forms of the term ``DNA BINDING'' (one hyphenated, the other not). Because our algorithmic approach depends upon co-occurence of terms directly extracted from text, conceptually related terms could be separated if standard practices for hyphenation and spelling are not in force. Hyphenation practices are not well standardized in author-written abstracts, or between databases. Databases containing abstracts composed by professional abstracters or indexers are less likely to contribute to a problem of redundant terms resulting from variations in hyphenation.
Subjects might find a term non-significant or meaningless if it is: an inappropriately included stopword; a single term from a term phrase (a fragmented phrase); or a very broad or non-specific term.
One outsider commented on the appearance of the term ``study'', which he termed "noise", saying that ``you would probably find more of [those terms] in every session''. This points to a need to carefully evaluate the stopword list with regard to the vocabulary of the domain.
Scientific concepts are often described in highly specific multiple-term phrases. All six subjects remarked that some of the 2-word phrases were meaningless, and that 3-word phrases that contains those 2-word phrases were more proper and held more meaning. This also occurred with 1- and 2-term phrases. One user suggested removing shorter phrases that are contained within longer phrases when we generate term phrases. Examples of this identified by the subjects included the following. ``Polarity Genes'' means nothing, but ``Segment Polarity Genes'' is significant. ``Cell Formation'' and ``Pole'' are not useful by themselves, but ``Pole Cell Formation'' is very significant. "Wing imaginal" means nothing: the full term is "Wing Imaginal Disk''.
Most subjects indicated that split term phrases would be considered ``too general''. One expert asked, "sexual, sexual what?; hormone', hormone what?" For this subject, the appearance of such terms was so disturbing that he called the 'list of words" "useless", and stated that a list of titles that would provide context would be more useful.
The distinction between general and sufficiently specific terms was commonly emphasized by all subjects. One expert defines useless terms as those that are incomplete, and those that do not indicate the "importance of something". Four subjects (two novices, one outsider, and one expert) defined specific terms as being those that would be unique in the literature, those that would clearly identify an author's field of study, or those that more clearly identified the context of a more general term.
In contrast, the other three subjects (an expert, a novice, and an outsider), were not dismayed by the presence of 'very general' terms. The novice's concerns over the appropriateness of general terms like 'populations' was eased when he learned that my marking multiple terms he could simulate a more 'specific' term, such as 'diverse populations'. The expert and the outsider suggested that one could glean clues from other terms on a given list, and thereby deduce some semantic meaning. In each search performed by the outsider, she attempted to identify semantic links between that several of the general terms on the list, to arrive at a scenario that might describe the findings presented in the abstract. In one case, she remarked that is would be beneficial to search for the "role" of a gene, or find out what some researcher has learned as a result of his "characterization" of a protein. By marking several terms of interest, she created a ``cluster'' of comments that represented a concept, similar to the ``topics'' generated by the Electronic Brainstorming Session classification tool created by our group. The subject even able to find reasons for accepting such general terms as "extensive" (which came up when searching "mammalian homologue": e.g extensive homology") and member (A term three positions higher was superfamily.). She surmised that genes with ``extensive'' homology may be ``members'' of a particular ``superfamily'' of genes. She stated that, ``If you know something about biology in general,'' you can find some benefit in having the general terms included in the list by making some deductions about what their relationship might be. Still, she stressed that the availability of abstracts would be helpful for confirmation. In fact, all subjects inquired as to the availability of abstracts.
The expert recognized the same potential for deducing context, but had some difficulty doing so because the nature of the terms was highly technical. She concluded that terms relating to technical subjects would be considered general, because techniques are often "re-used" and even modified for use in other fields of study. One novice further suggested that querying on very technical terms will mean retrieval of a wide range of terms that may or may not have significance to one another, because of the application of particular techniques to a wide range of fields of study.
Another source of ambiguity was caused by the presence of multiple authors with same name. One outsider performing an author search was perplexed by the output because the terms seemed so disparate, until he concluded that the list of terms represented the work of more than one author.
Although the interface includes a field to indicate the term type (subject (s), gene (g), function (f), author (a)) to reduce ambiguity, most users did not pay much attention to it. Consequently, some terms that fit into more than one ``type'' were considered by the subjects to be ambiguous. Two subjects came across a gene named "map". Because all the terms on the retrieved list seemed to be related to chromosome mapping and gene sequencing, it seemed unclear whether the problem of ambiguity of semantic meaning is really cleared up. It is more likely, however, that this was an example of the ambiguity and generality of technical terms. This ambiguity could very easily and very convincingly be explained by a scenario in which the gene ``map'' was being characterized with regard to it's position on the physical or cytological map.
The problem of related to synonomous linkages between genes was alluded to briefly above. We used two of the flat files in FlyBase that linked gene names with all synonymous names. One of the subjects (an expert) pointed out that some of the genes listed as synonyms are not actually synonyms, but are alleles, facets, etc. One term (Abruptex) with weight of 1.000 (synonym) was retrieved up for the Notch query. The subject pointed out that Abruptex is a class of alleles that fit under the Notch umbrella. Therefore Notch should be at the top of the tree. These two terms were listed in the Flybase as synonyms.
Our use of the Synonyms files fron FlyBase resulted in other incidents where the thesaurus brought up gene names unfamiliar to the subjects. All 6 subjects pointed out that they were unfamiliar with "cel", a gene name that came up in at least one searchfor each subject. Two of the subjects searched for the gene "wingless", and came up with Dint-1. Neither was familiar with this gene name. Two were unfamiliar with "Erased", which was the primary synonym that came up for "engrailed". It seems that the direction of some of the synonyms is appropriate, and that for others is not. In either case, it becomes clear that asymmetric links between ``synonyms'' are more appropriate than symmetric links where a single gene name searches for all the "synonyms" of a gene. The question of synonymy between genes is an important one. A set of criteria for assigning synonymy may need to be developed. For example, it clearly is not sufficient for two genes to lie on the same stretch of a chromosome (e.g., Abruptex and Notch).
The two experts and the outsider who is a faculty member were frustrated by not seeing names she expected to be on the list. One novice was dismayed to see that the gene she works on (``cappucino'') was not in the thesaurus. We estimate that because of the field's long history, the amount of literature, formal and informal, representing the area of Drosophila genetics is at least 10 times that for the area of worm genetics. The database underlying the thesaurus is only a sampling of the fly literature, fewer than 6000 abstracts from 2 databases. The subjects commented, however, that even within those restrictions, their terms, people, etc. should have been in that collection of documents. Non-experts commented that it would be difficult for them to know whether the lists of terms retrieved for any query were complete. One novice who is a post-doc said he would only be able to know if it were a complete list if he had plenty of time to study the list.
B. Quality of System and Interface
Ease of use of the interface was an issue for all of the subjects. Half of the subjects were finally comfortable with the interface by the end of the browsing session (including one who stated at the beginning of the session that she 'hates computers). Half never became comfortable, but two of these felt that with more experience its operation would eventually become 'second nature'. One subject found the left cursor/right cursor "impossible", and did not become comfortable with the directionality even by the end of her searching. (This was one of the subjects interested in seeing a mouse capability.) The right cursor key is required to go up to the left-most option on the menu. All the subjects asked for assistance with getting to the menu bar. One subject never got the pattern for typing in a new term. Four subjects would prefer to use a mouse. Several would like to see mouse capabilities, and reached for the mouse several times during the search.
The thesaurus returned as many as 200 related terms in response to a query term. In its current implementation, subjects must cursor through the list line-by-line. This was considered ``annoying'', and two subjects suggested that there be "page-up" and "page-down" keys. While the list of terms already queried could be viewed and terms could be deleted from it, terms on the list could not be accessed for use in subsequent searches. Three subjects (an outsider, an expert, and a novice) commented that their inability to go back to prior searches impeded the progresss of their search. They each wanted to be able to just see what a given term meant, or was referring to, and when they followed through to find out, they were stuck and could not get back to where they were. Three subjects said that inaccessiblity to the terms on the list necessitated starting over and re-entering search terms. The novice, in particular, spent the majority of her searching attempting to determine the significance of a gene named 'map'. The source of her difficulties was twofold: multiple forms of the word due to spelling differences (discussed above), and access to previously searched terms.
Because the ultimate use of the fly thesaurus is to permit fly researchers to access pertinent information of interest from the WCS, we did not provide subjects with access to the abstracts underlying the fly thesaurus. All subjects inquired about access to abstracts to provide confirmation of the contextual meaning of a given term, and to help them better judge the relatedness and relevance of the retrieved terms, particularly for the most general terms such as terms related to laboratory techniques that would be included in abstracts related to a wide variety of topics.
In the current implementation, both the ``and'' and ``or'' Boolean linkages are displayed. Several subjects requested that choice of Boolean operator be a user option. Three subjects suggested that system-suggested terms be grouped according to the number of marked or user-suggested matched.
Term searched is not at the top of the list of retrieved terms. This causes problems when the searcher sees a term lower on the list that is of interest, and wants to search it together with the original term. The absence of the term on the list requires that the user re-enter both terms using the "Type In A New Term" function. Several subjects were frustrated with this limitation on further searching. Both experts would prefer that terms be classed according to 'type', so that all authors would be together, subjects together, genes together. One subject questioned whether it is realistic to rank an author as being more highly related than a subject or a gene name. All but one subject (an outsider) found the display "hard to read" because of the interspersion in term ``types'' in the display. Two subjects found it useful to reserve judgment about the appropriateness of the ranking until after having read through the entire list of terms to see where terms showed up. To one subject it seemed ``odd'' that a particular gene would be considered more relevant than a person, or a function more relevant than a subject descriptor. She also remarked that it is "hard to read" a "grab-bag" of terms. he suggested a ``clickable'' option that would allow the searcher to selected the term type for the terms retrieved.
In an effort to address the domain-specific naming convention for gene that requires dominant gene names to be capitalized and recessive genes names be in lower case. While this convention is well-established and widely known, subjects generally had difficulty getting used to this feature of the online search tool.
A. Serendipitous Discovery:
Subjective evaluation of the fly thesaurus revealed two primary, novel areas of potential usefulness. First is serendipity discovery. As one expert at first was concerned about what she termed the ``misdirectedness'' of the the thesaurus' search method, and expressed frustration with it. However after a few searches, she conceded that, "It's like a ... word thesaurus in that there are probably going to be things in here that I absolutely never would have thought about... and routes though the data, through the literature that I never would take without it." Most users thought the thesaurus would be useful in introducing them to various aspects of the domain that were beyond their present level of experience. The experience of the expert and the novice, discussed above, who found that by imagining a scenario in which multiple terms were related resulted in a deductive arrival at a possible context for seemingly unrelated terms. This kind of creative use of the thesaurus is an observation we have not seen reported before. Two subjects (an expert and an outsider) did not see this potential. They were primarily interested in seeing the things they knew to be related and relevant, and anything unexpected was viewed as noise.
B. Memory-Jogging:
The second novel usefulness for the fly thesaurus is in memory-jogging. One novice searched for the gene ``wingless''. The term list retrieved contained the term ``signalling'', which triggered a spark of recognition in the subject. She said, ``Let's try `signalling', because `wingless' is presumed to be some sort of signalling molecule.'' In one case, serendipity discovery and memory-jogging, occurred almost simultaneously. The subject recognized the terms ``m4'' and ``96F'' on the list of retrieved terms, and was reminded that, ``m4 is a minute, which may exist at 96F; I know there are some out there [in the chromosome]. So I would say that maybe there is a rearrangement. Oh I know what it is, this is probably a clonal analysis using a minute to diminish the size of the wild type clones, so that the mutant clones can grow.'' By applying deductive reasoning to the thesaurus' inherent characteristics emphasizing knowledge discovery and memory-enhancement, the subject's creativity was triggered. This research has demonstrated that automatic thesauri have the potential to offer improvements to searching that far exceed those possible with manual thesauri.
In summary, the results from the experiment were very encouraging. The thesaurus suggested relevant terms and concepts that would not only be helpful for different users, but useful in spurring user ideas and desire to acquire knowledge. The thesaurus was tested by objective and subjective measures and produced results indicating that it did produce relevant terms, improve search recall, and would be useful as a thesaurus for the FlyBase.
Our most immediate future research directions, as part of a long term effort to develop a more efficient and ``intelligent'' framework and design for the management, retrieval, sharing, and dissemination of information for distributed, scientific computing, include the following:
We believe we are moving closer to finding a solution to one of the most challenging problems in IR - the vocabulary problem. In scientific communities an outsider (e.g., a fly biologist) often needs to search for literature in other domains (e.g., worm biology) using his/her own vocabularies (i.e., fly-specific terms). Now that we have created concept spaces for the fly and worm communities, we will analyze the overlapping terms in the two concept spaces. By adopting a multiple-thesauri consultation process ( we have developed one, reported in [4]), a searcher's fly-specific terms can be used to traverse the two concept spaces and eventually converge towards specific terms in the (target) worm concept space. Results of some initial structural comparison between the fly and worm thesauri are summarized in Table 2.
|
We intend for the automatically generated thesauri to be useful for scientists interested in searching literature in other research domains using terminology from their own domain. Before devising the mechanism for traversal from one thesaurus to the other, it was first important to determine the extent of overlap between the index terms contained in each. Because the object filter lists available for the two domains differed, some direct comparisons are not possible. Table 2 shows the number of terms in the Worm and Fly thesauri, the number of author terms, number of gene terms, number of subject terms, number of function terms (fly thesaurus only), and number of method terms (worm thesaurus only). The last three columns report the number over terms appearing in both thesauri, and the respective proportion of each thesaurus overlapping terms represent. It is not surprising that no overlap exists in gene names: the naming conventions for the two domains are extremely different. Furthermore, it is noteworthy that 252 author names appear in both thesauri. The format for author names is last name and first initial, which could present some ambiguity. Still it is feasible that some authors have published in both domains. The extent of overlap for the subjects was greater than 25% for the fly thesaurus and 32% for the worm thesaurus. With this much overlap, the likelihood of finding intermediate terms for concept space traversal is promising. We are in the process of designing a concept traversal experiment with the help of several molecular biologists who are knowledgebale in both fly and worm biology.
As a natural extension of the current research, we will be testing some AI-based general search algorithms (e.g., branch-and-bound and Hopfield network [5]) for automatic thesaurus consultation. These algorithms will be able to assist searchers in traversing the entirely concept space by following the more relevant links first, a general characteristic of optimal or heuristic search algorithms. We have done some work in this area already, but significant experimentation is still required to develop a robust automatic thesaurus consultation module for the WCS.
Currently the thesaurus was generated in a batch mode (in about 4 hours). Because scientific databases are rapidly and continually growing, an effective method for incremental update for the thesaurus is needed. We are currently developing an incremental version of our cluster algorithm for implementation in the worm and fly thesauri. By storing some intermediate results for term frequencies and inverse document frequencies, an incremental update for thesaurus is possible.
So far we have not included the ``time'' dimension of the documents and concepts in our analysis. By time-tagging each concept and analyzing the activities associated with each concept (e.g., when it first appeared, when it was most actively used, etc.), we believe a more fluid and time-precise thesaurus can be created.
The rationale behind our approach is that instead of letting knowledgeable information specialists (knowledgeable in several subject areas) perform term matching and consultation, we could automatically create different domain-specific thesauri tailored to the vocabularies and concepts exhibited in the related disciplines. For example, worm-related disciplines might be classified based on (1) organisms: coli, yeast, fly, mice, and human; or (2) level of analysis: biochemical, molecular, cellular, developmental, medical, and ecological. Our longer-term research effort will involve creating other ``outsider'' thesauri and incorporating them in the WCS. We hope by expanding the ``knowledge'' of the WCS, the community system will eventually be able to assist in seamless, distributed, and ``intelligent'' concept-based information retrieval for different community users.
This project was supported mainly by two NSF grants: the NSF CISE Research Initiation Award, IRI-9211418, 1992-1994 (H. Chen, ``Building a Concept Space for an Electronic Community System'') and NSF CISE Special Initiative on Coordination Theory and Collaboration Technology, IRI-9015407, 1990-1993 (B. Schatz et al., ``Building a National Collaboratory Testbed''). We would also like to thank the faculty and students of the Molecular and Cellular Biology Department, Ecology and Evolutionary Biology Department, Anatomy Department, and Biochemistry Department at the University of Arizona for their kind assistance and valuable suggestions, in particular, Dr. Samuel Ward, Dr. Danny Brower, Dr. John Clark, Dr. John Little, Dr. Mary Rykowski, Ms. Alicia Minniti, and Ms. Lisa Werner.