Hsinchun Chen 1, Bruce Schatz 2, Joanne Martinez 3, Tobun Dorbin Ng 4
We tested the fly thesaurus in an experiment with six fly biologists of varying degrees of expertise and background. This study confirms earlier research (of C. elegans worm thesaurus) that demonstrated the feasibility of an algorithmic approach to the vocabulary problem in online information retrieval. The knowledge and literature representing Drosophila research presents important challenges, including non-standardized naming conventions, a long history with a great deal of vocabulary fluidity, and about ten times the volume of literature representing the worm research. The experiment showed that the thesaurus was an excellent memory-jogging device and that it supported learning and serendipity browsing. Despite some occurrences of obvious noise, the system was useful in suggesting relevant concepts for the researchers' queries. The experience and lessons learned during the fly thesaurus development and evaluation process are presented in detail.
Biological research is highly data-intensive, and data accumulation in this area is growing rapidly because of massive efforts such as the Human Genome Initiative and other genome mapping projects. The complexity of the data and their inter-relationships have made genome informatics even more challenging. At present, the genomes of several organisms are being sequenced and mapped, including Caenorhabditis elegans (nematode worm), Drosophila melanogaster (fruit fly), Mus musculus (mouse), Homo sapiens (human), Escherichia coli (bacterium), Saccharomyces cerevisiae (yeast), and Arabidopsis thaliana (plant). Because communities in molecular biology form around organisms rather than techniques or problems, the results generated are stored in separate databases by each scientific community. Information about the identity, function, cytological and genetic location, mutations, and aberrations of any particular gene is therefore scattered among a variety of distributed, heterogeneous databases.
These systems range from elaborate, sophisticated database management systems to unstandardized machine-readable files created through text conversion. Distributed, heterogeneous databases are preferable to vast, centralized databases because of such inherent difficulties as file size, currency, updating, and retrieval. Nevertheless, the resulting problem of information scattering makes it difficult for scientists to share and compare information that exists in different files. Connectivity among databases is essential if scientists are to make full use of their contents. Adequate tools for structuring and accessing the contents of a diverse and heterogeneous collection of scientific databases are needed to unlock the knowledge embedded in them. Tools developed for this purpose must address the following problems: information overload, information scattering, vocabulary fluidity over time, unstandardized nomenclature, and semantic differences in the vocabularies used by the various subdisciplines of biology.
The Human Genome Initiative (HGI) offers tremendous challenges not only to the biology, biomedicine, and genetics research communities, but also to the information science and computer science communities. According to Courteau [13], the Human Genome Project ``will generate more data than any single project to date in biology,'' resulting in complete sequences and physical maps containing the location of every gene of the human genome and the genomes of other model organisms. The vast amount of knowledge accumulated during the project's scientific discovery process can only be managed with the use of computing technologies that support efficient and effective storage, retrieval, and analysis of information, that foster seamless distributed scientific collaboration, and that facilitate timely information dissemination and sharing.
FlyBase [20] is a set of linked databases designed to provide the Drosophila research community with access to broad and thorough coverage of molecular, genetic, and community information. The Drosophila community is one of the oldest groups in biological research. Most of the knowledge it has generated has been recorded in two sources, the ``Redbook'' [31] and the publications of the Drosophila Information Service. FlyBase gives promise of providing improved access to domain knowledge through use of computer technologies.
A consortium of Drosophila researchers, funded by grants from the U.S. National Institutes of Health (Washington) and Medical Research Council (London), guides FlyBase development and provides advice on preferred formats to groups developing allied databases and to individual contributors of data. FlyBase is accessible and searchable through Gopher (and Veronica) client software or obtainable via anonymous ftp from its primary archival site at the Department of Biology at Indiana University (IU). The primary database consists of a series of flat files that contain molecular and genetic data on Drosophila melanogaster, including the entire contents of the ``Redbook.''
In order to provide comprehensive access to Drosophila information, the IU Department of Biology also provides access to numerous ``allied'' databases developed at other institutions, each with its own focus. Included are a unified bibliography on Drosophila; pointers to nucleic acid and protein sequence databases; stock lists directing researchers to sources of clones and other biological reagents; a genetic map of Drosophila; a list of Drosophila genes sorted by function; lists of clones from European and American sources, and instructions for placing orders; a directory of Drosophila workers; and other associated databases useful to Drosophila researchers. Forums for informal communication between researchers are provided through the Drosophila Information Newsletter and a Bionet News group dedicated to Drosophila research. Each information source is resident on computers at its ``home'' institution. Access to the complete set of databases is organized through the IUBIO Gopher site. Table 1 summarizes some of the major online information sources of relevance to fly biologists.
In addition to various genome databases such as FlyBase, electronic community systems (ECS) have been proposed and implemented [43]. These have drawn significant attention recently due to the rapid proliferation and advancement of computing, databases, and telecommunication technologies. An electronic community system encodes a research community's information and knowledge and provides an online environment to support the manipulation of that knowledge. An advantage of this type of system over traditional databases is that an ECS enables users to browse the available knowledge easily, record their own knowledge for others to use, indicate authorizations for users to either view or annotate their own data, annotate entries from others' research, and form interrelationships between concepts [13]. An ECS is much like an electronic library where users can browse for relevant information, filter out information they do not currently need, and share data that they have collected.
Another novel characteristic of an ECS is its ability to handle a wide variety of community knowledge, both formal and informal. In order to ``live effectively within a community, one must have available both formal archival material and informal transient folklore'' [43]. An ECS provides both by integrating published literature about formal research findings with intermediate results, experimental protocols, laboratory notes, electronic bulletin board discussions, e-mail exchanges, and other informal data sources, and therefore is becoming an indispensable tool that allows researchers to browse, explore, and understand a vast and rapidly-changing world of scientific knowledge at the same time it creates a group memory.
The Worm Community System (WCS), which is a major NSF-funded collaboratory project, has been considered a model electronic community system [39] [44]. Constructed for C. elegans researchers, it offers traditional database functionalities along with literature, informal information and research lore, mapping programs and graphics, and the ability for users to browse, share, and filter a large amount of timely worm community knowledge. The system is intended to serve not only the entire community of worm biologists but also other related biology and biomedical community members [43] [13] [44]. In previous research, we developed a worm thesaurus based on the complete WCS literature [12]. The worm thesaurus was found to be an excellent memory-jogging tool and concept-based search aide and was incorporated into the WCS Release 2 made available to worm biologists in August 1993.
While the formats of the WCS and FlyBase are very different, each attempts to serve as a central, electronic ``clearinghouse'' for information a researcher needs to remain up-to-date. However, despite the potentially substantial benefits of being able to access, retrieve, and analyze data and information about other organisms, the use of distributed heterogeneous databases presents many significant obstacles. Besides the technical problems inherent in the use of various database formats, other sources of difficulties include information overload, information scattering, vocabulary fluidity over time, and differences in semantic meaning and nomenclature conventions between domains. In the context of information retrieval, providing access to multiple, disparate information sources (e.g., the fly information sources, as shown in Table 1) and supporting cross-domain information retrieval (e.g., fly biologists accessing information about recent findings in worm genetics) are two pressing research areas.
In this paper, we present an algorithmic approach to generation of a fly thesaurus. The main techniques used in our approach are presented in Section 3. Section 4 discusses in detail the algorithms and sample results for object filtering and automatic indexing of Drosophila literature and the algorithms and findings of the cluster analysis process. Section 5 presents the current status of our system implementation and an experiment we conducted involving subject area experts, novices, and (fly) community outsiders. Experimental results are discussed in detail in Section 6. Directions for future research are presented in Section 7.
Information retrieval in large document collections often requires vocabulary expansion aids because, as Blair and Maron [4] contend, ``vocabulary problems make high recall impossible in full-text databases.'' Gomez et al. [24] [23] found in their studies that ``searcher success is markedly improved by greatly increasing the number of names per object.'' They proposed an ``unlimited aliasing'' strategy, which allows essentially unlimited numbers of aliases for objects, to alleviate the vocabulary difference problem.
Many research groups have created vocabulary-based search aids for online information retrieval systems by making use of existing thesauri or dictionaries. Thesauri, in particular, exhibit a structure similar to human word-association networks. While these tools are able to provide the searcher with alternate terms to use in searching, they do not overcome the knowledge acquisition bottleneck [25]: the cognitive demand required of humans (indexers or domain experts) to create thesauri or dictionaries in the first place. An alternative approach to creating vocabulary-based search aids is based on automatic thesaurus generation.
The National Library of Medicine's Unified Medical Language System (UMLS) project is one of the largest-scale efforts adopting existing domain-specific knowledge sources or thesauri in information access. It aims to build an intelligent automated system that understands biomedical terms and their interrelationships and uses this understanding to help users retrieve and organize information from machine-readable sources [32] [30]. The UMLS includes a Metathesaurus (consisting of biomedical concepts and their relationships as presented in more than 10 different existing vocabularies and thesauri); a Semantic Network (containing information about and relationships between the categories or classes included in the Metathesaurus); and an Information Sources Map or directory (containing information about various biomedical databases). The system suggests terms for user selection.
Many recent information science projects also have adopted multiple existing thesauri for term suggestion. Chamis discussed the issues of thesaurus compatibility and strategies and systems developed to overcome difficulties in searching multiple incompatible databases [5]. In particular, she described the effectiveness of the Vocabulary Switching System (VSS), an integrated vocabulary consisting of twelve existing thesauri in four diverse subject areas (business, social sciences, life sciences, physical sciences). Knapp's BRS/TERM vocabulary database maps natural language synonyms and controlled vocabulary descriptors from seven bibliographic databases in the social and behavioral sciences [28]. The NTIS database consists of records from databases of numerous government agencies, each of which has its own thesaurus. The NTIS thesaurus represents a merged vocabulary of these various micro-thesauri, together with natural language terms, and ``tags'' indicating the source of each term [38]. In a similar effort, Chaplan mapped terms from the Laborline Thesaurus to the Library of Congress Subject Headings (LCSH) [6]. Development of the Art and Architecture Thesaurus (AAT) began as an attempt to improve upon the LCSH vocabulary by integrating terms from numerous disparate domain-specific thesauri and word lists, and presenting them in a hierarchical structure similar to that of the NLM's Medical Subject Headings (MeSH). The result is a faceted, hierarchical vocabulary that is compatible with and appropriate for libraries primarily centered on LCSH [36] [37]. Another project undertaken by the Genentech library, based on the methods used by Petersen with the AAT, attempted to rectify inconsistencies between the LCSH and MeSH descriptors in domains related to genetic engineering and molecular biology [3]. Finally, Niehoff and associates at Battelle Columbus Laboratories developed an integrated vocabulary for the energy domain which represented terms from eleven existing vocabularies [33] [34].
Several projects have attempted to incorporate existing thesauri in the design of knowledge-based information retrieval systems. Fox et al. focused on creation of so-called ``relational thesauri.'' For example, Fox's CODER system adopted the Handbook of Artificial Intelligence and Collin's Dictionary [21] [22]. Ahlswede and Evens parsed [1] Webster's Seventh New Collegiate Dictionary to obtain a ``lexical database'' containing lexical or lexical-semantic relationships from the dictionary definitions. Lesk converted an online version of Murray's Oxford English Dictionary into a thesaurus-like tool to facilitate searching of historical manuscripts. These approaches represent attempts to produce ``universal lexicons,'' rather than domain-specific thesauri or dictionaries. Chen et al. conducted a series of experiments which included several large-scale, domain-specific thesauri. In [7], Chen and Dhar incorporated a portion of the LCSH in the computing area into a system that used a branch-and-bound spreading activation algorithm to assist users in query formulation. More recently, they developed concept-based document retrieval using multiple thesauri: two existing thesauri (LCSH and the ACM Computing Review Classification System) and an automatically-generated computing-specific thesaurus [10] [11].
Numerous investigators have developed algorithmic approaches to automatic thesaurus generation. Most of these approaches employ techniques that compute coefficients of ``relatedness'' between terms using statistical co-occurrence algorithms (e.g., cosine, Jaccard, Dice similarity functions) [9] [15] [42] [40]. Some algorithms, however, perform cluster analysis to further group terms of similar meanings [19] [40].
Stiles [45] was one of the early researchers to report improved retrieval performance using a method based on term association (with collections of librarian-applied subject tags). Doyle [17] further argued that the principles underlying association-based retrieval should apply whether the associations are determined by humans or by machines (programs). Courtial and Pomian [14] argued that searches performed in the realms of science and technology frequently involve association of concepts that lie outside the traditional associations represented in thesauri. Associative networks gleaned through textual analysis, they argued, facilitate innovation by making obvious associations that would otherwise be impossible for humans to find on their own. In early research [29], Lesk found little overlap between term relationships generated through term associations and those presented in existing thesauri.
More recently, Crouch and Yang [16] automatically generated thesaurus classes from text keywords, which can subsequently be used to index documents and queries. Crouch's approach is based on Salton's vector space model and the term discrimination theory. Documents are clustered using the complete link clustering algorithm (agglomerative, hierarchical method). Ekmekcioglu et al. [18] tested retrieval performances for 110 queries on a database of 26,280 bibliographic records using four approaches: original queries and query expansion using co-occurrence data, Soundex code (a phonetic code that assigns the same code to words that sound the same), and string similarity measure (based on similar character microstructure), respectively. The four approaches produced 509 (original queries), 526 (term co-occurrence), 518 (Soundex), and 534 (string) documents, respectively. They concluded that there were no significant differences in retrieval effectiveness among these expansion methods and initial queries. However, a close examination of their results revealed that there was a very small degree of overlap between the retrieved relevant documents generated by the initial queries and those produced by the co-occurrence approach (19% overlap using the Dice coefficient). This suggests that search performance may be greatly improved if a searcher can select and use the terms suggested by a co-occurrence thesaurus in addition to the terms he/she has generated.
The limitations of the popular symmetric similarity functions, e.g., cosine, Dice, and Jaccard's, have been reported by Peat and Willett [35]. Their research showed that similar terms identified by symmetric co-occurrence function tended to occur very frequently in the database that is being searched and thus did little or nothing to improve the discriminatory power of the original query. They concluded that this can help explain Sparck Jones's finding that the best retrieval results were obtained if only the less frequently occurring terms were clustered and the more frequently occurring terms were left unclustered.
Several research groups recently have experimented with an algorithmic approach to cross-domain term switching. Chen et al. experimented extensively in generating, integrating, and activating multiple thesauri (some were existing thesauri, others automatically generated, all in computing-related areas) [10] [11]. Both Kim and Kim [27] and Chen et al. [10] proposed treating (automatic and manually-created) thesauri as a neural network or semantic network and applying spreading activation algorithms for term-switching. Despite questions about the usefulness of automatic thesaurus browsing heuristics [26], our recent experiment revealed that activation-based term suggestion was comparable to the manual thesaurus browsing process in document recall and precision, but that the manual browsing process was much more laborious and cognitively demanding [11].
The specific algorithms adopted in this research include: object filtering, automatic indexing, and co-occurrence analysis. In the following section, we present an overview of these techniques and our modifications.
In [2], Bates proposed a design model for subject access in online catalogs. She stressed the importance of building domain-specific lexicons for online retrieval purposes. A domain-specific, controlled list of keywords can help identify legitimate search vocabularies and help searchers ``dock'' on to the retrieval system. For most domain-specific databases, there generally appear to be some existing lists of subject descriptors (e.g., the subject indexes at the back of textbooks), researchers' names (e.g., author indexes or researcher directories), and other domain-specific objects (e.g., genes, experimental methods, organizational names, etc.), either online or obtainable through OCR scanning. These domain-specific keywords can be used to help in automatic identification of important concepts in documents. After the object filter lists are collected, we can perform a pattern matching procedure to identify words and phrases in documents that match with the indexes in the object filter lists. These matched terms are then used as candidate indexes for the specific document.
In [9], we used several domain-specific controlled lists of subject keywords, researchers' names, and organizational names for indexing in a Russian computing database (with about 200 MBs and 40,000 documents). In creating the worm thesaurus [12], we utilized author indexes from literature sources, the WCS gene list, the subject index from the Worm Book, and an existing keyword list.
After object filtering, the texts remaining may still contain many important concepts not detected by the filters. In [42], Salton presents a blueprint for automatic indexing, which typically includes dictionary look-up, stop-wording, word stemming, and term-phrase formation. The algorithm first identifies individual words. A stop word list is then used to remove function words such as the, a, on, in, etc. After removing the stop words, a stemming algorithm is used to identify the word stem for the remaining words. Finally, term-phrase formation that formulates phrases by combining only adjacent words is performed.
While automatic indexing identifies subject descriptors in a document, the relative importance of each descriptor to representing the content of the document may vary. Salton's Vector Space Model associates a weight with each descriptor to represent its descriptive power. Among the many probabilistic techniques that have been developed by various information science researchers, techniques which typically incorporate term frequency and inverse document frequency have been found to be simple and yet very useful [42]. Based on cluster analysis [19], the Vector Space Model has been extended for automatic thesaurus generation. In our research, we only adopted a co-occurrence analysis algorithm to convert the raw data (e.g., indexes and weights) into a matrix of inter-individual similarity (or distance) measures. In the biomedical and biological domains, we have found that the subject matter is often very specific (i.e., gene names, experimental methods, and topics) and provides a sound basis for performing co-occurrence analysis [12].
Although the above techniques had been employed in other applications, including the development of a highly domain-specific, up-to-date automatic thesaurus for the worm community, the volume of Drosophila literature and the unstandardized naming scheme utilized in the fly community offered special challenges. During our system development process, significant adaptation was required to meet the specific constraints and novel characteristics of FlyBase.
Specific concepts like gene names, function names, researchers, and subject descriptors obtained from FlyBase, and from the large number of fly-related documents in Medline and Biosis abstracts, provided the foundation for automatic thesaurus generation. FlyBase itself represents the long research history of the Drosophila community. The volume of information is estimated as 10 times that of the WCS. The automatic thesaurus generated for FlyBase used as a knowledge source a collection of 5,854 abstracts (more than 10 MBs of textual information) from Medline and Biosis CD-ROM, with a time span between 1983 and 1993. After interviewing several fly biologists at the University of Arizona, we chose to generate and represent the important concepts described in recent literature instead of capturing the concepts exhibited in all fly literature, because the latest concepts may be of most interest to current researchers. The experience and lessons learned during the system development and thesaurus evaluation process are reported below.
We created four lists of fly-related keywords from several directories in FlyBase and from Biosis and Medline. Unlike the unified and systematic naming convention used in the worm community, the format in Drosophila is more free-style. Capitalization is important for gene and function names. Furthermore, there is no systematic way to name alleles in Drosophila. Special characters such as left parenthesis ``('', pipe ``|'', apostrophe ``''', comma ``,'', slash ``/'', and plus sign ``+'', are commonly used in the naming convention. Here are some examples of those names: ``(-Glycerol phosphate dehydrogenase'', ``Ac-SD'', ``Su(Ste)'', ``l(3)4.15'', ``ort1 ninaE1'', ``ADP/ATP translocase'', ``fructose-1,6-bisphosphate aldolase'', ``(Na+ K+) ATPase ( subunit'', and ``Casein kinase II, | subunit''. Because of this unconstrained naming convention, a parser that had been developed previously and used for several applications [9] [12] [8] was revised to handle these scientific terms. Our revised parser used space as the delimiter and considered all special characters and punctuations valid for forming a term. We created four groups of object filters as follows:
In total, we pre-identified 23,773 terms known to be specific to the fly domains using object filters. These terms were of four types: gene, function, researcher, and subject descriptor (terms identified by the four different filters were assigned to different types). We used these object lists to ``filter'' domain-specific concepts from the Biosis and Medline abstracts and index the objects to the abstracts.
After documents were ``filtered'' using the four lists, the remaining text in the abstracts was processed by the automatic indexing routine. The following steps were executed in order:
We first computed the term frequency and the document frequency for each term in a document (i.e., terms generated by object filtering and automatic indexing). Term frequency, tfij, represents the number of occurrences of term j in document i. Document frequency, dfj, represents the number of documents in a collection of ndocuments in which term j occurs. High term frequency indicates that a term is highly related to a document. High document frequency, on the other hand, indicates that a term is too general to be useful as a descriptor (i.e., has no descriptive power).
Usually terms identified from the title of a document are more descriptive than terms identified from the abstract of the document. This is also the case here, as confirmed by fly experts at University of Arizona. In addition, terms identified through object filtering are usually more accurate than terms generated by automatic indexing. This is due to the fact that terms generated by automatic indexing are relatively ``noisy.'' In our research, terms identified in titles were assigned heavier weights than terms in abstracts and terms identified by object filtering were assigned heavier weights than terms identified by automatic indexing. (An ad hoc weighting scheme was adopted where small constant numbers were added to the actual term frequencies of terms.)
After extensive experimentation and consultation with several fly experts, we retained only automatic indexing terms that appeared more than three times, in order to remove incidental noise terms. The total numbers of unique terms generated (through both object filtering and automatic indexing) from the fly literature are shown in Table 2. Not surprisingly, subject descriptors constituted the largest share of the descriptors.
|
We then computed the combined weight of term j in document i, dij, based on the product of ``term frequency'' and ``inverse document frequency'' as follows:
where N represents the total number of fly documents, and wj represents the number of words in descriptor Tj. Multiple-word terms were assigned heavier weights than single-word terms because multiple-word terms usually conveyed more precise semantic meaning than single-word terms. Although such an ad hoc weighting scheme was added on the bases of our experience and fly expert feedback, a systematic evaluation of the impact of such weighting had not been conducted at this point.
After the concept descriptors for each abstract were identified, we proceeded to perform term co-occurrence analysis for all documents in the document collection. (We refer to the thesaurus as a concept space to distinguish it from the information space represented by the Biosis and Medline documents.)
We then generated a term co-occurrence table based on the asymmetric ``Cluster Function'' developed by the authors. In a previous term (concept) association experiment, we showed that this asymmetric similarity function produced a term association recall level significantly higher (at the 1.7% significance level) than that produced by the popular cosine function [9] (the term precision levels were comparable). Based on feedback from several fly experts, the weighting factor appearing in the equations below aimed to reward related terms which are specific and penalize related terms which are too general to be useful for scientific retrieval purposes. (On numerous occasions, fly biologists expressed dissatisfaction with system-suggested terms in their domain which were too general, e.g., gene, mutation, etc. They often considered those terms to be irrelevant.)
These two equations indicate the similarity weights from term Tj to term Tk (the first equation) and from term Tk to term Tj ( the second equation). dij and dik were calculated based on the equation in the previous step. dijk represents the combined weight of both descriptors Tj and Tk in document i. dijk is defined similarly as follows:
where tfijk represents the number of occurrences of both term j and term k in document i (the smaller number of occurrences between the terms was chosen). dfjkrepresents the number of documents (in a collection of N documents) in which terms j and k occur together. wj represents the number of words of descriptor Tj.
In order to penalize general terms (terms which appeared in many places) in the co-occurrence analysis, we adopted the following weighting schemes:
Terms with a higher dfk value (more general terms) had a smaller weighting factor value, which caused the co-occurrence probability to become smaller. In effect, general terms were pushed down in the co-occurrence table (terms in the co-occurrence table were presented in reverse probabilistic order, with more relevant terms appearing first).
Sample entries in the system-generated co-occurrence tables are shown in Figure 1. As shown in the co-occurrence table, ``DISCS'' was found to be most strongly related to researcher ``BROWER, D.'' with a weighted probability of 0.286650. The other terms related to researcher ``BROWER, D.'' were listed in descending order. In the second entry, gene ``e(bx)'' was found to be most relevant to gene ``white''. In the third entry, the subject ``LOCALIZED'' was founded to be the most relevant to researcher ``LEHMANN, R''. The last entry revealed that the subject ``HOMEOTIC'' had as its most relevant term the gene ``ANTC''.
Without setting a probabilistic threshold for the co-occurrence table, the total number of co-occurrence pairs was 811,356. Some terms may have had a few thousand co-occurrence terms. The enormous volume of these data not only used a lot of memory, it might also overwhelm searchers during the thesaurus browsing process. For productive user-system interaction, only highly relevant concepts should be suggested to searchers. Based on our experience from worm thesaurus generation [12], we chose 100 as the maximum number of links for any node. The fly thesaurus resulting after applying the thresholds contained 748,253 pairs of related concepts. On average, each term had about 41 relevant neighboring concepts.
Our prototype system, which consisted of a thesaurus generation component and a thesaurus query system, was developed in ANSI C and ran on SUN SPARC stations, DECstations, and DEC Alpha machines. It took 10.6 hours of CPU time to generate the thesaurus using a DECstation 5000/120 (25 MIPS, ULTRIX-based). The resulting size of the fly thesaurus was 13.5 MBs. Users can browse the thesaurus through a query interface that was previously developed for use with the worm thesaurus [12]. We first present a sample fly thesaurus query session and then discuss an experimental design that was adopted to evaluate the usefulness of the fly thesaurus and to identify areas for improvement.
A sample (typical) sequence of query sessions is shown in Figures 2, 3, and 4. (The Fly Thesaurus Query System had loaded the complete fly thesaurus used in this sample session in our earlier pilot study, i.e., without applying the 100-link threshold.) After a previous unrelated session had involved ``(cell) death'' (terms related to ``death'' were displayed in the main window), the user was interested in finding something about homeotic genes, a class of genes involved in body pattern formation in the developing fly. The term ``HOMEOTIC'' had been entered in the new query box. Although multiple terms may be searched together, the user elected to search only one term, as indicated by the lower case ``q'' entered for term 2 (Figure 2, bottom box). Figure 3 shows a display of co-occurring terms in decreasing weighted order. As expected, a number of gene names appeared on the list of relevant terms. Term 24, ``EMBRYONIC'', suggests that ``HOMEOTIC'' may be related to genes in the developing fly. Terms 1, 10, 11, 18, and 32 are all variations on the word ``antenna''. A group of genes important in the proper development and placement of the fly antennae are known to be clustered together on the chromosome, and are called ``Antennapedia Complex.'' Furthermore, terms 3, 7, 9, 12, 14, 15, and 16 are all gene names.
The user was interested in determining whether any of the system-suggested genes are included in the Antennapedia Complex, and marked terms 1 (ANTC), 2 (HOMEOTIC GENES), and 17 (COMPLEX) for further searching. The menu choice ``Use term(s) with marker'' under ``Thesaurus Re-Query'' re-activated the thesaurus. Figure 4 shows the result of the iterative query using the three selected terms. We see that ANTC (term 1) appears to function as a member of an immunoglobin superfamily, and that the protein expressed by the gene appears to be glucose dehydrogenase. We also see that six of the gene names that appeared on the list of terms suggested for ``HOMEOTIC'' (3, 4, 7, 14, 15, and 16) directly relate to all three of the marked terms, as indicated in the square brackets following the co-occurrence weight.
A user can iteratively query additional marked system-suggested terms as many times as needed to arrive at a list of sufficiently specific terms. Alternatively, the thesaurus can be re-activated by entering new user-suggested terms. Terms selected by a user during the iterative thesaurus browsing process are recorded in a separate area. The next stage of development for this thesaurus will involve traversal from the fly concept space into the previously generated worm concept space and will feature capability to retrieve worm documents using fly query terms through the help of a conjoined fly-worm concept space. This fly-worm traversal project aims to examine issues related to cross-domain vocabulary switching by developing automatic domain-specific thesauri and merging multiple thesauri and by providing spreading activation algorithms for concept traversal [11].
A fly thesaurus evaluation experiment was conducted in Fall 1993. The experiment consisted of two parts: a term association experiment and a searcher browsing experiment. In order to investigate the effects of a searcher's background on thesaurus usage, six subjects at three levels of expertise were selected. Subjects 1 and 2 were considered experts in Drosophila biology. Both were faculty members with more than 9 years experience in Drosophila genetics, and had published numerous papers in this area. Subjects 3 and 4 were considered novices: one was a doctoral student in Molecular and Cellular Biology; the other was a postdoctoral fellow in Ecology and Evolutionary Biology. Both had worked in laboratories dedicated to Drosophila research for at least 2 years. Subjects 5 and 6 were considered outsiders. Both worked outside the fly domain but had had exposure to fly concepts and had reasons to be interested in using a fly thesaurus. Subject 5 was a faculty member with limited previous experience in a Drosophila research laboratory (1 year sabbatical). Subject 6, a doctoral student with extensive experience in worm genetics, had frequently attended fly-worm joint seminars.
The first step of the term association experiment was to give each subject a pre-selected term. Ten fly terms chosen with the help of several fly researchers were presented to each subject in order. These terms included researchers' names, gene names, and subject descriptors. The subjects were asked to write down concepts (genes, researchers, and subject descriptors) related to each pre-selected term. A sample experimental sheet for one of the terms, ``white'' (a gene name)5 and the 12 related terms generated by Subject 2 are shown in Figure 5. Subjects were then asked to mark terms suggested by the fly thesaurus as irrelevant, somewhat relevant, or very relevant.
The specific hypothesis tested in this experiment was:
H1: Does the fly thesaurus help find more relevant terms and generate better concept recall and precision than human subjects of varying backgrounds?
After the term-association experiment, subjects were asked to browse the online fly thesaurus freely - using any terms they preferred and exploring any way they liked. During browsing, subjects were asked to think aloud, and to give specific comments, observations, or suggestions regarding the user interface and the quality of the fly thesaurus. Their complete online sessions were logged. Verbal protocols were recorded and later transcribed for analysis. We aimed to identify directions for system improvement through detailed analysis of the protocols. It should be cautioned that experience resulting from the term association experiment could have altered a subject's behavior on this experiment. The complete association and browsing experiment lasted 1 to 1.5 hours for each subject.
The research question for this experiment was:
H2: What are the potential strengths and weaknesses of the fly thesaurus and interface as experienced by users?
By counting the numbers of terms generated by the subjects themselves and the system-suggested terms marked as either somewhat relevant or very relevant by the subjects, we were able to tabulate and analyze whether the thesaurus was able to contribute relevant terms during a retrieval process. An analysis of variance procedure (ANOVA) using a statistical package MINITAB [41] was conducted for the search terms, followed by a two-sample t-test to determine the differences in means. The results are summarized in Figure 6. Overall, an average of 11.2 fly thesaurus terms was selected by the subjects as relevant. Subjects were able to generate 12.017 terms by themselves. As in the WCS experiment, the expert group was able to generate more relevant terms than both novice and outsider groups. However, differing from the findings in WCS, the outsider group generated more relevant terms than the novice group. Among the relevant terms suggested by the fly thesaurus, terms produced by object filtering were the majority (about 70-80% for all cases). However, the automatic indexing procedure did contribute relevant terms in some cases.
For the fly thesaurus, the two-sample t-test revealed that there were no statistically significant differences in means (at the level of significance of 0.05) for (ALL vs. KB, P-value=0.689), (NOVICES vs. KB, P-value=0.708), and (OUTSIDERS vs. KB, P-value=0.367).
In contrast to the document recall and precision measures typically used in information science research, we adopted concept recall and concept precision for evaluation. Instead of examining the number of relevant documents retrieved, we counted the number of relevant terms (concepts) identified. They were computed as follows:
For all subjects, the terms they initially generated and the terms selected from the thesaurus were included to represent the Total Relevant Concepts - the target set of concepts that can be obtained through user-thesaurus interaction. (In fairness, this measure actually represents only ``subject-perceived relevance.'') Based on this target set of concepts, we were able to examine the subjects' initial terms (generated without any thesaurus help) and determine the subjects' concept recall and concept precision levels when the thesaurus component was unavailable, i.e., by counting the number of terms that matched the target terms. We then evaluated the concept recall and concept precision levels for the thesaurus by counting the number of thesaurus terms that matched with the target terms. Both ANOVA tests and two-sample t-tests were performed for concept recall and concept precision.
The ANOVA results for concept recall are shown in Figure 7. Overall, there were no significant differences (P-value = 0.366) between the subject groups and the thesaurus. On average, subjects' recall level was 53.81% while that of the thesaurus was 49.80%. These two percentages suggested that the degree of overlap between the set of terms from subjects and that from the thesaurus was only 3.61% (i.e., the summation of 53.81% and 49.80%, minus 100%). Although this overlapping percentage was low, the thesaurus terms were carefully evaluated and selected by the subjects as relevant. Furthermore, such findings indicated that subjects could generate almost 50% of total relevant terms independently, The thesaurus was able to help subjects associate an equal number of additional terms. Among experts, novices, and outsiders, there were no significant differences (P-value = 0.277, P-value = 0.774, and P-value = 0.791, respectively) between each subject group and the thesaurus. We found that for these three subject groups, the percentage of overlapping between the set of terms generated by subjects and that generated by the thesaurus was consistent with the overall overlapping percentages. They were 4.33%, 3.09%, and 3.34%, respectively. At all levels of expertise, subjects could recall only half of relevant terms without any assistance. With assistance from the thesaurus, another half of the relevant terms were recalled.
As shown in Figure 8, the thesaurus produced a low
level of precision compared with those produced by the human subjects.
Overall, human subjects had about a 98.20% concept precision level;
the thesaurus had a 37.33% precision level. The differences between overall
level and each of the three subject-group levels were significant (P-value
0.000 in all levels). The low precision level of the
thesaurus was attributed partially
to the noise terms (mostly terms considered too general) in
the thesaurus, as reported in the subjects' determination of the
relevancy of those terms to their queries and needs. As is
evident in information science research, even man-made thesauri are
only useful when terms are presented in the context of the searchers'
needs and when selected by the searchers themselves. Thesauri should
be used for consultation purposes, not for automatic term
replacement. We believe that searchers'
involvement during the thesaurus consultation
process is crucial to the success of thesaurus usage.
In conclusion, in terms of quantity, the thesaurus and the human subjects were able to generate the same number of relevant terms, but the contents of their lists were almost mutually exclusive. However, human subjects were more precise than the thesaurus. With close human-computer interaction, it appears that an automatic thesaurus-augmented search process can become very fruitful and productive.
The subjective evaluation of the fly thesaurus reported here is the result of analyzing logged browsing sessions and transcripts of the subjects' verbal protocols. During browsing, subjects were asked to think aloud and to give specific comments, observations, or suggestions. These protocols provided clues for system improvement.
A search involved either typing in a new term (user-suggested), or marking system-suggested terms (iterations). A new search was defined as an entry of a searcher's own term. The online thesaurus suggested additional terms, which were shown on the system display. Whether they were using their own terms or system-suggested terms, most subjects used terms that they were either familiar with or curious about. Analysis of the logged search sessions revealed that the six subjects performed a total of 19 searches. Most search topics were related to biological functions or phenomena (e.g., ``LASER ABLATION,'' SPERMATOCYTE,'' ``SPERMATOGENESIS,'' ``LONGIVITY,'' ``LINEAGE PATTERN,'' etc.). However, a few search topics were related to author names (e.g., ``John, T.'') and gene names (e.g., ``SPE-12,'' ``SPE-26''). The greatest number of iterations following a new search was three, with nearly 90% (17/19) of all searches being followed by 0, 1, or 2 iterations. This observation is underscored by an observation made by Subject 1, who commented that after 3 iterations she was getting no new terms/information, and that the same set of terms was reappearing.
As a result of our analysis of the session logs and verbal protocols, a taxonomy was developed to represent the observations of the subjects (see Figure 9). Included in this taxonomy are problems identified, which may be taken as potential pitfalls to consider in development of a domain-specific automatic thesaurus, and advantages/benefits identified, which may be taken as potential usefulness of such a thesaurus. This experiment gave us a better idea of how the thesaurus could be modified and used for more real-life purposes and by different user groups.
A. Quality of nodes and links:
The issue of chronology was important to all subjects. However, we learned that the users had different expectations with regard to the emphasis placed on recent and seminal work, and on frequency of citation. This strongly affected their evaluation of the system-generated thesaurus. For example, two subjects (an expert and an outsider) were more concerned about giving priority to more recent publications, citing numbers of publications by each person on the list, and indicating co-authors and recency of each item. One expert subject was more concerned about the priority of the authors responsible for more seminal work.
Spelling and hyphenation of words is not standardized in the abstracts, and this created redundancies in the thesaurus. Spelling differences such as ``Homeodomain'' (American) and ``Homoeodomain'' (British), ``discs'' and ``disks,'' ``signaling'' and ``signalling'' were often confusing for the user.
Subjects could find a term non-significant or meaningless if it is: an inappropriately included stopword; a single term from a term phrase (a fragmented phrase); or a very broad or non-specific term. The distinction between general and sufficiently specific terms was commonly emphasized by all subjects. One expert defined useless terms as those that are incomplete and those that do not indicate the ``importance of something.'' Four subjects (two novices, one outsider, and one expert) defined specific terms as being those that would be unique in the literature, those that would clearly identify an author's field of study, or those that more clearly identified the context of a more general term.
One source of ambiguity was caused by the presence of multiple authors with same name. One outsider performing an author search was perplexed by the output because the terms seemed so disparate, until he concluded that the list of terms represented the work of more than one author. Although the interface includes a field to indicate the term type (subject (s), gene (g), function (f), author (a)) to reduce ambiguity, most users did not pay much attention to it. Consequently, some terms that fitted into more than one ``type'' were considered by the subjects to be ambiguous.
A problem related to synonymous linkages between genes was identified. We used two of the flat files in FlyBase that linked gene names with all synonymous names. One of the subjects (an expert) pointed out that some of the genes listed as synonyms are not actually synonyms, but are alleles, facets, etc. One term (Abruptex) with weight of 1.000 (synonym) was retrieved for the Notch query. The subject pointed out that Abruptex is a class of alleles that fit under the Notch umbrella. Therefore Notch should be at the top of the tree. However, these two terms were listed in the FlyBase as synonyms. Our use of the Synonyms files from FlyBase resulted in incidents in which the thesaurus brought up gene names unfamiliar to the subjects. All 6 subjects pointed out that they were unfamiliar with ``cel", a gene name that came up in at least one search for each subject, and questioned the appropriateness of this linkage.
The two experts and the outsider who is a faculty member were frustrated by not seeing names they expected to be on the list. One novice was dismayed to see that the gene she works on (``cappucino'') was not in the thesaurus. We estimate that because of the field's long history, the amount of formal and informal literature representing the area of Drosophila genetics is at least 10 times that for the area of worm genetics. The database underlying the thesaurus is therefore only a sampling of the fly literature, fewer than 6000 abstracts from 2 databases.
B. Quality of System and Interface
The interface program was written in C with CURSES function library, which provides control for many different alphanumeric terminals. The library provides a convenient rapid-prototyping tool for developing interface. Users can move the highlighted bar on the screen to the desired command by pressing the up/down and left/right arrow keys. Despite needing very few keystrokes, ease of use of the interface (especially the arrow keys) was a problem for all of the subjects. Half of the subjects were finally comfortable with the interface by the end of the browsing session (including one who stated at the beginning of the session that she ``hates computers''). Half never became comfortable, but two of these felt that with more experience its operation would eventually become ``second nature.''
The query term is not displayed at the top of the list of retrieved terms. This causes problems when the searcher sees a term lower on the list that is of interest and wants to search it together with the original term. To do so requires that the user re-enter both terms using the ``Type In A New Term" function. Several subjects were frustrated with this limitation on further searching. Both experts would prefer that terms be classed according to ``type,'' with all authors listed together, subjects together, genes together. One subject questioned whether it is realistic to rank an author as being more highly related to a term than a subject or a gene name. All but one subject (an outsider) found the display ``hard to read" because of the interspersion in term ``types'' in the display.
Some of the pitfalls reported can be resolved algorithmically, while others may require manual attention. After the experiment, we examined the high-frequency terms in the fly thesaurus and manually removed many terms which are noise and/or too general to be useful for biologists. More extensive fly thesaurus development work that will involve a larger fly collection is also planned. A new interface based on the user-friendly and accessible WWW has recently been created. The current fly thesaurus can be accessed at: http://bpaosf.bpa.arizona.edu:8000/cgi-bin/BioQuest.
Verbal protocols revealed two significant, novel areas of potential usefulness.
A. Serendipity Discovery:
One expert initially was concerned about what she termed the ``misdirectedness'' of the the thesaurus's search method and expressed frustration with it. However, after a few searches, she conceded that, ``It's like a...word thesaurus in that there are probably going to be things in here that I absolutely never would have thought about...and routes though the data, through the literature that I never would take without it."
Most users thought the thesaurus would be useful in introducing them to various aspects of the domain that were beyond their present level of experience. One expert and one novice found that by imagining a scenario in which multiple terms were related resulted in deductive arrival at possible contexts for seemingly unrelated terms. We have not seen this kind of creative use of a thesaurus reported before. Two subjects (an expert and an outsider) did not see this potential and were primarily interested in seeing the things they knew to be related and relevant. Anything unexpected was viewed as noise.
B. Memory-Jogging:
The second novel usefulness for the fly thesaurus is memory-jogging. One novice searched for the gene ``wingless''. The term list retrieved contained the term ``signalling'', which triggered a spark of recognition in the subject. She said, ``Let's try `signalling', because `wingless' is presumed to be some sort of signalling molecule.'' By applying deductive reasoning to the thesaurus's inherent characteristics emphasizing knowledge discovery and memory-enhancement, the subject's creativity was triggered. This research has shown that automatic thesauri have the potential to offer improvements to searching that exceed those possible with manual thesauri.
In summary, the results from the experiment were encouraging. The thesaurus suggested relevant terms and concepts that would not only be helpful for different users, but useful in spurring user ideas and desire to acquire knowledge.
Our most immediate future research directions, as part of a long term effort to develop a more efficient and ``intelligent'' framework and design for the management, retrieval, sharing, and dissemination of information for distributed scientific computing include the following:
We believe we have made a contribution to addressing one of the most challenging problems in IR - the vocabulary problem. In scientific communities an outsider (e.g., a fly biologist) often needs to search for literature in other domains (e.g., worm biology) using his/her own vocabulary (i.e., fly-specific terms). Now that we have created concept spaces for the fly and worm communities, adopting a multiple-thesauri consultation process (we have developed one, reported in [9]), a searcher's fly-specific terms should make it possible to traverse the two concept spaces and eventually converge towards specific terms in the (target) worm concept space. Results of some initial structural comparison between the fly and worm thesauri are summarized in Table 3.
|
Before devising a mechanism for traversal from one thesaurus to the other, it was important to determine the extent of overlap between the index terms contained in each. Because the object filter lists available for the two domains differed, some direct comparisons were not possible. Table 3 shows the numbers of terms in the worm and fly thesauri. These include the number of author terms, number of gene terms, number of subject terms, number of function terms (fly thesaurus only), and number of method terms (worm thesaurus only). The last three columns report the number of terms appearing in both thesauri and the respective proportion of each thesaurus that overlapping terms represent. It is not surprising that no overlap exists in gene names: the naming conventions for the two domains are extremely different. Furthermore, it is noteworthy that 252 author names appear in both thesauri. The format for author names is last name and first initial, which could present some ambiguity. Still it is likely that some authors have published in both domains. The extent of overlap for the subject descriptors was greater than 25% for the fly thesaurus and 32% for the worm thesaurus. With this much overlap, the likelihood of finding intermediate terms for concept space traversal is promising.
As an extension of the current research, we will be testing several AI-based general search algorithms (e.g., branch-and-bound and Hopfield network [10]) for automatic thesaurus consultation. We hope these algorithms will assist searchers in traversing domain-specific concept spaces by following the more relevant links first, a general characteristic of optimal or heuristic search algorithms. We have done some work in this area already, but significant experimentation is still required to develop a robust automatic thesaurus consultation module.
In the current version, the thesaurus was generated in a batch mode. Because scientific databases are rapidly and continually growing, an effective method for incremental update for the thesaurus is needed. We are currently developing an incremental version of our cluster algorithm for implementation in the worm and fly thesauri. Storing some intermediate results for term frequencies and inverse document frequencies should, we believe, make possible incremental updating of a thesaurus.
So far we have not included in our analysis the ``time'' dimension of the documents and concepts. By time-tagging each concept and weighting concepts differently in the thesaurus generation process, we believe a more fluid and time-precise thesaurus can be created.
The rationale behind our research is that instead of letting knowledgeable information specialists (knowledgeable in several subject areas) perform term matching and consultation for different users, we could automatically create different domain-specific thesauri tailored to the vocabularies and concepts exhibited in the related disciplines and develop an online search aid to bridge the vocabulary differences. We hope, by expanding the concept (terminology and linkage) coverage of the underlying databases, information retrieval systems will eventually be able to assist in seamless and ``intelligent'' concept-based information retrieval.
This project was supported mainly by two NSF grants: the NSF CISE Research Initiation Award, IRI-9211418, 1992-1994 (H. Chen, ``Building a Concept Space for an Electronic Community System'') and NSF CISE Special Initiative on Coordination Theory and Collaboration Technology, IRI-9015407, 1990-1993 (B. Schatz et al., ``Building a National Collaboratory Testbed''). We would also like to thank the faculty and students of the Molecular and Cellular Biology Department, Ecology and Evolutionary Biology Department, Anatomy Department, and Biochemistry Department at the University of Arizona for their kind assistance and valuable suggestions, in particular, those of Dr. Samuel Ward, Dr. Danny Brower, Dr. John Clark, Dr. John Little, Dr. Lynn Manceau, Dr. Mary Rykowski, Ms. Alicia Minniti, and Ms. Lisa Werner.