Contents


List of Figures

Internet Categorization and Search:
A Self-Organizing Approach

Hsinchun Chen
Associate Professor
Management Information Systems Department, University of Arizona
Tucson, Arizona 85721, (602) 621-4153, hchen@bpa.arizona.
and
Chris Schuffels
Research Scientist
Management Information Systems Department, University of Arizona
Tucson, Arizona 85721, (602) 621-2478, cschuffels@bpa.arizona.
and
Rich Orwig
Research Scientist
Management Information Systems Department, University of Arizona
Tucson, Arizona 85721, (602) 621-2478, rorwig@bpa.arizona.

Abstract:

The problems of information overload and vocabulary differences have become more pressing with the emergence of the increasingly more popular Internet services. The main information retrieval mechanisms provided by the prevailing Internet WWW software are based on either keyword search (e.g., Lycos server at CMU, Yahoo server at Stanford) or hypertext browsing (e.g., Mosaic and Netscape). This research aims to provide an alternative concept-based categorization and search capability for WWW servers based on selected machine learning algorithms.

Our proposed approach, which is grounded on automatic textual analysis of Internet documents (homepages), attempts to address the Internet search problem by first categorizing the content of Internet documents. We report results of our recent testing of a multi-layered neural network clustering algorithm employing the Kohonen self-organizing feature map to categorize (classify) Internet homepages according to their content. The category hierarchies created could serve to partition the vast Internet services into subject-specific categories and databases and improve Internet keyword searching and/or browsing.

Introduction

Despite the usefulness of database technologies, users of online information systems are often overwhelmed by the amount of current information, the subject and system knowledge required to access this information, and the constant influx of new information [11]. The result is termed ``information overload" [3]. A second difficulty associated with information retrieval and information sharing is the classic ``vocabulary problem," which is a consequence of diversity of expertise and backgrounds of system users [29] [30] [9]. The ``fluidity" of concepts and vocabularies in various domains further complicates the retrieval issue [9] [26] [18]. A concept may be perceived differently by different searchers and it may also convey different meanings at different times. To address the ``information overload" and the ``vocabulary problem" in a large information space that is used by searchers of varying backgrounds a more ``intelligent" and proactive search aid is needed.

The problems of information overload and vocabulary differences have become more pressing with the emergence of the increasingly popular Internet services [47] [24]. Although Internet protocols such as WWW/http support significantly easier importation and fetching of online information sources, their use is accompanied by the problem of users not being able to explore and find what they want in an enormous information space [2] [6] [55]. While the Internet services are popular and appealing to many online users, difficulties with search on Internet, we believe, will worsen as the amount of online information increases. We consider that devising a scalable approach to Internet search is critical to the success of Internet services and other current and future National Information Infrastructure applications.

The main information retrieval mechanisms provided by the prevailing Internet WWW-based software are based on either ``keyword search" (e.g., the Lycos server at CMU and the Yahoo server at Stanford) or ``hypertext browsing'' (e.g., NCSA Mosaic and Netscape browser). Keyword search often results in relatively low precision and/or poor recall, and slow response time due to the limitations of the indexing and communication methods (bandwidth), controlled language based interfaces (the vocabulary problem), and the inability of searchers themselves to fully articulate their needs. Furthermore, browsing allows users to explore only a very small portion of the large Internet information space. An extensive information space accessed through hypertext-like browsing can also potentially confuse and disorient its user, resulting in the ``embedded digression problem," and can cause the user to spend a great deal of time while learning nothing specific, typifying the ``art museum phenomenon" [8] [25].

Internet ``surfers'' also have begun to raise their expectations of the Internet services - from a simple desire to find something fun (for no particular reason) to hoping to find something that might be useful (to their work or personal interests). For example, the Lycos server at CMU has become one of the hottest and most popular servers on the Internet due to its comprehensive listing and indexing of Internet homepages (10+ million URLs in October, 1995 and growing) and (keyword) search capability. However, the Lycos server has been hampered severely by the information overload and communication bandwidth problems discussed above.

Our proposed approach, which is grounded on automatic textual analysis of Internet documents (homepages), aims to address the Internet search problem by first automatically categorizing the content of Internet documents and subsequently providing category-specific search capabilities. As the first step to intelligent categorization and search for Internet, we proposed a multi-layered neural network clustering algorithm employing a Kohonen self-organizing feature map to categorize (classify) the Internet homepages according to their content. The category hierarchies could serve to partition the vast Internet services into subject-specific categories and databases. After individual subject categories had been created, subject-specific searches or browsing could be performed.

In Section 2, we first present an overview of machine learning techniques for information retrieval. We then review the current status of Internet categorization and searching and associated problems. In Section 3, we present our framework for addressing these problems. Sections 4 and 5 discuss the specific algorithm and results from our ongoing Internet categorization research. Conclusions and discussion are provided in Section 6.

Internet Categorization and Search: Techniques and Problems

Machine Learning for Information Retrieval

Searching is a concept frequently discussed in the context of information retrieval research. In this section, we provide a brief summary of the emerging machine learning approach to searching. For a complete review of other techniques, readers are referred to [10].

Inductive machine learning techniques have drawn attention from researchers in computer and information sciences in recent years. In particular, Doszkocs et al. [22] have provided an excellent review of connectionist models for information retrieval and Lewis [37] and Chen [10] have surveyed and experimented with various machine learning algorithms in information retrieval and discussed promising areas for future research at the intersection of these two fields.

Based on our experience in this area, we believe that the new and emerging machine learning algorithms that analyze the common characteristics of documents and retrieval patterns of searchers are promising and may provide a viable solution to the complex and large-scale Internet categorization and search problem.

Internet Categorization and Search: An Overview

In its roots as the ARPANET, the Internet was conceived primarily as a means for remote login and experimentation with telecommunication [6]. However, the predominant usage quickly become email communication. This trend continues into the present form of the Internet, but with increasingly diverse support for collaborative data sharing and distributed, multimedia information access, especially using the World-Wide Web (WWW). Many people consider the Internet and the WWW the backbone of the Information Superhighway and the window to Cyberspace.

The WWW was developed initially to support physicists and engineers at CERN, the European Particle Physics Laboratory in Geneva, Switzerland [2]. In 1993, when several browser programs (most noticeably the NCSA Mosaic) became available for distributed, multimedia, hypertext-like information fetching, Internet became the preview of a rich and colorful information Cyberspace [55]. However, as Internet services based on WWW have become more popular, information overload has become a pressing research problem [6]. The user interactions paradigm on Internet has been shifted from simple hypertext-like browsing (human-guided activity exploring the organization and contents of an information space) to content-based searching (a process in which the user describes a query and a system locates information that matches the description). Many researchers and practitioners have considered Internet searching to be one of the more pressing and rewarding areas of research for future NII applications.

Internet searching has been the hottest topic at recent World-Wide Web Conferences. Two major approaches have been developed and experimented with: one is the client-based search spider (agent) and the other is online database indexing and searching. However, many systems contain components of both approaches.

A Framework for Internet Categorization and Search

In this section we present our overall framework and design for Internet categorization and search and the specific machine learning algorithms to use. Relevant findings and preliminary testing results from our own research will also be presented. Status of our current research (i.e., Stage 1. Internet categorization) will be reported in the next section. Our proposed design consists of three stages, to be executed consecutively.

1.
Stage 1. Internet categorization using multi-layered Kohonen self-organizing feature map:

In order to improve the efficiency of searching on Internet, the first task is to partition the Internet information space into distinct subject categories meaningful to Internet users. Categorization and subject classification are common practices in library and information sciences (e.g., the INSPEC database for the computer engineering domain, the ERIC database for sociology, etc.). Subject partitioning creates smaller databases, which are more efficient for searching. In addition, a subject directory created as a result of a categorization or classification can also aid searchers' ``directory-browsing," a searcher-guided information seeking behavior frequently seen in the previously popular Gopher information servers. Many searchers of the Yahoo database adopted a combination of directory browsing and keyword searching within the specific sub-category of homepages.

After examining several clustering algorithms in the areas of computer science (e.g., hierarchical and non-hierarchical methods) and neural network algorithms in our previous research (to be discussed in detail below), we concluded that a variant of the Kohonen self-organizing feature maps (SOM) appears promising. The algorithm has been shown to be robust in numerous image processing and pattern recognition applications [56]. It also creates an intuitive, graphical display of important concepts contained in textual information [39] [48].

A multi-layered, graphical SOM approach to Internet categorization have been adopted for this research. By analyzing keywords/descriptors in Internet homepages and their probabilities of co-occurrence, we should be able to represent the most important Internet subject categories (e.g., science, engineering, business, politics, entertainment, etc.) in different regions of a map. For each large region, a recursive process of analyzing homepages in the region and creating sub-maps could then be undertaken. Because each map might contain 30-50 categories, 5-6 layers of maps could then easily represent a number of homepages on the order of ten to hundred millions (e.g., 305-506). After subject categories had been created, searchers would be able to browse the subject directory to locate the appropriate partition in which to launch their keyword searching.

2.
Stage 2. Concept-based search based on cluster analysis and Hopfield net associative retrieval:

In addition to keyword searching in a subject category, we propose a concept space approach to information retrieval. By analyzing the co-occurrence probabilities of keywords in homepages of specific subject categories, we could create a concept space for each subject category. Such a concept space would represent the important terms and their weighted relationships in a graph structure, akin to an associative man-made thesaurus. A system-created concept space has been shown to be an effective tool to suggest alternative terms for searching and to articulate and reformulate precise queries during information retrieval. In a recent experiment involving an electronic community system and actual molecular biologists, a system-generated (nematode) worm concept space was shown to be an excellent ``memory-jogging'' tool that supported learning and serendipitous browsing. Despite some occurrences of obvious noise, the system was useful in suggesting relevant concepts for the researchers' queries and it helped improve concept recall [16].

The success of the concept space approach has been shown in various domain-specific applications such as Russian computing [13] and molecular biology [16]. However, the usefulness of such an approach to accessing the diverse and large-scale Internet servers remains to be tested. After concept spaces (graphs) have been created for each subject category, we plan to incorporate into Internet searches several graph traversal algorithms previously tested in other applications, e.g., branch-and-bound and Hopfield net association [15]. We believe the thesaurus-like concept spaces created automatically for each subject category will serve as an excellent memory-jogging and term suggestion aid for searchers on Internet.

3.
Stage 3. Intelligent spider (agent) using genetic algorithm:

In addition to enhancing Internet categorization and concept-based IR capabilities, the subject categories created during the categorization process could also be used to develop ``intelligent" global-search spiders (agents) for more efficient and optimal client-based search of relevant Internet information.

Based on our experience with various serial and parallel search algorithms and the analysis of the characteristics of the Internet structure, a genetic algorithm-based spider is proposed. By following homepages linked to starting homepages (a form of crossover) and performing sampling on a category-specific list of all other potentially relevant homepages (a form of mutation), a stochastic process of global evolution towards the ``fittest" (the most similar) homepages can be achieved. We believe this GA-based search algorithm is efficient and that using it will help us obtain optimal global search results on Internet that are based on users' preferences (i.e., identify a list of homepages most relevant to the user-supplied starting homepages). Such a genetic algorithms-based approach has been adopted successfully in recent ``intelligent agent'' research for human-computer interactions design [42] and for inductive query by examples [12].

Research Design and Findings for Internet Categorization and Search

The specific system design and research findings adopted for Stage 1 of our research are reported below.

Research Design: Multi-layered Self-organizing Feature Maps (M-SOM)

Categorization and classification are processes which involve clustering/grouping items of similar nature. Tagging similar items with meaningful labels (names) results in subject categories. When pairwise similarities are obtained between items, a hierarchical agglomerative cluster generation process can be adopted (a process often used in information science [54]). Several serial clustering algorithms exist, e.g., single-link clustering and complete-link clustering (based on popular minimal-spanning tree algorithms such as the Prim's and the Kruskal's) [58] [53] [51]. While these methods have demonstrated their usefulness in clustering documents, a somewhat newer and more promising approach based on the connectionist paradigm, or neural network computing, has attracted a resurgence of interest [33] [57]. There are several reasons for this, including the appearance of faster digital computers on which to simulate large networks, interest in building massively parallel computers, and, most importantly, the development of more powerful neural network architectures and algorithms.

Kohonen's self-organizing feature maps [40] [34], in particular, have drawn significant attention in various engineering and scientific domains. In its basic form, continuous-valued vectors are presented sequentially in time without specifying the desired output. After enough input vectors have been presented, network connection weights will specify cluster or vector centers that sample the input space such that point density function of the vector centers tends to approximate the probability density function of the input vectors. In addition, the connection weights will be organized such that topologically close nodes are sensitive to inputs that are physically similar. Lin [39] first adopted the Kohonen SOM for information retrieval. In his prototype, he generated self-organizing clusters of important concepts in a small database of several hundred documents.

In order to organize the large number of homepages (10M+) on Internet, we proposed a multi-layered SOM algorithm, which permitted unlimited layers of Kohonen maps (we refer to it as M-SOM). A sketch of our proposed M-SOM algorithm is presented below:

1.
Initialize input nodes, output nodes, and connection weights:

Use the top (most frequently occurring) N terms (say 1000) from all homepages as the input vector and create a two-dimensional map (grid) of M output nodes (say a 20-by-10 map of 200 nodes). Initialize weights from N input nodes to M output nodes to small random values.

2.
Present each document (homepage) in order:

Represent each document (homepage) by a vector of N terms and present to the system.

3.
Compute distances to all nodes:

Compute distance dj between the input and each output node j using


\begin{displaymath}d_{j} = \sum_{i=0}^{N-1} (x_{i}(t) - w_{ij}(t))^{2} \end{displaymath}

where xi(t) is the input to node i at time t and wij(t) is the weight from input node i to output node j at time t.

4.
Select winning node j* and update weights to node j* and neighbors:

Select winning node j* as that output node with minimum dj. Update weights for node j* and its neighbors to reduce their distances (between input nodes and output nodes). (See [34] [40] for the algorithmic detail of neighborhood adjustment.)

5.
Label regions in map:

After the network is trained through repeated presentation of all homepages (each homepage is presented at least 5 times), submit unit input vectors of single term to the trained network and assign the winning node the name of input term. Neighboring nodes which contain the same name/term then form a concept/topic region (group). Similarly, submit each homepage as input to the trained network again and assign it to a particular node in the map. The resulting map thus represents regions of important terms/concepts (the more important a concept, the larger a region) and the assignment of homepages to each region. Concept regions that are similar (conceptually) will also appear in the same neighborhood.

6.
Apply the above steps recursively for large regions:

For each map region which contains more than k (say 100) homepages, conduct a recursive procedure of generating another self-organizing map until each region contains no more than k homepages.

We believe that, with 5-6 layers of self-organizing maps and a simple subject category browsing interface, we can partition Internet resources into meaningful and manageable sizes, ready for hypertext browsing and/or category-specific searching.

Research Findings

We have adopted the proposed algorithm in various applications, which varied in sizes: electronic brainstorming comments (10+ KBs, several hundred comments) and Internet entertainment-related homepages (3+ MBs, 10,000+ homepages). In this section, we report the results and status of our system implementation and evaluation.

Categorizing Electronic Brainstorming Comments

In [48], Orwig describes research in the application and evaluation of a Kohonen SOM algorithm to the problem of categorization of brainstorming output using electronic meeting systems.

A major advantage of the meeting software is its ability to let meeting participants brainstorm ideas in a parallel mode. Brainstormers can sit around a table and ``talk'' at the same time, using their keyboards. Often as many as several hundred comments can be generated by a group of 10-20 meeting participants during a typical one-hour electronic brainstorming (EBS) session. While meeting software has been shown to be extremely useful for idea generation, a divergent task, the process of categorizing crucial ideas embedded in meeting comments and generating a consensus list of important topics (idea categorization), a convergent task, is more difficult.

Because of the relatively small amount of data required of the categorization process (often on the order or 10-30 KBs or 50-300 comments), a single-layered SOM algorithm was developed and tested. Figure 1 shows the SOM output of actual brainstorming comments in an electronic meeting session. Twenty group participants, who included managers and users of GroupSystems (an electronic meeting system developed at the University of Arizona) from various companies and government agencies, were asked to use the electronic brainstorming tool to respond to the following question: What are the most important information technology problems with respect to Collaborative Systems to be solved over the next five years? The group generated 201 comments over a period of 30 minutes. While the participants were responding to the question, an expert group facilitator utilized an existing Categorizer tool to produce a list of the major topics addressed by the respondents (by manually browsing the participants' comments). When he recognized that a new concept was appearing frequently in the comments, it was added to his list, and relevant comments were attached to the topic. The expert spent the entire 30 minutes during the EBS session, plus more time during break, to arrive at a list of 20 items.


  
Figure 1: SOM-generated list of topics
\begin{figure}\rule{6.5in}{.01in}
\par\vspace{4.5in}
\par\rule{6.5in}{.01in}
\end{figure}

In a recent experiment [48], we compared a facilitator-generated list of topics with one generated by SOM. Eight facilitator subjects participated in the experiment. Subjects were given the text output of the brainstorming session results and the two lists. Subjects read though the actual comments first and then corrected each list by deleting inappropriate topics and adding topics that they thought were missing. The resulting lists of topics were then used to compute the recall and precision levels of the two lists. Statistical results for hypothesis testing were also obtained.

On an average (with sample size N=8), the facilitator list obtained an 81% precision level and the SOM list obtained a 55% precision level. The difference was statistically significant (at 5% significance level). Compared with human facilitators, the SOM algorithm was less precise in generating topics. In recall, the facilitator list reached an 88.5% level and the SOM list reached an 81% level. The difference was statistically insignificant. However, the SOM algorithm took significantly less time to produce a list of topics (45 minutes for the facilitator and four minutes for the system). Considering the cognitive demand for generating topics manually and the prospect of using the SOM output as an information visualization and decision aid (i.e., using the SOM output as a straw-man list for further user refinement), we believe the results from this experiment were encouraging. It suggests an efficient, algorithmic alternative for information searchers or system users. Based on this initial research, we proceeded to test the SOM approach in several other larger Internet applications (where human categorization becomes even more difficult).

Categorizing Internet Entertainment Homepages

In order to examine the scalability of the SOM approach to Internet categorization, we created a testbed of about 10,000 Internet homepages related to entertainment, using the Yahoo server. (We developed a spider/softbot that traversed and fetched the homepages at the top three layers of the entertainment section of the Yahoo directory.) The resulting testbed was about 3 MBs in size. The experiment aimed to use the M-SOM approach to classify the 10,000+ homepages into meaningful, multi-layered categories.

The first-layer SOM process took 1 hour and 37 minutes on a DEC Alpha 3000/600 (200 MHz, 128 MBs RAM) and produced about 50 regions (groups) on the map. Using 100 homepages as the threshold for further SOM categorization, the second-layer SOM process took about 1 hour and 21 minutes. The SOM categorization process for the 10,000+ entertainment homepages ended after four levels.

The computational characteristics and initial output produced for the entertainment homepage analysis are interesting. We observed that many of the larger concept regions appeared to be meaningful and relating to each other (e.g., SAN FRANCISCO and LOS ANGELES form neighboring regions). Initial browsing of the homepages in a concept region showed relevant homepages, e.g., we found Star Trek fan homepages under the SCIENCE FICTION concept region.

A sample concept browsing WWW server using SOM was developed recently. The resulting ET-Map server contains about 50 concept regions at the first layer and is available at: http://ai.bpa.arizona.edu/ent/et-map.html. The large concept regions (using 100 URLs as the threshold) can be clicked on to produce sub-regions. For example, Figure 2 shows the top-level map for all entertainment homepages. By clicking on the STAR TREK (503 URLs) concept region of the top-level map, the system displayed a sub-map which contained SCIENCE FICTION (23 URLs), as shown in Figure 3. Clicking on the SCIENCE FICTION region resulted in a ranked list of URLs summarized with titles and top keywords (Figure 4). Each URL is ``live'' and can be clicked on the fetch an actual homepage. We felt that this experimental server was interesting enough for some initial user evaluation.


  
Figure 2: ET-Map, top-most layer
\begin{figure}\rule{6.5in}{.01in}
\par\vspace{6.5in}
\par\rule{6.5in}{.01in}
\end{figure}


  
Figure 3: Sub-regions for STAR TREK region
\begin{figure}\rule{6.5in}{.01in}
\par\vspace{6.5in}
\par\rule{6.5in}{.01in}
\end{figure}


  
Figure 4: URLs in SCIENCE FICTION sub-region of the STAR TREK region
\begin{figure}\rule{6.5in}{.01in}
\par\vspace{6.5in}
\par\rule{6.5in}{.01in}
\end{figure}

User Evaluation: An Experiment on the Entertainment Map

In order to assess the usefulness of the ET-Map for Web browsing, we designed a qualitative experiment based upon protocol analysis [23]. Our research goal was to understand the characteristics of the SOM output as demonstrated in the ET-Map server and its potential for becoming an alternative for concept browsing and searching for WWW services, using the Yahoo entertainment directory as the benchmark for comparison.

Experimental Design

The experiment involved 10 subjects: five graduate students from the Library Science Department at the University of Arizona, four graduate students from the MIS Department, and one System Administrator, also from the MIS Department. The subjects compared the ET-Map (http://ai.bpa.arizona.edu/ent/et-map.html) created by the Kohonen SOM algorithm to the manually catalogued Entertainment hierarchy of Yahoo (http://www.yahoo.com/Entertainment/). The subjects were asked to perform three searches twice, once by searching Yahoo's Entertainment hierarchy and the other by searching the ET-Map. The subjects verbalized their thought processes and comments while searching and the experimenters collected the protocols for analysis.

The searches were all performed using Netscape 1.1 or higher on either an X-terminal or a Macintosh. The only `search' mechanism which subjects could use was the Netscape Find function, which searches the text on the loaded page for the input word. No instructions were given to the subjects as to whether the searches should be broad (any home pages on a specific subject) or narrow (a particular page). Due to the open endedness of the experiment and the nature of browsing the WWW, some of the searches became extensive and unsuccessful (most subjects would abandon an unsuccessful search after approximately 10 minutes). At the other end of the spectrum, a few of the searches remained relatively short and successful in under a minute, with the majority of the search times falling between the two extremes. All subjects completed the experiment within an hour and 15 minutes.

Experimental Results

Patterns of Problems Common to Both Tools

The ET-Map Strengths and Weaknesses

Through analysis of the search patterns and verbal protocols of the subjects, it became apparent that the ET-Map has certain weaknesses and certain strengths.

It also appears that some of the problems associated with the map are due to the nature of the documents on the Web. A large percentage of the `entertainment' home pages are personal home pages, which are difficult to classify. On most personal home pages, the owners may discuss their Professional Experience (e.g., Ikos Systems, Ready Systems, Link Flight Simulation, Dymac, Simpact Associates), their Education (e.g., M.A. Applied Mathematics, UCSD), and their Personal Interests (e.g., Astronomy, Bay Area, Comics, Computer Languages, Family, Fantasy and Science Fiction, Games, History, Maps, Mathematics, Sports). The only appropriate classification for these pages is probably as a personal home page region. If the personal home pages could be weeded out of the collection and placed into their own category, the map would be much cleaner and more clearly attain its promise.

Conclusion

This research aimed to address current and future Internet searching problems by developing and testing promising neural network categorization techniques. Based on a general research framework for Internet categorization and searching, the first stage of our research involves an automatic, multi-layered, self-organizing approach to categorizing Internet homepages based on their contents (terms). Results of this categorization process can then be used at the second stage to create category-specific concept spaces for assisting in concept-based, associative information retrieval.

The multi-layered SOM (M-SOM) algorithm has been tested in several applications including: electronic brainstorming comments and Internet entertainment-related homepages. The initial testing results were interesting. The techniques appeared to produce meaningful results for small-scale applications (e.g., EBS categorization) and potentially useful concept maps for serendipitous browsing for large-scale applications (e.g., Internet homepage categorization). However, more systematic refinement and user evaluation for large-scale Internet applications and parallelization for selected algorithms are needed (and are under way).

Acknowledgment

This project is supported by a Research Initiation Award grant awarded by the Division of Information, Robotics, and Intelligent Systems, NSF (``Building a Concept Space for an Electronic Community System,'' PI: H. Chen, 1992-1994, IRI9211418), a National Collaboratory grant awarded by NSF (``Systems Technology for Building a National Collaboratory'', PI: B. Schatz, 1990-1994), a Digital Library Initiative grant awarded by NSF/ARPA/NASA (``Building the Interspace: Digital Library Infrastructure for a University Engineering Community,'' PIs: B. Schatz, H. Chen, et al., 1994-1998, IRI9411318), and an NSF/CISE grant (``Concept-based Categorization and Search on Internet: A Machine Learning, Parallel Computing Approach,'' PI: H. Chen, 1995-1995, IRI9525790).

Bibliography

1
R. K. Belew.
Adaptive information retrieval.
In Proceedings of the Twelfth Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pages 11-20, Cambridge, MA, June 25-28, 1989.

2
T. Berners-Lee, R. Cailliau, A. Luotonen, H. F. Nielsen, and A. Secret.
The World-Wide Web.
Communications of the ACM, 37(8):76-82, August 1994.

3
D. C. Blair and M. E. Maron.
An evaluation of retrieval effectiveness for a full-text document-retrieval system.
Communications of the ACM, 28(3):289-299, 1985.

4
M. J. Blosseville, G. Hebrail, M. G. Monteil, and N. Penot.
Automatic document classification: natural language processing, statistical analysis, and expert system techniques used together.
In Proceedings of the Fifteenth Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pages 51-57, Copenhagen, Denmark, June 21-24 1992.

5
C. M. Bowman.
The Harvest information discovery and access system.
In Proceedings of the Second International World Wide Web Conference '94, Chicago, IL, October 17-20, 1994.

6
C. M. Bowman, P. B. Danzig, U. Manber, and F. Schwartz.
Scalable internet resource discovery: research problems and approaches.
Communications of the ACM, 37(8):98-107, August 1994.

7
L. Breiman, J. Friedman, R. Olshen, and C. Stone.
Classification and Regression Tree.
Wadsworth, Monterey, CA, 1984.

8
E. Carmel, S. Crawford, and H. Chen.
Browsing in hypertext: A cognitive study.
IEEE Transactions on Systems, Man and Cybernetics, 22(5):865-884, September/October 1992.

9
H. Chen.
Collaborative systems: solving the vocabulary problem.
IEEE COMPUTER, 27(5):58-66, Special Issue on Computer-Supported Cooperative Work (CSCW), May 1994.

10
H. Chen.
Machine learning for information retrieval: neural networks, symbolic learning, and genetic algorithms.
Journal of the American Society for Information Science, 46(3):194-216, April 1995.

11
H. Chen and V. Dhar.
User misconceptions of online information retrieval systems.
International Journal of Man-Machine Studies, 32(6):673-692, June 1990.

12
H. Chen and J. Kim.
GANNET: a machine learning approach to document retrieval.
Journal of Management Information Systems, 11(3):7-41, Winter 1994-95.

13
H. Chen and K. J. Lynch.
Automatic construction of networks of concepts characterizing document databases.
IEEE Transactions on Systems, Man and Cybernetics, 22(5):885-902, September/October 1992.

14
H. Chen, K. J. Lynch, K. Basu, and D. T. Ng.
Generating, integrating, and activating thesauri for concept-based document retrieval.
IEEE EXPERT, Special Series on Artificial Intelligence in Text-based Information Systems, 8(2):25-34, April 1993.

15
H. Chen and D. T. Ng.
An algorithmic approach to concept exploration in a large knowledge network (automatic thesaurus consultation): symbolic branch-and-bound vs. connectionist Hopfield net activation.
Journal of the American Society for Information Science, 46(5):348-369, June 1995.

16
H. Chen, B. R. Schatz, T. Yim, and D. Fye.
Automatic thesaurus generation for an electronic community system.
Journal of the American Society for Information Science, 46(3):175-193, April 1995.

17
H. Chen and L. She.
Inductive query by examples (IQBE): A machine learning approach.
In Proceedings of the 27th Annual Hawaii International Conference on System Sciences (HICSS-27), Information Sharing and Knowledge Discovery Track, Maui, HI, January 4-7, 1994.

18
J. Courteau.
Genome databases.
Science, 254:201-207, October 11, 1991.

19
S. L. Crawford, R. Fung, L. A. Appelbaum, and R. M. Tong.
Classification trees for information retrieval.
In Proceedings of the 8th International Workshop on Machine Learning, pages 245-249, Morgan Kaufmann, 1991.

20
S. L. Crawford and R. M. Fung.
An analysis of two probablistic model induction techniques.
Statistics and Computing, 2(2):83-90, June 1992.

21
P. DeBra and R. Post.
Information retrieval in the World-Wide Web: making client-based searching feasible.
In Proceedings of the First International World Wide Web Conference '94, Geneva, Switzerland, 1994.

22
T. E. Doszkocs, J. Reggia, and X. Lin.
Connectionist models and information retrieval.
Annual Review of Information Science and Technology (ARIST), 25:209-260, 1990.

23
K. A. Ericsson and H. A. Simon.
Protocol analysis: verbal reports as data.
MIT Press, Cambridge, Mass., 1993.

24
O. Etzioni and D. Weld.
A softbot-based interface to the Internet.
Communications of the ACM, 37(7):72-79, July 1994.

25
C. L. Foss.
Tools for reading and browsing hypertext.
Information Processing and Management, 25(4):407-418, 1989.

26
K. A. Frenkel.
The human genome project and informatics.
Communications of the ACM, 34(11):41-51, November 1991.

27
N. Fuhr, S. Hartmann, G. Knorz, G. Lustig, M. Schwantner, and K. Tzeras.
AIR/X - a rule-based multistage indexing system for large subject fields.
In Proceedings of the Eighth National Conference on Artificial Intelligence (AAAI-90), pages 789-795, Boston, MA, July 29-August 3, 1990.

28
R. Fung and S. L. Crawford.
Constructor: a system for the induction of probablistic models.
In Proceedings of the 8th National Conference on Artificial Intelligence (AAAI-90), pages 762-769, Boston, MA, July 29-August 3, 1990.

29
G. W. Furnas.
Statistical semantics: How can a computer use what people name things to guess what things people mean when they name things.
In Proceedings of the Human Factors in Computer Systems Conference, pages 251-253, Gaithersburg, MD, Association for Computing Machinery, March 1982.

30
G. W. Furnas, T. K. Landauer, L. M. Gomez, and S. T. Dumais.
The vocabulary problem in human-system communication.
Communications of the ACM, 30(11):964-971, November 1987.

31
M. Gordon.
Probabilistic and genetic algorithms for document retrieval.
Communications of the ACM, 31(10):1208-1218, October 1988.

32
M. D. Gordon.
User-based document clustering by redescribing subject descriptions with a genetic algorithm.
Journal of the American Society for Information Science, 42(5):311-322, June 1991.

33
K. Knight.
Connectionist ideas and algorithms.
Communications of the ACM, 33(11):59-74, November 1990.

34
T. Kohonen.
Self-Organization and Associative Memory.
Third Edition, Springer-Verlag, Berlin Heidelberg, 1989.

35
M. Koster.
ALIWEB: Archie-like indexing in the web.
In Proceedings of the First International World Wide Web Conference '94, Geneva, Switzerland, 1994.

36
K. L. Kwok.
A neural network for probablistic information retrieval.
In Proceedings of the Twelfth Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pages 21-30, Cambridge, MA, June 25-28, 1989.

37
D. D. Lewis.
Learning in intelligent information retrieval.
In Proceedings of the 8th International Workshop on Machine Learning, pages 235-239, Morgan Kaufmann, 1991.

38
D. D. Lewis.
An evaluation of phrasal and clustered representations on a text categorization task.
In Proceedings of the Fifteenth Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pages 37-50, Copenhagen, Denmark, June 21-24 1992.

39
X. Lin, D. Soergel, and G. Marchionini.
A self-organizing semantic map for information retrieval.
In Proceedings of the Fourteenth Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pages 262-269, Chicago, IL, October 13-16 1991.

40
R. P. Lippmann.
An introduction to computing with neural networks.
IEEE Acoustics Speech and Signal Processing Magazine, 4(2):4-22, April 1987.

41
K. J. MacLeod and W. Robertson.
A neural algorithm for document clustering.
Information Processing & Management, 27(4):337-346, 1991.

42
P. Maes.
Agents that reduce work and information overload.
Communications of the ACM, 37(7):30-40, July 1994.

43
M. E. Maron and J. L. Kuhns.
On relevance, probabilistic indexing and information retrieval.
Journal of the ACM, 7(3):216-243, July 1960.

44
B. Masand, L. Gordon, and D. Waltz.
Classifying news stories using memory-based reasoning.
In Proceedings of the Fifteenth Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pages 59-65, Copenhagen, Denmark, June 21-24 1992.

45
Mauldin and Leavitt.
Web-agent related research at the CMT.
In Proceedings of the ACM Special Interest Group on Networked Information Discovery and Retrieval (SIGNIDR-94), August 1994.

46
O. McBryan.
GENVL and WWWW: tools for taming the web.
In Proceedings of the First International World Wide Web Conference '94, Geneva, Switzerland, 1994.

47
K. Obraczka, P. B. Danzig, and S. Li.
Internet resource discovery services.
IEEE COMPUTER, 26(9):8-24, September 1993.

48
R. Orwig, H. Chen, and J. F. Nunamaker.
A graphical, self-organizing approach to classifying electronic meeting output.
Journal of the American Society for Information Science, 48(2):157-170, February 1997.

49
F. Petry, B. Buckles, D. Prabhu, and D. Kraft.
Fuzzy information retrieval using genetic algorithms and relevance feedback.
In Proceedings of the ASIS Annual Meeting, pages 122-125, 1993.

50
V. V. Raghavan and B. Agarwal.
Optimal determination of user-oriented clusters: An application for the reproductive plan.
In Proceedings of the Second International Conference on Genetic Algorithms and Their Applications, pages 241-246, Cambridge, MA, July 1987.

51
E. Rasmussen.
Clustering algorithms.
In Information Retrieval: Data Structures and Algorithms, W. B. Frakes and R. Baeza-Yates, Editors, Prentice Hall, Englewood Cliffs, NJ, 1992.

52
L. F. Rau and P. S. Jacobs.
Creating segmented databases from free text for text retrieval.
In Proceedings of the Fourteenth Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pages 337-346, Chicago, IL, October 13-16 1991.

53
G. Salton.
Generation and search of clustered files.
ACM Transactions on Database Systems, 3(4):321-346, December 1978.

54
G. Salton.
Automatic Text Processing.
Addison-Wesley Publishing Company, Inc., Reading, MA, 1989.

55
B. R. Schatz and J. B. Hardin.
NSCA Mosaic and the World Wide Web: global hypermedia protocols for the internet.
Science, 265:895-901, 12 August 1994.

56
P. K. Simpson.
Artificial Neural Systems: Foundations, Paradigms, Applications, and Implementations.
McGraw-Hill Book Company, New York, NY, 1990.

57
P. K. Simpson.
Fuzzy min-max neural networks - part 2: clustering.
IEEE Transactions on Fuzzy Systems, 1(1):32-45, February 1993.

58
K. Sparck Jones.
Automatic Keyword Classification for Information Retrieval.
Butterworths, London, 1971.

59
R. Wilkinson and P. Hingston.
Using the cosine measure in neural network for document retrieval.
In Proceedings of the Fourteenth Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pages 202-210, Chicago, IL, October 13-16 1991.

60
R. Wilkinson, P. Hingston, and T. Osborn.
Incorporating the vector space model in a neural network used for document retrieval.
Library Hi Tech, 10(12):69-75, 1992.

61
J. Yang and R. R. Korfhage.
Effects of query term weights modification in document retrieval: a study based on a genetic algorithm.
In Proceedings of the Second Annual Symposium on Document Analysis and Information Retrieval, pages 271-285, Las Vegas, NV, April 26-28, 1993.

62
J. Yang, R. R. Korfhage, and E. Rasmussen.
Query improvement in information retrieval using genetic algorithms: a report on the experiments of the TREC project.
In Text Retrieval Conference (TREC-1), pages 31-58, Gaithersburg, MD, November 4-6 1993.


hchen@bpa.arizona.edu