List of Figures


List of Tables

An Automatic Indexing and
Neural Network Approach to
Concept Retrieval and Classification of
Multilingual (Chinese-English) Documents

Chung-hsin Lin and Hsinchun Chen
Management Information Systems Department, University of Arizona
Tucson, Arizona 85721
(602) 621-4153, hchen@bpa.arizona.edu

Abstract:

An automatic indexing and concept classification approach to a multilingual (Chinese and English) bibliographic database is presented. We introduced a multi-linear term-phrasing technique to extract concept descriptors (terms or keywords) from a Chinese-English bibliographic database. A concept space of related descriptors was then generated using a co-occurrence analysis technique. Like a man-made thesaurus, the system-generated concept space can be used to generate additional semantically-relevant terms for search. For concept classification and clustering, a variant of a Hopfield neural network was developed to cluster similar concept descriptors and to generate a small number of concept groups to represent (summarize) the subject matter of the database. The concept space approach to information classification and retrieval has been adopted by the authors in other scientific databases and business applications, but multilingual information retrieval presents a unique challenge. This research reports our experiment on multilingual databases.

Our system was initially developed in the MS-DOS environment, running ETEN Chinese operating system. For performance reasons, it was then tested on a UNIX-based system. Due to the unique ideographic nature of the Chinese language, a Chinese term-phrase indexing paradigm considering the ideographic characteristics of Chinese was developed as a multilingual information classification model. By applying the neural network based concept classification technique, the model presents a novel way of organizing unstructured multilingual information.

Introduction

The overwhelming volume of online information generated and disseminated across computer networks has created a significant burden for researchers and practitioners. For structured numeric data, database management systems (DBMS) have typically been used. However, for unstructured textual data, information management, processing and retrieval remain very complex and problematic.

Information Retrieval (IR) is a research area that has been studied extensively in the western world [8] [43] 1. For example, retrieving information from large, unstructured databases of English documents has been an area of inquiry for several decades. Nevertheless, due to the cognitive process involved in organizing and retrieving information [8], there still exists significant uncertainty and difficulty in online information management and information retrieval.

The prevailing keyword searching and user browsing techniques for IR suffer from an inability to support concept or content-based search [2]. Recently researchers and practitioners have turned their attention to ``concept-based'' IR, e.g., the (automatic) concept space approach proposed by Chen and his co-workers [10] [11] and the (manual) thesaurus search and content-based IR supported by commercial packages such as Verity's Topic and Oracle's ConText. Such a concept-based retrieval capability has been considered by many researchers and practitioners to be an effective complement to the prevailing keyword search or user browsing options available in most information retrieval systems.

For oriental languages the situation is even more complex than for western languages [C10]. Due to the unique linguistics and grammar structures of oriental languages, IR techniques applicable to western languages may not be appropriate for oriental language information organization and retrieval. Although several techniques have been proposed, an automatic and systematic approach to multilingual information management and retrieval is still lacking. In this research, we examined a multilingual bibliographic database which contained technical documents mainly written in Chinese, with occasional English terms. This application allowed us to study some crucial multilingual information retrieval issues and helped us develop an integrated, concept-based Chinese-English information classification and retrieval model.

The structure of this article is as follows. Section 2 discusses information retrieval problems in the context of multilingual databases and presents our proposed concept space approach. Extensive literature reviews are also provided. Section 3 presents a classical English classification model and our proposed multilingual model. Section 4 describes in detail the automatic multilingual indexing techniques adopted in this research, in particular for Chinese information. Section 5 presents a neural network based concept classification technique. Based on cluster analysis and Hopfield network algorithms, an unstructured database of multilingual documents can be organized in a semantically structured form. Section 6 summarizes a system implementation for a sample bibliographic database of 1052 documents. Conclusions and directions for additional research appear in Section 7.

Problem Description and Research Design

Different human languages exhibit significantly different linguistic and grammatic characteristics which strongly affect how information is structured and represented in modern databases. This is particularly true of the contrasts between western languages (e.g., English, French, German, etc.) and oriental languages (e.g., Chinese, Japanese, Korean, etc.). Despite these differences, there are common problems associated with online information management and retrieval across databases created in different languages. In this research we attempted to address the information management and retrieval issues related to a multilingual database containing mainly Chinese and English texts.

In this section, we will first review some common IR problems and the prevailing techniques to address these problems. Discussion of specific research objectives and the techniques adopted in this research will follow.

Information Retrieval Problems and Techniques: An Overview

In the past few decades, the availability of cheap and effective storage devices and information systems has prompted the rapid growth and proliferation of relational, graphical, and textual databases. Information collection and storage have become easier, but effort required to retrieve relevant information has become significantly greater, especially in large-scale databases. This situation is particularly evident for textual databases, which are widely used in traditional library science environments, in business applications (e.g., manuals, newsletters, and electronic data interchanges), and in scientific applications (e.g., electronic community systems and scientific databases). Information stored in these databases often has become voluminous, fragmented, and unstructured after years of intensive use. Only users with extensive subject area knowledge, system knowledge, and classification scheme knowledge are able to maneuver and explore in these textual databases [7].

In conventional information retrieval environments, keywords are manually or automatically assigned and queries are formulated by using terms interconnected by Boolean operators. Although widely used, the Boolean query languages have some drawbacks: users find it difficult to formulate their queries using the Boolean syntax, the retrieved documents are not ranked in any particular order; and most importantly, the retrieval results are often inadequate [43] [44]. The vocabulary problem in human-computer interactions further confound the keyword-based Boolean retrieval mechanism [5]. In [23], Furnas et al. found that in spontaneous word choice for objects in five domains, two people favored the same term with less than 20% probability. This fundamental property of language limits the success of various design methodologies for keyword-driven interaction.

The vector space model, proposed by Salton [43] [44], presents an alternative approach for handling information retrieval applications. In the vector space model, both the stored documents and the user queries are represented by sets of terms and weights pairs, without Boolean operators. Documents and queries are then compared based on similarity functions (e.g., cosine function, Jaccard's function, etc.) between the matching terms [16] [43]. The similarity scores of the relevant documents indicate a match between a user's query and the related documents and can be ranked and displayed to the users more meaningfully.

Most recent internet resource discovery services support basic Boolean or vector space querying capabilities. For example, Gopher permits keyword-based search across different information sources and WAIS allows search using vector space matching. In addition to these keyword-based searching functionalities, internet resource discovery software often also provides user browsing capabilities. For example, Gopher allows its user to traverse in a system hierarchy (a road map for the Gopher space) and Mosaic supports hypertext browsing across information sources. However, several design problems frequently arise from user browsing, especially in a large information space. The browsing system can potentially confuse and disorient its user through what is known as the embedded digression problem and it can cause the user to spend a great deal of time while learning nothing specific, the art museum phenomenon [20] [3]. Hypertext systems provide bookmarks, context maps, backtracking or history lists, and guided tours to help the user overcome such problems [38]. However, browsing in a large and unfamiliar information space is still time-consuming, cognitively demanding, and at times non-productive.

Based on our experiences in dealing with several textual applications, including intelligence analysis [10] [11], meeting support systems [9], and scientific (molecular biology) databases [12], we have developed an algorithmic, concept-based approach to information classification and retrieval. In our design, we generate a concept space by first extracting concepts (terms) automatically from the texts in databases. Similar concepts are then linked through the co-occurrence analysis of concepts in texts. The concept space created represents the vocabularies used in documents and the similarity probabilities between these vocabularies. A graph traversal and/or clustering algorithm can then be used to help searchers identify similar concepts in the concept space and ``dock'' on to the relevant part of the information space automatically. We believe the difficulties and problems associated with keyword-based searching and user browsing can be partially alleviated by adopting the proposed concept-based IR approach. We present a blueprint of our approach and a review of relevant literature below. Details about the specific algorithms will be discussed in the next section.

By providing a meaningful and semantics-rich concept space and clusters for the often large and unstructured information space using the co-occurrence analysis and clustering techniques, the proposed approach presents a concept-based retrieval option to users, complementary to the prevailing keyword-based searching and user browsing. However, the feasibility of such an approach in the complex multilingual IR environment remains to be examined.

Multilingual Information Retrieval: Chinese-English Bibliographic Databases

English is a phonographic language in which almost every word has one or more independent meanings [C3]. However, the structure of the Chinese language is based on pictographs and each Chinese word (character) has a unique meaning. In ancient Chinese literature, writing tended to be more concise, with a single word conveying several meanings. In modern Chinese writing, especially for technical literature, term-phrases which contain more than one word have often been used to express precise meaning. A technical term is seldom represented by a single word. Nevertheless, most existing Chinese information retrieval systems are still based on word indexing [C7] [C9] [C13] [C13] [C14] [C15].

A problem associated with Chinese technical literature indexing is referred to as the word-division problem. Automatically extracting meaningful phrases instead of incidental, meaningless words from Chinese literature remains a challenge for researchers [C4] [C5] [C6] [C8] [C12]. Two approaches to the Chinese word-division problem have been proposed. The first of these is grammar-based [18] [54] [35] [53]. Incorporating a syntactical and semantic knowledge base into a computer program will enable it to perform semantic parsing of documents and texts [46] [24]. However, the effectiveness of this approach for large applications has not been fully tested. Developing an effective and robust natural language parser for Chinese documents is still in its infancy.

The other approach is non-grammar-based [C1] [50] [17] and typically uses a large built-in dictionary to help extract phrases automatically from unstructured Chinese information. In [C2], the researcher coupled a phonetic (pinyin) symbol with the original ideographic Chinese entry to help extract phrases from Chinese sentences. This approach represents Chinese information in an English-like structure. A space character is used for separating each pinyin symbol in the phonetic symbol file and an algorithm is used to automatically divide and extract phrases appearing in the pinyin symbol file. This approach is a variant of the term-marking operation described in [C11]. Due to the difficulties in creating large-scale, detailed syntactical and semantic knowledge bases (the grammar-based approach), our research adopted the non-grammar-based approach.

The phonographic nature of the English language has created the well-documented suffix-stripping problem during information retrieval [40] [34] [26]. It is not clear whether the Chinese language is immune from this problem or not. For example, should `` '' (book) be suffix-stripped to be `` '' since these two terms have the same meaning? This research allowed us to examine a few well-known Chinese-English document retrieval problems such as the English suffixing problem. The specific English suffix-striping algorithm adopted in our research was initially developed by Porter [40]. Some sample rules are shown in Figure 1.


  
Figure 1: Examples of Porter's Stemming Rules
\begin{figure}\rule{6in}{.01in}
\par\vspace{0.1in}
\par {\scriptsize
\begin{cent...
...ace & sensible \\
\end{tabular}\end{center}}
\par\rule{6in}{.01in}
\end{figure}

After addressing the Chinese word-division and the English suffix-stripping problems, we then were able to adopt automatic indexing techniques, the vector space model, and concept space generation techniques [43] [9] to analyze and classify Chinese-English documents. The specific goals of this research included:

We aimed to improve Chinese information retrieval from the character-word level to the term-phrase level, characteristic of English information retrieval, but still preserve the ideographic features of the Chinese language and then to facilitate automatic extraction and classification of concepts hidden in unstructured multilingual databases.

The test bed for our research was a Chinese-English bibliographic database, initially stored in the MS-DOS environment. Due to the requirement of processing Chinese information, the ETEN Chinese operating system (one of the most popular Chinese operation systems on PCs) was also used. Our prototype system was developed in ANSI C. The system was then ported to a UNIX SPARC 390 workstation for systematic analysis and testing.

The sample bibliographic database contained a collection of articles published in ETEN magazine over the past four years. Figure 2 presents sample entries. Each entry consists of structured data (i.e., the author, the publication year, and the volume number) and unstructured data (i.e., the title). Many titles contain both Chinese and English terms, a characteristic of technical Chinese databases. In the multilingual system model proposed in this paper, we were concerned with only the unstructured part (i.e., title) of each bibliographic entry. A total of 1052 document entries were stored in the sample bibliographic database.


  
Figure 2: Sample Chinese-English Bibliographic Records
\begin{figure}\rule{6in}{.01in}
\par\vspace{3in}
\par\rule{6in}{.01in}
\end{figure}

A Multilingual Information Classification Model

An English Classification Model

The stages involved in indexing and classifying unstructured English documents according to the concept space approach proposed by Chen and his co-wokers can be summarized in the English classification model shown in Figure 3. In this model, four steps are necessary to automatically extract and represent concept descriptors (terms or keywords).


  
Figure 3: English Classification Model
\begin{figure}\rule{6in}{.01in}
\par\vspace{3.5in}
\par\rule{6in}{.01in}
\end{figure}

A Multilingual Classification Model

As shown in Figure 4, our multilingual classification model consists of six phases, the first four for automatic multilingual indexing and the last two for automatic concept classification.


  
Figure 4: Multilingual Classification Model
\begin{figure}\rule{6in}{.01in}
\par\vspace{4in}
\par\rule{6in}{.01in}
\end{figure}

Due to the difficulty in translating foreign technical and scientific terminologies into Chinese, it is a common practice to use both Chinese and foreign terminologies in Chinese technical literature. This phenomenon is evident in our sample bibliographic database of computing related documents. In order to support concept-based management and retrieval of Chinese-English documents, we developed an integrated multilingual information classification model. An overview of the model is presented below. Details about specific techniques are presented in Sections 4 and 5.

Automatic Multilingual Indexing

This section will describe in detail the first four phases of the multilingual classification model. They all contribute to our efforts to resolve the automatic Chinese-English indexing problem. Examples and analyses are also presented.

Source Entry Reformation

Source entry reformation reformulates each source entry into a collection of text segments without stopwords. Since there are two languages involved, Chinese and English, both need to be processed.

Each Chinese word was separated from the others by the space character. For technical Chinese literature, a Chinese word often plays the role of a character in English. Therefore, the basic semantic unit in modern Chinese is a term-phrase, not a single word. Words need to be combined in order to obtain specific meanings.

English words in the source entry were checked against a stopword list of 150 entries. An identified stopword was replaced by a `-' symbol. Remaining words were then processed by the Porter's stemming algorithm [40]. After the stopwording and stemming process, English entries were ready for automatic indexing. However, Chinese entries still needed to be further analyzed.

Multi-Linear Term-Phrasing

Two approaches were considered for forming phrases (see Figure 6). One method is called single-linear term-phrasing. No two terms will use the same word from a specified text context. For example, terms that can be extracted from the text segment ``Information Retrieval System'' include (``Information'', ``Retrieval'', and ``Systems''), (``Information'' and ``Retrieval Systems''), or (``Information Retrieval'' and ``Systems''). ``Information Retrieval'' and ``Retrieval Systems'' will not be generated as a set of candidate terms using single-linear term-phrasing.


  
Figure 6: Single-Linear vs. Multi-Linear Term-Phrasing for ``Desktop (W1-W2) Publishing (W3-W4) Systems (W5-W6)''

We believe a multi-linear term-phrasing approach is more appropriate for extracting complete, meaningful Chinese phrases. This approach can also be adopted for English documents. Extracted phrases could have the same word constituent. For example, for ``Information Retrieval Systems'', ``Information Retrieval'' and ``Retrieval Systems'' can be formed. We found the multi-linear method to produce more phrases than the single-linear method. Both Chinese and English sentences were transformed to a collection of multiple-word terms in the same way.

Figure 6 shows the different possible permutations of adjacent words which can be generated for the same text segment using the two approaches. Table 1 presents a summary of the results of multi-linear term-phrasing for the bibliographic database; 45450 terms were extracted from the bibliographic database, with 1-word, 2-word, 3-word, and 4-word phrases.


 
Table 1: Results of Multi-Linear Term-Phrasing
1-word 2-word 3-word 4-word Total
terms terms terms terms  
15442 12580 9947 7481 45450

 


Legality Checking

Term-phrasing often caused an undesired shadow effect for Chinese information. In order to clean up some of the noise, a built-in dictionary was used. As discussed early, in modern Chinese technical literature, a single Chinese word plays the role of a character in English. In English, permutations of words in a meaningful term-phrase are often still meaningful. For example, ``Desktop Publishing Systems'' can produce meaningful adjacent term-phrases like ``Desktop Publishing'' or ``Publishing Systems''. But this is not true for Chinese phrases. For example, the equivalent `` '' actually consists of three meaningful Chinese terms: `` '' (Desktop), `` '' (Publishing), and `` '' (Systems). But the term-phrasing process may create invalid permutations like `` '', `` '' and so on. These invalid term-phrases were undesired shadows of the valid terms-phrases.

Illegal phrases should be removed from the source entry. As described in [C1] [50] [17] [35] [53], a built-in dictionary could be employed to solve the word-division problem. Our system included a built-in dictionary to check the legality of the extracted terms. Table 2 presents a comparison of the statistics generated by using of the built-in dictionary and the result of legality checking for the sample bibliographic data. By applying only 3338 terms in the dictionary we were able to screen out 22139 illegal terms. The terms in the built-in dictionary were created manually based on the dictionary described in [C1] and some common computer science terms. Online, general-purpose dictionary and other domain-specific dictionaries (e.g., in business, computer science, engineering, medicine, etc.) are often available from vendors and publishers or can be generated through OCR scanning of some existing hard-copy sources. In this experiment we created our dictionary manually because of our limited domain and the prototype nature of our experiment. However, in larger-scale applications, existing dictionaries could be incorporated.

Since the modern Chinese language no longer treats a word as a basic semantic unit, single Chinese words were not considered as valid descriptors and only legal 2-word, 3-word, and 4-word phrases were used in concept space generation. English terms were not analyzed in this phase; single-word English terms were included in the concept space generation process.


 
Table 2: Statistics of Legality Checking
Term 2-word 3-word 4-word Total
  terms terms terms  
Dictionary 1633 810 895 3338
Legal Terms 4892 1602 1375 7869
in Sample Database        
Illegal Terms 7688 8345 6106 22139
in Sample Database        

 


Filtering

After removing illegal Chinese terms, legal but non-semantic-bearing (general) Chinese terms were then filtered by a stopterm list, a process similar to stopwording in English. The dictionary used in legality checking was mainly based on [C1] and included many general entries that were not useful for indexing purposes. Table 3 summarizes the results of stopterm filtering. A total of 1581 stop terms were included and they helped filter 3791 terms in our sample database. (We could also have created a smaller, but more precise dictionary for both legality checking and filtering purposes.) Figure 5 displays some sample stop terms.


 
Table 3: Statistics of Term Filtering
Term 2-word 3-word 4-word Total
  terms terms terms  
Stopterm List 793 514 274 1581
Non Stop Terms in Database 2493 544 1041 4078
Stop Terms identified 2399 1058 334 3791

 


By applying the legality checking function and a stopterm file of 1581 distinct terms, the collection of terms created from automatic indexing was reduced from 45450 to 5126 (4078 Chinese terms and 1048 English terms). Almost 89% of noisy terms derived from the multi-linear term-phrasing phase was eliminated. These 5126 terms included only 2126 distinct terms. Each of these represented a unique concept descriptor for the bibliographic database. Figure 7 shows sample results of the complete automatic indexing procedure.


  
Figure 7: Sample Index File
\begin{figure}\rule{6in}{.01in}
\par\vspace{6in}
\par\rule{6in}{.01in}
\end{figure}

Automatic Concept Classification

Earlier research on automatic classification has considered a wide range of applications including biological, anthropological, archaeological, and linguistic. In addition to existing statistical techniques [28] [4], many techniques consider class definitions, class-finding, and quasi-classification structures based on various theoretical perspectives [30].

Due to the uncertainty involved in information processing, object-class relationships may change over time. Researchers in information retrieval have been concerned with grouping or clustering documents or descriptors (terms) [16] [19]. Document clustering groups documents of similar nature; while term clustering could be used for automatic thesaurus generation or information abstraction.

In the next two subsections, we focus our discussion on a neural network approach to term classification. The first subsection describes the relevance weighting mechanism in measuring the degree of relevance of pairs of concept descriptors in the concept space. The second subsection describes a neural network model which clustered relevant concept descriptors into concept groups.

Relevance Weighting

In order to identify the ``relevance'' relationship between any of the 2126 distinct concepts identified earlier, each concept descriptor needed to be evaluated further to determine its ``importance'' in a document and across the whole database. The technique adopted was based on Salton's Vector Space Model [43] which considered term frequency (tfij) and document frequency (dfj). We also used the technique recently in [9].

1.
First, a threshold for the document frequency (i.e., number of documents in the database that contain the term j) dfjis required to eliminate concept descriptors with low frequency. In our system, this threshold can be decided by the system user. By applying the threshold, a portion of the underlying information may be lost (referred to as information loss). That is, infrequent indexes that appear only in selected source bibliographic entries may no longer be used in the next concept space generation and clustering phase. Despite the potential danger of missing some infrequent but important concept descriptors, the threshold helped remove many noisy terms and significantly improved classification efficiency, an observation confirmed in our earlier study [9]. It should be noted that the document frequency of each term was weighted according to the word count of the specified term. That is, if a term consisted of 3 words, then the original document frequency was multiplied by 3 to get the new document frequency. This weighting scheme helped bring out the often more specific multiple-word terms.

2.
A combined term weight dijwas then computed as follows:



where dij represents the relevance weight of the jth concept descriptor in document i. This procedure transformed the source document into a representation of a finite set of concept descriptors with relevance weights (between 0 and 1) [36] [31] [22]. This computation is a variant of the SMART indexing function developed by Salton [43] [45]. Because of the characteristics of the small sample Chinese-English bibliographic data (the total number of records in the database was smaller than most public-accessible, large-scale bibliographic databases), we used document frequency (more frequent terms are weighted more heavily) instead of the inverse document frequency (more specific terms are weighted more heavily) used in large database applications for computing term weights. By using document frequency, we were able to weight concept descriptors that appear in more documents more heavily than descriptors that appear sparsely in the whole database. This simple modification helped the system identify important concepts/terms for representing the documents in the database. (In our experience, inverse document frequency was found useful for generating specific indexes for large-scale databases [12], but for smaller applications, we found the document frequency weight to be better for capturing important, consensual concepts [9].)

3.
A co-importance weight dijk was was then computed based on the following formula:



where dijk represents the relevance weight of descriptors j and k in document i, tfijk = min (tfij,tfik), and dfjk = min (dfj,dfk).

4.
An asymmetric co-occurrence function developed by the authors [10] as shown below computed the relevance of any two concept descriptors j and k in the concept space (relational weight).


\begin{displaymath}W_{jk} =
\frac{\sum_{i=1}^{n} d_{ijk}}
{\sum_{i=1}^{n} d_{ij}} \end{displaymath}




\begin{displaymath}W_{kj} =
\frac{\sum_{i=1}^{n} d_{ijk}}
{\sum_{i=1}^{n} d_{ik}} \end{displaymath}



where n represents the total number of documents in the database. An empirically determined weight threshold of 0.1 was then adopted to retain strongly-associated pairs of descriptors in the concept space. Figure 8 shows a portion of such a concept space, where concept pairs are displayed in decreasing order. More relevant concepts showed higher relevance probabilities.


  
Figure 8: Partial Result of Co-relevance Weighting
\begin{figure}\rule{6in}{.01in}
\par\vspace{7in}
\par\rule{6in}{.01in}
\end{figure}

Parallel Classification

In order to cluster relevant concepts into concept groups, we adopted a new and interesting connectionist (neural network) approach.

Clustering algorithms [16] [14] [29] have been used extensively in IR over the past two decades [41] [43] because the ability of clustering methods to categorize or classify by assigning items to automatically created groups gives it a natural affinity with the goals of online information processing and management. Clustering methods are usually categorized according to the type of cluster structure they produce. The simple nonhierarchical methods divide the data set into clusters where no overlap is allowed [45]. The more popular hierarchical methods produce a nested data set in which pairs of items or clusters are successively linked until every item in the data set is connected. The most commonly used hierarchical clustering methods are based on either the single-link method which joins, at each step, the most similar pair of objects that are not yet in the same cluster or the complete-link method which uses the least similar pair between each of two clusters to determine the inter-cluster similarity [41]. Despite the usefulness of the hierarchy (also called dendrogram [41]) produced by such methods, these algorithms do not allow incorporation of a priori decisions about the number of desired clusters, cluster size, and criteria for cluster membership and the dendrogram does not provide a meaningful abstract (summary) representation of the data set.

More recently neural network clustering has attracted significant interest from researchers [32] [48]. Neural network clustering offers the ability to determine the size, shape, number, and placement of pattern clusters adaptively while intrinsically operating in parallel [47] [48]. Kohonen's self-organizing feature maps (SOM) [32] is one of the several neural network clustering algorithms which have demonstrated significant utility in various engineering, scientific, and business applications [33]. In this research we adopted the Hopfield network [27] for information classification. The summation function used in the algorithm allowed our system to cluster strongly-related terms (concepts) together to form concept groups and a sigmoid transformation function allowed overlapping clusters to be generated during the parallel clustering process. By controlling various thresholds, we were able to generate a small number of meaningful concept groups. We had implemented such an algorithm successfully in previous research which involved clustering ideas generated by meeting participants in an electronic meeting environment [9]. Details about this algorithm are presented below.

A neural network model simulates the self-organizing and adaptive properties of a neurological subsystem in human brain. The neuron nodes and weighted links in a neural network model are active processing agents. The Hopfield neural network [27] [33], which resembles an associative network and exhibits a parallel relaxation property in particular, can transform a partial, noisy distributed pattern into a stable state representation. This important property has been used in various pattern recognition and image restoration applications [33] [21]. In the proposed multilingual classification model, a variant of the Hopfield neural network was developed to cluster highly relevant concept descriptors. The strength of the connection between two neuron nodes revealed the degree of relevance between these two neurons.

The classifying behavior of this network was considered a variant of the competitive learning paradigm of neural networks [13] [25]. Multiple neurons, instead of a single winner neuron, were activated by a specific input neuron. These relevant output neurons, which were all concept descriptors in the concept space, were categorized as a concept group [9]. As shown in Figure 9, a concept space can be perceived as a single-layer network of relevant (weighted) concepts (nodes). By activating each concept in turn and converging to its strongly-associated neighbors, we could generate overlapping regions (clusters) of concept groups (represented as ovals in Figure 9). The Hopfield network clustering procedure adopted was as follows:


  
Figure 9: A Conceptual Diagram for Hopfield Net Clustering
\begin{figure}\rule{6in}{.01in}
\par\vspace{3in}
\par\rule{6in}{.01in}
\end{figure}

1.
Assigning Connection Weights

Training of the Hopfield net was performed by the relevance weighting computation phase described earlier. Each concept descriptor was represented by a neuron node and relevance weight was considered as synaptic weight.

2.
Initialization with Unknown Input Pattern

$\mu_{i}(0) = x_{i}$, $0 \leq i \leq n-1$

$\mu_{i}(t)$ is the output of neuron i at time t and xi which has a value between 0 and 1, indicates a value for neuron i. Initially all neurons were assigned 0 except for the specific node (concept) to be classified (weight of 1). After each iteration, the output computed was then used as the input for the next iteration. This initialization and activation process was repeated n times (for all n nodes), each time started with a specific neuron. The activation equation was iterated until the network converged or until it reached 100 iterations.

3.
Activation and Iteration


\begin{displaymath}\mu_{j}(t+1) = f_{s}[\sum_{i=0}^{n-1} W_{ij}\mu_{i}(t)], \:0\leq j\leq
n-1 \end{displaymath}

where Wij is the relational weight defined earlier and fs is the continuous sigmoid transformation function [33] as shown below.


\begin{displaymath}f_{s}(net_{j}) = \frac{1}{1+\exp[\frac{-(net_{j}-\theta_{j})}{\theta_{0}}]}\end{displaymath}

where $net_{j} = \sum_{i=0}^{n-1} t_{ij}\mu_{i}(t)$, $\theta_{j}$ served as a threshold output and $\theta_{0}$ was used to alter the shape of the sigmoid function.

4.
Convergence

The above process was repeated until there was no change in terms of output in the output layer between two iterations, which was accomplished by checking:


\begin{displaymath}\sum_{j=0}^{n-1}[\mu_{j}(t+1)-\mu_{j}(t)]^{2} \leq \epsilon \end{displaymath}

where $\epsilon$ was the maximal allowable difference between two iterations. Once the network converged, the final output represented the set of terms relevant to the starting term. In our system, the following values were used: $\theta_{j} = 0.\ 1$, $\theta_{o} = 0.\ 01$ and $\epsilon = 1$, all determined empirically.

The Hopfield net classification process strongly relied upon the associative property of the neural net and the parallel relaxation method for concept activation. A system testing session was performed, of which a detailed description will be provided in the next section.

System Implementation and Evaluation


 
Table 4: Classification and Information Loss Analysis
Doc-Freq # of Documents # of Terms # of Categories
Threshold Indexed Left Classified
30 61.56% 47 24
29 61.75% 48 21
28 65.70% 51 24
27 66.18% 54 27
26 66.96% 56 28
25 67.82% 59 31
24 67.82% 59 28
23 68.88% 68 34
22 68.88% 68 36
21 70.91% 73 39
20 71.39% 72 44
19 73.22% 90 51
18 74.08% 91 51
17 75.72% 102 56
16 75.72% 102 55
15 78.32% 120 53
14 78.52% 123 -
13 79.00% 131 -
12 79.58% 133 -
11 82.27% 180 -
10 82.95% 181 -
9 83.82% 195 -
8 84.39% 214 -
7 87.28% 320 -
6 87.48% 321 -
5 90.27% 409 -
4 90.75% 414 -
3 95.47% 1044 -
2 97.21% 1284 -
1 98.94% 1872 -
0 100.00% 2126 -
 

Conclusions and Future Extensions

Database management systems (DBMS) have provided a feasible solution to indexing and classifying structured, numeric information. However, for unstructured, textual information, development of an automatic and ``intelligent'' information system is important. This research proposed a systematic blueprint of a multilingual classification model to help automatically index and classify unstructured Chinese-English information.

Special features of this information classification model include a Chinese multi-linear term-phrasing technique, an integrated Chinese-English automatic indexing framework, and a neural network based concept classification component. For automatic multilingual indexing, the multi-linear term-phrasing method was adopted to extract term-phrases for a concept space of Chinese and English descriptors. In automatic concept classification, a variant of the Hopfield neural network and its parallel relaxation algorithm was developed to categorize concept descriptors. We believe this multilingual methodology can also contribute to research related to other multilingual databases involving languages such as Spanish-English or Japanese-English. The specific directions for our future research include:

Acknowledgments

This project was supported in part by a grant awarded by the International Program Development Fund, University of Arizona, 1992-1993 and a Research Initiation Award grant awarded by the Division of Information, Robotics, and Intelligent Systems, National Science Foundation (IRI-9211418), 1992-1994. We wish to thank the anonymous reviewers for their comments and suggestions.

Bibliography

1
A. L. Baker, J. M. Bieman, and W.F. Tooley.
Interlingua: A multi-language business information center.
Proc. of the 21st Annual Hawaii Intl. Conference on System Sciences, 4:79-86, 1988.

2
D. C. Blair.
Indeterminacy in the subject access to documents.
Information Processing and Management, 22(2):229-241, 1986.

3
E. Carmel, S. Crawford, and H. Chen.
Browsing in hypertext: A cognitive study.
IEEE Transactions on Systems, Man and Cybernetics, 22(5):865-884, September/October 1992.

4
P. Cheeseman, J. Kelly, M. Self, J. Stutz, W. Taylor, and D. Freeman.
Autoclass: A Bayesian classification system.
In Proc. of Fifth Intl. Conference on Machine Learning, San Mateo, CA, 1988. Morgan Kaufmann.

5
H. Chen.
Collaborative systems: solving the vocabulary problem.
IEEE COMPUTER, 27(5):58-66, Special Issue on Computer-Supported Cooperative Work (CSCW), May 1994.

6
H. Chen and V. Dhar.
Reducing indeterminism in consultation: a cognitive model of user/librarian interaction.
In Proceedings of the 6th National Conference on Artificial Intelligence (AAAI-87), pages 285-289, Seattle, WA, July 13-17, 1987.

7
H. Chen and V. Dhar.
User misconceptions of online information retrieval systems.
International Journal of Man-Machine Studies, 32(6):673-692, June 1990.

8
H. Chen and V. Dhar.
Cognitive process as a basis for intelligent retrieval systems design.
Information Processing and Management, 27(5):405-432, 1991.

9
H. Chen, P. Hsu, R. Orwig, L. Hoopes, and J. F. Nunamaker.
Automatic concept classification of text from electronic meetings.
Communications of the ACM, 37(10):56-73, October 1994.

10
H. Chen and K. J. Lynch.
Automatic construction of networks of concepts characterizing document databases.
IEEE Transactions on Systems, Man and Cybernetics, 22(5):885-902, September/October 1992.

11
H. Chen, K. J. Lynch, K. Basu, and D. T. Ng.
Generating, integrating, and activating thesauri for concept-based document retrieval.
IEEE EXPERT, Special Series on Artificial Intelligence in Text-based Information Systems, 8(2):25-34, April 1993.

12
H. Chen, B. R. Schatz, T. Yim, and D. Fye.
Automatic thesaurus generation for an electronic community system.
Journal of the American Society for Information Science, 46(3):175-193, April 1995.

13
T. E. Doszkocs, J. Reggia, and X. Lin.
Connectionist models and information retrieval.
Annual Review of Information Science and Technology (ARIST), 25:209-260, 1990.

14
R. Dubes and A. K. Jain.
Clustering methodologies in exploratory data analysis.
Advances in Computers, 19:113-228, Yovits, M. C., Editor 1980.

15
O. Ekeberg.
Robust dictionary lookup using associative networks.
Int. Journal of Man-Machine Studies, 28:29-43, 1988.

16
B. Everitt.
Cluster Analysis.
Second Edition, Heinemann Educational Books, London, England, 1980.

17
C. K. Fan and W. H. Tsai.
Automatic word identification in Chinese sentences by the relaxation technique.
Computer Processing of Chinese and Oriental Languages, 4(1):33-56, 1988.

18
Y. Feng and K. Z. Wang.
Chinese question-answer experimental system based on the sense coherence among the SNEs.
Second Intl. Conference on Computers and Applications, pages 927-933, 1987.

19
D. Fisher.
Knowledge acquisition via incremental conceptual clustering.
Machine Learning, 2:139-172, 1987.

20
C. L. Foss.
Tools for reading and browsing hypertext.
Information Processing and Management, 25(4):407-418, 1989.

21
L. Fu.
Neural Networks in Computer Intelligence.
McGraw-Hill, New York, NY, 1994.

22
N. Fuhr and C. Buckley.
A probabilistic learning approach for document indexing.
ACM Transactions on Information Systems, 9(3):223-248, July 1991.

23
G. W. Furnas, T. K. Landauer, L. M. Gomez, and S. T. Dumais.
The vocabulary problem in human-system communication.
Communications of the ACM, 30(11):964-971, November 1987.

24
L. S. Gay and W. B. Croft.
Interpreting nominal compounds for information retrieval.
Information Processing & Management, 26(1):21-38, 1990.

25
S. Grossberg.
Competitive learning: From interactive activation to adaptive resonance.
Cognitive Science, 11:23-63, 1987.

26
D. Gusfield, G. M. Landau, and B. Schieber.
Efficient algorithm for the all pairs suffix prefix problem.
Information Processing Letters, 41(4):181-185, 1992.

27
J. J. Hopfield.
Neural network and physical systems with collective computational abilities.
Proceedings of the National Academy of Science, USA, 79(4):2554-2558, 1982.

28
L. Hunter and D. J. States.
Bayesian classification of protein structure.
IEEE Expert, pages 67-74, August 1992.

29
A. K. Jain and R. C. Dubes.
Algorithms for Clustering Data.
Prentice-Hall, Inc., Englewood Cliffs, NJ, 1988.

30
K. S. Jones.
Some thoughts on classification for retrieval.
The Journal of Documentation, 26(2):89-101, June 1970.

31
K. S. Jones.
Experiments in relevance weighting of search terms.
Information Processing & Management, 15:133-144, 1979.

32
T. Kohonen.
Self-Organization and Associative Memory.
Third Edition, Springer-Verlag, Berlin Heidelberg, 1989.

33
R. P. Lippmann.
An introduction to computing with neural networks.
IEEE Acoustics Speech and Signal Processing Magazine, 4(2):4-22, April 1987.

34
L. B. Lovins.
Development of a stemming algorithm.
Mechanical Translation and Computational Linguistics, 11:22-31, 1968.

35
J. H. Lu.
Approach to converting phrases in pinyin to phrases in Chinese characters.
Proc. of the 1988 IEEE Intl. Conference on Systems, Man, and Cybernetics, 1:389-391, 1988.

36
Z. Mazur.
Properties of a model of information retrieval system based on thesaurus with weights.
Information Processing & Management, 15:145-154, 1979.

37
R. S. Michalski and R. E. Stepp.
Learning from observation: conceptual clustering.
In Machine Learning, An Artificial Intelligence Approach, Pages 331-363, Michalski, R. S., Carbonell, J. G., and Mitchell, T. M., Editors, Tioga Publishing Company, Palo Alto, CA, 1983.

38
J. Nielsen.
Hypertex and Hypermedia.
Academic Press, New York, NY, 1990.

39
K. Obraczka, P. B. Danzig, and S. Li.
Internet resource discovery services.
IEEE COMPUTER, 26(9):8-24, September 1993.

40
M. E. Porter.
Competitive Strategy: Techniques for Analyzing Industries and Competitors.
Free Press, New York, NY, 1980.

41
E. Rasmussen.
Clustering algorithms.
In Information Retrieval: Data Structures and Algorithms, W. B. Frakes and R. Baeza-Yates, Editors, Prentice Hall, Englewood Cliffs, NJ, 1992.

42
G. Salton.
Generation and search of clustered files.
ACM Transactions on Database Systems, 3(4):321-346, December 1978.

43
G. Salton.
Automatic Text Processing.
Addison-Wesley Publishing Company, Inc., Reading, MA, 1989.

44
G. Salton, J. Allan, and C. Buckley.
Automatic structuring and retrieval of large text files.
Communications of the ACM, 37(2):97-108, February 1994.

45
G. Salton and M. E. Lesk.
Information analysis and dictionary construction.
In The Smart Retrieval System - Experiments in Automatic Document Processing, G. Salton, Editor, Prentice-Hall Inc., Englewood Cliffs, NJ, 115-142., 1971.

46
C. B. Schwind.
Semantic trees for natural language representation.
Information Processing & Management, 19(4):223-235, 1983.

47
P. K. Simpson.
Artificial Neural Systems: Foundations, Paradigms, Applications, and Implementations.
McGraw-Hill Book Company, New York, NY, 1990.

48
P. K. Simpson.
Fuzzy min-max neural networks - part 2: clustering.
IEEE Transactions on Fuzzy Systems, 1(1):32-45, February 1993.

49
K. Sparck Jones.
Automatic Keyword Classification for Information Retrieval.
Butterworths, London, 1971.

50
R. Sproat and C. L. Shih.
A statistical method for finding word boundaries in Chinese text.
In Proc. of 1989 International Conference on Computer Processing of Chinese and Oriental Languages, Changsha, China, 1990.

51
R. E. Stepp.
Concepts in conceptual clustering.
In Proceedings of the 10th International Joint Conference on Artificial Intelligence (IJCAI-87), pages 211-213, Milan, Italy, August 23-28 1987.

52
S. S. Tseng, C. C. Yang, and C. C. Hsieh.
An experimental model of Chinese textual database.
Journal of the Chinese Institute of Engineers, 13(6):607-622, 1990.

53
X. D. Wang and B. Q. Dai.
Chinese speech understanding system.
9th Intl. Conference on Pattern Recognition, pages 14-17, 1988.

54
C. W. Yang, Z. Lai, and Y. Zhang.
Application of Chinese natural language understanding to robot control.
Proc. of the 1988 IEEE Intl. Conference on Systems, Man, and Cybernetics, 1:385-388, 1988.


hchen@bpa.arizona.edu