H. Chen1
University of Arizona
-
W. K. McHenry2
Georgetown University
-
K. J. Lynch, and S. E. Goodman
University of Arizona
Title: ``A Textual Database/Knowledge-Base Coupling Approach to Creating Computer-Supported Organizational Memory''
Functional Areas: MIS/DSS (Database/Knowledge Base).
Methodological Areas: Computer Science (Database/Knowledge Base).
Behavioral researchers who perceive organizations to be information processing systems have agreed upon functional resemblances between organizations and human beings. Both receive stimuli, store and process information, and perform actions in response. An organization made up of individuals, structure, culture, procedures, and ecology acquires external stimuli and information and responds by making decisions within its organizational boundaries in much the same way as a human being [75]. The behaviors and performance of an organization depend upon the structure of its retention facility, the information contained there, and the processes used for information acquisition, management, and retrieval.
Early theorists defined organizational memory in terms of standard operating procedure [55], past events, promises, assumptions, and behaviors [54], cause maps, architecture, strategic orientation [42], and memory retention facility [75]. The use of information technologies to acquire organizational memory and assist in intelligence analysis and decision making has been examined only recently [46] [1] [75]. Huber [46], in particular, postulates that advanced information technologies3 such as computer-assisted information storage and retrieval systems and expert systems can lead to organizational intelligence4 that is more accessible, timely, comprehensive, and accurate. This proposition is particularly significant in relation to increasing global competition among business entities, the influx of huge amounts of online information, and the advancement and proliferation of various advanced information technologies.
In this article, based on a literature review and our own empirical studies, we first propose a framework for understanding the computer-supported information management and retrieval process in organizational memory as well as problems that may be encountered. The components depicted in this framework are used to support a textual database/knowledge-base coupling approach to creating effective computer-supported organizational memory that is illustrated by describing the implementation experience of an international computing research group.
The structure of this paper is as follows. In Section 2, we first present an overview of research in computer-supported organizational memory. We then propose an information management and retrieval framework in Section 3. Based on this framework, we present a textual database/knowledge-base coupling approach to creating computer-supported organizational memory in Section 4. We also provide a blueprint of procedures and techniques for system designers to use. In Section 5, we describe a research group that adopted this approach, including specific implementation details. Other important issues to be considered in creating computer-supported organizational memory, such as automatic information acquisition, automatic indexing, incremental knowledge extraction, and human-computer interactions, are discussed in Section 6. We conclude the article in Section 7.
Organizational memory has long been recognized by researchers and practitioners as an important factor in the success of an organization's operations and its responsiveness to the changes and challenges of its environment. In a recent article, Walsh and Ungson have proposed a framework for examining the structure and components of organizational memory and identified areas for future research. Within this framework, organizational memory is defined as: ``a construct that is composed of the structure of its information retention facility, the information contained in it, the process of information acquisition and retrieval, and its consequential effects'' [75]. Having access to information an organization has acquired helps reduce the uncertainty and indeterminism involved in that organization's decision making. Such information is retained by an organization's individuals, culture, procedures, structure, and ecology. Individuals, in particular, are involved in the complete process of acquiring, retaining, and processing information and are considered the ``sensor'' and ``cognitive'' units of an organization (using the information processing systems metaphor). The information they acquire and the knowledge they have make up a significant portion of an organizational memory.
Even though information management and retrieval activities are recognized by Walsh and Ungson as being important, these authors do not consider the special role information technologies play in creating a robust and accessible organizational memory but consider information technologies to be only part of the information retention facility called individuals.
Huber, on the other hand, has for some time been suggesting the unique contribution of advanced information technologies to organizational design, organizational memory, and decision making [44] [45] [46]. In [46], he proposes a theory about the effects that computer-assisted communication, information storage and retrieval, and decision-aiding technologies have on organizational activities. The following propositions in his theory are particularly relevant to organizational memory:
Huber specifically stresses the role of advanced information technologies such as information storage and retrieval systems and knowledge-based systems, in creating more timely, comprehensive, and accurate organizational intelligence and in capturing organization members' expertise. Huber urges researchers to examine his theory and propositions empirically, paying particular attention to the effects of diverse information technologies on the different aspects of organizational memory and decision making.
While organizational researchers have been postulating theories and suggesting frameworks for organizational memory development, many Computer Science and Information Systems researchers have continuously advanced the state of the art in information technologies. Techniques such as mathematical modeling and analysis, object-oriented databases, machine learning, knowledge-based information retrieval, rule-based programming, and other automatic decision aids have been incorporated into various systems to assist in organizational intelligence development and decision making. Examples of such implementation are found in various decision support systems5 and question-answering systems6.
A special type of information system that helps create computer-supported organizational memory and typically exists in military organizations, business units, and research institutions is called Intelligence Systems. This type of system assists environmental scanning, information collection and management, and intelligence retrieval and analysis for an organization in a competitive and dynamic environment [67] [50]. The system which we will discuss extensively in the paper is another example of such systems.
Kochen [50] describes intelligence systems as follows:
Contemporary intelligence systems rely more on overt, public sources than on covert espionage missions. They must screen, evaluate, correlate, interpret, analyze, and synthesize vast amounts of information. These activities require judgement, hypotheses-formulation, reasoning, and a great deal of knowledge, understanding, and intelligence. They are performed by persons with the help of computers.
An intelligence system needs to interact and scan its environment to solicit useful data. The system screens incoming data using some known indicators or noteworthy patterns that the organization has acquired in the past. Relevant and interesting data are then archived and managed in the organization's database. Problem solvers within the organization who possess their own experience, knowledge, and skills often generate hypotheses and questions in response to new environmental stimuli. They formulate search strategies, employ their own knowledge, and identify and retrieve pieces of relevant information from the archived database in order to verify or reject hypotheses for solutions. This process is iterative in nature and often requires a significant amount of human judgment, intuition, and expertise. The answers that the problem solvers present and the outcomes of actions taken are then recorded and archived in the organizational database to become part of the organizational memory. Corporate intelligence systems are a special type of intelligence system, one that has been used extensively when corporations face dynamic environments and intense global competition [37] [36] [67].
In addition to developing intelligence systems, the computer supported cooperative work (CSCW) community has long been examining issues related to human collaborative activities in an organizational setting. Johansen [48] discussed the organizational and system development problems involved in synchronous and asynchronous collaboration. He commented that asynchronous communication and collaboration capabilities, in particular, will be one of the most compelling features of many groupware products. We echo his view and further postulate that the evolution and change of organizational members over time could cause asynchronous collaboration and organizational memory building to become extremely complex and difficult. The temporal features of group work and how those temporal matters are affected when technological tools are added are discussed extensively in [57], although McGrath had a primarily organizational perspective rather than a system development view.
The role of ``human augmentation systems'' in support of collaborative research has been recognized by Engelbart [29]. In [30], he discussed the issue of the ``interoperability between knowledge-work domains.'' He predicted that someday all of our basic knowledge-work domains will be integrated within one coherent ``organizational knowledge workshop.'' Close cooperation between compound knowledge domains puts new demands on knowledge work interchange. Engelbart proposed an open hyperdocument system for linking different knowledge domains.
Previous CSCW field research has also demonstrated the importance of collaborative information sharing and retrieval and has pinpointed problems associated with current system-supported information processing functions in various scientific, engineering, and business domains [48] [17] [20] [69]. A major difficulty in accessing pertinent information results from current systems' lack of support for concept-based information retrieval. Searchers who are able to express their concepts using their own vocabularies are unlikely to find relevant information because of the vocabulary differences between different collaborators and between the system and the searchers.
Despite variations in implementation, online information retention, information management, and information retrieval activities are all essential for creating effective and useful organizational memory. As Walsh and Ungson [75] and Huber [46] have suggested, the computer-assisted information management and retrieval process within organizations poses interesting and important research questions and requires detailed investigation. In the next section, we discuss the nature and difficulties of computer-assisted information management and retrieval in the organizational memory context before presenting our proposed solution. We make explicit references to information indexing and retrieval problems that have previously been reported in information sciences.
Information management and retrieval has been researched extensively by information scientists whose research has revealed that computer-supported information management and retrieval can be difficult for the following reasons [51] [4]:
Differences in information searchers' and information managers' knowledge and training often cause disparity in their indexing and searching approaches and create a terms matching problem between search terms and index terms. Even with the aid of information technologies, information retrieval may still be difficult.
Information sciences research provides a sound basis for understanding the fundamental process of information management and retrieval, but it does not address many issues unique to organizational memory. First, the information stored in an organizational memory is pertinent to that organization's environmental stimuli, its decision responses, and its unique procedures. Information formats are likely to be unstructured, e.g., newsletters, business cards, product brochures, electronic mail messages, etc. This is in contrast to a typical information sciences setting, where information not only is more structured (mostly books or articles), but in most cases does not directly address the survival or success of an organization because information is created for public access and reference purposes.
Second, the individuals involved in information management and retrieval activities, and their backgrounds, are very different in these two settings. In the organizational memory context, the information collection, management, retrieval, and analysis activities are conducted by members of the organization. Their roles are clearly defined by the organization and they share responsibilities with other organizational members in the success or failure of the organization. In traditional information sciences settings such as libraries or bibliographic databases, information is typically managed by professional information specialists who have no stake in the success or failure of the organization with which information searchers are affiliated. Information management and retrieval in the organizational memory context presents some unique research questions not previously addressed by information scientists.
In this section, we present a framework for understanding information management and retrieval in terms of the human agents involved in creating and using organizational memory, the types of knowledge these agents possess, and the observed characteristics of their information management and retrieval behaviors. Prior research related to the components within this framework and findings from our own studies also are presented.
Four types of knowledge are involved in online information management and retrieval. First, the classification scheme knowledge that is used for indexing and storing documents is also required to search for them. Second, subject area knowledge is required for selecting appropriate index terms and for expressing a query in search terms. Third, system knowledge is necessary for operating effectively on specific information storage and retrieval systems. And finally, organizational knowledge is needed to comply with organizational goals, procedures, and norms. The four knowledge components are presented toward the left of Figure 1.
These knowledge components have been reported in prior research. Bates [3] and Blair [7], for example, identified subject area knowledge and classification knowledge as factors that affect subject-based search. In a more detailed discussion, Belkin postulated that searchers exhibit an ``anomalous state of knowledge'' [6] regarding the subject area of their inquiries. The author of a document, on the other hand, has a much more complete state of knowledge concerning the subject area of the document. In an empirical study of the organization memory-based information management process, Chen et al. [20] identified collaboration knowledge, a type of organizational knowledge, as crucial for managing organization-specific information. The term refers to information managers' knowledge of the organizational goals and knowledge of the expertise of organizational members.
The four types of knowledge outlined above are typically distributed among two parties to whom we refer as ``information agents.'' They are information managers, who acquire, classify and manage documents based on some pre-determined classification scheme, organizational standards, and system requirements and information searchers, who access online databases to retrieve information that is pertinent to their needs.
In the information sciences setting, information managers are often referred to as indexers, who are well trained in a classification scheme (e.g., the Library of Congress classification scheme) and have sufficient subject area knowledge. Searchers, on the other hand, are interested in only a small portion of the document collection. They often have good subject knowledge but little classification scheme knowledge. In libraries, reference librarians often act as intermediaries between searchers and the database.
In the organizational memory context, the information manager's role may be more diverse than that of an indexer in a library. In addition to selecting appropriate index terms for documents, the information manager may also be involved in document acquisition, retention, and entry.
Indexing uncertainty and search uncertainty are the primary sources of information management and retrieval problems. Indexing uncertainty arises because different information specialists can assign different index terms for a given document, based on their own indexing heuristics and/or organization's classification rules. Search uncertainty arises because searchers have latitude in choosing terms to express their queries and the search strategies they employ in acquiring information. Because of the indeterminism involved in indexing and searching, an exact match between the searcher's terms and those of the indexer is unlikely (the terms matching problem). Both types of uncertainty were recognized in prior information sciences research.
The process of indexing is partly indeterminate. Evidence suggests that different information managers, well trained in an indexing scheme, might assign index terms for a given document differently. It has also been observed that an information manager might use different terms for the same document at different times [47] [74]. This discrepancy in index term assignment may be attributed to the information manager's training and experience in the subject area and classification scheme.
Information managers may follow certain indexing principles or have developed their own rules through years of experiences. For example, indexers of various bibliographic databases often follow the following principles [4] [14]:
For organization-specific databases, the indexing and information management rules may not be as stringent as those adopted in public-accessible databases. Organizations may have imposed a looser structure on their databases and may expect information managers to develop their own heuristics or rules for information management [20].
In addition to latitude in selecting search terms, the approach searchers adopt for performing a search also varies. Search strategy usually refers to a plan or approach for the whole search. It often includes a well-defined sequence of operations which are conducted in an attempt to achieve a specific goal. In cognitive psychology, it is also referred to as the control knowledge for problem solving [12] [2]. Bourne identified two search strategies. In the ``building-block'' strategy, the user enters various terms as separate search statements. After the search results are derived, he or she combines all search statements into a single final statement using the Boolean operator, AND. This strategy contrasts with the ``pearl-growing'' strategy, in which the user initially searches on a few specific terms to retrieve some citations. These citations are then examined for new candidate search terms to be added to subsequent searches [56]. Chen and Dhar [16] reported five document search strategies adopted by online searchers. Two strategies, which are based on trial-and-error and forward-backward screen browsing, respectively, were adopted by inexperienced searchers and were generally ineffective. Three strategies, which were based on extensive thesaurus consultation, utilization of efficient online options, and exploration of known citations, were adopted by experienced searchers and reference librarians.
Using the above information management and retrieval framework, we developed a textual database/knowledge-base coupling approach to creating effective knowledge-based organizational memory. The stages involved in our approach are similar to those described in the Management Intelligence System design model proposed by Kochen [50] and the information acquisition, management, and retrieval model proposed by Walsh and Ungson [75]. The techniques we adopted in each stage, however, are more specific and different from those in either of those proposals.
Our approach includes four basic stages: information acquisition, textual database creation, automatic knowledge base generation, and knowledge-based information retrieval. In this section we discuss the issues to be addressed in each of these stages and the techniques to be used. The complete approach and some specific techniques are illustrated in detail through an implementation example in the next section.
The complete process of creating a knowledge-based organizational memory is shown graphically on the left in Figure 2. The agents, information sources, and technologies involved during this process are shown on the right-hand side of the figure.
Most organizations have to function in challenging and dynamic environments. They need to interact with their suppliers and customers, obtain competitor information, and understand the general economic, social, and political context in which the organization is positioned. Comprehensive and longitudinal environmental scanning and timely and effective information collection (as shown in the box labelled Information Acquisition in Figure 2) are essential in this information acquisition stage. An organization which has more ``intelligence'' about its customers, its competitors, and the general social context is believed to be able to react more promptly to new threats and opportunities [75].
An influx of large amounts of potentially relevant information, often unstructured, needs to be scanned and filtered so that only relevant information can be retained by the organization. This environmental scanning activity is often performed by human information managers or analysts who have extensive subject knowledge. Despite the increasing availability of multimedia information such as drawings, pictures, films, and animations, a significant portion of organization-related information is still in textual format. This presents some unique information management and retrieval problems, especially regarding the contents or semantics of textual information (the classical subject access problem in online catalog systems [4]).
Information acquired from the environment can be classified into two basic categories according to its source: primary source information is collected first-hand and secondary source information is obtained from other existing, information-dissemination sources. The ``front-line'' members of an organization (e.g., sales persons or overseas representatives) typically contribute a lot of first-hand information. Occasionally top organizational executives or intelligence unit members may conduct on-site visits or perform extensive user surveys or environmental analyses in order to obtain first-hand information. Information collected from primary sources is often more reliable, but usually more expensive to obtain. In contrast, information from secondary sources is easier to acquire, but it often requires extra synthesis or interpretation to meet the organization's needs. Organizational members may need to scan and filter a large volume of publications produced by governments (both domestic and foreign), industries, and academic institutions in order to obtain a set of organization-specific information.
During the process of acquiring information from its environment, an organization may also develop its own procedures and format to identify and collect relevant information and to transform this information into useful ``organizational intelligence.'' Most successful multinational corporations have devoted extensive resources to environmental scanning and information collection [36] [67].
After information is acquired, it needs to be stored and archived in databases for future retrieval and analysis. (In this research we have addressed only information management and retrieval issues regarding textual documents.) In this textual database creation stage, information entry and indexing need to be performed by information managers (as shown in Figure 2).
During information entry, online documents can be entered manually by information managers or created through some combination of OCR (Optical Character Recognition) scanning and database importation (from other databases). OCR software, which converts text scanned from hard copy to ASCII files, is generally 95% accurate and prices for PC-based OCR software have dropped significantly in recent years to well below $1,000 [33]. Database importation typically involves down-loading files from external databases and converting tagged entries into local database formats, a process which can be largely automated.
For information which is more unstructured and which requires human synthesis, interpretation, and/or translation, a manual entry process may be more desirable. Most secondary information from existing online sources or publications can be imported or scanned automatically, however. This online information is then stored according to the formats and requirements specified by the databases.
Once information is captured online, information managers often need to assign indexes or descriptors to each document to represent its contents. This indexing process is necessary for applications that are based on commercial database management systems (through the creation of data dictionary and attribute/domain assignment) or conventional inverted index-based information retrieval systems. In full-text retrieval systems, however, each word or group of adjacent words automatically becomes an index for a document and keyword search or proximity search can be performed to retrieve documents. In this research our focus has been on database management systems and index-based information retrieval systems. We will, however, discuss issues related to full-text retrieval systems when appropriate.
In addition to a manual indexing process, automatic indexing that may include dictionary look-up, stop-wording, word stemming, morphological analysis, term-phrase formation, and phrase-level syntactic analysis is also applicable in an online environment. (See [68] for a good overview of these techniques.).
In an online environment an indexer's subject area knowledge, system knowledge, and classification scheme knowledge are crucial to the success of indexing (as indicated in Figure 1). Through organizational members' collective environmental scanning and information collection efforts and effective information entry and indexing, a textual database could eventually emerge as a comprehensive and useful repository of organizational memory.
While the two phases already described help to create a computer-supported organizational memory in the form of textual databases, the problems associated with search uncertainty remain largely unresolved. Information searchers who do not know the specific classification scheme and indexes used and unfamiliar with the system's functionalities will still experience significant difficulty during information retrieval and these problems only worsen when the number of documents increases continuously (the classical information overload problem). So the pressing question becomes: ``How can we facilitate the creation of large online databases and at the same time foster effective and efficient information access?''
A solution to this problem is to let some algorithms automatically ``learn'' (or analyze) the subjects/concepts embedded in the documents of the database and to represent the result as an online search aid - a thesaurus-like component which we refer to in this paper as a knowledge base. Similar to a man-made thesaurus that helps searchers locate specific subject terms and find relevant documents, a system-generated thesaurus can also provide pointers to actual documents. This automatic approach to knowledge base generation (as shown in Figure 2, it produces a knowledge base from a textual database without human intervention), which is grounded on inductive knowledge discovery research [34] [60] [21] [25] and automatic thesaurus generation [68] [18] could produce a network of indexes and their relationships to represent the concepts and subjects in the underlying textual database. We refer to this network as Concept Space in [18].
The most commonly-used automatic thesaurus generation algorithms compute probabilities of indexes co-occurring in all documents of a database (co-occurrence analysis in Figure 2). Akin to a human inductive learning process which generates concepts from a set of examples (the more examples, the better), a knowledge base created from a database that is more subject-specific and larger in its size of collections will be more ``knowledgeable.'' The knowledge base generation process can be performed in a batch mode when a large amount of information is in existence, incrementally whenever new information is acquired, or in some manner between the two (e.g., perform a batch learning operation when there is a goodly number of collections and perform incremental learning afterwards).
Because the knowledge base is generated completely from all indexes in the underlying textual database, there is a tight linkage between the knowledge base and the database. Indexes in the knowledge base can be used to access documents in the database and documents found relevant by searchers can be used (through their indexes) to traverse the knowledge base.
The knowledge base, which is more concise in size and representation and which is semantics-rich (due to the explicit cross-references), can serve as a search front-end to its underlying database. Searchers can browse the knowledge base and use any search terms they deem appropriate to represent their needs. Because of the rich vocabularies captured by the knowledge base, search terms can easily ``dock'' on the knowledge base (the hit-the-side-of-barn principle described by Bates [4]) and, once ``docked'' on, the knowledge base's cross-reference structure can be used by searchers to browse the network and identify concepts of interest to them - a type of thesaurus consultation process [16]. This knowledge-based browsing process has been shown to be useful for alleviating the terms matching problem discussed earlier in the information management and retrieval framework [16] [4]. Alternatively we can perceive the knowledge base as a ``recall-aiding'' tool. With the abundant clues (terms and links) suggested by the knowledge base, information searchers, experts or novices, will have a better chance of recalling other subject-specific concepts regarding their queries [2].
Browsing has been recognized for its ability to foster serendipitous discovery and memory-jogging in hypertext systems [13]. Science advisor to FDR, Vannevar Bush is credited with conceptualizing the first (hypertext) browsing system. He created a browsing system based upon microfilm and photocells called Memex [9] which predated the modern digital computer. However, different researchers and system designers appear to have slightly different interpretations. Halasz defines it as a style of building information management and representation systems using typed links to create a network of nodes [41]. Smith and Weiss [71] define hypertext as a form of electronic document viewed through an interactive browser, as well as a network-based approach to information management. Booth [8] describes branching and networked hypertext as non-linear and interactive in contrast to linear and ``passive'' text documents. In this research, we use browsing to indicate a user-controlled process of traversing a large network of concepts.
In addition to browsing, concepts in the knowledge base can also be retrieved through an automatic spreading activation process, during which known search terms can be used to follow existing links in the knowledge base and activate other relevant concepts automatically [65]. The spreading activation process can either be performed by following some ad hoc activation heuristics (e.g., specific-link-first) [22] [70], through some search algorithms (e.g., branch-and-bound method) [16], or by means of existing neural network convergence algorithms (e.g., Hopfield net algorithm) [5] [19]. An automatic spreading activation process can help alleviate a searcher's problem of browsing manually in a network of concepts without learning anything specific and avoid getting-lost in a large network (the ``Art Museum Phenomenon'' [31]).
The concept of traversing an information space automatically also exists in ``knowbots,'' knowledge ``robots'' (or automatic librarians) that automatically hunt for information on the internet or in bibliographic databases. The ``Knowbot Information Service'' (KIS) on internet can be perceived as a ``white pages'' service that allows a user to present a query and have KIS search several other databases of various sorts [61]. Recently researchers have incorporated knowbots in information retrieval, e.g., Baylor College of Medicine's MEDLINE Retriever [32]. Our spreading activation algorithms resemble knowbots' proactive search engines. However, we had not addressed the issues of accessing information sources across internet and between heterogeneous databases.
Both knowledge-based browsing and spreading activation in our framework are likely to resolve the search uncertainty problem discussed earlier. They are shown on the bottom of Figure 2.
The textual database/knowledge-base approach is best illustrated by means of an implementation example. In this section we describe a research group which has adopted a textual database, automatic knowledge base generation, and knowledge-based information access for supporting international computing research. Over the past decade this research group has also been involved in all the activities described in the various stages of our proposed approach. Through this example, we aim to show how information technologies and techniques such as textual databases and knowledge bases can be used to provide more timely, accessible, and comprehensive organizational memory.
Even though the research group is smaller in size than most business entities, it has been operating in a competitive and dynamic research environment for a decade and has developed its own unique group norm, culture, and procedures for its tasks. The research group interacts extensively with its environments and it relies strongly on various external information sources. Its structural simplicity and its reliance on information technologies present an ideal opportunity to examine the effects of information technologies on organizational memory. In the following subsections (5.1-5.4), we describe the group's activities and experiences according to the four-stage database/knowledge-base coupling model we have proposed.
The organization we studied is the Mosaic research group at the University of Arizona. Group members have conducted research over the past decade in the areas of foreign-area studies and assessment of information technologies, focusing on the (former) Soviet Union and Eastern Europe (Russian/EE) [40] [59]. Today, the group has more than 15 members, who together have fluency in seven languages, proficiency in five others, and working capabilities in four more.
Group members (analysts) collect articles and other forms of international computing-related academic publications, browse and study documents collected, exchange ideas with foreign researchers via email, telephone, and other means, visit foreign countries and organizations periodically, and attend major international conferences and professional meetings. They build their knowledge around certain subject areas, develop their own personal contacts with foreign researchers and organizations, and shape their beliefs, values, and judgement concerning international computing technologies and developments in specific countries of interest.
Group member transformations (the standard operating procedures within an organization that convert an input (e.g., raw material, a new recruit) into an output (e.g., finished goods, a company veteran) [75]) are done through extensive language, subject area, and research training. Language training is conducted through language courses and tutoring services the group provides. Subject area training requires extensive exposure to the subject area and consistent interaction with external information sources. Group members typically learn first from academic publications and other secondary sources. They then participate in various site visits, foreign trips, and conferences in order to get a first-hand impression and knowledge of foreign technologies.
Over the past decade, the group has developed distinctive principles and techniques for acquiring Russian/EE computing information [59]. The techniques address two goals: determining what information to collect, and determining how to interpret that information.
The group's belief in the Web model has prompted group members to scan and collect a wide spectrum of information concerning computing policies, history, installations, infrastructure, usage, proliferation, and manufacturing capabilities. Longitudinal environmental monitoring and comprehensive information collection efforts allow the Mosaic analysts to access critical information and provide in-depth analysis for complex international computing research questions.
The Mosaic analysts collect important computing information from most of the countries in the world (143 countries in 1990). Even though the focus is on the former East-bloc countries, analysts also monitor major technological development information in the more highly developed countries, especially the U.S., Japan, and some European countries. The comprehensive coverage of various regions in the world and the recording of significant technological experiences permit analysts to make parallel comparisons and make forecasts for the information technologies in the less developed and developing countries.
This technique becomes feasible only when a very large base of information, from both primary and secondary sources and ranging from official government statements and newspaper reports to on-site visits to plants, interviews with workers and engineers, and discussions with foreign researchers, has been laid down.
This technique has emerged from more than a decade of painstaking experience in East-bloc computing research - official statements and government news are often misleading; finding reliable literature is hard; getting access to and finding people or organizations in (former) Communist countries is difficult; the languages and cultures of foreign countries create impediments to research; and a researcher's own biases can cause distortion and misunderstandings (see [59] for a detailed discussion).
The overriding attributes of Russian/EE computing research are that multiple, triangulating references on all aspects of computing (technological, economical, and social), experiences of the more developed countries, and the information acquired from both primary and secondary sources, are needed to make a point, support a comparison, reason a conclusion, or dig out an answer. Each piece of data collected from a variety of sources may itself be insignificant, but when combined with other pieces of data it attains meaning and provides support for intelligence analysis.
A significant portion of the Mosaic group memory and expertise has been captured by the Mosaic textual database. A custom-made information storage and retrieval system, built on top of INGRES, supports Mosaic research on international computing [53]. The system furnishes information modeling and management capabilities, facilitates document preparation, and supports inter-analyst communication [58]. The system has been in operation for the past nine years.
The data entry task on the Mosaic database is facilitated by the system's tool for information entry and indexing. The data entry procedure makes use of different frame-like templates for various information sources (e.g., article template, book template, business card template, etc.). Each template contains fields to be filled by the analysts (a template entry process). These fields include country identifier, organization identifier, journal identifier, and a series of bibliographic fields such as author, editor, publisher, etc. After template entry, text can be entered.
Analysts often abstract secondary source information
from a long document source or translate and synthesize from a foreign-language
document. First-hand information is often entered in detail and with
careful comments by the analysts who collected it.
Document entry is a manual process and
it often requires significant effort on the part of the analysts.
During document entry, content-bearing terms that appear in the
document are marked by analysts as keywords.
The system provides notations for different keyword types:
``|'' for person name
(e.g., A B Gel'b in Figure 3),
``#'' for organization name
(e.g., IKANESSR, and Goskomizobreteniye),
``^'' for subject keywords (e.g., law, invention, and registration),
and ``an::'' for the analyst's comments about the text.
After document entry and indexing are completed, the system
parses the complete entry and performs syntactic checking.
A sample article template entry is shown in Figure 3.
Junior analysts often handle general, secondary publications and summarize information verbatim (i.e., little analysis and few comments). Senior analysts, on the other hand, have specific areas of interest and provide more analysis of the information items. During document entry and indexing, junior analysts often seek advice from the senior analysts for help in understanding foreign documents and/or suggesting appropriate indexes or folder names. Due to the physical arrangement of the ``Mosaic house'' (where members work in adjacent offices), inter-analyst communication often is conducted in person. But the system also supports online information exchange through email and folder assignment. For example, each facilitator has one or more ``hot'' folders which collect information from other facilitators on the items of relevance to his/her research. During information entry, analysts are requested to send relevant, timely information to the appropriate analyst-specific folders.
Documents such as articles in foreign journals, magazines and newspapers, book chapters, trip reports, conference proceedings and notes, electronic mail messages, personal research communications, technical reports, product brochures, and business cards are translated, summarized, entered (manually), and indexed (manually) in the database by the Mosaic analysts. These textual information items or documents (a generic term we use to refer to any Mosaic information item) represent subject-related facts and knowledge and contain answers for past, present, and future research questions.
In order to capture the complete content of documents and provide detailed access points to documents, each document has over 20 different types of indexes, as shown in Table 1. The semantic indexes, which include keyword, country, organization, person, and folder, reveal the content of a document. The fine-grained indexing structure forces analysts to index information at a high level of specificity and permits many connections to be made between disparate facts.
An overview of the Mosaic database can help reveal the growth of the group's memory in response to the changing international computing environment7. Six variables were chosen to show the subject area growth of the Mosaic database: the number of countries covered, the number of organizations covered, the number of journals reported, and the number of unique documents, keywords, and folders created. The annual tallies for these variables, between 1982 and 1990, are shown in Table 2.
In 1982, the database included information from 25 countries, mostly East European. In 1990, the database contained information about 143 countries and, due to the intertwining nature of international computing issues (see the parallel forecasting technique discussed earlier), few countries have been left out of the group's information collection effort, although the focus remains on Russia and Eastern Europe.
In 1982, the database stored information about 345 organizations. Since then, it has grown at a rate of about 1,500 organizations a year, to about 11,000. This consistent growth rate indicates the group's persistent information collection activities and the proliferation of information technology-related organizations, including producer and user organizations. This organizational information has proven extremely useful in cross-cultural research, especially for tracing the source of information, verifying the reliability of information, and identifying future contacts [59]. The database has grown at a rate of about 186 journals a year and it currently stores information from more than 1,400 different journals worldwide.
Documents, keywords, and folders also indicate the growth of the group's subject area. By 1990 the database contained about 40,000 documents, with a size of about 200 megabytes. Folders are virtual files created by the analysts to represent topics of special interest to the group. Vocabularies for keywords and folders have grown at an average rate of 5,280 and 1,167 a year, respectively, resulting in the current total of more than 44,000 unique keywords and 10,000 folders.
|
In addition to textual databases, Mosaic analysts are also supported by knowledge-based techniques. In particular, a knowledge base was created automatically from the Mosaic textual database and the two are tightly integrated.
Grounded on knowledge discovery in databases 8 the Mosaic researchers have adopted statistics-based algorithms and AI-based representation to extract knowledge from the Mosaic database. The algorithms were based on the frequency of terms co-occurring in the documents and the resulting knowledge is captured in a semantic network representation where nodes represent different types of concepts and weighted links indicate their strengths of relevance.
Documents provide explicit linkages between relevant indexes in the domain. A document may describe a recent technological development (keywords and folders) by some foreign researchers (persons and countries) in some institution (organization). Documents on similar topics collected over time from different sources often contain similar indexes. Collectively, the co-occurrence of indexes in the documents stored in this large-scale database can reveal the relationships among the important topics (projects, computers, policy, etc.), crucial persons, relevant organizations, and countries in East-bloc computing.
We sketch the procedure and algorithms used in creating the Mosaic knowledge base below. As shown in Figure 2, this procedure did not require any human intervention, but the success of this approach was dependent on the quality and quantity of documents acquired and the goodness of the analysts' assigned indexes (i.e., completeness and preciseness). With 40,000 documents in the Mosaic database and careful expert indexing, the result from our automatic knowledge base generation appeared robust.
They indicate the similarity weights from Tj to Tk (the first equation) and from Tk to Tj (the second equation). Where dij indicates index Tj in document i (value: 0 or 1), dik indicates index Tk in document i (value: 0 or 1), and dijk indicates both indexes Tj and Tk are in document i (value: 0 or 1).
Notice that the amount of computation required was quite large - it took several weeks of computing cycle on a small VAX/VMS mainframe. We computed the co-occurrence of the 60,000 plus indexes in our complete database. The two algorithms acted as a ``batch'' learning process, making it possible to examine the patterns in all the documents collected by the Mosaic analysts over the past decade.
Five types of subject indexes (the semantic indexes in Table 1) were extracted by our algorithms: keywords (describing topics, machines, projects, and so on, e.g., technology transfer, MVS 810, etc.) folders (virtual files that store a collection of documents of interest to the Mosaic analysts, e.g., ``softlaw.dat'' for the Russian software protection law folder), persons (persons related to a document, e.g., Y. Andropov), organizations (institutions related to a document, e.g., Academy of Science in Kiev), and countries (e.g., USSR, Poland, etc.). For each pair of objects there is a probability (between 0 and 1) that indicates their strength of relevance. Collectively, we therefore could consider our knowledge bases as a frame-based semantic network where nodes represented objects and links represented relationships between objects. Figure 4 shows the frame-based representation for keywords. Each frame contains different attributes (slots) which indicate the properties of the keyword (e.g., name and number of occurrence in the database) and its weighted relationships with other objects (keywords, folders, persons, organizations, and countries). Similar frame-based representations were created for the folder, person, organization, and country objects. The resulting Mosaic knowledge base has about 20,000 indexes and 280,000 relationships.
The Mosaic knowledge base contains both the subject area knowledge in East-bloc computing and the classification scheme knowledge unique to the Mosaic database. In an attempt to determine the performance of the knowledge base, we found that the Mosaic knowledge base performed comparably to human experts in a concept-association experiment [18]. With a friendly system interface (as shown in Figure 7), it can provide accessible and knowledgeable assistance for concept-based information retrieval.
The Mosaic database was initially created in 1985 using INGRES; SQL was the main information retrieval engine. Because of the need for more flexible and fine-grained retrieval, a customized information retrieval front-end (AAIS, the Arizona Analyst Information System [58]), which supported information retrieval, management, and manipulation, was developed in 1987 and has been in operation since then. In 1991, we successfully developed and integrated a Mosaic knowledge base component and a thesaurus-browsing interface with the Mosaic database. The Mosaic knowledge base can be invoked during an online information retrieval process, helping users identify semantically-relevant concepts and documents (the browsing component in the box labelled Knowledge-Based Information Retrieval in Figure 2). For historical reasons, the AAIS interface was developed in Fortran; but the newer Mosaic knowledge base interface was developed mainly in C. Most analysts use both the AAIS/Mosaic interface and SQL search operations.
The Mosaic knowledge-based information retrieval interface allows extensive query articulation and concept exploration during search. Searchers' initial query statements are taken as the ``triggers'' to identify other semantically-relevant indexes. The system uses the query terms to consult the Mosaic knowledge base, activate relevant indexes, and rank them in order. Since the Mosaic knowledge base captures almost all the important terms and indexes used by the Mosaic researchers in the past decade, the thesaurus consultation process will be able to assist terms matching. Terms suggested by the system also serve as clues to help searchers articulate their needs. Searchers can use the knowledge base component iteratively - select relevant terms, activate knowledge base terms by using the selected terms, make more selections, activate more knowledge base terms, and so on. During this human-system interaction cycle, the Mosaic knowledge base becomes a concept exploration or concept convergence aid to the users, alleviating the cognitive demand on the users in refining their ``anomalous state of knowledge'' [6].
A sample search session and system options are presented below. In addition to using the SQL query language, analysts could invoke the AAIS search front-end (the VIEW main menu as shown in Figure 5), which supports fine-grained search options (e.g., search on country, directory, file, person name, and organization name) and document manipulation functions (e.g., SORT, SELECT, MERGE, etc.). For example, using a combination of KEYWORD/ORG search (``HANDICAP*/IBM'') documents are retrieved and displayed, as shown in Figure 6. A searcher can also invoke the ``THESAURUS'' option from the AAIS main menu.
Figure 7 shows the top-ranked terms suggested by the Mosaic knowledge base with an initial query request for ``EL'BRUS'' (a Soviet high-performance computer) using the THESAURUS search option. These terms were ranked in decreasing order of relevance and their object types were also indicated on the display, i.e., (F) for folder, (K) for keywords, (O) for organization, (N) for person name, and (C) for country. For example, ``BESM-6'' was the predecessor of the EL'BRUS system; ``ITMVT'' was the organization in which the machine was developed; and ``Albert Nikolayevich Naumov'' was the key scientist credited with developing the machine. The searchers could then select any terms they deemed relevant to their queries and perform more specific searches using the selected terms. When searchers felt comfortable with their articulated queries, they could activate the system's document retrieval module, which uses the final selected terms to search in the complete database.
After the initial implementation, most analysts reported that the knowledge base had been an excellent tool for finding additional topics, people, and organizations. Even for senior analysts, the knowledge base often suggested some unexpected people or topics and previously-forgotten keywords. It also provided semantic interpretation (e.g., related keywords, persons, organizations, etc.) for previously obscure folders, which were created by different analysts at different times over the past decade (a reflection of the expert-turnover problem). Three junior analysts used the Mosaic knowledge base extensively as a training tool (there were four junior analysts in the Mosaic group at the time when we incorporated the Mosaic knowledge base in 1991), exploring and traversing the network of knowledge and its underlying database in order to become familiar with topics of interest.
The success or usefulness of the Mosaic organizational memory cannot be easily quantified. But the continuous support from various grants and contracts over the past decade for creating and developing this unique database (it has become the most comprehensive source of East-bloc computing information), the extensive use of the Mosaic database for supporting intelligence analysis for technological assessment, US export control, and technology proliferation, and the fact that the group has established itself as a leading center for East-bloc computing research are indirect evidence of its value. In addition to the group's knowledgeable individuals and supportive research setting, Mosaic's computer-supported organizational memory, which was created by the Mosaic analysts in the first place and which fosters seamless user-system collaboration, has contributed significantly to its success.
The Mosaic organizational memory-building adopted some but not all of the techniques we suggested. Based on our proposed approach and the Mosaic experience, we discuss some important research and implementation issues.
Once the information is down-loaded, it can be converted automatically into local database formats through some format conversion program. Since most online documents are in tagged format (e.g., author field, title field, etc.), format conversion would involve matching external source fields into local database fields. Other printed materials can also be put online by means of OCR software. However, extensive manual effort may still be needed in order to clean up the OCR output (i.e., the potential 5% error). Both OCR input and online source conversion are important for collecting a large amount of environmental information by relatively few organizational members.
The Mosaic information acquisition process is a slow, painstaking, and longitudinal effort - the 50-plus past and present Mosaic analysts acquired the 40,000 international computing documents over a period of ten years. Mosaic's manual practice is likely to be infeasible for organizations that face a more dynamic and changing environment and that have severe constraints on resources. A more automated process that requires less human intervention may be desirable. (Because of recent Russian and East European political changes, the Mosaic group is also moving toward a more automated information acquisition process in order to cope with more dynamic and uncertain Russian/EE computing environments.)
An alternative to manual indexing is automatic indexing (as shown in Figure 2). In [68], Salton provides a blueprint for performing automatic indexing, which typically includes stop-wording, dictionary consultation, stemming, and term-phrase formation. This simple automatic indexing approach has been shown to be as good as other more complex natural language processing methods (e.g., ATN parser, case grammar, etc.) [68] [15]. An automatic indexing procedure does not identify index types like organization, person, or country, however. It produces mainly subject identifiers. Identification of specific index types is an obvious advantage of manual indexing, during which analysts can provide more precise semantics for documents.
Associated with the benefit of quick automatic indexing is the side effect of generating less precise and significantly more numerous indexes for each document. On the bright side, this creates more access points for retrieving documents, but the undesirable side-effects may include imprecise and/or inaccurate retrieval of documents for analysis and a significant waste of memory.
The significantly larger number of indexes that can be generated from automatic indexing also may result in a much larger automatic knowledge base. This can be shown through some experiments we performed. In the Mosaic manual indexing environment, the 200-megabytes Mosaic database generated a 13-megabyte knowledge base. This 6.5% overhead in storage for the knowledge base appeared reasonable considering the benefit of having the knowledge base. In another experiment we examined a textual database which contained about 3,400 DIALOG articles (40 megabytes in size), with DIALOG indexes already assigned to them (manually assigned by other experts). Using existing DIALOG indexes as the basis, we were able to generate a 1-megabyte knowledge base (5% overhead). But when we performed an automatic indexing procedure (without using assigned indexes), the resulting knowledge base was significantly larger - 15.5 megabytes (38.75% overhead). For organizations with large databases (say a few gigabytes), this level of overhead could become a source of concern.
Another potential problem in automating document scanning, selection, and indexing is that the start of `human' analysis is postponed. If the same analysts are both collecting and scanning documents, as is the current practice of Mosaic research, their analysis probably already has begun during the document selection and filtering stage. However, the automated procedure may still save enough time to make the overall research and intelligence analysis process more time-efficient.
After completion, the database and knowledge base were in synchronization in terms of the indexes in use in February, 1991. But since Mosaic database creation is an ongoing effort, the database coverage has exceeded that of the knowledge base since then. As discussed earlier, the Mosaic database has grown at a rate of about 5,000 documents per year. Without an incremental algorithm or periodic batch processing, tight coupling between database and knowledge base will no longer be possible. An incremental algorithm for knowledge discovery is particularly essential in a dynamic environment where new and time-critical documents arrive constantly (newer information is often more valuable than older information). We are currently designing an incremental co-occurrence analysis algorithm that can perform incremental updating of knowledge base.
Most full-text retrieval software has been designed to handle large volumes of text by indexing every word (and its position). This allows users to perform proximity search, morphological search (using prefix, suffix, or wildcards), and thesaurus search. BRS/SEARCH and BASIS/plus are typical of this type of software. PixTex and Topic, on the other hand, are among the most advanced full-text retrieval systems and feature ``content-based IR'' and ``learning'' capabilities. PixTex calls its indexing process ``learning.'' The system automatically extracts patterns from binary data (texts or images) and associates them with the storage location of the data using neural network technology (the exact form and algorithm are not clear due to the lack of publications and the proprietary nature of the product). By automatically storing visual scene or textual contents in terms of Huffman codes, the system can then retrieve other similar scene objects or texts during IR. Verity's Topic claims to use fuzzy logic in a design employing ``conceptual searching'' for ``intelligent'' document retrieval systems. Topic allows users to create and re-use hierarchical, weighted query trees that become part of the corporate memory and produce rank-ordered documents. It also appears to have some ``similarity search'' capability (e.g., `find me all documents like this one.'). We believe that with the extensive indexing capabilities provided by such full-text retrieval software we can adopt the automatic knowledge base (thesaurus) generation capability of our proposed approach in the full-text retrieval environment. The thesaurus browsing and concept-augmentation features available in full-text retrieval software have enormous potential for use in a system equipped with an automatically-generated, domain-specific thesaurus. It should be noted that none of the existing full-text retrieval software exhibits knowledge base generation capability.
Computer-supported organizational memory has become an integral part of many organizational activities. As Huber postulated in his conceptual theory of organizational memory: ``use of advanced information technologies leads to increased information accessibility, which leads to improvements in effectiveness of intelligence development and decision making.'' Our research provided more detailed and technical support for this proposition. In particular, we suggested application of two technologies, textual database and system-generated knowledge base, and we discussed the feasibility of using these technologies for organizational memory development and intelligence analysis.
Our discussion was grounded on a framework of information management and retrieval in which knowledge components, information agents, and indexing and search uncertainty intertwine and affect organizational memory related activities. Four types of knowledge - subject area, classification scheme, system, and organizational - play an important role in the information management and retrieval process. Mismatches between index terms and search terms cause a significant portion of information retrieval problems.
A textual database/knowledge-base coupling approach was proposed and a blueprint for system design was discussed. The approach includes four stages: information acquisition, textual database creation, automatic knowledge base generation, and knowledge-based information retrieval. These four stages need to be performed in order, as each stage provides support for the next. Different algorithmic techniques appropriate for creating robust computer-supported organizational memory such as automatic information acquisition, automatic indexing, terms co-occurrence analysis, and concept-based information retrieval were presented at each stage.
We illustrated a complete implementation of approach through the example of an international computing research group called Mosaic. The 200-MB textual Mosaic database provides timely, accurate, and comprehensive information to support various intelligence analysis techniques - the Web Model, parallel forecasting, and from-the-small-to-the-large. The Mosaic knowledge base, with 20,000 indexes and 280,000 relationships, was generated automatically from the Mosaic database, and provides expert-like support in exploring concepts and articulating ideas; it also permits knowledge-based information retrieval. We also discussed other implementation considerations pertaining to automatic information acquisition, automatic indexing, incremental knowledge base creation, and human-computer interactions.
The research presented in this article suggests an approach to creating computer-supported organizational memory that is potentially applicable to organizations in which receiving a large influx of unstructured, textual documents is the norm. We are currently investigating other areas where our approach could be useful, including: knowledge-based environmental scanning for corporate intelligence systems, ``intelligent'' email storage, retrieval, and dissemination, and long-term ``group memory'' creation for group decision support systems.
Many thanks for comments on earlier versions of the paper are offered to Dr. Andrew Dillon, Ann Danowitz, and anonymous referees.