An Information Access Model with a Unified Approach to Data Type,
Retrieval Mechanism and Information Need

Gregory B. Newby
University of North Carolina at Chapel Hill

 

Abstract

In a global information economy, as well as smaller corporate or personal economies, emphasis is on efficiency and cost effectiveness. When economic progress is strongly tied to information access, models are needed to help predict what types of retrieval mechanisms, data types and information needs may be combined for maximum performance. In cases where information seekers have choices about which systems to utilize, where system designers have choices about retrieval mechanisms to implement or develop, and where content creators have choices about what organization scheme to apply to their data, it is important to be able to make decisions about the likely benefits of the choices. This paper presents the background and rationale for a description of information systems which places typical information retrieval and other types of systems in context with one another. A relationship between data type, retrieval mechanism, and information need is presented which can be used to generate realistic expectations about system performance based on their combination. Examples are drawn from various types of information systems, and a typology is derived for identifying useful combinations of data type, retrieval mechanism, and information needs. It is anticipated that cost savings for information seekers, system designers or content developers may be achieved by seeking improvements on one aspect over another.

1. Introduction

Information retrieval (IR) systems are primarily devoted to matching structured queries to document surrogates, yet such matching applies to only a portion of the larger realm of automated means of generating information to address information needs. Activities such as browsing or question answering are commonly found during IR system use (e.g., Borgman, 1995; Barry, 1994), yet most systems were not designed with those activities in mind. Other types of IR systems exist to support more than the matching of queries to document surrogates, yet these have important limitations in interface, database size, speed, etc. (e.g., Mark Pejtersen’s "Book House," 1992).

This paper discuses a unified view of information systems which includes typical IR systems as well as others. By taking this unified view, it is hoped that system designers and information seekers may have a sound functional goal and realistic expectation of performance for a given system in a particular situation. Inasmuch as the information economy – from global and international components to micro-economies of companies and individuals – emphasis effective and cost efficient access to information, a view of the likely cost-benefit performance of development or improvement in one aspect of an information system over another is required.

Consider the proposed relation among data type, retrieval mechanism, and information need, below. The goal is to create a framework in which different types of information systems may be related to one another. Based on this relation, realistic assessment of the usability of a system in a particular information need situation, with a particular data collection, can be made.

Relation among information system components:


The effectiveness of an information system is dependent on the combination of

a) information need

b) data collection

and c) retrieval mechanism


The various components of the statement require scrutiny. First, consider the role of the information need. The information need has been given great attention by scholars (for example, Dervin & Nilan, 1986), yet is not commonly a part of an IR system. Part of the reason for this is that most IR systems have no way of dealing with information need -- they have a single data collection and retrieval mechanism, and no flexibility in the nature of the information need they are designed to serve. (The information need they are designed for is, as stated above, typically the retrieval of document surrogates based on structured queries.) Information needs may have many components, including a situation, past knowledge, a gap in knowledge, a time frame for gaining a response, etc.

The role of data contained in an information system is obvious, as a subset of the available data is what system output usually consists of. A less obvious implication is that not all information needs can be addressed effectively by all data collections. Part of the role of the data collection in the proposed relationship is to include an understanding of this limitation. A further implication is that different retrieval mechanisms are more or less suitable for different data collections.

Retrieval mechanisms are the set of algorithms (i.e., methods) employed to produce an output set from the data collection. These may include Boolean matching schemes, parsers for natural language input, information need profiles, etc. IR algorithms are documented extensively in sources such as Ingwersen (1992) and Pao (1989). The relation of the retrieval mechanism to the particular data collection has already been stated; its relation to information need should also be made clear. If an information need requires an answer to a question, for example, the retrieval mechanism should be capable of generating or identifying an answer. If the information need requires a fact, the retrieval mechanism must be capable of producing that fact. For most information needs, the ideal role of the retrieval mechanism is necessarily beyond that of simply matching and producing document surrogate.

The notion of a maximally effective solution is put forth as the goal of information system designers. As presented, a maximally effective solution cannot consist solely of a retrieval mechanism (although most literature in IR seems to imply that it can -- e.g., Callan et al. 1995). The effectiveness of an information system cannot be assessed separately from the human information needs that lead to its use, nor apart from the data contained in the system and the utility the data content and format has for addressing the information need. This limitation on effectiveness is consistent with a user-based approach to relevance (see esp. Schamber et al., 1991), yet is not consistent with most approaches to evaluating retrieval system effectiveness (cf., Salton & McGill, 1993; Sparck Jones, 1981).

Before considering the assumptions and implications of the proposed relation further, some limitations should be introduced. First, it is intended to be applied to various automated means of resolving information needs. It is not intended to be applied to, for example, reference librarians, telephone operators, or others who deal with information needs (although it could be useful to them). Second, some of the goals presented here are not yet achievable. (In fact, there is no implication that even existing commercial IR systems have achieved their maximally effective performance.) As such, this work makes no attempt to be a panacea for the implementation of information systems, but only a framework from which to plan or evaluate new or existing systems.

2. Background

This section will briefly examine some of the assumptions underlying the statement presented in the proposed relation, above, in the interest of creating boundaries on its domain. The view of information systems presented here encompasses somewhat more than that taken in most IR research and development, yet most of the components are derived from IR. For example, the reliance on information need as a primary component of an information system has existed in the IR literature for some time (e.g., Taylor, 1968).

Assumption 1: A single retrieval mechanism is not sufficient for all information needs or data collections. This is self-evident when a whole variety of data and information needs are considered (as is done below), but can be overlooked by system designers working in a more typical IR scenario involving bibliographic document surrogates.

Assumption 2: Not all information needs can be addressed effectively by all data collections. This balances the statement that a maximally effective solution exists for combinations of data, information need, and retrieval mechanism.

Assumption 3: A measure of "maximal effectiveness" must necessarily include factors which are not part of the data or the retrieval mechanism. These include the various components of information needs (e.g., Newby et al., 1991), such as:

Assumption 4: "Maximal effectiveness" is an ideal which may not be practically achievable. Furthermore, the difficulties in assessing information needs (as described below) means that real and practical assessment of effectiveness is necessarily artificial or otherwise flawed.

3. Retrieval Mechanisms

The mechanisms that information systems use to produce data based on some sort of input have great variety. This section introduces some of the prominent mechanisms and their benefits and limitations. These include mathematical and analytical engines; Boolean set schemes; vector space; information space; neural networks; and physical location-based approaches. Mechanisms in this section are derived from the IR literature and other literatures on information systems.

Mathematical and analytical engines are not obvious components of IR systems, yet are frequently included. A calculator is a typical mathematical engine: It includes circuitry for solving well-defined problems, and can accept input of a pre-defined type of (usually) any value. So, an arithmetic circuit in an electronic calculator might be able to add only real numbers, but will add any (or almost any, perhaps limited by size and precision) series of numbers.

Information retrieval systems make use of mathematical or analytical engines to count or rank documents and other behind-the-scenes functions. Such an engine is important to a system that derives responses using algorithms (see the section on retrieving facts, below), especially when the domain of an analytical engine is taken beyond numbers to include any formal symbol system, such as that encountered in symbolic logic. A suitably formed problem in, for example, logistics, could be solved by a system possessing a mathematical and analytical engine.

Boolean set schemes are found commonly in IR systems. There are two major components: A mechanism for identifying membership in a set, and a mechanism for combining or otherwise operating on sets. The first component is essentially a database component: an index of some sort is consulted to determine whether a record (which could be a bibliographic citation, a document, or some other item) belongs to a target set. This is repeated for all items in the appropriate database, yielding one set.

The second component, for set combination, may allow Boolean operations such as finding the intersection, the union, a set negation, etc. The records contained in this type of system are usually divided into logical fields, such as AUTHOR and SUBJECT, in order for the information need to be specified more fully.

Boolean set schemes are limited by their nature. Documents (or whatever is contained in a database) may be members or not members of a particular set, with no gradation (fuzzy Boolean systems attempt to circumvent this difficulty). Furthermore, Boolean systems, like most other IR systems, are also limited by the quality of the work which produced the record. Because current practice in librarianship dictates that a bibliographic record is designed independently of any particular information need (per American Library Association, 1988), the effectiveness of Boolean or other systems that rely on such records is limited to the information seeker’s (or some intermediary or intermediary system’s) ability to translate his or her information need into the generic form used by the indexer.

In Boolean systems, there is no capability for iterative searching other than through continued refinement of sets or returning to previously constructed sets. That is, there is basically no process available except for the construction and combination of sets: If an information need is not being met, the information seeker’s only recourse is to attempt alternate set constructions.

Vector space systems circumvent the discrete set membership problem of Boolean systems by creating a continuous special mapping of the contents of a database (Salton & McGill, 1983). As implemented, they do not take relations among index terms into account (which could be useful for addressing limitations of accuracy in indexing and expressing information needs, as described below), except for the important task of developing term importance weights. But document membership in a retrieved set can be assessed along a metric scale.

Vector space retrieval mechanisms facilitate relevance feedback (Oddy, 1977; Salton et al., 1985), an important advance for information needs which involve exploration or an otherwise unknown item search (see below). Vector spaces are limited by the accuracy and information need-dependent qualities of the indexing process. In vector spaces, weighting of terms (both positive and negative weights) may be utilized.

Information space mechanisms incorporate term relationships in order to circumvent some limitations in the accuracy and generality of the indexing process. Information space mechanisms (Newby, 1996; Chen & Ng, 1995) include the capability for relevance feedback as found in vector space systems. In addition, information spaces offer additional means of seeking information through the visualization of the information space content. This is accomplished by selecting a subset of the available database content and producing a visualization of the content wherein relationships in the information space are presented. Vector spaces cannot be visualized effectively because the highly multidimensional information spaces cannot be easily reduced to two or three dimensions required for visualization.

Information space mechanisms may be combined with psychometric approaches to better serve information needs by enhancing the space to include relations among documents as perceived by a particular information seeker or for a particular type of information need (Ingwersen, 1992).

Neural networks have been utilized for various purposes, with the overall purpose of dealing with ill-defined problem domains. For information systems, neural networks have the benefit of being trainable based on relevance feedback or other factors, such as document relations. Relatively little research on this area of application exists in the IR literature, although some commercial retrieval products report use of neural network technologies.

A difficulty with relevance feedback is that full training of a system may be impossible for the true variety of information seekers and their purposes. However, neural network approaches may nevertheless be utilized with success to identify relations among documents or document surrogates which may be missed by information or vector spaces and by traditional indexing.

Neural networks are probably best utilized as a retrieval mechanism for situations where an iterative information seeking process is anticipated.

Physical location-based systems make use of an information seeker’s previous experience with some physical phenomenon or object (e.g., Mark Pejtersen, 1992). That phenomenon, which is frequently a room or building, becomes the domain onto which a map of some information content is projected. This approach has great utility for purposes where the actual quantity of available information is limited, or for narrowing in on a goal for use with a different retrieval mechanism.

For example, an auto mechanic might make use of a diagram of a car and its subsystems in order to target a set of documents which might contain a suitable response to her information need. Familiarity with the physical domain as depicted in a car diagram could help to narrow the appropriate set of documents to be searched by another mechanism (perhaps a Boolean set mechanism) more quickly and precisely than could be accomplished by starting with the Boolean mechanism.

4. Information Needs

The role of information needs for constructing and evaluating information systems has always co-existed with IR system research (early examples include Taylor, 1968; Cuadra et al. 1967). However, IR systems typically do not have any method for taking information need into account. Other types of information systems have similar difficulties, although there are notable exceptions. For example, a common approach in expert systems is to put emphasis on identifying the information need by asking increasingly focused questions. In the expert systems scenario, the information seeker might not be fully aware of his or her information need, and therefore be unable to state it clearly.

Dervin (1986) and her colleagues have identified an array of information needs based on a physical movement metaphor with demonstrated utility in IR situations. Common types of information needs include: seeking references for a scholarly paper; getting ideas for a dissertation; seeking answers to reference-type questions, perhaps for a paper for a K-12 student; identifying the existence of a particular method or procedure, e.g., for a patent search or medical problem; and finding all instances that match a particular criterion, as found when creating a mailing list from a database.

Across these and other examples of information need, three major types emerge: expansive; narrowing; and targeted searches for information. In an expansive information need, the information seeker is trying to broaden his or her knowledge of some topic domain. A literature review is a good example of this, where the seeker in this scenario is happy to follow paths that may lead to new or unknown topics or areas, in the hopes that there will be useful outcomes from unexpected areas. IR systems do not serve expansive information needs well, but other types of information systems do. Systems that meet expansive information needs well include the thesaurus and newspaper (in electronic or non-electronic form), the encyclopedia, and the World Wide Web.

A narrowing information need exists when the information seeker has good a priori knowledge of what they are seeking, and are able to make incremental judgments about the suitability of findings for what they are seeking. The relevance feedback model for IR systems fits this type of information need closely. A common non-IR scenario is ordering theater tickets or selecting from a menu: the information seeker might discard options at an increasing level of specificity until eventually arriving at a good solution. In the theater scenario, someone could first select a play, then select a week to see it, then a date, then a particular showing, and finally particular seats, perhaps seeking alternate solutions simultaneously ("Which night has the best seats available?").

A targeted search occurs when the information seeker can identify a correct or suitable response without an additional process of narrowing or expanding. This does not imply that only one response is possible, although that could be the case. In an IR system, a targeted information need could result in an author/title search using a Boolean retrieval system. Information needs directed at other types of information systems could include, "What is Abraham Lincoln’s birth date?" and "Which mutual funds outperformed the market index last year?"

Many instances of use of information systems involve combinations of these major categories. Browsing, for example, involves alternating between expansive and narrowing strategies. The categories of expansive, narrowing, and targeted information needs are broad and have ill-defined boundaries, but should serve to focus the current discussion on the suitability of combinations of information need, data type, and retrieval mechanism.

5. Data Types

This work expands the notion of information retrieval to encompass various types of data in addition to bibliographic or full-text documents. Brief descriptions of some different types of data that might be the focus of information systems work follow. Suggested data types include bibliographic data and full-text documents, natural language responses, facts, and algorithmic outcomes.

Bibliographic data and full text documents are the typical focus of information retrieval work, and are readily understood. The most important quality of this type of data for retrieval is that information needs are generally not met by information when this is the type of data to be retrieved. In other words, a bibliographic citation or the full text of a document is often not what an information seeker wants. Rather, he or she is seeking an answer to a question, some clarification, or another type of information that might be contained within the document that the citation refers to.

With bibliographic data and full text documents, it may be up to the user to meet an information need based on what is retrieved. An information need of the sort, "What are the major bodies of theory to be presented in IR research during the 1980s," could be well-met by documents retrieved through a search of a database of bibliographic citations and full text. What is retrieved would not actually meet the need directly although there is information contained within the documents or document surrogates which is useful and appropriate.

The addition of tags within a natural language document, such as are produced using SGML, could aid in retrieval (see the section on retrieval mechanisms below), and may additionally serve to provide a better match between an information need and the information contained in a document. Even so, any data to be retrieved are essentially in original form, and the suitability of a response for a particular information need will necessarily be constrained by that original form. The same limitation could be applied to hypertext or other ways of generating links or relationships among documents.

Natural language response refers to a data type created "on the fly" in response to an information need. Some expert systems provide good examples of this type of data generation: after an assessment of information need, an expert system may produce a response which was generated just for that need. This is different than a typical bibliographic retrieval system in that the document retrieved may not have existed previously (although various components for the document may have existed).

This type of data is most closely associated with systems that do not currently exist: artificially intelligent systems which can produce not only documents but also intelligent-sounding responses to queries in true natural language. But our current purposes do not require such sophisticated responses; it is sufficient to note that information systems are capable (in a limited fashion) of producing unique responses to information needs.

Unlike bibliographic citations or full text documents, natural language responses may be tailored to provide information to meet an information need, rather than pointers to documents or documents themselves that may, eventually, provide a suitable response to an information need.

Facts are typically found in the domain of database systems rather than information retrieval systems. For example, a bank’s interest rate for a certain transaction on a particular date might be retrieved based on a structured query using database management software. Facts are important, though, both as responses to information needs and as components with which to generate information, reports, natural language documents "on the fly" (as above), and so forth.

Unlike documents created "on the fly," facts are stored before a particular information need is involved. A world atlas is a good example of such a store of facts. "What is the capitol of Estonia?" may be an expression of information need that can be well met by a fact collection.

Fact retrieval in a database management system is not trivial, but is nevertheless typically more straightforward than fact retrieval using a typical IR-style query interface. Such matters are important, and are discussed below in the section on retrieval mechanisms.

Algorithmic outcomes are responses based on some fixed set of steps performed on the expression of information need. Although some expert systems involve such processes, a better example is an arithmetic system. A housing contractor might query such a system in order to determine the quantities and types of materials required for a particular job.

Algorithmic outcomes are important to include here because, unlike the other types of data types discussed above, there is conceivably an infinite number of responses that can be generated. This section has provided a brief summary of a variety of data types that may be accessed through information systems. This description is meant to be illustrative, not exhaustive, as other types and sub-types exist.

6. Typology

Based on the discussion above, we can consider different combinations of information need and data, and draw conclusions about which retrieval mechanism is best suited for that combination. Figure 1 shows the outcomes of such a process.

Figure 1: A 3 by 3 Table of Course-Level Relationships among Data Type, Retrieval Mechanism and Information Need (Empty boxes indicate unlikely system types.)


Slice a. Expansive information needs

Mechanism

Bibliographic Data

Facts

Natural Language Responses

Boolean and database systems

Better

Worse

 

Vector & information spaces

Better

Worse

 

Expert systems & neural networks

 

Worse

Better

 

Slice b. Narrowing information needs

Mechanism

Bibliographic Data

Facts

Natural Language Responses

Boolean and database systems

Worse

Better

 

Vector & information spaces

Worse

Better

 

Expert systems & neural networks

 

Better

Worse

 

Slice c. Focused information needs

Mechanism

Bibliographic Data

Facts

Natural Language Responses

Boolean and database systems

Better

Worse

 

Vector & information spaces

Better

Worse

 

Expert systems & neural networks

 

Better

Worse


In Figure 1, the retrieval mechanisms are simplified, and presented with little discussion (as are the information needs and data types). Yet it may be seen that a more specific examples may also be derived utilizing this method.

Example combinations of information need, retrieval mechanism and data type:


Sample 1: Trivial mathematics

Sample 2: Question answering

Sample 3: Expanding knowledge

Sample 4: Narrowing knowledge (typical IR system goal)

Sample 5: Check existence

Sample 6: Get all citations


In these examples, the most important thing to notice is that there is a complete interdependence of the information need, the data type, and the retrieval mechanism. Changing any of them results in a loss (or gain) in the usefulness we can expect in the others. The least obvious of these is data type, where for a given information need (such as the need dealing with Abraham Lincoln), any variety of data types might address that need (for example, a bibliography of Abraham Lincoln; an encyclopedia; or a chronology of US history). Only after we have determined which data type will be used can we make the most intelligent decision about a retrieval mechanism.

In the modern environment of hundreds of various data types, and millions of separate data stores (if one is willing to consider individual books or databases as data stores), the selection of a retrieval mechanism based on data type is necessary for maximal success. This implies a critical need for additional retrieval mechanisms or steps in an information seeking process to identify a suitable data store to utilize.

7. Conclusion

This work is intended to assist system designers, evaluators, and information seekers. By considering the interdependent roles of the information need, the retrieval mechanism, and the data type, a maximally effective combination of different available mechanisms and data types may be identified for a particular information seeking instance. In the global information environment where emphasis is needed on efficient use of systems, and effective choices about which systems to use (or further develop), the proposed model could be useful for decision making.

System designers need to consider the role of alternative retrieval mechanisms and representations or storage models for data types, in order that systems can be effective across a range of information needs. Although this work has focused on common types of information needs, retrieval mechanisms, and data types, it is clear that there are areas for which good solutions do not exist. Such areas might rely on natural language parsing, artificial intelligence, or producing responses which are derived on the fly from disparate data stores.

For users of retrieval systems, this analysis might help explain why many types of information needs are not met effectively by current IR systems. Based on the relation among data type, retrieval mechanism and information need proposed here, information seekers may choose to utilize a wider variety of mechanisms or data types depending on the nature of the information need and what data and mechanisms are available.

References

American Library Association. 1988. Anglo-American Cataloging Rules 2nd Ed. Chicago: American Library Association.

Barry, Carol. 1994. User-Defined Relevance Criteria: An Exploratory Study. Journal of the American Society for Information Science 45(3): 149-159.

Borgman, Christine L.; Hirsh, Sandra G.; Gallagher, Andrew L. 1995. Children’s Searching Behavior on Browsing and Keyword Online Catalogs: The Science Library Catalog Project. Journal of the American Society for Information Science 46(9): 663-684.

Callan, J.P.; Croft, W.B.; Broglio, J. 1995 TREC and TIPSTER Experiments with Inquery. Information Processing and Management 31(3): 327-343.

Chen, H. & Ng, T. 1995. An Algorithmic Approach to Concept Exploration in a Large Knowledge Network. Journal of the American Society for Information Science 46(5): 348-369.

Cuadra, Carlos et al. 1967. Technology and Libraries. Bethesda: ERIC.

Dervin, Brenda. 1986. Neutral Questioning: A New Approach to the Reference Interview. RQ 25: 506-513.

Dervin, Brenda & Nilan, Michael S. 1986. Information Needs and Uses. In: Williams, Martha E. (Ed.). Annual Review of Information Science and Technology 21: 3-33. Medford, NJ: Learned Information.

Ingwersen, Peter. 1992. Information Retrieval Interaction. London: Taylor Graham.

Mark Pejtersen, Annalise. 1992. New Model for Multimedia Interfaces to Online Public Access Catalogues. Electronic Library 10(6): 359-386.

Newby, Gregory B. 1996. Metric Multidimensional Information Space. In Harman, Donna (Ed.). Proceedings of TREC-5. Gaithersburg, MD: NIST.

Newby, Gregory B.; Nilan, Michael S.; Duvall, Lorraine M. 1991. Towards a Reassessment of Individual Differences for Information Systems: The Power of User-Based Situational Predictors. Proceedings of the American Society for Information Science Annual Meeting 28. Medford, NJ: Learned Information.

Oddy, Robert N. 1977. Information Retrieval through Man-Machine Dialogue. Journal of Documentation 33: 1-14.

Pao, Miranda. 1989. Concepts of Information Retrieval. Englewood, Colorado: Libraries Unlimited.

Salton, Gerold; Fox, Edward A.; Voorhees, Ellen. 1985. Advanced feedback methods in information retrieval. Journal of the American Society for Information Science 36(3): 200-210.

Salton, Gerold; McGill, Michael J. 1983. Introduction to Modern Information Retrieval. New York: McGraw Hill.

Schamber, Linda; Eisenberg, Michael E.; Nilan, Michael S. 1991. Towards a Dynamic, Situational Definition of Relevance. Information Processing and Management 26(2): 755-776.

Sparck Jones, Karen. 1981. Information Retrieval Experiment. London: Butterworths.

Taylor, Robert S. 1968. Question-Negotiation and Information Seeking in Libraries. College and Research Libraries 29(1): 178-191.