Issue

Article

Vol.30 No.3, July 1998

Article

Issue

The User Interface in Text Retrieval Systems

Offer Drori

Introduction

Text Retrieval Systems and Data Processing Systems

Information Unit

Information Format

Retrieval Operations

How the User Interface Works

Design of the User Interface

Defining the Search

Advantages and Disadvantages

Defaults for the Search Definition

Processing a List of Items

List of Titles

List of Titles + Beginning of Document

List of Titles + Sections that Match the Search Criteria

Introduction

Ever since the advent of on-line computer systems, the development of the user interface has been a key issue in research and applications. The main function of the user interface is to mediate between the system operator and the computer programs that run the information systems.

The nature of the interface required by information systems and the functions with which it must provide the user has been studied. It can be said that there are certain rules to be adhered to in the design of any information system, and that there are certain rules characteristic of specific information systems only. (MicrosoftG, 1995)(Galitz, 1994) (MicrosoftU, 1995) (Shneiderman, 1998).

The aim of this article is to define the user interface characteristic of text retrieval systems, which generally differ in nature from other information systems.

Text Retrieval Systems and Data Processing Systems

In order to understand the special needs of the user interface in text retrieval systems, the distinction must first be made between such systems and standard data processing systems. It is these differences that create the need for a special user interface for text retrieval systems, and these differences are expressed in several spheres. (Drori1, 1997) (Drori2, 1997).

Information Unit

The unit of information in data processing is generally a field within a record. In text retrieval, the unit of information is a document. The definition of the document varies from one system to another, but what they all have in common is a segment of text of variable length dealing with a single logical subject. A document is generally an independent entity, though there may be many interlinked documents in a given repository.

Information Format

In data processing, the information is in the form of raw data presented in one of several ways, as numbers, codes, etc. In text retrieval, the information is in the form of text segments containing free text, numbers, tables, etc.

Retrieval Operations

The retrieval operation in data processing is usually straightforward and unequivocal, such as "retrieval of a record with key X". The outcome of such a retrieval operation is usually a singular result: the desired record.

In text retrieval, the retrieval operation is complex, and usually difficult to formulate and define. The response, too, is a general one, usually comprising the display of all system documents likely to contain a match of the request. In comparison, the outcome in data processing is a final and complete result.

In text retrieval, the result is partial: i.e., the answer to the query may be in the list of documents meeting the search criteria, but there may be additional answers in the repository that were not included in the list of documents generated in the output. The result obtained from text retrieval generally requires further processing, whether manually (by the user reading the relevant documents) or computerized (as filtering of the results obtained).

How the User Interface Works

Data processing is characterized by a one-time operation per question: a discrete query whose result is the answer to the query.

In text retrieval, the work process is a lengthier one, commonly involving interaction between the user and the system until the desired result is obtained.

To sum up, in the light of above, it can be said that text retrieval is clearly far more complex and problematic than the standard data processing operation.

Design of the User Interface

The nature of the way in which text retrieval systems are used, together with the nature of databases, necessitates design of a special interface. In standard information systems, the user interface generally handles the selection of the subject to be processed, whether this involves the processing of a discrete item or processing of a list of items. In most cases, the subject is selected by the user via a menu-driven interface, which can be implemented in different ways, based on the rules of menu design for a graphical interface.

Discrete items are processed by displaying the information fields for the item on the screen for viewing or updating. Processing of lists involves the user defining the search criteria for the information base, execution of the search operation according to the criteria, and generation of a list of items that meet the search criteria.

Definition of the search criteria and processing the list of items constituting the results of the search are the key factors differentiating the user interface in standard information systems from the user interface in text retrieval systems.

Defining the Search

Defining the search in standard information systems generally involves the input of a key (usually numerical) for execution of the search. In text retrieval systems, the search is generally defined by text (one word or more), with Boolean operators between the words. This operation can be performed in a number of ways:

On a computer-guided form that enables the user to enter words into data fields, one data field per word, and to enter conventional symbols on the screen to define the Boolean expression between words.
By entry of the words for the search in free text and entry of the Boolean operators using conventional mathematical symbols, or their equivalent reserved words such as "greater than", or alternatively, the ">" sign.

Advantages and Disadvantages

Defining a search by means of a computer-guided form is relatively easy and very convenient for the less proficient user. It requires the entry of the necessary words and selection of the appropriate Boolean symbol. This method can be further refined by graphically displaying, adjacent to the user definition, the significance of the search operation being built.

The main disadvantage of this approach is its lack of flexibility for defining complex searches. Theoretically, generating a form with all possible combinations for the search is possible. In practice, however, this would create a complicated and cumbersome form, and the main advantage of simplicity would be lost. As a result, the use of computer-guided forms is largely restricted to relatively simple searches.

Searches based on free text excel in the rich possibilities and absence of restrictions that they provide for more complex searches. Free text generally enables a lengthier search definition strategy (even several lines long), including different levels of parentheses, and the use of more Boolean expressions. The most obvious disadvantage of this method is the need for extensive knowledge of the search language (which differs for each such type of system), as well as the intricacies involved. As in any system that enables numerous combinations, this system is complex and best suited to proficient users who require a complex and sophisticated search strategy.

In sum, the average user will generally prefer the computer-guided form, and proficient users the free text approach. A system that provides both options will obviously cater to a wider group of users.

Defaults for the Search Definition

The search definition takes account of the requested words as well as the search criteria defined by the user. In addition to the user's categorical definition, system defaults must also be defined for instances in which no symbols are specified between the requested words -- i.e., should the default be a restrictive choice, in which the output comprises answers that exactly match the defined criteria, or alternatively, should the default be a broad definition which generates as broad a set of answers as possible. For example, should the system search relate to the word exactly as it was entered, or treat it as a prefix or suffix root to be expanded, etc.

Naturally, the answer to this question requires familiarity with the system, familiarity with the organization or the users operating the system, and most important of all, familiarity with the database used by the retrieval system. It can be assumed that the larger the repository, the more restrictive the defaults, in order to provide a reasonably small number of results for review. In contrast, we are all familiar with situations in which it is in the interest of the service provider to display as many results as possible, as in the case of a number of search machines currently on the Internet (due to the element of competition).

Processing a List of Items

In standard information systems, a list of items generally consists of a list of records that share a common key, with each record representing a different event in the system. Sometimes, the list represents records with different keys, with the list sorted according to the main key, helping the user to find the specific record desired.

List of Titles

In text retrieval systems, the list consists of a set of documents that match the defined search criteria. Since all the items in the generated list are potentially the requested answer, the user is required to check all the items in order to find the one that meets his/her requirements.

The list presented by standard text retrieval systems consists of document titles; where a title does not provide sufficient indication of the document's content, the user is meant to enter and view the document. The problem with this type of solution is that the document title is insufficient in many cases, obligating the user to enter each item individually in order to decide whether the document is relevant or not. This approach, in which the user is obliged to view each and every document, is tedious and time-consuming.

List of Titles + Beginning of Document

More advanced systems provide, in addition to the title, the first few lines of the document. The assumption is that the beginning of the document generally provides additional details about the document content, enabling the user to reach a decision as to its relevance without having to peruse the full text of the document (not to mention the simplicity of considerations such as the relative technical ease of implementing this approach). This solution has been adopted by most search engines on the Internet. However, this solution, while reducing the need of the user to enter each and every document, does not suffice, for two reasons. One of these is that the beginning of the document does not always divulge its content and often contains identifying details about the document which do not disclose much about its content; the other is that the first few lines are often not enough to understand the connection between the search and the content of these lines.

List of Titles + Sections that Match the Search Criteria

In order to overcome this problem, a third approach has been developed, which, in addition to the document title, also displays several lines -- as in the case of the second approach -- except that these are not the first lines of the document, but rather, lines in a section of the document that match the search criteria. This means that, if a certain document is included in the results list because a particular Boolean expression has been met, the lines in the body of the document that meet the Boolean expression are displayed. This makes it highly probable that the user will be able to determine the relevance of the document from the section of the document that matches the search criteria, and is likely to significantly reduce the user's need to read the full text of numerous documents in the results list.

This approach, in which a few lines located in a section of the document that match the search criteria are presented together with the title, was attempted in a working text retrieval system with a large number of users. It was found that the user's capacity to decide the relevance of a document based on these few lines was greater than in the case of the former two approaches, i.e., in which the user receives the title only, or alternatively, the title plus the first few lines of the document. The advantage was even greater in information bases containing full text documents rather than indexed documents. The reason for this is that, in the case of indexed documents, keywords are added at the beginning of the document, so that displaying the first few lines of a document is relevant to understanding its content, while in the case of non-indexed documents, the first few lines do not necessarily divulge the content of the document.

The sole disadvantage of this approach is the technical aspects of its implementation. Displaying the document title is simple enough and does not require any sophistication. Displaying the document title plus the first few lines is more complex, since it requires reading the first few lines of each document in the course of generating the list of results, a function that requires greater resources.

In the third approach, building the list of results is even more complicated, since, in addition to generating a list based on the search criteria, the documents must be read by the system in their entirety (not only the first few lines as in the second approach), and only those sections that meet the search criteria are allowed to pass the filter. This approach also has to deal with situations in which the search criteria are met by more than one section in the document, and a solution provided to display them all.

Although this operation is technically more complicated, proper formulation of the search algorithms will achieve the desired results within acceptable response times, without the system user sensing a delay that would hinder the work process.

Summary

We have shown that the design of the user interface for text retrieval systems is distinctive in two main respects: definition of the search criteria, and handling of the resulting list of items that meet the criteria. We have reviewed the advantages and disadvantages of defining search criteria using free text versus the computer-guided form. We also demonstrated that a list of items comprising the title, and additionally several lines from a section in the document that meets the search criteria, can save the user a large amount of work, and facilitate the use of a system of this type.

References

Drori1, 1997 Drori, Offer, "Search Engines on the Internet", in: the Special Interest Group on Text Retrieval Systems (SIGTRS) Newsletter, Or-Yehuda: SPL WorldGroup, February 1997, Vol. C, No. 1.

Drori2, 1997 Drori, Offer, "Integration of Text Retrieval Systems and Conventional Information Systems", in: the Special Interest Group on Text Retrieval Systems (SIGTRS) Newsletter, Or-Yehuda: SPL WorldGroup, July 1997, Vol. C, No. 2.

Galitz, 1994 Galitz, Wilbert O. Its Time to Clean Your Windows.
New York: John Wiley & Sons, Inc., 1994.

MicrosoftG, 1995 The Windows Interface Guidelines for Software Design,
WA. Microsoft Press, 1995.

MicrosoftU, 1995 User Interface Design for Microsoft Windows 95, Microsoft Corporation, 9.95.

Shneiderman, 1998 Shneiderman, Ben, Designing the User Interface: Strategies for Effective Human-Computer Interaction. 3rd ed. Reading, Massachusetts: Addison-Wesley, 1998.

About the Author

Offer Drori is a teacher at the school of Business Administration Hebrew University (information systems) and at the school of Library, Archive and Information Studies (Hebrew University). He is a head of Information database system in SHAAM Information System and Chairman of SIGTRS (Special Interest Group on Text Retrieval Systems) of SPL WorldGroup (Israel).

Author's Address

SHAAM Information Systems
Poaly zedek St. #4
Jerusalem
Israel 93420

E-mail: offerd@shum.huji.ac.il

Issue
Article
Vol.30 No.3, July 1998
Article
Issue