Robertson -- Computer Retrieval

THE DEVELOPING CONCEPT OF IR

This century's transformation of information retrieval, from card and printed indexes, via various forms of mechanization, to computer-based systems, has been one continuous process of re-evaluation. In some sense, to pick up the story at the point where, more-or-less simultaneously, computers were invented and the Journal of Documentation started, is to begin in the middle. In particular, there is some difficulty with the terminology, which was already in transition at the time computers began to be used; the use of computers then accelerated the process.

Take, for example, the word `indexing'. Traditionally, it means `creating an index' or perhaps entering in an (existing) index; the index was a thing which the searcher could see. Nowadays, it is much more likely to mean the assignment of index terms to an item, whatever subsequently happens to them -- and indeed the searcher may have no idea what mechanisms are involved when he or she puts a request to a system. Sometimes the term is used to describe an internal mechanism -- the process of generating an inverted file (in whatever form the input comes). If the system is free-text, then the process of generating an inverted file includes the process of assigning index terms (even if only trivially); if the system involves human indexing, then the processes are mutually exclusive. On the other hand, if we are talking about machine-generated printed indexes, then the distinction is more complex.

Some such terminological problems were becoming apparent at earlier stages of mechanization. Furthermore, as the example shows, the terminological problems reflect more fundamental problems about distinguishing different functions and different parts of processes in IR, given the changing technological context.

Indexing

Human indexing for computer retrieval

The question of whether and what kind of human input is required for a computer retrieval system, beyond the simple bibliographic record, abstract and/or text of the document, is clearly a vexed one. Some authors adapted an existing form of human indexing (or classification) to machine operation (e.g. Caless and Kirk [48] with UDC, or Hines [49] with LC and Dewey codes); others attempted to assess the value of including human indexing (e.g. Olive, Terry and Datta [78], Barker, Veal and Wyatt [70]). Later, there were some authors who considered quite different kinds of human indexing, on the assumption that they would be used in computer systems (e.g. Kircz [140] with rhetorical structures).

Automatic indexing

Whether or not there has been some intellectual input, the question of what machine manipulation of the input record is required at the input stage is another subject for much discussion. Many authors considered automatic indexing processes which might be regarded as equivalent to, or perhaps alternatives to, human indexing. I have already referred to the early book by Maron [28] discussing the possibility of using statistical data as the basis for an automatic indexing process -- an idea that was to be reiterated many times. Another early contribution, by Artandi (in fact the first Journal paper to discuss a computer program in some detail) [32], discussed the detection of proper nouns in text, by means of rules specifying patterns in text indicating the presence of proper nouns. (At a conference I attended in late 1993, a paper discussed the automatic detection in text of citations, with particular reference to author names. Plus çà change!) Later, Artandi described a project in which similar rules are used for subject indexing of text [56]; this was interesting for being a long time before the idea of rule-based expert systems become so common.

Dillon and McDonald [111] demonstrated a method of book indexing, using dictionary-based syntactic tagging of words, rule-based identification of multi-word content-bearing units (concepts), and grouping of concepts. Field [87] and Robertson and Harding [115] devised methods by which the system would learn, from a training sample, what index terms a human would assign to a document, given some other information about that document (such as free-text terms). Rada et al. [126] discussed augmenting a thesaurus with additional entry terms in order to improve automatic indexing.

Several authors discussed automatic methods which bear some resemblance to indexing, but do not result in conventionally indexed documents with direct subject descriptions. For example, Salton described an experiment with citation indexing [63]; Martyn [38] defined bibliographic coupling (linking items by citations-in-common). Griffiths, Robinson and Willett [114] and Enser [117] derived clusters of documents (classification without labels). Needham and Sparck Jones [33] clustered the terms already used to index document (by some kind of free-text process); again, the clusters themselves do not have labels. Clustering is discussed further in the section on retrieval system theory, below.

Free text

The idea that one might do away with anything that looks like an intellectual indexing step altogether, by using free-text records, is an attractive one from an economic point of view, though it only begins to make any kind of sense once the move from punched cards to computers is established. Most early writers assumed that some form of indexing was required; as an intermediate step, Shaw and Rothman [53] proposed that words should be chosen from the text by a human indexer, but not controlled in any way. Gralewska's [54] system included title words as well as controlled-language indexing. But by the early seventies, the possibilities of free-text were being explored, and the free-text versus controlled indexing debate was in full spate. For example, Barker, Veal and Wyatt [70] compared searching of titles and abstracts with index terms; similarly Hersey [66], Olive, Terry and Datta [78]. Although the frequency has declined, there are still some papers on the same lines (for example Cousins [143]).

Actually the distinction between free-text systems and automatic indexing is decidedly fuzzy. Consider the following steps, some of which are present in all free-text systems, and others are sometimes used. Although all might be regarded as elementary or trivial from the point of view of traditional human indexing, they must nevertheless be treated as some sort of indexing operation.

(a) Free-text indexing from part of the item only (e.g. title or abstract). This is clearly making use of someone else's selection of the important words to describe an item.

(b) Word identification. There must be a set of rules for this, dealing not only with word separators such as blank characters and punctuation, but also with upper-lower case, embedded hyphens or hyphens at the end of lines, numbers etc.

(d) Stemming or suffix stripping.

(e) Dictionary operations such as identification of phrases, acronyms, synonyms.

(f) Inverted file generation. In some sense, this step subsumes all the above. However, it also has its own built-in effects on the later searching stage which might force it to be regarded as a form of indexing in its own right: for example, the traditional inverted index makes it easy to do right-hand-truncation at the searching stage, but much more difficult to allow left-hand-truncation.

It is, of course possible to have a system which does none of these things -- for example, the searching facilities provided in word-processing packages, which involve serial scanning of text. Furthermore, one could probably produce anecdotal arguments against any of them -- for example, although case folding is clearly in general a good thing, the distinction in a medical database between AIDS and (hearing) aids is one that one would like to maintain! But it is generally assumed that even the most minimal retrieval system must include some of these steps. The reasons for this assumption are of two quite different kinds:

(i) Efficiency: There is little hope of providing the kind of speed of search required, on the kind of size databases required, without using inverted files. Some of the above steps are simply necessary for inverted files; some are desirable from the point of view of inverted file size.

(ii) Effectiveness: Many or all of these steps can (to a greater or lesser extent) be justified on the grounds of providing better retrieval performance.

These two reasons have become very firmly intertwined, and it is now difficult to disentangle them. So the possibility (if not now, then in the future) that technological developments such as parallel processing will render the first reason obsolete clouds the issue greatly. Not that it is obvious that the first reason will necessarily become obsolete -- although the size and power of computers is increasing at a phenomenal rate, so too is the size of the databases.

Hutchins [50] did propose a system that would work without indexing (the principle was that at search time, all possible variants of search terms as they might appear in text would be generated, and matched against the raw text).³

However, if one were trying to implement his system, one would almost certainly use an inverted file for efficiency reasons, and in the process do at least (b) above. Sharp [85] appealed for the term `natural language' to be used only where the language of the document is not changed at all (though it is not clear whether he would exclude even case conversion).

Searching

Boolean logic

As indicated above, the transition from punched-card systems (which were designed to allow for post-coordination) to Boolean search logic in computer-based systems was a fairly natural one, and from an early stage it was assumed that searching would involve Boolean logic. Fairthorne's discussion of the subject in a review [29] has already been mentioned. Authors have continued to write about the use of Boolean logic (sometimes including those extensions to the logic which text retrieval has induced almost without noticing, such as the implied-OR in truncation and explosion, and term adjacency and proximity operators). For example, Harley, le Minor and Weil [75] discussed a universal search formulation language; Barraclough [93] surveyed work on online retrieval; Dillon and Desper [103] developed a method for the automatic modification of Boolean queries following relevance feedback (see below); Radecki [105] [109] discussed the relation between Boolean and weighted retrieval; Vickery et al. [122] described an expert system which manipulates Boolean search statements. Caless and Kirk [48] extended Boolean logic in a different way when searching on UDC codes, by adding order operators.

Associative methods

The alternative to Boolean and similar methods for search statement construction is to use some kind of associative method such as search term weighting. The most obvious difference between the two approaches is that Boolean-type searches are dichotomous: an item is either retrieved or not. Associative methods tend to be used to rank items retrieved, so that the items which match the search statement best are at the top of the ranking. In this respect, associative methods are feasible only in computer systems. Robertson and Belkin [97] discuss the principles of ranking.

Associative methods also tend to make use of obviously statistical information such as term occurrence or co-occurrence or frequency within or between documents etc. However, there is recent interest in making use also of term position information in text, which has usually been seen in Boolean systems (Keen [137]).

Associative methods have been more common in experimental systems or environments than in operational ones: for example, Salton's system [63] [80] or Sparck Jones' experiments [69] [101] [113]. They have also figured more often in theoretical papers (see Robertson [92] for some examples). However, some authors have described work in something approaching an operational environment (e.g. Miller [68]).

Interactive searching

In the early seventies, online searching became feasible. Most online systems were based on batch search systems, and they tended to have similar facilities (Barraclough [93]). A few authors began to consider the possibilities of highly interactive retrieval (e.g. Oddy [90]); however, on the whole the most interesting work on interactive IR was happening in the context of manual searching (e.g. Keen [94], Ingwersen [108]). The unwillingness of the major online hosts to make substantial changes in their interfaces lead eventually to consideration of alternative ways to provide interactive help to searchers, such as front-end systems. Two substantial papers in the Journal have surveyed, reviewed and analysed these aids and the principles on which they are based (Efthimiadis [133], Vickery and Vickery [145]). One particular interface was described by Vickery and Vickery [142].

One interactive mechanism which has been the subject of a number of Journal papers is relevance feedback. The principle is that if the user indicates to the system which items (resulting from a first attempt at searching, say) are of interest to her/him, then the system can modify the search statement to match more closely the desired items, and thus find more similar items. In effect, the user's relevance judgements provide indirect evidence as to her/his real need, in addition to the information provided directly in the form of a query. The use of relevance feedback in SDI has already been mentioned (Barker, Veal and Wyatt [72]); other papers on the subject are Sparck Jones [101], Dillon and Desper [103], Wu and Salton [104], Robertson [119] [136], and Hancock-Beaulieu and Walker [144].

Although not all relevance feedback methods are based on this view, the idea fits very well with the probabilistic model for IR (see the section on retrieval system theory, below).

Another aspect of interaction which has received some attention is display. In fact there has been an interesting continuity between the work on printed indexes mentioned above, and later work on the display of records, index entries, concepts and relations on computer screens. Examples of work in this area are Craven [106] [123] [135], Bertrand-Gastaldy and Davidson [120], Sano [138], and Bovey and Brown [124].

Theories and models

From the point of view of the Journal of Documentation, the major effect of the development of computer-based information retrieval systems (as predicted by Fairthorne) was not so much in the development of specific, practical systems, as in the stimulation of ideas. (The more formally explored and presented ideas, of course, become theories or models.) Experiments or experimental systems are then often used to test the ideas. Fuller discussion of these aspects must wait until after a section on the evaluation of systems.