Robertson -- Computer Retrieval

INFORMATION RETRIEVAL EXPERIMENT

Although there is now a strong association between computer-based retrieval and evaluation, and although the history of formal evaluation begins almost simultaneously with the history of computer-based methods, the two developed quite independently of each other for some time. Much of the early evaluation work was done on manual systems.

Cranfield

The remark at the Royal Society conference in 1948 on the desirability of experimentation was mentioned above, but the first major evaluation experiment in information retrieval was the first Cranfield experiment, begun in the late fifties. This experiment was very well represented in the Journal, beginning with a long analytical review by O'Connor of two early reports from the project [26]. This review was interesting for, among other things, suggesting analytical experiments looking at the results for individual queries. (This idea was taken up in the big Medlars experiment in the mid-sixties (Lancaster [62]), but relatively seldom since.) The final Cranfield 1 report also prompted a substantial review (Mote [31]), and then a whole series of longer articles. Kyle [35] and Hyslop [39] both tried to draw conclusions from Cranfield 1 that would be of practical significance in operational systems. Brownson [40] reviewed the state of the art of evaluation, with particular reference to Cranfield 1; Fairthorne [41] discussed some theoretical issues. The idea of testing systems began to spread (e.g. Martyn and Slater [37]; Rolling [42]; Martyn [45]). It is very clear that Cranfield 1 had a major impact on our perception of information retrieval systems, and of the possibility of experimental study of IR.

Methodology

Brownson [40] also announced the funding of the two major studies of relevance of the sixties, one of which was later reported in the Journal (Cuadra and Katter [51]). Relevance was also discussed by Barhydt [46], and other aspects of testing by Saracevic and Rees [44]. Subsequently, the Journal published many papers on methodological and/or theoretical aspects of evaluation, particularly evaluation measures (Brookes [52]; Robertson [55]; Miller [64]; Brookes [73]; Cleverdon [74]; Heine [77] etc.).

One might argue that, having raised the possibility and hope of being able to treat IR as an experimental discipline, Cranfield 1 dashed it again by simply revealing the extreme difficulty of devising adequate methodologies. Certainly the problems are severe, and although the frequency of methodological papers has declined, this is in no sense because they have been solved. The recent TREC project in the United States (about which more below) has reinforced this point.

Evaluation experiments

Cleverdon in 1970 reviewed evaluation tests up to that point [59]. He emphasised the way in which the seeking of experimental evidence (rather than philosophical argument or anecdote) had become an accepted method of enquiry in information retrieval. He also particularly excluded from his consideration studies which implement and test only part of a system (for example, inter-indexer consistency studies) -- a point to which I return below.

Many specific tests were reported in the Journal, some on manual and some on computer-based systems (for example, Corbett [47], Shaw and Rothman [53], Searle [60], Salton [63], Miller [68], Sparck Jones [69], Barker, Veal and Wyatt [70] etc.). Some of these tests were intended to provide data for specific decision-making (e.g. about particular features of an operational system, or between competing systems). Some experiments, on the other hand, were intended to inform more generally about information retrieval: to establish general principles of system design or implementation. Relatively rarely, experimental studies were undertaken without actual evaluation, but designed to characterise systems, procedures, methods or databases, in a way which might increase our understanding of evaluation results: an example is van Rijsbergen and Sparck Jones [79].

Test collections

From Cranfield on, many experiments have been performed on datasets created for earlier experiments (the prime example of this was the Cranfield 2 data, which has been used for an astonishing number and range of experiments since it became available). This led to extensive consideration in the UK in the 70s about the possibility of creating a new, bigger and better test collection of material (documents, requests and relevance judgements). It was referred to as the `ideal' test collection, and the first published paper about it appeared in the Journal (Sparck Jones and van Rijsbergen [89]).

Unfortunately, the ideal test collection project never got off the ground. However, it eventually inspired a similar project in the United States, which began in 1991: TREC (Text REtrieval Conference). The basis of TREC is that a central organisation builds the test collection, and researchers around the world use it to test their own methods and systems, reporting back to the conference with results presented in a standardised way. The TREC collection is far larger than any previous test collection, which makes the exercise extremely interesting; however, it must also be said that the TREC methodology reflects strongly its origins in the 70s ideal test collection proposal, and (in my view) need substantial development to bring it closer to current concerns, such as highly interactive systems.