A probabilistic model of information retrieval

Development and comparative experiments

Spärck Jones, Walker and Robertson

Part 1 (sections 1-4.4 plus results)
Part 2 (sections 4.5-10.3)

This paper covers developments in the relevance weighting probabilistic model associated with Cambridge and City Universities and the Okapi system. It includes extensive experimental results, from a TREC-derived corpus, on different components of the term weighting and document scoring system.

A note on terminology

The weighting scheme commonly known as IDF is briefly referred to as such in this document, but is generally here called CFW (collection frequency weight). This is at odds with the (now) common use of 'collection frequency' to refer to the total number of occurrences of a term in a collection. (See also The IDF page.) The weighting scheme referred to as RW (relevance weight) is the original Robertson/Sparck Jones relevance weight from 1976, also known at different times as RSJ, F4 or Miller4. The scheme now commonly known as Okapi BM25 (or merely Okapi or BM25) appears in the present paper in various guises, depending on exactly what information is included (CW, CIW, QACW, QACIW). Finally, the weight referred to here as OW (offer weight) is elsewhere called a 'term selection function' or 'value', since this is the purpose it serves.

Publication

An earlier and fuller version of this document was published as University of Cambridge Computer Laboratory Technical Report 446, 1998 (via www.cl.cam.ac.uk/TechReports/). The present version is essentially the same (with the possible exception of some proof changes) as the one published in Information Processing and Management in 2000. It is presented as it was in IP&M in two parts. However, it is intended to be read as a single document.

Publication details are as follows:

  1. K. Sparck Jones, S. Walker and S.E. Robertson, A probabilistic model of information retrieval: development and comparative experiments. Information Processing and Management 36, Part 1 779-808 (2000).
  2. K. Sparck Jones, S. Walker and S.E. Robertson, A probabilistic model of information retrieval: development and comparative experiments. Information Processing and Management 36, Part 2 809-840 (2000).
The Information Processing and Management home page is here.

Stephen Robertson
February 2005