Next: Basic Probabilistic Weighting Up: Some Simple Effective Approximations Previous: Some Simple Effective Approximations

Introduction

This paper discusses an approach to the incorporation of new variables into traditional probabilistic models for information retrieval, and some experimental results relating thereto. Some of the discussion has appeared in the proceedings of the second TREC conference [1], albeit in less detail.

Statistical approaches to information retrieval have traditionally (to over-simplify grossly) taken two forms: firstly approaches based on formal models, where the model specifies an exact formula; and secondly ad-hoc approaches, where formulae are tried because they seem to be plausible. Both categories have had some notable successes. A more recent variant is the regression approach of Fuhr and Cooper (see, for example, Cooper [3]), which incorporates ad-hoc choice of independent variables and functions of them with a formal model for assessing their value in retrieval, selecting from among them and assigning weights to them.

One problem with the formal model approach is that it is often very difficult to take into account the wide variety of variables that are thought or known to influence retrieval. The difficulty arises either because there is no known basis for a model containing such variables, or because any such model may simply be too complex to give a usable exact formula.

One problem with the ad-hoc approach is that there is little guidance as to how to deal with specific variables---one has to guess at a formula and try it out. This problem is also apparent in the regression approach---although ``trying it out'' has a somewhat different sense here (the formula is tried in a regression model, rather than in a retrieval test).

The approach in the present paper is to take a model which provides an exact but intractable formula, and use it to suggest a much simpler formula. The simpler formula can then be tried in an ad-hoc fashion, or used in turn in a regression model. Although we have not yet taken this latter step of using regression, we believe that the present suggestion lends itself to such methods.

The variables we have included in this paper are: within-document term frequency, document length, and within-query term frequency (it is worth observing that collection frequency of terms appears naturally in traditional probabilistic models, particularly in the form of the approximation to inverse collection frequency weighting demonstrated by Croft and Harper [4]). The formal model which is used to investigate the effects of these variables is the 2--Poisson model (Harter [5], Robertson, van Rijsbergen and Porter [6]).

Next: Basic Probabilistic Weighting Up: Some Simple Effective Approximations Previous: Some Simple Effective Approximations

Steve Robertson
Mon May 13 18:33:21 BST 1996