next up previous
Next: The 2--Poisson Model Up: Basic Probabilistic Weighting Previous: Basic Probabilistic Weighting

Eliteness

 

The approach taken in reference [6] is to model within-document term frequencies by means of a mixture of two Poisson distributions. However, before discussing the 2--Poisson model, it is worth extracting one idea which is necessary for the model, but can perhaps stand on its own. (This idea was in fact taken from the original 2--Poisson work by Harter [5], but was extended somewhat to allow for multi-concept queries.)

We hypothesize that occurrences of a term in a document have a random or stochastic element, which nevertheless reflects a real but hidden distinction between those documents which are ``about'' the concept represented by the term and those which are not. Those documents which are ``about'' this concept are described as ``elite'' for the term. We may draw an inference about eliteness from the term frequency, but this inference will of course be probabilistic. Furthermore, relevance (to a query which may of course contain many concepts) is related to eliteness rather than directly to term frequency, which is assumed to depend only on eliteness. The usual term-independence assumptions are replaced by assumptions that the eliteness properties for different terms are independent of each other; although the details are not provided here, it is clear that the independence assumptions can be replaced by ``linked dependence'' assumptions in the style of Cooper [7].

As usual, the various assumptions of the model are very clearly over-simplifications of reality. It seems nevertheless to be useful to introduce this hidden variable of eliteness in order to gain some understanding of the relation between multiple term occurrence and relevance.



Steve Robertson
Mon May 13 18:33:21 BST 1996