next up previous
Next: A Very Rough Up: Document Length Previous: Document Length

Hypotheses Concerning Document Length

 

We may postulate at least two reasons why documents might vary in length. Some documents may simply cover more material than others; an extreme version of this hypothesis would have a long document consisting of a number of unrelated short documents concatenated together (the ``Scope hypothesis''). An opposite view would have long documents like short documents, but longer: in other words, a long document covers a similar scope to a short document, but simply uses more words (the ``Verbosity hypothesis'').

It seems likely that real document collections contain a mixture of these effects; individual long documents may be at either extreme or of some hybrid type. (It is worth observing that some of the long TREC news items read exactly as if they are made up of short items concatenated together.) However, with the exception of a short discussion in section 5.7, the work on document length reported in this paper assumes the Verbosity hypothesis; little progress has yet been made with models based on the Scope hypothesis.

The Verbosity hypothesis would imply that document properties such as relevance and eliteness can be regarded as independent of document length; given eliteness for a term, however, the number of occurrences of that term would depend on document length.



Steve Robertson
Mon May 13 18:33:21 BST 1996