As indicated above (section 5.1), an alternative hypothesis concerning document length would regard a long document as a set of unrelated, concatenated short documents. The obvious response to this hypothesis is to attempt to find appropriate boundaries in the documents, and to treat short passages rather than full documents as the retrievable units. There have been a number of experiments on these lines reported in the literature [10].
This approach appears difficult to combine with the ideas discussed above, in a way which would accommodate an explanation of document length in terms of a mixture of the two hypotheses. One possible solution is that used by Salton [11], of allowing passages to compete with full documents for retrieval. But there seems to be room for more theoretical analysis.