We describe now the mechanisms to rank documents according to their relevance to a query.

Table of contents

Approach

A node a is able to establish the profile of a owned document d by measuring the number of times d has been downloaded, and by observing the words that were associated with the download requests. We call it a vote.

Considering Word

Relevance to a Word

A document d is relevant to a word w if most of the votes for d are associated with w. A node a quantifies the relevance of w for d as the ratio of the votes for w to the votes for other words.

REL_a(d, w) = \frac{vote_a(d,w)}{\sum_{w' \in \mathcal W(a)}vote_a(d,w')}

Popularity for a Word

On the other side, a document d is popular for a word w if most of the received votes for w are associated with d. The popularity is computed with the ratio of votes for d to the votes for another documents.

POP_a(d, w) = \frac{vote_a(d,w)}{\sum_{d' \in \mathcal D(a)}vote_a(d',w)}

Computation Adjustment

But, as pointed out in Neurogrid (http://www.neurogrid.net), a document receiving numerous votes has a more accurate measure of relevance than another which receives few votes. In order to prevent such bias, we use the Hoeffding Bound (http://en.wikipedia.org/wiki/Hoeffding%27s_inequality) to compute the maximum deviation e(n) of the ratio from its expected value for n votes. So, it is possible to obtain the pessimistic value of the relevance and the popularity.

As the popularity of old documents increase, newly published documents may never be found. So, we implement an aging mechanism' by periodically multiplying the number of votes for a document by an aging factor (for instance 0.99).

Considering the Initiator

The request initiator should also be considered when ranking document. If a word w is relevant to a document d and if the user z is a specialist of w, we guess that the document d will be of interest for z. So we obtain the following computation of the "matching" between a user and a document:

matching_a(z, d) = \sum_{w \in \mathcal W(a)}{REL_a(d,w) * SP_a(z,w)}

Document Ranking Score

Finally, the node a is able to compute the document ranking score of a document d for a queried word w and a requester z as:

DRSa(z,d,w) = RELa(d,w) * POPa(d,w) * matchinga(z,d)

Nodes also use informations contained in the search responses to establish a partial profile of some known but not stored documents. The accuracy of this partial profile depends on the amount of informations in the search response. There is a trade-off between the compactness of the search responses and the accuracy of the profiles.

Using the document ranking score, a node can select, from its point of view, the most relevant document for the requester. The selection is done among the documents published, cached or known through previous search responses.

This algorithm which has been simplified for clarity, can be improved by forwarding first the request before scanning its documents. So, this operation does not affect the latency of the query.