Relevance Measuring in Information Retrieval System

One of the challenges Information Retrieval system faces is Relevance Quality. It's the main factor that determines end end user's happiness. (The other two are latency and corpus size)

To design and implement a IR system that has high relevance quality, we must have some methods to measure the quality of relevance.

Generally speaking, a measuring system consists of three components:
- Test Corpus (Document Collection for Test purpose)
- Test Query Set (Set of Queries for Test)
- Measuring Parameter (usually a function, used to measure the retrieval result of an IR system for some query in the query set, using the test corpus)

Test Corpus/Query is another story and we only focus on measuring parameter/function here.

1. Precision/Recall for un-ranked retrieval result

Precision = #Relevant Documents Retrieved / #Retrieved Documents, it's the percentage of the returned documents that are really relevant to the user query. (查准率)

Recall = #Relevant Documents Retrieved / #Total Relevant Document, it's the percentage of the relevant document in the corpus that is retrieved in the query result. (查全率)

2. NDCG for ranked retrieval result

NDCG stands for Normalized Discounted Cumulative Gain, which is a human rating based measuring system.

Gain - user will assign a numeric value (which is a score gained) to represent the goodness of a returned document for some specific query request.

Cumulative Gain - user will assign gain value for each document in the top K returned results, the values is assigned individually and independently.
 \mathrm{CG_{p}} = \sum_{i=1}^{p} rel_{i}
Discounted Cumulative Gain - when assigning relevance score to the returned document, there is a weight related to the order of the document in the retrieval result.
 \mathrm{DCG_{p}} = rel_{1} + \sum_{i=2}^{p} \frac{rel_{i}}{\log_{2}i}
Normalized Discounted Cumulative Gain - it's easy to understand: make the final value to be [0, 1]. Usually, the DCG score of the ideally ordered (ordered using Gain score) document list is used as the normalizing factor. So
 \mathrm{nDCG_{p}} = \frac{DCG_{p}}{IDCG{p}}
For concrete example of how to compute the NDCG value of a query result, please see wiki on NDCG

NDCG is widely used in today's commercial search engine evaluation, but the problem is that, if the returned document is ordered in the same way as the decreasing order of gain score, the NDCG value will be the max:1.

This means that, NDCG is only used for the measuring the ranking algorithm of a search engine and can't tell whether the returned document is highly related to the user intention or not. But in end user's perspective, the perfect return result should be highly related document ordered properly.

More technically, a typical query serving sub-system of an IR system has two phases, one is matching (find highly related document), and the other is ranking (order the matched documents). NDCG may be a proper tool to measure the ranking phase, but definitely not the matching phase. So I think is not an ideal measuring mechanism for IR system.

So, tuning the whole system against NDCG score only may not be a correct direction for search engine improving.

- The ideal set, which is used to calculate the normalization factor, is the highly scored documents list ordered properly, not the proper order of the returned documents. So the problem I mentioned above doesn't exist.
- But the final effect of this measuring method depends on what test corpus, what test query, what the predefined gain score for each query.

No comments: