Friday, 20 May 2016

Lenskit: Precision, Recall and NDCG

This post is dedicated to accuracy metrics in Lenskit framework.

NDCG

In Lenskit implementation, normalized discounted cumulative gain (NDCG) is calculated as follows:
,
where IDCG is DCG for a list with perfect order, while DCG is calculated as follows:
,
where U is a set of users, relu(i) is relevance of user u for item at position i, while Ru is a recommendation list for user u. The following example illustrates the formula.
Suppose a user was recommended a list of 5 movies. The user did not really like the first movie in the list and gave it 3 stars out of 5. She really liked (5 stars) the second movie, though. The following table demonstrates user tastes.
DCG then is calculated as follows:
This is an absolute values, so it is unclear how to good the order actually is. We therefore can compare this value with the value for the list with the perfect order:
 
The relative value than (n=5):
Lenskit implementation does not capture positions of first two items in the list. The NDCG value would not change is the first two values were 5 and 3 instead of 3 and 5. The implementation is consistent is the paper, where the metric was proposed:
Järvelin K., Kekäläinen J. IR evaluation methods for retrieving highly relevant documents //Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval. – ACM, 2000. – С. 41-48.

Precision and Recall

Precision and recall are calculated as follows:
 
Lets consider the previous example of movie recommendations. To calculate these metrics, we can consider some items relevant depending on the threshold. For example, we could assume if the user gave more that 3 stars to a movie, than the movie is relevant to the user.
If the user likes 6 movies overall in our dataset, then:
Currently, the metrics are not consistent in Lenskit. Precision and recall ignore users that a recommendation algorithm is unable to generate recommendations for, while NDCG does take into account all the users.
For example, if the dataset is very sparse item-based collaborative filtering cannot generate item neighborhoods and generate suggestions for some users. NDCG would penalize the algorithm for suggesting nothing, while precision and recall would just skip these users.
The implementation is going to be fixed in the future. To make metrics more consistent, one can force precision and recall capture all the users by changing class (or a copy the class) in the following way.

No comments:

Post a Comment