Thursday 18 February 2016

Lenskit: data split in evaluation

Lenskit 2 has an embedded evaluation module. In this post I am going to describe how it splits datasets.



Lets consider an example, where we are given the data file with 4 users:
The data file corresponds to the following user-item matrix:
If we ask Lenskit to split our data with crossfold 1 and holdout 3, we would receive two files test.0.csv and train.0.csv that correspond to the following user-item matrices.
Lenskit hid 3 ratings of each user regardless of how many ratings the user has. For example, in the training dataset user 4 does not have any rating at all. If we set crossfold 2 and holdout 2, which correspond to 2-fold cross-validation, we would obtain the following result:
The framework selected half of users for the first fold and half for the second fold. For each user the framework hid 2 ratings.
I could not find a detailed documentation on data split in Lenskit framework. Hope this post helps.

No comments:

Post a Comment