Common metrics to evaluate recommendation systems

Are your evaluation metrics top-heavy?

Screenshot from Netflix

How to Evaluate?

As users get to see only top-k recommended items, the recommender systems are usually evaluated using top-heavy metrics that assign higher scores to models which rank most of the relevant items in top-k. A top-heavy property of a metric implies that the metric gives higher scores to an algorithm which performs very well on top few items and discards (or gives lower importance to) the scores for other items.

Average Metric


In this section, we take a look at some of the common metrics that can be used for evaluating recommendation systems.

Area Under ROC Curve
Comparison between model A and model B
Average Precision@K
Inverse Log Reward

Comparison of Metrics

For simplicity, let’s assume that there is only 1 relevant item in the catalog for a given input instance (e.g. only 1 relevant movie on Netflix for a given user and his / her historical behavior). In this case, the metrics discussed above would map the rank of this relevant item given by the recommendation system to a single number. The following plots show the metric for different ranks of the relevant item.

Comparison of Metrics (
Comparison of Metrics (zoomed in) (

Upcoming topic

Hopefully, we all now have a high-level idea about how recommendation systems are evaluated. In the next blog post (yet to be published), we will discuss how these metrics behave when we apply sampling during evaluation. This sampling process leads to interesting mistakes when selecting the best recommendation system based on the sample metrics calculated during offline evaluation, so stay tuned!

Machine Learning Engineer @ PlayStation

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store