What are popular ways of evaluating recommender systems?

By: Bart Baesens, Seppe vanden Broucke

This QA first appeared in Data Science Briefings, the DataMiningApps newsletter as a “Free Tweet Consulting Experience” — where we answer a data science or analytics question of 140 characters maximum. Also want to submit your question? Just Tweet us @DataMiningApps. Want to remain anonymous? Then send us a direct message and we’ll keep all your details private. Subscribe now for free if you want to be the first to receive our articles and stay up to data on data science news, or follow us @DataMiningApps.


You asked: What are popular ways of evaluating recommender systems?

Our answer:

Two categories of evaluation metrics are generally considered: the goodness or badness of the output presented by a recommender system and its time and space requirements. Recommender systems generating predictions (numerical values corresponding to users’ ratings for items) should be evaluated separately from recommender systems that propose a list of N items that a user is expected to find interesting (top-N recommendation).

The first category of evaluation metrics that we consider is the goodness or badness of the output presented by a recommender system. Concerning recommender systems that make predictions, prediction accuracy can be measured using statistical accuracy metrics (of which Mean Absolute Deviation (MAD) is the most popular one) and using decision support accuracy metrics (of which area under the Receiver Operating Characteristic Curve is the most popular one). Coverage denotes for which percentage of the items the recommender system can make a prediction. Coverage might decrease in case of data sparsity in the user-item matrix.  Concerning top-N recommendation, important metrics are recall-precision related measures. Data is first divided in a training set and a test set. The algorithm runs on the training set, giving a list of recommended items. The concept of ‘hit set’ is considered, containing only the recommended (top-N) items that are also in the test set. Recall and precision are then determined as follows:

  • Recall = Size of hit set   /   Size of test set
  • Precision = Size of hit set   /    N

A problem with recall and precision is that usually, recall increases as N is increased while precision decreases as N is increased.  Therefore, the F1 metric combines both measures:

  • F1 = (2 * recall * precision)   /   (recall + precision)

Computing F1 for each user and then taking the average gives the score of the top-N recommendation list.

The other category of evaluation metrics is dealing with the performance of a recommender system in terms of time and space requirements. Response time is the time that is needed for a system to formulate a response to a user’s request. Storage requirements can be considered in two ways: main memory requirement (online space needed by the system) and secondary storage requirement (offline space needed by the system).

Additional metrics can also be considered and will depend on the type of recommender system faced and the domain in which it is used. For example, it is a common practice in a direct marketing context to build a cumulative lift curve or calculate the AUC. One also has to decide whether online or offline evaluations will be made. Although offline evaluation is typically applied, it is often misleading because the context of the recommendation is not considered. However, the costs linked with online evaluations are typically higher and are accompanied by different risks (e.g. bad recommendations may impact customers’ satisfaction).