Evaluating Recommender Systems

By: DJ Rich

Posted: Updated:

I’m working on a retail recommender system and thinking first about evaluation. Here are a few things I’ve learned to worry about.

Conditional on existing RS

Evaluating a candidate RS by seeing how well it predicts test set item purchases isn’t quite what you want, since all items purchased are under the existing RS. The data depends on the RS system. In truth, proposing an RS is to propose an intervention, making it a causal question and making the data partially irrelevant.

Positive feedback

I’m working with item purchases, and those are positive feedback only. If an item is recommended often and never purchased, shouldn’t it be recommended less? Positive feedback will learn that only slowly. In general, this is where the preference for user ratings comes from.

Complexity of the user experience

There is so much more to a user’s experience than what’s indicated with item purchases, like what else they saw when purchasing. Every item purchased is conditional on the user experience at purchase time, and that’s real hard to represent as a vector - a vector that produces personalization but doesn’t separate user experiences so much as to make things impossible to estimate.

A/B Testing

You should evaluate RS’s with A/B testing, but an A/B test just tells you about changing one variable (normally). A/B testing alone will be a very slow path to an optimal RS. Think years.


There’s an explore-exploit trade-off. Maybe the old RS never recommends certain products that, were they recommended, would be very popular. An RS trained on this data would make the same mistake. A smart RS knows to explore just the right amount of new items.

The right way, technically

The theoretically right way to do this is with reinforcement learning. You need something that learns in real time with real users. But RL evaluation is extremely hard and precarious. It’s an actual test-in-production routine. You don’t get to freeze the parameters, evaluate the model offline, launch in shadow mode and then deploy to production. You have to create a meta-model-thing that fits on its own, stress it in a simulated environment (which never matches reality) and then deploy it to production. Next, you relaunch it later when the real evaluation rolls in. And repeat. Most companies won’t stomach this.