How to Understand Recommender Systems

By: DJ Rich

Posted: Updated:

In this post, I’ll present a perspective for understanding recommender systems. At their core is a model that infers users’ preferences for items, which are then used to recommend those that are most preferred. The literature is a zoo of such models, and it’s hard to see what they have in common. But at bottom, they handle mostly the same set of information. Thinking in terms of this set makes the range of recommender systems easier to understand.

The context is that I’m expanding the recommender system for Aampe, a client of mine that does engagement management. For example, if an online retailer signs up with Aampe, their algorithms manage the messages sent to the retailer’s users. The algorithms are optimized for producing purchases1 without sending too many messages, which can cause users to shut off notifications.

Some of the strongest messages are those that recommend a specific and relevant purchase.

A great message might be “Do you need to buy sunblock?” after a user purchased towels, sunglasses, a bathing suit and no sunblock.

So, a strong recommender is critical for effective messaging.

The Aampe team and I are encouraged by the performance of the system we’ve developed so far, but there’s more testing to do and it’s not yet deployed. Once that’s done, we’ll write a post on how exactly it works.

For now, I’ll present how I’m thinking about these systems in general.

The Information

The following is most of the information we need to understand recommender systems. Say we have:

  • \(N\) users indexed by \(i = 1, \cdots, N\). Think of these as user IDs.

  • \(M\) items indexed by \(j = 1, \cdots, M\). Think of these as item IDs.

  • User \(i\) has feature vector \(\mathbf{v}_i\). This could indicate things like their gender, geography, search queries, estimated age or engagement levels.

  • Item \(j\) has feature vector \(\mathbf{w}_j\). This could indicate things like the item’s price, category, name, manufacturer or brand.

  • \(y_{i, j}\) is an interaction target. If it’s binary, \(y_{i, j} = 1\) might indicate user \(i\) purchased item \(j\). If it’s an integer between \(1\) and \(5\), it could be user \(i\)’s rating of item \(j\). Being slightly sloppy, I call them preferences.

    We can represent all user-item targets with an \(N\)-by-\(M\) matrix \(\mathbf{Y}\). Note that in typical cases, \(\mathbf{Y}\) is mostly unobserved.

  • \(\mathbf{c}_{i, j}\) is a feature vector of length \(d\) characterizing the circumstance (the ‘context’) under which interaction \(y_{i, j}\) took place. For example, it could represent the time of the interaction or the other items on screen during the interaction.

    We can represent all contextual information with an \(N\)-by-\(M\)-by-\(d\) tensor \(\mathbf{C}\). Note that it will have (at least) the same unobserved pattern as \(\mathbf{Y}\).

Collaborative Filtering

Collaborative filtering (‘CF’) methods use similarities to infer unobserved user-item preferences. We’re interested in unobserved preferences because we’d like to recommend highly preferred items not yet encountered by the user.

For a given user, the preferences of a peer group of similar users are used to estimate the user’s preferences for unencountered items. Reference to a peer group is what makes it ‘collaborative’.

Specifically, an estimate of an unobserved preference is given with the following sum over observed elements, where \(s(i, i')\) is the similarity between users \(i\) and \(i'\):

\[\hat{y}_{i, j} = \sum_{i':\,y_{i', j}\textrm{ observed}} s(i, i')y_{i', j}\]

This is a generalized notion of a peer group; \(s(i, i')\) could be defined such that it indicates discretely if \(i\) and \(i'\) are from the same group. Further, \(s(i, i')\) needs to be properly normalized such that the sum amounts to a weighted average.

Alternatively, items may collaborate to infer unobserved user-item preferences. The prediction is formed as follows, where \(s(j, j')\) is the similarity between items \(j\) and \(j'\):

\[\hat{y}_{i, j} = \sum_{j':\,y_{i, j'}\textrm{ observed}} s(j, j')y_{i, j'}\]

See Two Decades of Recommender Systems at Amazon.com for an example of an item-based system2.

CF may only require item IDs, user IDs and the matrix \(\mathbf{Y}\). User-similarities could be defined with a similarity metric between \(\mathbf{Y}\)’s rows or item-similarities could be defined with a metric across columns. Depending on how \(\mathbf{Y}\) is defined, more creative similarities are available as well. Again, see the Amazon paper for an example.

Further, precomputing similarities and storing the observed elements of \(\mathbf{Y}\) may be all that is required prior to making recommendations. These are so called memory-based CF because the training routine amounts to storing (‘memorizing’) the targets and their similarities. This makes it an easy to implement and little-tuning-required nearest-neighbors algorithm. Like any neighbors-based algorithm, this shifts the compute burden to predict time, where a prediction involves summing over many terms or searching for a small set of neighbors (‘collaborators’).

One difficulty with memory-based methods comes from the sometimes-extreme sparsity of \(\mathbf{Y}\). For instance, if \(\mathbf{Y}\) is ratings of movies, an extreme majority of \(\mathbf{Y}\) is unobserved since most people don’t rate movies even if they watch them. Consequently for a given user, the collaborators necessarily need to be quite dissimilar, producing a bad estimate.

To compensate for this sparsity, we make assumptions in so called model-based CF. We assume \(\mathbf{Y}\) has some structure. After fitting to observations, this structure can fill in the unobserved values. The most well known example is matrix factorization, where we make the following approximation:

\[\mathbf{Y} \approx \mathbf{G}\mathbf{H}^T\]

where \(\mathbf{G}\) is \(N\)-by-\(k\), \(\mathbf{H}\) is \(M\)-by-\(k\) and \(k\) is a hyperparameter to be tuned. These matrices can be learned by minimizing (via gradient descent) the Frobenius norm of the approximation errors. It’s a required speedup to only compute predictions on observed values during training. Further, it eases the training routine to add two intercept vectors, \(\mathbf{a} \in \mathbb{R}^N\) and \(\mathbf{b} \in \mathbb{R}^M\), to match the column and row means of \(\mathbf{Y}\) respectively:

\[\mathbf{Y} \approx \textrm{diag}(\mathbf{a}) + \mathbf{G}\mathbf{H}^T + \textrm{diag}(\mathbf{b})\]

With this, any user-item preference can be predicted as:

\[\hat{y}_{i, j} = a_i + \mathbf{g}_i\mathbf{h}_j^T + b_j\]

See chapter 22 of Murphy (2022) for more details on this style of model.

In my experience, I’ve only seen matrix factorization be useful when sparsity is severe. If finding reasonably similar collaborators is possible, the modeling assumptions add little predictive value.

Further, model based approaches require careful tuning and regularization. So before pursuing this, it might be worth exploring a different similarity (e.g. item-item instead of user-user) where sparsity is not an issue.

Content-Based Methods

As mentioned, everything above essentially only uses the interaction target \(\mathbf{Y}\). Observed feature vectors describing items (\(\mathbf{w}_j\)) are ignored despite their useful information (e.g. ‘this is a movie about nature’). Similarly, user feature vectors (\(\mathbf{v}_i\)) are ignored. To include them is a key aspect of content-based methods.

The typical and simple content-based approach is as follows. Both users and items share the same feature space. That is, \(\mathbf{v}_i\) and \(\mathbf{w}_j\) are of the same length and their elements measure similar things. For example, the second element of both vectors might measure exposure to the topic ‘nature’. If items are movies and users are viewers, a nature movie would have its second element equal to one. A user who watches nature movies 50% of the time would have their second element equal to \(0.5\). From this, users are recommended items with similar feature vectors. That is, someone who watches mostly nature videos should be recommended more nature videos.

The upside of this approach is the feature vector for a new item is known upfront, so it can be recommended immediately upon entry into the system. In CF, preferences need to accumulate before an item is recommended, the so-called ‘cold-start’ problem (though it exists for users as well). On the downside, the hand-designed feature space inclusive of both users and items is restrictive, making it unlikely to produce a powerful model. To compensate, deep learning can be used to add model capacity. See section 3.1 of H. Zhou et al. (2023) for an example.

Because of this, I like my more general definition of content-based methods as those that use \(\mathbf{v}_i\) and \(\mathbf{w}_j\), however they are defined. It is easy to define models that roll these vectors into the CF-style predictions of \(\hat{y}_{i, j}\) from earlier. I might call them ‘content-aware’ methods. Others call them hybrid systems (but I find this terminology to not be especially informative).

Using the Context

The last piece of information to include is the context \(\mathbf{C}\) describing the conditions under which user-item interactions took place. Since this is the largest chunk of data and reasonable performance can be achieved without it, it is often ignored3.

In a simple case, \(\mathbf{C}\) could be a binary matrix indicating whether an interaction took place on a mobile device. To incorporate this, we could conditionalize our model on this mobile-yes-or-no fact. That means we essentially have two recommenders, one for mobile and one for not-mobile. But this is a poor use of data since the data is divided and not shared across systems.

Because of this, \(\mathbf{C}\) is especially difficult to use. We are trying to model a high dimensional \(\mathbf{Y}\) conditional on an even higher dimensional \(\mathbf{C}\). This means for a given preference to predict, there is likely very little data considered relevant.

Personally, I have made no attempt of using \(\mathbf{C}\), so I can’t provide much advice on this front. See section II.A.3 of Y. Li et al. (2023) for a discussion.

Other approaches

To be a bit more comprehensive, I’ll briefly describe some other approaches and how they relate to this framework:

  • AutoRec: It is one of the earliest successful demonstrations of autoencoders applied to recommender systems. An autoencoder is a function that recreates its input after passing it through a low-dimensional space. The low dimension forbids a perfect recreation and so the autoencoder must learn the essential characteristics of the input. AutoRec uses an autoencoder on the columns (items) of \(\mathbf{Y}\); missing values are ignored during training but predicted with the final output-recreation of the input. See S. Suvas et al. (2015) for more details.

  • Tree-based methods: Tree-based methods can be strongly predictive with little tuning, making them an attractive tool for recommender systems. The idea is to turn each column (or row) of \(\mathbf{Y}\) into a prediction problem. A given column becomes the target, and all other columns become the input features. One complication is the features are mostly missing, so their values need to be imputed, ideally as predictions of the other models. This is not a trivial issue. See section 3.2.1 of Aggarwal (2016) for details.

  • Reinforcement Learning: A problem with everything discussed so far is the assumption that accurately predicting \(\mathbf{Y}\) is sufficient for what a recommender system should do. In reality, this is a myopic, short-term measure. It is better to measure long-term metrics. Maybe there exists an item that isn’t highly preferred but gets users returning to the product (e.g. educational items might be such a case). This means the recommender system shouldn’t optimize individual recommendations for a short term signal but a long sequence of recommendations for a long term reward. This is exactly the paradigm of reinforcement learning. However, this brings other challenges. See L. Zou et al. (2019) for a discussion.

Complications

In closing, here are some complications to be aware of:

  • Evaluation is difficult. It is not merely a matter of predicting a held out portion of \(\mathbf{Y}\). See my other post for a short description.

  • Missing-patterns may be signals. An entry in \(\mathbf{Y}\) might be missing because the item was never exposed to the user or because it was and the user declined to interact. In the latter case, the missing pattern should be modeled to properly infer preferences.

  • There may be feedback loops. If a system recommends an item often, people may buy it primarily due to their high exposure to it. In the next training, this product will look like an even more attractive item to recommend, due to the exposure-induced buying. The system will lift winners and suppress losers only due the system’s previous behavior and not due to people’s actual preferences. This is one reason why ratings, which allow for explicit negativity, are often a favored signal.


References

  1. C. Aggarwal. Recommender Systems: The Textbook. Springer International Publishing. 2016.

  2. B. Smith and G. Linden. Two Decades of Recommender Systems at Amazon.com. IEEE Internet Computing, vol. 21, no. 3, pp. 12-18, May-June 2017

  3. K. Murphy. Probabilistic Machine Learning: An Introduction. MIT Press. 2022.

  4. H. Zhou, F. Xiong, H. Chen. A Comprehensive Survey of Recommender Systems Based on Deep Learning. Applied Sciences. 2023

  5. Y. Li, K. Liu, R. Satapathy, S. Wang, E. Cambria. Recent Developments in Recommender Systems: A Survey. arXiv. 2023

  6. S. Suvash, A. Menon, S. Sanner, and L. Xie. Autorec: Autoencoders meet collaborative filtering. In Proceedings of the 24th International Conference on World Wide Web, pp. 111-112. ACM, 2015.

  7. L. Zou, L. Xia, Z. Ding, J. Song, W. Liu, D. Yin. Reinforcement Learning to Optimize Long-term User Engagement in Recommender Systems. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. pp. 2810-2818. July 2019


Something to Add?

If you see an error, egregious omission, something confusing or something worth adding, please email dj@truetheta.io with your suggestion. If it’s substantive, you’ll be credited. Thank you in advance!


Footnotes

  1. Or some other goal behavior. 

  2. The prediction or use of \(\mathbf{Y}\) isn’t explicit in the Amazon paper. This follows from the fact that, implicitly, their \(\mathbf{Y}\) is a binary matrix indicating which users purchased which items. The sum over observed elements that I’ve written is equivalent to the averaging over item-similarities they perform. 

  3. A notable exception is the time of the interaction. This can be rather simply included by recency-weighting preferences.