Recommendations: What and Why?

What are Recommendations?

How does YouTube know what video you might want to watch next? How does the Google Play Store pick an app just for you? Magic? No, in both cases, an ML-based recommendation model determines how similar videos and apps are to other things you like and then serves up a recommendation. Two kinds of recommendations are commonly used:

Homepage Recommendations

Homepage recommendations are personalized to a user based on their known interests. Every user sees different recommendations.

If you go to the Google Play Apps homepage, you may see something like this:

As the name suggests, related items are recommendations similar to a particular item. In the Google Play apps example, users looking at a page for a math app may also see a panel of related apps, such as other math or science apps.

Why Recommendations?

A recommendation system helps users find compelling content in a large corpora. For example, the Google Play Store provides millions of apps, while YouTube provides billions of videos. More apps and videos are added every day. How can users find new compelling new content? Yes, one can use search to access content. However, a recommendation engine can display items that users might not have thought to search for on their own.

Did you know?

Terminology

Before we dive in, there are a few terms that you should know:

Items (also known as documents)

The entities a system recommends. For the Google Play store, the items are apps to install. For YouTube, the items are videos.

Query (also known as context)

The information a system uses to make recommendations. Queries can be a combination of the following:

Embedding

A mapping from a discrete set (in this case, the set of queries, or the set of items to recommend) to a vector space called the embedding space. Many recommendation systems rely on learning an appropriate embedding representation of the queries and items.

Recommendation Systems Overview

One common architecture for recommendation systems consists of the following components:

Candidate Generation

In this first stage, the system starts from a potentially huge corpus and generates a much smaller subset of candidates. For example, the candidate generator in YouTube reduces billions of videos down to hundreds or thousands. The model needs to evaluate queries quickly given the enormous size of the corpus. A given model may provide multiple candidate generators, each nominating a different subset of candidates.

Scoring

Next, another model scores and ranks the candidates in order to select the set of items (on the order of 10) to display to the user. Since this model evaluates a relatively small subset of items, the system can use a more precise model relying on additional queries.

Re-ranking

Finally, the system must take into account additional constraints for the final ranking. For example, the system removes items that the user explicitly disliked or boosts the score of fresher content. Re-ranking can also help ensure diversity, freshness, and fairness.

We will discuss each of these stages over the course of the class and give examples from different recommendation systems, such as YouTube.

Extra Resource: For a more comprehensive account of the technology, architecture, and models used in YouTube, see Covington et al., Deep Neural Networks for YouTube Recommendations.

Candidate Generation Overview

Candidate generation is the first stage of recommendation. Given a query, the system generates a set of relevant candidates. The following table shows two common candidate generation approaches:

Screen Shot 2022-01-25 at 6.06.02 PM

Embedding Space

Both content-based and collaborative filtering map each item and each query (or context) to an embedding vector in a common embedding space E=RdE=\mathbb{R}^d. Typically, the embedding space is low-dimensional (that is, dd is much smaller than the size of the corpus), and captures some latent structure of the item or query set. Similar items, such as YouTube videos that are usually watched by the same user, end up close together in the embedding space. The notion of “closeness” is defined by a similarity measure.

Extra Resource: projector.tensorflow.org is an interactive tool to visualize embeddings.

Similarity Measures

A similarity measure is a function s:E×ERs:E×E \rightarrow \mathbb{R} that takes a pair of embeddings and returns a scalar measuring their similarity. The embeddings can be used for candidate generation as follows: given a query embedding qEq \in E, the system looks for item embeddings xEx \in E that are close to qq, that is, embeddings with high similarity s(q,x)s(q,x).

To determine the degree of similarity, most recommendation systems rely on one or more of the following:

Cosine

This is simply the cosine of the angle between the two vectors, s(q,x)=cos(q,x)s(q,x)=\cos⁡(q,x)

Dot Product

The dot product of two vectors is s(q,x)=q,x=i=1dqixis(q,x)=⟨q,x⟩=\sum\limits^d_{i=1}q_i x_i. It is also given by s(q,x)=x‖‖qcos(q,x)s(q,x)=‖x‖‖q‖\cos⁡(q,x) (the cosine of the angle multiplied by the product of norms). Thus, if the embeddings are normalized, then dot-product and cosine coincide.

Euclidean distance

This is the usual distance in Euclidean space, s(q,x)=qx=[i=1d(qixi)2]12s(q,x)=‖q−x‖=[\sum\limits^d_{i=1} (q_i - x_i)^2]^{\frac{1}{2}}. A smaller distance means higher similarity. Note that when the embeddings are normalized, the squared Euclidean distance coincides with dot-product (and cosine) up to a constant, since in that case 12qx2=1q,x\frac{1}{2}‖q−x‖^2=1−⟨q,x⟩.

Comparing Similarity Measures

Consider the example in the figure below. The black vector illustrates the query embedding. The other three embedding vectors (Item A, Item B, Item C) represent candidate items. Depending on the similarity measure used, the ranking of the items can be different.

Using the image, try to determine the item ranking using all three of the similarity measures: cosine, dot product, and Euclidean distance.

How did you do?

Item A has the largest norm, and is ranked higher according to the dot-product. Item C has the smallest angle with the query, and is thus ranked first according to the cosine similarity. Item B is physically closest to the query so Euclidean distance favors it.

Which Similarity Measure to Choose?

Compared to the cosine, the dot product similarity is sensitive to the norm of the embedding. That is, the larger the norm of an embedding, the higher the similarity (for items with an acute angle) and the more likely the item is to be recommended. This can affect recommendations as follows:

Content-based Filtering

Content-based filtering uses item features to recommend other items similar to what the user likes, based on their previous actions or explicit feedback.

To demonstrate content-based filtering, let’s hand-engineer some features for the Google Play store. The following figure shows a feature matrix where each row represents an app and each column represents a feature. Features could include categories (such as Education, Casual, Health), the publisher of the app, and many others. To simplify, assume this feature matrix is binary: a non-zero value means the app has that feature.

You also represent the user in the same feature space. Some of the user-related features could be explicitly provided by the user. For example, a user selects “Entertainment apps” in their profile. Other features can be implicit, based on the apps they have previously installed. For example, the user installed another app published by Science R Us.

The model should recommend items relevant to this user. To do so, you must first pick a similarity metric (for example, dot product). Then, you must set up the system to score each candidate item according to this similarity metric. Note that the recommendations are specific to this user, as the model did not use any information about other users.

Using Dot Product as a Similarity Measure

Consider the case where the user embedding xx and the app embedding yy are both binary vectors. Since x,y=i=1dxiyi⟨x,y⟩ = \sum\limits^d_{i=1} x_i y_i, a feature appearing in both xx and yy contributes a 11 to the sum. In other words, x,y⟨x,y⟩ is the number of features that are active in both vectors simultaneously. A high dot product then indicates more common features, thus a higher similarity.

Content-based Filtering Advantages & Disadvantages

Advantages

Disadvantages

Collaborative Filtering

To address some of the limitations of content-based filtering, collaborative filtering uses similarities between users and items simultaneously to provide recommendations. This allows for serendipitous recommendations; that is, collaborative filtering models can recommend an item to user A based on the interests of a similar user B. Furthermore, the embeddings can be learned automatically, without relying on hand-engineering of features.

A Movie Recommendation Example

Consider a movie recommendation system in which the training data consists of a feedback matrix in which:

The feedback about movies falls into one of two categories:

To simplify, we will assume that the feedback matrix is binary; that is, a value of 1 indicates interest in the movie.

When a user visits the homepage, the system should recommend movies based on both:

For the sake of illustration, let’s hand-engineer some features for the movies described in the following table:

Screen Shot 2022-01-26 at 12.19.40 PM

1D Embedding

Suppose we assign to each movie a scalar in [1,1][−1,1] that describes whether the movie is for children (negative values) or adults (positive values). Suppose we also assign a scalar to each user in [1,1][−1,1] that describes the user’s interest in children’s movies (closer to -1) or adult movies (closer to +1). The product of the movie embedding and the user embedding should be higher (closer to 1) for movies that we expect the user to like.

In the diagram below, each checkmark identifies a movie that a particular user watched. The third and fourth users have preferences that are well explained by this feature—the third user prefers movies for children and the fourth user prefers movies for adults. However, the first and second users’ preferences are not well explained by this single feature.

2D Embedding

One feature was not enough to explain the preferences of all users. To overcome this problem, let’s add a second feature: the degree to which each movie is a blockbuster or an arthouse movie. With a second feature, we can now represent each movie with the following two-dimensional embedding:

We again place our users in the same embedding space to best explain the feedback matrix: for each (user, item) pair, we would like the dot product of the user embedding and the item embedding to be close to 1 when the user watched the movie, and to 0 otherwise.

Note: We represented both items and users in the same embedding space. This may seem surprising. After all, users and items are two different entities. However, you can think of the embedding space as an abstract representation common to both items and users, in which we can measure similarity or relevance using a similarity metric.

In this example, we hand-engineered the embeddings. In practice, the embeddings can be learned automatically, which is the power of collaborative filtering models. In the next two sections, we will discuss different models to learn these embeddings, and how to train them.

The collaborative nature of this approach is apparent when the model learns the embeddings. Suppose the embedding vectors for the movies are fixed. Then, the model can learn an embedding vector for the users to best explain their preferences. Consequently, embeddings of users with similar preferences will be close together. Similarly, if the embeddings for the users are fixed, then we can learn movie embeddings to best explain the feedback matrix. As a result, embeddings of movies liked by similar users will be close in the embedding space.

Matrix Factorization

Matrix factorization is a simple embedding model. Given the feedback matrix ARm×nA \in \mathbb{R}^{m×n}, where mm is the number of users (or queries) and nn is the number of items, the model learns:

The embeddings are learned such that the product UVTUV^T is a good approximation of the feedback matrix A. Observe that the (i,j)(i,j) entry of U.VTU.V^T is simply the dot product Ui,Vj⟨U_i,V_j⟩ of the embeddings of user ii and item jj, which you want to be close to Ai,jA_{i,j}.

Note: Matrix factorization typically gives a more compact representation than learning the full matrix. The full matrix has O(nm)O(nm) entries, while the embedding matrices UU, VV have O((n+m)d)O((n+m)d) entries, where the embedding dimension dd is typically much smaller than mm and nn. As a result, matrix factorization finds latent structure in the data, assuming that observations lie close to a low-dimensional subspace. In the preceding example, the values of nn, mm, and dd are so low that the advantage is negligible. In real-world recommendation systems, however, matrix factorization can be significantly more compact than learning the full matrix.

Choosing the Objective Function

One intuitive objective function is the squared distance. To do this, minimize the sum of squared errors over all pairs of observed entries:

minURm×d,VRn×d(i,j)obs(AijUi,Vj)2\min_{U \in \mathbb{R}^{m \times d}, V \in \mathbb{R}^{n \times d}} \sum\limits_{(i,j) \in \text{obs}} (A_{ij} - \langle U_i, V_j \rangle)^2

In this objective function, you only sum over observed pairs (i, j), that is, over non-zero values in the feedback matrix. However, only summing over values of one is not a good idea—a matrix of all ones will have a minimal loss and produce a model that can’t make effective recommendations and that generalizes poorly.

Perhaps you could treat the unobserved values as zero, and sum over all entries in the matrix. This corresponds to minimizing the squared Frobenius distance between AA and its approximation UVTUV^T:

minURm×d,VRn×dAUVTF2\min_{U \in \mathbb{R}^{m \times d}, V \in \mathbb{R}^{n \times d}} ||A - UV^T||^2_F

You can solve this quadratic problem through Singular Value Decomposition (SVD) of the matrix. However, SVD is not a great solution either, because in real applications, the matrix AA may be very sparse. For example, think of all the videos on YouTube compared to all the videos a particular user has viewed. The solution UVTUV^T (which corresponds to the model’s approximation of the input matrix) will likely be close to zero, leading to poor generalization performance.

In contrast, Weighted Matrix Factorization decomposes the objective into the following two sums:

minURm×d,VRn×d(i,j)obs(AijUi,Vj)2+w0(i,j)obs(Ui,Vj)2\min_{U \in \mathbb{R}^{m \times d}, V \in \mathbb{R}^{n \times d}} \sum\limits_{(i,j) \in \text{obs}} (A_{ij} - \langle U_i, V_j \rangle)^2 + w_0 \sum\limits_{(i,j) \notin \text{obs}} (\langle U_i, V_j \rangle)^2

Here, w0w_0 is a hyperparameter that weights the two terms so that the objective is not dominated by one or the other. Tuning this hyperparameter is very important.

Note: In practical applications, you also need to weight the observed pairs carefully. For example, frequent items (for example, extremely popular YouTube videos) or frequent queries (for example, heavy users) may dominate the objective function. You can correct for this effect by weighting training examples to account for item frequency. In other words, you can replace the objective function by:

(i,j)obswij(AijUi,Vj)2+w0(i,j)obs(Ui,Vj)2\sum\limits_{(i,j) \in \text{obs}} w_{ij} (A_{ij} - \langle U_i, V_j \rangle)^2 + w_0 \sum\limits_{(i,j) \notin \text{obs}} (\langle U_i, V_j \rangle)^2

where wi,jw_{i,j} is a function of the frequency of query ii and item jj.

Minimizing the Objective Function

Common algorithms to minimize the objective function include:

The objective is quadratic in each of the two matrices UU and VV. (Note, however, that the problem is not jointly convex.) WALS works by initializing the embeddings randomly, then alternating between:

Each stage can be solved exactly (via solution of a linear system) and can be distributed. This technique is guaranteed to converge because each step is guaranteed to decrease the loss.

SGD vs. WALS

SGD and WALS have advantages and disadvantages. Review the information below to see how they compare:

SGD

✅ Very flexible—can use other loss functions.

✅ Can be parallelized.

❌ Slower—does not converge as quickly.

❌ Harder to handle the unobserved entries (need to use negative sampling or gravity).

WALS

❌ Reliant on Loss Squares only.

✅ Can be parallelized.

✅ Converges faster than SGD.

✅ Easier to handle unobserved entries.

Collaborative Filtering Advantages & Disadvantages

Advantages

No domain knowledge necessary

We don’t need domain knowledge because the embeddings are automatically learned.

Serendipity

The model can help users discover new interests. In isolation, the ML system may not know the user is interested in a given item, but the model might still recommend it because similar users are interested in that item.

Great starting point

To some extent, the system needs only the feedback matrix to train a matrix factorization model. In particular, the system doesn’t need contextual features. In practice, this can be used as one of multiple candidate generators.

Disadvantages

Cannot handle fresh items

The prediction of the model for a given (user, item) pair is the dot product of the corresponding embeddings. So, if an item is not seen during training, the system can’t create an embedding for it and can’t query the model with this item. This issue is often called the cold-start problem. However, the following techniques can address the cold-start problem to some extent:

minvi0RdAi0Uvi0\min_{v_{i_0} \in \mathbb{R}^d} ||A_{i_0} - Uv_{i_0}||

Hard to include side features for query/item

Side features are any features beyond the query or item ID. For movie recommendations, the side features might include country or age. Including available side features improves the quality of the model. Although it may not be easy to include side features in WALS, a generalization of WALS makes this possible.

To generalize WALS, augment the input matrix with features by defining a block matrix Aˉ\bar{A}, where:

Note: Block (1, 1) is typically left empty. If you apply matrix factorization to Aˉ\bar{A}, then the system learns embeddings for side features, in addition to user and item embeddings.

Exercise: Build a Movie Recommendation System

This Colab notebook goes into more detail about Recommendation Systems. Specifically, you will be using matrix factorization to build a movie recommendation system, using the MovieLens dataset. Given a user and their ratings of movies on a scale of 1-5, your system will recommend movies the user is likely to rank highly.

Topics covered:

Note: In the last section of the Colab notebook (Section VI), you will build a softmax model. You will come back to that section at the end of the course, once we have discussed softmax models.

Colab notebook

Recommendation Using Deep Neural Networks

The previous section showed you how to use matrix factorization to learn embeddings. Some limitations of matrix factorization include:

Deep neural network (DNN) models can address these limitations of matrix factorization. DNNs can easily incorporate query features and item features (due to the flexibility of the input layer of the network), which can help capture the specific interests of a user and improve the relevance of recommendations.

Softmax DNN for Recommendation

One possible DNN model is softmax, which treats the problem as a multiclass prediction problem in which:

Input

The input to a DNN can include:

Unlike the matrix factorization approach, you can add side features such as age or country. We’ll denote the input vector by x.

Model Architecture

The model architecture determines the complexity and expressivity of the model. By adding hidden layers and non-linear activation functions (for example, ReLU), the model can capture more complex relationships in the data. However, increasing the number of parameters also typically makes the model harder to train and more expensive to serve. We will denote the output of the last hidden layer by ψ(x)Rd\psi(x) \in \mathbb{R}^d.

Softmax Output: Predicted Probability Distribution

The model maps the output of the last layer, ψ(x)\psi(x), through a softmax layer to a probability distribution p^=h(ψ(x)VT)\hat{p} = h(\psi(x)V^T), where:

The softmax layer maps a vector of scores yRny \in \mathbb{R}^n (sometimes called the logits) to a probability distribution.

Did you know?

The name softmax is a play on words. A “hard” max assigns probability 1 to the item with the largest score yiy_i. By contrast, the softmax assigns a non-zero probability to all items, giving a higher probability to items that have higher scores. When the scores are scaled, the softmax h(αy)h(\alpha y) converges to a “hard” max in the limit α\alpha \rightarrow \infty.

Loss Function

Finally, define a loss function that compares the following:

For example, you can use the cross-entropy loss since you are comparing two probability distributions.

Softmax Embeddings

The probability of item jj is given by p^j=exp(ψ(x),Vj)Z\hat{p}_j = \frac{\exp(\langle \psi(x), V_j \rangle)}{Z}, where ZZ is a normalization constant that does not depend on jj.

In other words, log(p^j)=ψ(x),Vjlog(Z)\log(\hat{p}_j) = \langle \psi(x), V_j \rangle - \log(Z), so the log probability of an item jj is (up to an additive constant) the dot product of two dd-dimensional vectors, which can be interpreted as query and item embeddings:

Note: Since log is an increasing function, items jj with the highest probability p^j\hat{p}_j are the items with the highest dot product ψ(x),Vj\langle \psi(x), V_j \rangle. Therefore, the dot product can be interpreted as a similarity measure in this embedding space.

DNN and Matrix Factorization

In both the softmax model and the matrix factorization model, the system learns one embedding vector VjV_j per item jj. What we called the item embedding matrix VRn×dV \in \mathbb{R}^{n \times d} in matrix factorization is now the matrix of weights of the softmax layer.

The query embeddings, however, are different. Instead of learning one embedding UiU_i per query ii, the system learns a mapping from the query feature xx to an embedding ψ(x)Rd\psi(x) \in \mathbb{R}^d. Therefore, you can think of this DNN model as a generalization of matrix factorization, in which you replace the query side by a nonlinear function ψ(.)\psi(.).

Can You Use Item Features?

Can you apply the same idea to the item side? That is, instead of learning one embedding per item, can the model learn a nonlinear function that maps item features to an embedding? Yes. To do so, use a two-tower neural network, which consists of two neural networks:

The output of the model can be defined as the dot product of ψ(xquery),ϕ(xitem)⟨\psi(x_\text{query}) ,\phi(x_\text{item})⟩. Note that this is not a softmax model anymore. The new model predicts one value per pair (xquery,xitem)(x_\text{query}, x_\text{item}) instead of a probability vector for each query xqueryx_\text{query}.

Softmax Training

The previous page explained how to incorporate a softmax layer into a deep neural network for a recommendation system. This page takes a closer look at the training data for this system.

Training Data

The softmax training data consists of the query features xx and a vector of items the user interacted with (represented as a probability distribution pp). These are marked in blue in the following figure. The variables of the model are the weights in the different layers. These are marked as orange in the following figure. The model is typically trained using any variant of stochastic gradient descent.

Negative Sampling

Since the loss function compares two probability vectors p,p^Rnp,\hat{p} \in \mathbb{R}^n (the ground truth and the output of the model, respectively), computing the gradient of the loss (for a single query xx) can be prohibitively expensive if the corpus size nn is too big.

You could set up a system to compute gradients only on the positive items (items that are active in the ground truth vector). However, if the system only trains on positive pairs, the model may suffer from folding, as explained below.

Screen Shot 2022-01-26 at 1.41.34 PM

Instead of using all items to compute the gradient (which can be too expensive) or using only positive items (which makes the model prone to folding), you can use negative sampling. More precisely, you compute an approximate gradient, using the following items:

There are different strategies for sampling negatives:

Extra Resources:

On Matrix Factorization Vs. Softmax

DNN models solve many limitations of Matrix Factorization, but are typically more expensive to train and query. The table below summarizes some of the important differences between the two models.

Screen Shot 2022-01-26 at 1.41.46 PM

In summary:

Retrieval

Suppose you have an embedding model. Given a user, how would you decide which items to recommend?

At serve time, given a query, you start by doing one of the following:

Once you have the query embedding qq, search for item embeddings VjV_j that are close to qq in the embedding space. This is a nearest neighbor problem. For example, you can return the top kk items according to the similarity score s(q,Vj)s(q,V_j).

You can use a similar approach in related-item recommendations. For example, when the user is watching a YouTube video, the system can first look up the embedding of that item, and then look for embeddings of other items VjV_j that are close in the embedding space.

Large-scale Retrieval

To compute the nearest neighbors in the embedding space, the system can exhaustively score every potential candidate. Exhaustive scoring can be expensive for very large corpora, but you can use either of the following strategies to make it more efficient:

Scoring

After candidate generation, another model scores and ranks the generated candidates to select the set of items to display. The recommendation system may have multiple candidate generators that use different sources, such as the following:

Examples

The system combines these different sources into a common pool of candidates that are then scored by a single model and ranked according to that score. For example, the system can train a model to predict the probability of a user watching a video on YouTube given the following:

The system can then rank the videos in the pool of candidates according to the prediction of the model.

Why Not Let the Candidate Generator Score?

Since candidate generators compute a score (such as the similarity measure in the embedding space), you might be tempted to use them to do ranking as well. However, you should avoid this practice for the following reasons:

Choosing an Objective Function for Scoring

As you may remember from Introduction to ML Problem Framing, ML can act like a mischievous genie: very happy to learn the objective you provide, but you have to be careful what you wish for. This mischievous quality also applies to recommendation systems. The choice of scoring function can dramatically affect the ranking of items, and ultimately the quality of the recommendations.

Example:

Maximize Click Rate

If the scoring function optimizes for clicks, the systems may recommend click-bait videos. This scoring function generates clicks but does not make a good user experience. Users’ interest may quickly fade.

Maximize Watch Time

If the scoring function optimizes for watch time, the system might recommend very long videos, which might lead to a poor user experience. Note that multiple short watches can be just as good as one long watch.

Increase Diversity and Maximize Session Watch Time

Recommend shorter videos, but ones that are more likely to keep the user engaged.

Positional Bias in Scoring

Items that appear lower on the screen are less likely to be clicked than items appearing higher on the screen. However, when scoring videos, the system usually doesn’t know where on the screen a link to that video will ultimately appear. Querying the model with all possible positions is too expensive. Even if querying multiple positions were feasible, the system still might not find a consistent ranking across multiple ranking scores.

Solutions

Re-ranking

In the final stage of a recommendation system, the system can re-rank the candidates to consider additional criteria or constraints. One re-ranking approach is to use filters that remove some candidates.

Example: You can implement re-ranking on a video recommender by doing the following:

  1. Training a separate model that detects whether a video is click-bait.
  2. Running this model on the candidate list.
  3. Removing the videos that the model classifies as click-bait.

Another re-ranking approach is to manually transform the score returned by the ranker.

Example: The system re-ranks videos by modifying the score as a function of:

This section briefly discusses freshness, diversity, and fairness. These factors are among many that can help improve your recommendation system. Some of these factors often require modifying different stages of the process. Each section offers solutions that you might apply individually or collectively.

Freshness

Most recommendation systems aim to incorporate the latest usage information, such as current user history and the newest items. Keeping the model fresh helps the model make good recommendations.

Solutions

Diversity

If the system always recommend items that are “closest” to the query embedding, the candidates tend to be very similar to each other. This lack of diversity can cause a bad or boring user experience. For example, if YouTube just recommends videos very similar to the video the user is currently watching, such as nothing but owl videos (as shown in the illustration), the user will likely lose interest quickly.

Solutions

Fairness

Your model should treat all users fairly. Therefore, make sure your model isn’t learning unconscious biases from the training data.

Solutions

Exercise: Build a Movie Recommendation System (continued)

You are now ready to complete the last section of the Colab. You can reopen the Colab notebook and resume working on Section VI. In this section, you will train a softmax model using the MovieLens data set.

Topics covered:

Colab notebook

Summary

You should now have a better understanding of how to: