How does YouTube know what video you might want to watch next? How does the Google Play Store pick an app just for you? Magic? No, in both cases, an ML-based recommendation model determines how similar videos and apps are to other things you like and then serves up a recommendation. Two kinds of recommendations are commonly used:
Homepage recommendations are personalized to a user based on their known interests. Every user sees different recommendations.
If you go to the Google Play Apps homepage, you may see something like this:
As the name suggests, related items are recommendations similar to a particular item. In the Google Play apps example, users looking at a page for a math app may also see a panel of related apps, such as other math or science apps.
A recommendation system helps users find compelling content in a large corpora. For example, the Google Play Store provides millions of apps, while YouTube provides billions of videos. More apps and videos are added every day. How can users find new compelling new content? Yes, one can use search to access content. However, a recommendation engine can display items that users might not have thought to search for on their own.
Did you know?
Before we dive in, there are a few terms that you should know:
The entities a system recommends. For the Google Play store, the items are apps to install. For YouTube, the items are videos.
The information a system uses to make recommendations. Queries can be a combination of the following:
A mapping from a discrete set (in this case, the set of queries, or the set of items to recommend) to a vector space called the embedding space. Many recommendation systems rely on learning an appropriate embedding representation of the queries and items.
One common architecture for recommendation systems consists of the following components:
In this first stage, the system starts from a potentially huge corpus and generates a much smaller subset of candidates. For example, the candidate generator in YouTube reduces billions of videos down to hundreds or thousands. The model needs to evaluate queries quickly given the enormous size of the corpus. A given model may provide multiple candidate generators, each nominating a different subset of candidates.
Next, another model scores and ranks the candidates in order to select the set of items (on the order of 10) to display to the user. Since this model evaluates a relatively small subset of items, the system can use a more precise model relying on additional queries.
Finally, the system must take into account additional constraints for the final ranking. For example, the system removes items that the user explicitly disliked or boosts the score of fresher content. Re-ranking can also help ensure diversity, freshness, and fairness.
We will discuss each of these stages over the course of the class and give examples from different recommendation systems, such as YouTube.
Extra Resource: For a more comprehensive account of the technology, architecture, and models used in YouTube, see Covington et al., Deep Neural Networks for YouTube Recommendations.
Candidate generation is the first stage of recommendation. Given a query, the system generates a set of relevant candidates. The following table shows two common candidate generation approaches:

Both content-based and collaborative filtering map each item and each query (or context) to an embedding vector in a common embedding space . Typically, the embedding space is low-dimensional (that is, is much smaller than the size of the corpus), and captures some latent structure of the item or query set. Similar items, such as YouTube videos that are usually watched by the same user, end up close together in the embedding space. The notion of “closeness” is defined by a similarity measure.
Extra Resource: projector.tensorflow.org is an interactive tool to visualize embeddings.
A similarity measure is a function that takes a pair of embeddings and returns a scalar measuring their similarity. The embeddings can be used for candidate generation as follows: given a query embedding , the system looks for item embeddings that are close to , that is, embeddings with high similarity .
To determine the degree of similarity, most recommendation systems rely on one or more of the following:
This is simply the cosine of the angle between the two vectors,
The dot product of two vectors is . It is also given by (the cosine of the angle multiplied by the product of norms). Thus, if the embeddings are normalized, then dot-product and cosine coincide.
This is the usual distance in Euclidean space, . A smaller distance means higher similarity. Note that when the embeddings are normalized, the squared Euclidean distance coincides with dot-product (and cosine) up to a constant, since in that case .
Consider the example in the figure below. The black vector illustrates the query embedding. The other three embedding vectors (Item A, Item B, Item C) represent candidate items. Depending on the similarity measure used, the ranking of the items can be different.
Using the image, try to determine the item ranking using all three of the similarity measures: cosine, dot product, and Euclidean distance.
How did you do?
Item A has the largest norm, and is ranked higher according to the dot-product. Item C has the smallest angle with the query, and is thus ranked first according to the cosine similarity. Item B is physically closest to the query so Euclidean distance favors it.
Compared to the cosine, the dot product similarity is sensitive to the norm of the embedding. That is, the larger the norm of an embedding, the higher the similarity (for items with an acute angle) and the more likely the item is to be recommended. This can affect recommendations as follows:
Items that appear very frequently in the training set (for example, popular YouTube videos) tend to have embeddings with large norms. If capturing popularity information is desirable, then you should prefer dot product. However, if you’re not careful, the popular items may end up dominating the recommendations. In practice, you can use other variants of similarity measures that put less emphasis on the norm of the item. For example, define .
Items that appear very rarely may not be updated frequently during training. Consequently, if they are initialized with a large norm, the system may recommend rare items over more relevant items. To avoid this problem, be careful about embedding initialization, and use appropriate regularization. We will detail this problem in the first exercise.
Content-based filtering uses item features to recommend other items similar to what the user likes, based on their previous actions or explicit feedback.
To demonstrate content-based filtering, let’s hand-engineer some features for the Google Play store. The following figure shows a feature matrix where each row represents an app and each column represents a feature. Features could include categories (such as Education, Casual, Health), the publisher of the app, and many others. To simplify, assume this feature matrix is binary: a non-zero value means the app has that feature.
You also represent the user in the same feature space. Some of the user-related features could be explicitly provided by the user. For example, a user selects “Entertainment apps” in their profile. Other features can be implicit, based on the apps they have previously installed. For example, the user installed another app published by Science R Us.
The model should recommend items relevant to this user. To do so, you must first pick a similarity metric (for example, dot product). Then, you must set up the system to score each candidate item according to this similarity metric. Note that the recommendations are specific to this user, as the model did not use any information about other users.
Consider the case where the user embedding and the app embedding are both binary vectors. Since , a feature appearing in both and contributes a to the sum. In other words, is the number of features that are active in both vectors simultaneously. A high dot product then indicates more common features, thus a higher similarity.
Advantages
Disadvantages
To address some of the limitations of content-based filtering, collaborative filtering uses similarities between users and items simultaneously to provide recommendations. This allows for serendipitous recommendations; that is, collaborative filtering models can recommend an item to user A based on the interests of a similar user B. Furthermore, the embeddings can be learned automatically, without relying on hand-engineering of features.
Consider a movie recommendation system in which the training data consists of a feedback matrix in which:
The feedback about movies falls into one of two categories:
To simplify, we will assume that the feedback matrix is binary; that is, a value of 1 indicates interest in the movie.
When a user visits the homepage, the system should recommend movies based on both:
For the sake of illustration, let’s hand-engineer some features for the movies described in the following table:

Suppose we assign to each movie a scalar in that describes whether the movie is for children (negative values) or adults (positive values). Suppose we also assign a scalar to each user in that describes the user’s interest in children’s movies (closer to -1) or adult movies (closer to +1). The product of the movie embedding and the user embedding should be higher (closer to 1) for movies that we expect the user to like.
In the diagram below, each checkmark identifies a movie that a particular user watched. The third and fourth users have preferences that are well explained by this feature—the third user prefers movies for children and the fourth user prefers movies for adults. However, the first and second users’ preferences are not well explained by this single feature.
One feature was not enough to explain the preferences of all users. To overcome this problem, let’s add a second feature: the degree to which each movie is a blockbuster or an arthouse movie. With a second feature, we can now represent each movie with the following two-dimensional embedding:
We again place our users in the same embedding space to best explain the feedback matrix: for each (user, item) pair, we would like the dot product of the user embedding and the item embedding to be close to 1 when the user watched the movie, and to 0 otherwise.
Note: We represented both items and users in the same embedding space. This may seem surprising. After all, users and items are two different entities. However, you can think of the embedding space as an abstract representation common to both items and users, in which we can measure similarity or relevance using a similarity metric.
In this example, we hand-engineered the embeddings. In practice, the embeddings can be learned automatically, which is the power of collaborative filtering models. In the next two sections, we will discuss different models to learn these embeddings, and how to train them.
The collaborative nature of this approach is apparent when the model learns the embeddings. Suppose the embedding vectors for the movies are fixed. Then, the model can learn an embedding vector for the users to best explain their preferences. Consequently, embeddings of users with similar preferences will be close together. Similarly, if the embeddings for the users are fixed, then we can learn movie embeddings to best explain the feedback matrix. As a result, embeddings of movies liked by similar users will be close in the embedding space.
Matrix factorization is a simple embedding model. Given the feedback matrix , where is the number of users (or queries) and is the number of items, the model learns:
The embeddings are learned such that the product is a good approximation of the feedback matrix A. Observe that the entry of is simply the dot product of the embeddings of user and item , which you want to be close to .
Note: Matrix factorization typically gives a more compact representation than learning the full matrix. The full matrix has entries, while the embedding matrices , have entries, where the embedding dimension is typically much smaller than and . As a result, matrix factorization finds latent structure in the data, assuming that observations lie close to a low-dimensional subspace. In the preceding example, the values of , , and are so low that the advantage is negligible. In real-world recommendation systems, however, matrix factorization can be significantly more compact than learning the full matrix.
One intuitive objective function is the squared distance. To do this, minimize the sum of squared errors over all pairs of observed entries:
In this objective function, you only sum over observed pairs (i, j), that is, over non-zero values in the feedback matrix. However, only summing over values of one is not a good idea—a matrix of all ones will have a minimal loss and produce a model that can’t make effective recommendations and that generalizes poorly.
Perhaps you could treat the unobserved values as zero, and sum over all entries in the matrix. This corresponds to minimizing the squared Frobenius distance between and its approximation :
You can solve this quadratic problem through Singular Value Decomposition (SVD) of the matrix. However, SVD is not a great solution either, because in real applications, the matrix may be very sparse. For example, think of all the videos on YouTube compared to all the videos a particular user has viewed. The solution (which corresponds to the model’s approximation of the input matrix) will likely be close to zero, leading to poor generalization performance.
In contrast, Weighted Matrix Factorization decomposes the objective into the following two sums:
Here, is a hyperparameter that weights the two terms so that the objective is not dominated by one or the other. Tuning this hyperparameter is very important.
Note: In practical applications, you also need to weight the observed pairs carefully. For example, frequent items (for example, extremely popular YouTube videos) or frequent queries (for example, heavy users) may dominate the objective function. You can correct for this effect by weighting training examples to account for item frequency. In other words, you can replace the objective function by:
where is a function of the frequency of query and item .
Common algorithms to minimize the objective function include:
Stochastic gradient descent (SGD) is a generic method to minimize loss functions.
Weighted Alternating Least Squares (WALS) is specialized to this particular objective.
The objective is quadratic in each of the two matrices and . (Note, however, that the problem is not jointly convex.) WALS works by initializing the embeddings randomly, then alternating between:
Each stage can be solved exactly (via solution of a linear system) and can be distributed. This technique is guaranteed to converge because each step is guaranteed to decrease the loss.
SGD and WALS have advantages and disadvantages. Review the information below to see how they compare:
SGD
✅ Very flexible—can use other loss functions.
✅ Can be parallelized.
❌ Slower—does not converge as quickly.
❌ Harder to handle the unobserved entries (need to use negative sampling or gravity).
WALS
❌ Reliant on Loss Squares only.
✅ Can be parallelized.
✅ Converges faster than SGD.
✅ Easier to handle unobserved entries.
✅ No domain knowledge necessary
We don’t need domain knowledge because the embeddings are automatically learned.
✅ Serendipity
The model can help users discover new interests. In isolation, the ML system may not know the user is interested in a given item, but the model might still recommend it because similar users are interested in that item.
✅ Great starting point
To some extent, the system needs only the feedback matrix to train a matrix factorization model. In particular, the system doesn’t need contextual features. In practice, this can be used as one of multiple candidate generators.
❌ Cannot handle fresh items
The prediction of the model for a given (user, item) pair is the dot product of the corresponding embeddings. So, if an item is not seen during training, the system can’t create an embedding for it and can’t query the model with this item. This issue is often called the cold-start problem. However, the following techniques can address the cold-start problem to some extent:
The preceding equation corresponds to one iteration in WALS: the user embeddings are kept fixed, and the system solves for the embedding of item . The same can be done for a new user.
Heuristics to generate embeddings of fresh items. If the system does not have interactions, the system can approximate its embedding by averaging the embeddings of items from the same category, from the same uploader (in YouTube), and so on.
❌ Hard to include side features for query/item
Side features are any features beyond the query or item ID. For movie recommendations, the side features might include country or age. Including available side features improves the quality of the model. Although it may not be easy to include side features in WALS, a generalization of WALS makes this possible.
To generalize WALS, augment the input matrix with features by defining a block matrix , where:
Note: Block (1, 1) is typically left empty. If you apply matrix factorization to , then the system learns embeddings for side features, in addition to user and item embeddings.
This Colab notebook goes into more detail about Recommendation Systems. Specifically, you will be using matrix factorization to build a movie recommendation system, using the MovieLens dataset. Given a user and their ratings of movies on a scale of 1-5, your system will recommend movies the user is likely to rank highly.
Topics covered:
Note: In the last section of the Colab notebook (Section VI), you will build a softmax model. You will come back to that section at the end of the course, once we have discussed softmax models.
The previous section showed you how to use matrix factorization to learn embeddings. Some limitations of matrix factorization include:
Deep neural network (DNN) models can address these limitations of matrix factorization. DNNs can easily incorporate query features and item features (due to the flexibility of the input layer of the network), which can help capture the specific interests of a user and improve the relevance of recommendations.
One possible DNN model is softmax, which treats the problem as a multiclass prediction problem in which:
The input to a DNN can include:
Unlike the matrix factorization approach, you can add side features such as age or country. We’ll denote the input vector by x.
The model architecture determines the complexity and expressivity of the model. By adding hidden layers and non-linear activation functions (for example, ReLU), the model can capture more complex relationships in the data. However, increasing the number of parameters also typically makes the model harder to train and more expensive to serve. We will denote the output of the last hidden layer by .
The model maps the output of the last layer, , through a softmax layer to a probability distribution , where:
The softmax layer maps a vector of scores (sometimes called the logits) to a probability distribution.
Did you know?
The name softmax is a play on words. A “hard” max assigns probability 1 to the item with the largest score . By contrast, the softmax assigns a non-zero probability to all items, giving a higher probability to items that have higher scores. When the scores are scaled, the softmax converges to a “hard” max in the limit .
Finally, define a loss function that compares the following:
For example, you can use the cross-entropy loss since you are comparing two probability distributions.
The probability of item is given by , where is a normalization constant that does not depend on .
In other words, , so the log probability of an item is (up to an additive constant) the dot product of two -dimensional vectors, which can be interpreted as query and item embeddings:
Note: Since log is an increasing function, items with the highest probability are the items with the highest dot product . Therefore, the dot product can be interpreted as a similarity measure in this embedding space.
In both the softmax model and the matrix factorization model, the system learns one embedding vector per item . What we called the item embedding matrix in matrix factorization is now the matrix of weights of the softmax layer.
The query embeddings, however, are different. Instead of learning one embedding per query , the system learns a mapping from the query feature to an embedding . Therefore, you can think of this DNN model as a generalization of matrix factorization, in which you replace the query side by a nonlinear function .
Can you apply the same idea to the item side? That is, instead of learning one embedding per item, can the model learn a nonlinear function that maps item features to an embedding? Yes. To do so, use a two-tower neural network, which consists of two neural networks:
The output of the model can be defined as the dot product of . Note that this is not a softmax model anymore. The new model predicts one value per pair instead of a probability vector for each query .
The previous page explained how to incorporate a softmax layer into a deep neural network for a recommendation system. This page takes a closer look at the training data for this system.
The softmax training data consists of the query features and a vector of items the user interacted with (represented as a probability distribution ). These are marked in blue in the following figure. The variables of the model are the weights in the different layers. These are marked as orange in the following figure. The model is typically trained using any variant of stochastic gradient descent.
Since the loss function compares two probability vectors (the ground truth and the output of the model, respectively), computing the gradient of the loss (for a single query ) can be prohibitively expensive if the corpus size is too big.
You could set up a system to compute gradients only on the positive items (items that are active in the ground truth vector). However, if the system only trains on positive pairs, the model may suffer from folding, as explained below.
Instead of using all items to compute the gradient (which can be too expensive) or using only positive items (which makes the model prone to folding), you can use negative sampling. More precisely, you compute an approximate gradient, using the following items:
There are different strategies for sampling negatives:
Extra Resources:
DNN models solve many limitations of Matrix Factorization, but are typically more expensive to train and query. The table below summarizes some of the important differences between the two models.
In summary:
Suppose you have an embedding model. Given a user, how would you decide which items to recommend?
At serve time, given a query, you start by doing one of the following:
Once you have the query embedding , search for item embeddings that are close to in the embedding space. This is a nearest neighbor problem. For example, you can return the top items according to the similarity score .
You can use a similar approach in related-item recommendations. For example, when the user is watching a YouTube video, the system can first look up the embedding of that item, and then look for embeddings of other items that are close in the embedding space.
To compute the nearest neighbors in the embedding space, the system can exhaustively score every potential candidate. Exhaustive scoring can be expensive for very large corpora, but you can use either of the following strategies to make it more efficient:
After candidate generation, another model scores and ranks the generated candidates to select the set of items to display. The recommendation system may have multiple candidate generators that use different sources, such as the following:
Examples
The system combines these different sources into a common pool of candidates that are then scored by a single model and ranked according to that score. For example, the system can train a model to predict the probability of a user watching a video on YouTube given the following:
The system can then rank the videos in the pool of candidates according to the prediction of the model.
Since candidate generators compute a score (such as the similarity measure in the embedding space), you might be tempted to use them to do ranking as well. However, you should avoid this practice for the following reasons:
As you may remember from Introduction to ML Problem Framing, ML can act like a mischievous genie: very happy to learn the objective you provide, but you have to be careful what you wish for. This mischievous quality also applies to recommendation systems. The choice of scoring function can dramatically affect the ranking of items, and ultimately the quality of the recommendations.
Example:
If the scoring function optimizes for clicks, the systems may recommend click-bait videos. This scoring function generates clicks but does not make a good user experience. Users’ interest may quickly fade.
If the scoring function optimizes for watch time, the system might recommend very long videos, which might lead to a poor user experience. Note that multiple short watches can be just as good as one long watch.
Recommend shorter videos, but ones that are more likely to keep the user engaged.
Items that appear lower on the screen are less likely to be clicked than items appearing higher on the screen. However, when scoring videos, the system usually doesn’t know where on the screen a link to that video will ultimately appear. Querying the model with all possible positions is too expensive. Even if querying multiple positions were feasible, the system still might not find a consistent ranking across multiple ranking scores.
In the final stage of a recommendation system, the system can re-rank the candidates to consider additional criteria or constraints. One re-ranking approach is to use filters that remove some candidates.
Example: You can implement re-ranking on a video recommender by doing the following:
Another re-ranking approach is to manually transform the score returned by the ranker.
Example: The system re-ranks videos by modifying the score as a function of:
This section briefly discusses freshness, diversity, and fairness. These factors are among many that can help improve your recommendation system. Some of these factors often require modifying different stages of the process. Each section offers solutions that you might apply individually or collectively.
Most recommendation systems aim to incorporate the latest usage information, such as current user history and the newest items. Keeping the model fresh helps the model make good recommendations.
If the system always recommend items that are “closest” to the query embedding, the candidates tend to be very similar to each other. This lack of diversity can cause a bad or boring user experience. For example, if YouTube just recommends videos very similar to the video the user is currently watching, such as nothing but owl videos (as shown in the illustration), the user will likely lose interest quickly.
Your model should treat all users fairly. Therefore, make sure your model isn’t learning unconscious biases from the training data.
You are now ready to complete the last section of the Colab. You can reopen the Colab notebook and resume working on Section VI. In this section, you will train a softmax model using the MovieLens data set.
Topics covered:
You should now have a better understanding of how to: