Recommendations: What and Why?

What are Recommendations?

How does YouTube know what video you might want to watch next? How does the Google Play Store pick an app just for you? Magic? No, in both cases, an ML-based recommendation model determines how similar videos and apps are to other things you like and then serves up a recommendation. Two kinds of recommendations are commonly used:

home page recommendations
related item recommendations

Homepage Recommendations

Homepage recommendations are personalized to a user based on their known interests. Every user sees different recommendations.

If you go to the Google Play Apps homepage, you may see something like this:

As the name suggests, related items are recommendations similar to a particular item. In the Google Play apps example, users looking at a page for a math app may also see a panel of related apps, such as other math or science apps.

Why Recommendations?

A recommendation system helps users find compelling content in a large corpora. For example, the Google Play Store provides millions of apps, while YouTube provides billions of videos. More apps and videos are added every day. How can users find new compelling new content? Yes, one can use search to access content. However, a recommendation engine can display items that users might not have thought to search for on their own.

Did you know?

40% of app installs on Google Play come from recommendations.
60% of watch time on YouTube comes from recommendations.

Terminology

Before we dive in, there are a few terms that you should know:

Items (also known as documents)

The entities a system recommends. For the Google Play store, the items are apps to install. For YouTube, the items are videos.

Query (also known as context)

The information a system uses to make recommendations. Queries can be a combination of the following:

user information
- the id of the user
- items that users previously interacted with
additional context
- time of day
- the user’s device

Embedding

A mapping from a discrete set (in this case, the set of queries, or the set of items to recommend) to a vector space called the embedding space. Many recommendation systems rely on learning an appropriate embedding representation of the queries and items.

Recommendation Systems Overview

One common architecture for recommendation systems consists of the following components:

candidate generation
scoring
re-ranking

Candidate Generation

In this first stage, the system starts from a potentially huge corpus and generates a much smaller subset of candidates. For example, the candidate generator in YouTube reduces billions of videos down to hundreds or thousands. The model needs to evaluate queries quickly given the enormous size of the corpus. A given model may provide multiple candidate generators, each nominating a different subset of candidates.

Scoring

Next, another model scores and ranks the candidates in order to select the set of items (on the order of 10) to display to the user. Since this model evaluates a relatively small subset of items, the system can use a more precise model relying on additional queries.

Re-ranking

Finally, the system must take into account additional constraints for the final ranking. For example, the system removes items that the user explicitly disliked or boosts the score of fresher content. Re-ranking can also help ensure diversity, freshness, and fairness.

We will discuss each of these stages over the course of the class and give examples from different recommendation systems, such as YouTube.

Extra Resource: For a more comprehensive account of the technology, architecture, and models used in YouTube, see Covington et al., Deep Neural Networks for YouTube Recommendations.

Candidate Generation Overview

Candidate generation is the first stage of recommendation. Given a query, the system generates a set of relevant candidates. The following table shows two common candidate generation approaches:

Embedding Space

Both content-based and collaborative filtering map each item and each query (or context) to an embedding vector in a common embedding space $E=\mathbb{R}^d$ . Typically, the embedding space is low-dimensional (that is, $d$ is much smaller than the size of the corpus), and captures some latent structure of the item or query set. Similar items, such as YouTube videos that are usually watched by the same user, end up close together in the embedding space. The notion of “closeness” is defined by a similarity measure.

Extra Resource: projector.tensorflow.org is an interactive tool to visualize embeddings.

Similarity Measures

A similarity measure is a function $s:E×E \rightarrow \mathbb{R}$ that takes a pair of embeddings and returns a scalar measuring their similarity. The embeddings can be used for candidate generation as follows: given a query embedding $q \in E$ , the system looks for item embeddings $x \in E$ that are close to $q$ , that is, embeddings with high similarity $s(q,x)$ .

To determine the degree of similarity, most recommendation systems rely on one or more of the following:

cosine
dot product
Euclidean distance

Cosine

This is simply the cosine of the angle between the two vectors, $s(q,x)=\cos⁡(q,x)$

Dot Product

The dot product of two vectors is $s(q,x)=⟨q,x⟩=\sum\limits^d_{i=1}q_i x_i$ . It is also given by $s(q,x)=‖x‖‖q‖\cos⁡(q,x)$ (the cosine of the angle multiplied by the product of norms). Thus, if the embeddings are normalized, then dot-product and cosine coincide.

Euclidean distance

This is the usual distance in Euclidean space, $s(q,x)=‖q−x‖=[\sum\limits^d_{i=1} (q_i - x_i)^2]^{\frac{1}{2}}$ . A smaller distance means higher similarity. Note that when the embeddings are normalized, the squared Euclidean distance coincides with dot-product (and cosine) up to a constant, since in that case $\frac{1}{2}‖q−x‖^2=1−⟨q,x⟩$ .

Comparing Similarity Measures

Consider the example in the figure below. The black vector illustrates the query embedding. The other three embedding vectors (Item A, Item B, Item C) represent candidate items. Depending on the similarity measure used, the ranking of the items can be different.

Using the image, try to determine the item ranking using all three of the similarity measures: cosine, dot product, and Euclidean distance.

How did you do?

Item A has the largest norm, and is ranked higher according to the dot-product. Item C has the smallest angle with the query, and is thus ranked first according to the cosine similarity. Item B is physically closest to the query so Euclidean distance favors it.

Which Similarity Measure to Choose?

Compared to the cosine, the dot product similarity is sensitive to the norm of the embedding. That is, the larger the norm of an embedding, the higher the similarity (for items with an acute angle) and the more likely the item is to be recommended. This can affect recommendations as follows:

Items that appear very frequently in the training set (for example, popular YouTube videos) tend to have embeddings with large norms. If capturing popularity information is desirable, then you should prefer dot product. However, if you’re not careful, the popular items may end up dominating the recommendations. In practice, you can use other variants of similarity measures that put less emphasis on the norm of the item. For example, define $s(q,x)=‖q‖^{\alpha}‖x‖^{\alpha}\cos⁡(q,x) \text{ for some } \alpha \in (0,1)$ .
Items that appear very rarely may not be updated frequently during training. Consequently, if they are initialized with a large norm, the system may recommend rare items over more relevant items. To avoid this problem, be careful about embedding initialization, and use appropriate regularization. We will detail this problem in the first exercise.

Content-based Filtering

Content-based filtering uses item features to recommend other items similar to what the user likes, based on their previous actions or explicit feedback.

To demonstrate content-based filtering, let’s hand-engineer some features for the Google Play store. The following figure shows a feature matrix where each row represents an app and each column represents a feature. Features could include categories (such as Education, Casual, Health), the publisher of the app, and many others. To simplify, assume this feature matrix is binary: a non-zero value means the app has that feature.

You also represent the user in the same feature space. Some of the user-related features could be explicitly provided by the user. For example, a user selects “Entertainment apps” in their profile. Other features can be implicit, based on the apps they have previously installed. For example, the user installed another app published by Science R Us.

The model should recommend items relevant to this user. To do so, you must first pick a similarity metric (for example, dot product). Then, you must set up the system to score each candidate item according to this similarity metric. Note that the recommendations are specific to this user, as the model did not use any information about other users.

Using Dot Product as a Similarity Measure

Consider the case where the user embedding $x$ and the app embedding $y$ are both binary vectors. Since $⟨x,y⟩ = \sum\limits^d_{i=1} x_i y_i$ , a feature appearing in both $x$ and $y$ contributes a $1$ to the sum. In other words, $⟨x,y⟩$ is the number of features that are active in both vectors simultaneously. A high dot product then indicates more common features, thus a higher similarity.

Content-based Filtering Advantages & Disadvantages

Advantages

The model doesn’t need any data about other users, since the recommendations are specific to this user. This makes it easier to scale to a large number of users.
The model can capture the specific interests of a user, and can recommend niche items that very few other users are interested in.

Disadvantages

Since the feature representation of the items are hand-engineered to some extent, this technique requires a lot of domain knowledge. Therefore, the model can only be as good as the hand-engineered features.
The model can only make recommendations based on existing interests of the user. In other words, the model has limited ability to expand on the users’ existing interests.

Collaborative Filtering

To address some of the limitations of content-based filtering, collaborative filtering uses similarities between users and items simultaneously to provide recommendations. This allows for serendipitous recommendations; that is, collaborative filtering models can recommend an item to user A based on the interests of a similar user B. Furthermore, the embeddings can be learned automatically, without relying on hand-engineering of features.

A Movie Recommendation Example

Consider a movie recommendation system in which the training data consists of a feedback matrix in which:

Each row represents a user.
Each column represents an item (a movie).

The feedback about movies falls into one of two categories:

Explicit— users specify how much they liked a particular movie by providing a numerical rating.
Implicit— if a user watches a movie, the system infers that the user is interested.

To simplify, we will assume that the feedback matrix is binary; that is, a value of 1 indicates interest in the movie.

When a user visits the homepage, the system should recommend movies based on both:

similarity to movies the user has liked in the past
movies that similar users liked

For the sake of illustration, let’s hand-engineer some features for the movies described in the following table:

1D Embedding

Suppose we assign to each movie a scalar in $[−1,1]$ that describes whether the movie is for children (negative values) or adults (positive values). Suppose we also assign a scalar to each user in $[−1,1]$ that describes the user’s interest in children’s movies (closer to -1) or adult movies (closer to +1). The product of the movie embedding and the user embedding should be higher (closer to 1) for movies that we expect the user to like.

In the diagram below, each checkmark identifies a movie that a particular user watched. The third and fourth users have preferences that are well explained by this feature—the third user prefers movies for children and the fourth user prefers movies for adults. However, the first and second users’ preferences are not well explained by this single feature.

2D Embedding

One feature was not enough to explain the preferences of all users. To overcome this problem, let’s add a second feature: the degree to which each movie is a blockbuster or an arthouse movie. With a second feature, we can now represent each movie with the following two-dimensional embedding:

We again place our users in the same embedding space to best explain the feedback matrix: for each (user, item) pair, we would like the dot product of the user embedding and the item embedding to be close to 1 when the user watched the movie, and to 0 otherwise.

Note: We represented both items and users in the same embedding space. This may seem surprising. After all, users and items are two different entities. However, you can think of the embedding space as an abstract representation common to both items and users, in which we can measure similarity or relevance using a similarity metric.

In this example, we hand-engineered the embeddings. In practice, the embeddings can be learned automatically, which is the power of collaborative filtering models. In the next two sections, we will discuss different models to learn these embeddings, and how to train them.

The collaborative nature of this approach is apparent when the model learns the embeddings. Suppose the embedding vectors for the movies are fixed. Then, the model can learn an embedding vector for the users to best explain their preferences. Consequently, embeddings of users with similar preferences will be close together. Similarly, if the embeddings for the users are fixed, then we can learn movie embeddings to best explain the feedback matrix. As a result, embeddings of movies liked by similar users will be close in the embedding space.

Matrix Factorization

Matrix factorization is a simple embedding model. Given the feedback matrix $A \in \mathbb{R}^{m×n}$ , where $m$ is the number of users (or queries) and $n$ is the number of items, the model learns:

A user embedding matrix $U \in \mathbb{R}^{m×d}$ , where row $i$ is the embedding for user $i$ .
An item embedding matrix $V \in \mathbb{R}^{n×d}$ , where row $j$ is the embedding for item $j$ .

The embeddings are learned such that the product $UV^T$ is a good approximation of the feedback matrix A. Observe that the $(i,j)$ entry of $U.V^T$ is simply the dot product $⟨U_i,V_j⟩$ of the embeddings of user $i$ and item $j$ , which you want to be close to $A_{i,j}$ .

Note: Matrix factorization typically gives a more compact representation than learning the full matrix. The full matrix has $O(nm)$ entries, while the embedding matrices $U$ , $V$ have $O((n+m)d)$ entries, where the embedding dimension $d$ is typically much smaller than $m$ and $n$ . As a result, matrix factorization finds latent structure in the data, assuming that observations lie close to a low-dimensional subspace. In the preceding example, the values of $n$ , $m$ , and $d$ are so low that the advantage is negligible. In real-world recommendation systems, however, matrix factorization can be significantly more compact than learning the full matrix.

Choosing the Objective Function

One intuitive objective function is the squared distance. To do this, minimize the sum of squared errors over all pairs of observed entries:

$\min_{U \in \mathbb{R}^{m \times d}, V \in \mathbb{R}^{n \times d}} \sum\limits_{(i,j) \in \text{obs}} (A_{ij} - \langle U_i, V_j \rangle)^2$

In this objective function, you only sum over observed pairs (i, j), that is, over non-zero values in the feedback matrix. However, only summing over values of one is not a good idea—a matrix of all ones will have a minimal loss and produce a model that can’t make effective recommendations and that generalizes poorly.

Perhaps you could treat the unobserved values as zero, and sum over all entries in the matrix. This corresponds to minimizing the squared Frobenius distance between $A$ and its approximation $UV^T$ :

$\min_{U \in \mathbb{R}^{m \times d}, V \in \mathbb{R}^{n \times d}} ||A - UV^T||^2_F$

You can solve this quadratic problem through Singular Value Decomposition (SVD) of the matrix. However, SVD is not a great solution either, because in real applications, the matrix $A$ may be very sparse. For example, think of all the videos on YouTube compared to all the videos a particular user has viewed. The solution $UV^T$ (which corresponds to the model’s approximation of the input matrix) will likely be close to zero, leading to poor generalization performance.

In contrast, Weighted Matrix Factorization decomposes the objective into the following two sums:

A sum over observed entries.
A sum over unobserved entries (treated as zeroes).

$\min_{U \in \mathbb{R}^{m \times d}, V \in \mathbb{R}^{n \times d}} \sum\limits_{(i,j) \in \text{obs}} (A_{ij} - \langle U_i, V_j \rangle)^2 + w_0 \sum\limits_{(i,j) \notin \text{obs}} (\langle U_i, V_j \rangle)^2$

Here, $w_0$ is a hyperparameter that weights the two terms so that the objective is not dominated by one or the other. Tuning this hyperparameter is very important.

Note: In practical applications, you also need to weight the observed pairs carefully. For example, frequent items (for example, extremely popular YouTube videos) or frequent queries (for example, heavy users) may dominate the objective function. You can correct for this effect by weighting training examples to account for item frequency. In other words, you can replace the objective function by:

$\sum\limits_{(i,j) \in \text{obs}} w_{ij} (A_{ij} - \langle U_i, V_j \rangle)^2 + w_0 \sum\limits_{(i,j) \notin \text{obs}} (\langle U_i, V_j \rangle)^2$

where $w_{i,j}$ is a function of the frequency of query $i$ and item $j$ .

Minimizing the Objective Function

Common algorithms to minimize the objective function include:

Stochastic gradient descent (SGD) is a generic method to minimize loss functions.
Weighted Alternating Least Squares (WALS) is specialized to this particular objective.

The objective is quadratic in each of the two matrices $U$ and $V$ . (Note, however, that the problem is not jointly convex.) WALS works by initializing the embeddings randomly, then alternating between:

Fixing $U$ and solving for $V$ .
Fixing $V$ and solving for $U$ .

Each stage can be solved exactly (via solution of a linear system) and can be distributed. This technique is guaranteed to converge because each step is guaranteed to decrease the loss.

SGD vs. WALS

SGD and WALS have advantages and disadvantages. Review the information below to see how they compare:

SGD

✅ Very flexible—can use other loss functions.

✅ Can be parallelized.

❌ Slower—does not converge as quickly.

❌ Harder to handle the unobserved entries (need to use negative sampling or gravity).

WALS

❌ Reliant on Loss Squares only.

✅ Can be parallelized.

✅ Converges faster than SGD.

✅ Easier to handle unobserved entries.

Collaborative Filtering Advantages & Disadvantages

Advantages

✅ No domain knowledge necessary

We don’t need domain knowledge because the embeddings are automatically learned.

✅ Serendipity

The model can help users discover new interests. In isolation, the ML system may not know the user is interested in a given item, but the model might still recommend it because similar users are interested in that item.

✅ Great starting point

To some extent, the system needs only the feedback matrix to train a matrix factorization model. In particular, the system doesn’t need contextual features. In practice, this can be used as one of multiple candidate generators.

Disadvantages

❌ Cannot handle fresh items

The prediction of the model for a given (user, item) pair is the dot product of the corresponding embeddings. So, if an item is not seen during training, the system can’t create an embedding for it and can’t query the model with this item. This issue is often called the cold-start problem. However, the following techniques can address the cold-start problem to some extent:

Projection in WALS. Given a new item $i_0$ not seen in training, if the system has a few interactions with users, then the system can easily compute an embedding $v_{i_0}$ for this item without having to retrain the whole model. The system simply has to solve the following equation or the weighted version:

$\min_{v_{i_0} \in \mathbb{R}^d} ||A_{i_0} - Uv_{i_0}||$

The preceding equation corresponds to one iteration in WALS: the user embeddings are kept fixed, and the system solves for the embedding of item $i_0$ . The same can be done for a new user.
Heuristics to generate embeddings of fresh items. If the system does not have interactions, the system can approximate its embedding by averaging the embeddings of items from the same category, from the same uploader (in YouTube), and so on.

❌ Hard to include side features for query/item

Side features are any features beyond the query or item ID. For movie recommendations, the side features might include country or age. Including available side features improves the quality of the model. Although it may not be easy to include side features in WALS, a generalization of WALS makes this possible.

To generalize WALS, augment the input matrix with features by defining a block matrix $\bar{A}$ , where:

Block (0, 0) is the original feedback matrix $A$ .
Block (0, 1) is a multi-hot encoding of the user features.
Block (1, 0) is a multi-hot encoding of the item features.

Note: Block (1, 1) is typically left empty. If you apply matrix factorization to $\bar{A}$ , then the system learns embeddings for side features, in addition to user and item embeddings.

Exercise: Build a Movie Recommendation System

This Colab notebook goes into more detail about Recommendation Systems. Specifically, you will be using matrix factorization to build a movie recommendation system, using the MovieLens dataset. Given a user and their ratings of movies on a scale of 1-5, your system will recommend movies the user is likely to rank highly.

Topics covered:

Exploring the MovieLens Data
Matrix factorization using SGD
Embedding Visualization
Regularization in Matrix Factorization

Note: In the last section of the Colab notebook (Section VI), you will build a softmax model. You will come back to that section at the end of the course, once we have discussed softmax models.

Colab notebook

Recommendation Using Deep Neural Networks

The previous section showed you how to use matrix factorization to learn embeddings. Some limitations of matrix factorization include:

The difficulty of using side features (that is, any features beyond the query ID/item ID). As a result, the model can only be queried with a user or item present in the training set.
Relevance of recommendations. As you saw in the first Colab, popular items tend to be recommended for everyone, especially when using dot product as a similarity measure. It is better to capture specific user interests.

Deep neural network (DNN) models can address these limitations of matrix factorization. DNNs can easily incorporate query features and item features (due to the flexibility of the input layer of the network), which can help capture the specific interests of a user and improve the relevance of recommendations.

Softmax DNN for Recommendation

One possible DNN model is softmax, which treats the problem as a multiclass prediction problem in which:

The input is the user query.
The output is a probability vector with size equal to the number of items in the corpus, representing the probability to interact with each item; for example, the probability to click on or watch a YouTube video.

Input

The input to a DNN can include:

dense features (for example, watch time and time since last watch)
sparse features (for example, watch history and country)

Unlike the matrix factorization approach, you can add side features such as age or country. We’ll denote the input vector by x.

Model Architecture

The model architecture determines the complexity and expressivity of the model. By adding hidden layers and non-linear activation functions (for example, ReLU), the model can capture more complex relationships in the data. However, increasing the number of parameters also typically makes the model harder to train and more expensive to serve. We will denote the output of the last hidden layer by $\psi(x) \in \mathbb{R}^d$ .

Softmax Output: Predicted Probability Distribution

The model maps the output of the last layer, $\psi(x)$ , through a softmax layer to a probability distribution $\hat{p} = h(\psi(x)V^T)$ , where:

$h: \mathbb{R}^n \rightarrow \mathbb{R}^n$ is the softmax function, given by $h(y)_i = \frac{e^{y_i}}{\sum_j e^{y_j}}$
$V \in \mathbb{R}^{n \times d}$ is the matrix of weights of the softmax layer.

The softmax layer maps a vector of scores $y \in \mathbb{R}^n$ (sometimes called the logits) to a probability distribution.

Did you know?

The name softmax is a play on words. A “hard” max assigns probability 1 to the item with the largest score $y_i$ . By contrast, the softmax assigns a non-zero probability to all items, giving a higher probability to items that have higher scores. When the scores are scaled, the softmax $h(\alpha y)$ converges to a “hard” max in the limit $\alpha \rightarrow \infty$ .

Loss Function

Finally, define a loss function that compares the following:

$\hat{p}$ , the output of the softmax layer (a probability distribution)
$p$ , the ground truth, representing the items the user has interacted with (for example, YouTube videos the user clicked or watched). This can be represented as a normalized multi-hot distribution (a probability vector).

For example, you can use the cross-entropy loss since you are comparing two probability distributions.

Softmax Embeddings

The probability of item $j$ is given by $\hat{p}_j = \frac{\exp(\langle \psi(x), V_j \rangle)}{Z}$ , where $Z$ is a normalization constant that does not depend on $j$ .

In other words, $\log(\hat{p}_j) = \langle \psi(x), V_j \rangle - \log(Z)$ , so the log probability of an item $j$ is (up to an additive constant) the dot product of two $d$ -dimensional vectors, which can be interpreted as query and item embeddings:

$\psi(x) \in \mathbb{R}^d$ is the output of the last hidden layer. We call it the embedding of the query $x$ .
$V_j \in \mathbb{R}^d$ is the vector of weights connecting the last hidden layer to output $j$ . We call it the embedding of item $j$ .

Note: Since log is an increasing function, items $j$ with the highest probability $\hat{p}_j$ are the items with the highest dot product $\langle \psi(x), V_j \rangle$ . Therefore, the dot product can be interpreted as a similarity measure in this embedding space.

DNN and Matrix Factorization

In both the softmax model and the matrix factorization model, the system learns one embedding vector $V_j$ per item $j$ . What we called the item embedding matrix $V \in \mathbb{R}^{n \times d}$ in matrix factorization is now the matrix of weights of the softmax layer.

The query embeddings, however, are different. Instead of learning one embedding $U_i$ per query $i$ , the system learns a mapping from the query feature $x$ to an embedding $\psi(x) \in \mathbb{R}^d$ . Therefore, you can think of this DNN model as a generalization of matrix factorization, in which you replace the query side by a nonlinear function $\psi(.)$ .

Can You Use Item Features?

Can you apply the same idea to the item side? That is, instead of learning one embedding per item, can the model learn a nonlinear function that maps item features to an embedding? Yes. To do so, use a two-tower neural network, which consists of two neural networks:

One neural network maps query features $x_\text{query}$ to query embedding $\psi(x_\text{query}) \in \mathbb{R}^d$
One neural network maps item features $x_\text{item}$ to item embedding $\phi(x_\text{item}) \in \mathbb{R}^d$

The output of the model can be defined as the dot product of $⟨\psi(x_\text{query}) ,\phi(x_\text{item})⟩$ . Note that this is not a softmax model anymore. The new model predicts one value per pair $(x_\text{query}, x_\text{item})$ instead of a probability vector for each query $x_\text{query}$ .

Softmax Training

The previous page explained how to incorporate a softmax layer into a deep neural network for a recommendation system. This page takes a closer look at the training data for this system.

Training Data

The softmax training data consists of the query features $x$ and a vector of items the user interacted with (represented as a probability distribution $p$ ). These are marked in blue in the following figure. The variables of the model are the weights in the different layers. These are marked as orange in the following figure. The model is typically trained using any variant of stochastic gradient descent.

Negative Sampling

Since the loss function compares two probability vectors $p,\hat{p} \in \mathbb{R}^n$ (the ground truth and the output of the model, respectively), computing the gradient of the loss (for a single query $x$ ) can be prohibitively expensive if the corpus size $n$ is too big.

You could set up a system to compute gradients only on the positive items (items that are active in the ground truth vector). However, if the system only trains on positive pairs, the model may suffer from folding, as explained below.

Instead of using all items to compute the gradient (which can be too expensive) or using only positive items (which makes the model prone to folding), you can use negative sampling. More precisely, you compute an approximate gradient, using the following items:

All positive items (the ones that appear in the target label)
A sample of negative items ( $j$ in $1,…,n$ )

There are different strategies for sampling negatives:

You can sample uniformly.
You can give higher probability to items j with higher score $\psi(x).V_j$ . Intuitively, these are examples that contribute the most to the gradient); these examples are often called hard negatives.

Extra Resources:

For a more comprehensive account of the technology, architecture, and models used in YouTube, see Deep Neural Networks for YouTube Recommendations.
See Xin et al., Folding: Why Good Models Sometimes Make Spurious Recommendations for more details on folding.
To learn more about negative sampling, see Bengio and Senecal, Adaptive Importance Sampling to Accelerate Training of a Neural Probabilistic Language Model.

On Matrix Factorization Vs. Softmax

DNN models solve many limitations of Matrix Factorization, but are typically more expensive to train and query. The table below summarizes some of the important differences between the two models.

In summary:

Matrix factorization is usually the better choice for large corpora. It is easier to scale, cheaper to query, and less prone to folding.
DNN models can better capture personalized preferences, but are harder to train and more expensive to query. DNN models are preferable to matrix factorization for scoring because DNN models can use more features to better capture relevance. Also, it is usually acceptable for DNN models to fold, since you mostly care about ranking a pre-filtered set of candidates assumed to be relevant.

Retrieval

Suppose you have an embedding model. Given a user, how would you decide which items to recommend?

At serve time, given a query, you start by doing one of the following:

For a matrix factorization model, the query (or user) embedding is known statically, and the system can simply look it up from the user embedding matrix.
For a DNN model, the system computes the query embedding $\psi(x)$ at serve time by running the network on the feature vector $x$ .

Once you have the query embedding $q$ , search for item embeddings $V_j$ that are close to $q$ in the embedding space. This is a nearest neighbor problem. For example, you can return the top $k$ items according to the similarity score $s(q,V_j)$ .

You can use a similar approach in related-item recommendations. For example, when the user is watching a YouTube video, the system can first look up the embedding of that item, and then look for embeddings of other items $V_j$ that are close in the embedding space.

Large-scale Retrieval

To compute the nearest neighbors in the embedding space, the system can exhaustively score every potential candidate. Exhaustive scoring can be expensive for very large corpora, but you can use either of the following strategies to make it more efficient:

If the query embedding is known statically, the system can perform exhaustive scoring offline, precomputing and storing a list of the top candidates for each query. This is a common practice for related-item recommendation.
Use approximate nearest neighbors.

Scoring

After candidate generation, another model scores and ranks the generated candidates to select the set of items to display. The recommendation system may have multiple candidate generators that use different sources, such as the following:

Examples

Related items from a matrix factorization model.
User features that account for personalization.
“Local” vs “distant” items; that is, taking geographic information into account.
Popular or trending items.
A social graph; that is, items liked or recommended by friends.

The system combines these different sources into a common pool of candidates that are then scored by a single model and ranked according to that score. For example, the system can train a model to predict the probability of a user watching a video on YouTube given the following:

query features (for example, user watch history, language, country, time)
video features (for example, title, tags, video embedding)

The system can then rank the videos in the pool of candidates according to the prediction of the model.

Why Not Let the Candidate Generator Score?

Since candidate generators compute a score (such as the similarity measure in the embedding space), you might be tempted to use them to do ranking as well. However, you should avoid this practice for the following reasons:

Some systems rely on multiple candidate generators. The scores of these different generators might not be comparable.
With a smaller pool of candidates, the system can afford to use more features and a more complex model that may better capture context.

Choosing an Objective Function for Scoring

As you may remember from Introduction to ML Problem Framing, ML can act like a mischievous genie: very happy to learn the objective you provide, but you have to be careful what you wish for. This mischievous quality also applies to recommendation systems. The choice of scoring function can dramatically affect the ranking of items, and ultimately the quality of the recommendations.

Example:

Maximize Click Rate

If the scoring function optimizes for clicks, the systems may recommend click-bait videos. This scoring function generates clicks but does not make a good user experience. Users’ interest may quickly fade.

Maximize Watch Time

If the scoring function optimizes for watch time, the system might recommend very long videos, which might lead to a poor user experience. Note that multiple short watches can be just as good as one long watch.

Increase Diversity and Maximize Session Watch Time

Recommend shorter videos, but ones that are more likely to keep the user engaged.

Positional Bias in Scoring

Items that appear lower on the screen are less likely to be clicked than items appearing higher on the screen. However, when scoring videos, the system usually doesn’t know where on the screen a link to that video will ultimately appear. Querying the model with all possible positions is too expensive. Even if querying multiple positions were feasible, the system still might not find a consistent ranking across multiple ranking scores.

Solutions

Create position-independent rankings.
Rank all the candidates as if they are in the top position on the screen.

Re-ranking

In the final stage of a recommendation system, the system can re-rank the candidates to consider additional criteria or constraints. One re-ranking approach is to use filters that remove some candidates.

Example: You can implement re-ranking on a video recommender by doing the following:

Training a separate model that detects whether a video is click-bait.
Running this model on the candidate list.
Removing the videos that the model classifies as click-bait.

Another re-ranking approach is to manually transform the score returned by the ranker.

Example: The system re-ranks videos by modifying the score as a function of:

video age (perhaps to promote fresher content)
video length

This section briefly discusses freshness, diversity, and fairness. These factors are among many that can help improve your recommendation system. Some of these factors often require modifying different stages of the process. Each section offers solutions that you might apply individually or collectively.

Freshness

Most recommendation systems aim to incorporate the latest usage information, such as current user history and the newest items. Keeping the model fresh helps the model make good recommendations.

Solutions

Re-run training as often as possible to learn on the latest training data. We recommend warm-starting the training so that the model does not have to re-learn from scratch. Warm-starting can significantly reduce training time. For example, in matrix factorization, warm-start the embeddings for items that were present in the previous instance of the model.
Create an “average” user to represent new users in matrix factorization models. You don’t need the same embedding for each user—you can create clusters of users based on user features.
Use a DNN such as a softmax model or two-tower model. Since the model takes feature vectors as input, it can be run on a query or item that was not seen during training.
Add document age as a feature. For example, YouTube can add a video’s age or the time of its last viewing as a feature.

Diversity

If the system always recommend items that are “closest” to the query embedding, the candidates tend to be very similar to each other. This lack of diversity can cause a bad or boring user experience. For example, if YouTube just recommends videos very similar to the video the user is currently watching, such as nothing but owl videos (as shown in the illustration), the user will likely lose interest quickly.

Solutions

Train multiple candidate generators using different sources.
Train multiple rankers using different objective functions.
Re-rank items based on genre or other metadata to ensure diversity.

Fairness

Your model should treat all users fairly. Therefore, make sure your model isn’t learning unconscious biases from the training data.

Solutions

Include diverse perspectives in design and development.
Train ML models on comprehensive data sets. Add auxiliary data when your data is too sparse (for example, when certain categories are under-represented).
Track metrics (for example, accuracy and absolute error) on each demographic to watch for biases.
Make separate models for underserved groups.

Exercise: Build a Movie Recommendation System (continued)

You are now ready to complete the last section of the Colab. You can reopen the Colab notebook and resume working on Section VI. In this section, you will train a softmax model using the MovieLens data set.

Topics covered:

Building and training a softmax model.
Visualizing the movie embeddings.

Colab notebook

Summary

You should now have a better understanding of how to:

Describe the purpose of recommendation systems.
Understand the components of a recommender system including candidate generation, scoring, and re-ranking.
Use embeddings to represent items and queries.
Develop a deeper technical understanding of common techniques used in candidate generation.
Use TensorFlow to develop two models used for recommendation: matrix factorization and softmax.

Recommendations: What and Why?

What are Recommendations?

Homepage Recommendations

Related Item Recommendations

Why Recommendations?

Terminology

Items (also known as documents)

Query (also known as context)

Embedding

Recommendation Systems Overview

Candidate Generation

Scoring

Re-ranking

Candidate Generation Overview

Embedding Space

Similarity Measures

Cosine

Dot Product

Euclidean distance

Comparing Similarity Measures

Which Similarity Measure to Choose?

Content-based Filtering

Using Dot Product as a Similarity Measure

Content-based Filtering Advantages & Disadvantages

Collaborative Filtering

A Movie Recommendation Example

1D Embedding

2D Embedding

Matrix Factorization

Choosing the Objective Function

Minimizing the Objective Function

SGD vs. WALS

Collaborative Filtering Advantages & Disadvantages

Advantages

Disadvantages

Exercise: Build a Movie Recommendation System

Recommendation Using Deep Neural Networks

Softmax DNN for Recommendation

Input

Model Architecture

Softmax Output: Predicted Probability Distribution

Loss Function

Softmax Embeddings

DNN and Matrix Factorization

Can You Use Item Features?

Softmax Training

Training Data

Negative Sampling

On Matrix Factorization Vs. Softmax

Retrieval

Large-scale Retrieval

Scoring

Why Not Let the Candidate Generator Score?

Choosing an Objective Function for Scoring

Maximize Click Rate

Maximize Watch Time

Increase Diversity and Maximize Session Watch Time

Positional Bias in Scoring

Solutions

Re-ranking

Freshness

Solutions

Diversity

Solutions

Fairness

Solutions

Exercise: Build a Movie Recommendation System (continued)

Summary