15_learning_to

Learning To Rank

Table of Content

1. Learning To Rank 1.1. Framing a Ranking Problem 1.2. Candidate Generation 1.3. Ranking the Top K 1.3.1. Learning To Rank 1.3.2. Learning To Rank Loss Function 1.4. RankNet 1.5. LambdaNet 1.6. nDCG 1.7. LambdaMART 1.8. Other Notes

1. Learning To Rank• In the Recommender Systems section, ?, we talked about how to recommend an item to a user.– However, it only considered one item at a time per prediction.– The prediction, which is a score or probability, help us gauge whether or not we should we should recommend an item.• Let's say, instead of that, we wanted to create a feed or a series of posts (or items) for a user to scroll through on their home screen.• Now, this is a question of what order (or ranking) do we place a series of items in to provide the best feed for a user.– This is quite a different problem from classification/regression. 1.1. Framing a Ranking Problem• Our first goal is to extract relevant posts (or items) from all of the posts available to the user → This is typically called Candidate Generation or Obtaining the Top

K

.• Once, we have the top

K

, then we will rank them in some particular order for the feed. 1.2. Candidate Generation• To perform candidate generation we can again use matrix factorization.• In matrix factorization, we perform this,

U . P = {\tilde{u}}^{T} \tilde{p} + b_{p} + b_{u}

• Note that both

{\tilde{u}}^{T}

and

\tilde{p}

are in fact embeddings that represent some characteristics about some user or some item.• To get the top

K

items for a user, we can get the closest adjacent items to the user using the embedding vectors,

{\tilde{u}}^{T}

and

\tilde{p}

.– We can use Euclidean distance or cosine similarity or dot product.• Note: If, instead of matrix factorization, we've used a deep recommender system, then we could use the embedding layers in the DL model. 1.3. Ranking the Top K• Given the top

K

items, now we need to rank these top candidates.• The (bad) solution would be to just rank the items by their distance from a particular user (closer distance will rank higher on the top). There are a couple of problems with this method:– There could be tons of users and items where, in reality, if we extract the top

K

, we could generate a more sophisticated model because now we're only dealing with a small fraction of the users and items.– What if we had different systems generating our candidates? * We could subset our users into groups (e.g. user traffic coming from Amazon or Google).* These groups would exist in separate embedding spaces. However, if we want to apply rankings of items across different embedding spaces, then we would have to have some mechanism to rank beyond just the distances because they don't translate across different embeddings.1.3.1. Learning To Rank• Learning to rank takes some probability that some item

i

should appear before some item

j

P (i > j) = {\hat{y}}_{i, j} = \frac{1}{1 + e^{- (s_{i} - s_{j})}}

• Where

s_{i} = f (u, d_{i})

and

s_{j} = f (u, d_{j})

and

d_{i}

and

d_{j}

indicate items

i

and

j

.• The functions,

f

, are neural networks.• Here, it's modeled as a sigmoid function of

s_{i}

and

s_{j}

.• Note: The user feature term,

u

, don't have to be users per se. They can also be queries.– In fact, originally, learning to rank, was used as a search optimization tool. – In our case, we're going to treat queries as users. 1.3.2. Learning To Rank Loss Function• The loss function is the same sigmoid loss function,

L (y_{i j}, {\hat{y}}_{i j}) = - \sum_{i \neq j}^{} y_{i j} \log {\hat{y}}_{i j} + (1 - y_{i j}) \log (1 - {\hat{y}}_{i j})

• We're also going to use gradient descent to minimize the loss function. 1.4. RankNet• Let's say we have candidate documents

A, B

and

C

which we want to rank.• The first step is to generate all unique pairs of these documents because our loss function takes into account the probability that document

i

outranks document

j

→ so, we have to represent all

i j

pairs.• Now, let's say user clicked on

A

and

B

when presented with the following feed:

A ✓, C \times, B ✓

→ so, our labels should be

A, B, C

(simply because user clicked on

A

and

B

and not on

C

).• Now, we assign

{\hat{y}}_{i j}

as follows:–

A, B \to {\hat{y}}_{12} = 1

–

B, C \to {\hat{y}}_{23} = 1

–

C, A \to {\hat{y}}_{31} = 0

(because

C

does NOT rank over

A

)• Note: If there were more documents (that user haven't seen), we'll assign them

0.5

, because we don't know which one the user will click on.• Now, our neural network is going to look like this: • Note: The neural networks are the

f

functions that'd give us

s_{i}

and

s_{j}

for each pair of documents.• Now, we can replace

s_{i}

and

s_{j}

in equations

(1)

and

(2)

to calculate updates based on the gradient of the loss function →

w^{t + 1} = w^{t} - r \nabla_{l o s s}

.– The only difference here (compared to NN update) is that we're placing adjacent documents into the neural network one at a time and we're applying those differences in outputs to the overall loss function instead of using two documents at once as an input.– This strategy, which is called RankNet, was designed by Microsoft.• Now, let's say we trained our NN with these updates and we get a new document we haven't seen. – We first generate the pairwise documents.– We then repeat the same process described above, except this time with

s_{1}

and

s_{2}

, we just simply plug them into the model and get a probability. * If probability >

0.5

→ we rank item

1

above item

2

.– Then, we go to the next pair of documents and repeat this process.– Eventually, we can come up with a consistent order of which we should place the documents in.* Consistent means that

P (i > j)

will propagate consistency across the entire rank. For example, if in training

P (1 > 2) = 0.5

(i.e. rank uncertainty), AND

P (2 > 3) = 0.5

then

P (1 > 3) = 0.5

→ i.e. complete uncertainty propagates. * Same thing for complete certainty: if

P (1 > 2) = P (2 > 3) = 1

then

P (1 > 3) = 1

.1.5. LambdaNet• RankNet pairwise may not be the best penalty.– This means that we may want to consider more than just two documents at a time when trying to rank all the documents. • It's also pretty inefficient.– For all pairs of documents, we have to put all of those pairs into the neural network and perform SGD on every one of them.• In terms of learning to rank, after RankNet, there came something called LambdaNet.• Two improvements came with LambdaNet:– It is more efficient. They factorize the gradient such that they can find the gradient update for a document in comparison to all the others without having to evaluate itself versus every other pair.– LambdaNet enables us to use better metrics such as nDCG for evaluating a particular ranking, where nDCG → normalized Discounted Cumulative Gain.* It considers more than just a pair of documents at a time. 1.6. nDCG• Let's see how nDCG works.• Let's say we have the following documents along with their probabilities:–

A \to 0.5

–

B \to 1

–

C \to 0

• Here, if we just use the number of inversions, we would count one inversion (

A ⇆ B

) that would give us the optimal ranking → i.e. the model will be penalized by one.• Now, let's consider this case:–

C \to 0

–

A \to 0.5

–

B \to 1

• The problem is that the same penalty would be incurred for for the inversion (

A ⇆ B

) vs. if it was higher (previous case).• nDCG considers documents up to some position

p

D C G_{p} = \sum_{p}^{} \frac{2^{r e l_{i}} - 1}{\log_{2} (i + 1)}

• where

r e l_{i}

is the relevance of document

i

, which is those probabilities.• Now, we calculate the

D C G_{3}

(because there are 3 documents) for the first case we get

1.04

and for the second case we get

0.76

.– It means that even though inversions would penalize (

A ⇆ B

) the same, the

D C G

is clearly lower for the second case, simply because this elements appeared later down the list.• How to compare

D C G

across different users?– To do that we have to take the

D C G

for a particular user and divide it by the ideal

D C G

→ i.e.

I D C G

n D C G_{p} = \frac{D C G_{p}}{I D C G_{p}}

• The

I D C G

in our example is:–

B \to 1

–

A \to 0.5

–

C \to 0

• So, the

I D C G_{3}

for our example is

1.26

and

n D C G_{3} = \frac{1.04}{1.26} = 0.82

.•

n D C G

allows us to use the same loss for all users. • The only problem with incorporating

n D C G

into our loss function is that it's not differentiable.• What LambdaNet does is that instead of defining the gradient as the gradient of the loss function itself, they just assigned the gradient to a value called lambda (

𝜆

) which is defined as,•

𝜆_{i j} = \frac{1}{1 + e^{- (s_{i} - s_{j})}} . | Δ n D C G_{i j} |

• • At the time, the authors didn't have any mathematical backing for it, but it was later proven that this was completely fine and it actually does optimize for

n D C G

.• Now that we use

𝜆

instead of the gradients, how do we update our weights?– It's extremely similar to our old update functions. The only difference is that we're subtracting different terms as shown below,

w^{t + 1} = w^{t} - r \sum_{i j}^{} 𝜆_{i j} (\frac{\partial s_{i}}{\partial w_{k}} - \frac{\partial s_{j}}{\partial w_{k}})

• Note: There's no partial derivative in equation

(6)

with respect to some cost function. 1.7. LambdaMART• The only difference with LambdaNet is that it uses a gradient boosted tree in place of the neural network.• LambdaMART does appear to be pairwise because it compares pairwise documents, but

n D C G

considers elements beyond the pairs → so, technically, it's not a pairwise model.• How does MART work? – MART takes a single example,

{\vec{x}}_{i} = [u_{1}, u_{2}, \dots, p_{1}, p_{2}, \dots]

→ i.e. some user features + item features where features can be embeddings or explicit features.– The example is fed to the first tree and finds itself at the leaf node after it traverses down the tree. – The error is going to be the prediction at the leaf node,

y_{i}

, divided by the weight

w_{i}

.* Here,

y_{i}

is just the

𝜆

and

w_{i}

is just the gradient of the

𝜆

with respect to the output at the leaf node and

\frac{y_{i}}{w_{i}}

is actually the error that gets passed on to the next tree to be learned.* Note: The

i

index just indicates that we perform the above calculations for all the leaf nodes.* Note: For calculating

\frac{y_{i}}{w_{i}}

(or

𝜆

divided by the derivative of

𝜆

), we're taking what's called a Newton Step to figure out our error to be trained on in the next tree. • – Using the concept of MART with our Lambda (instead of the actual gradient), we end up with LambdaMART. 1.8. Other Notes• So far, in our examples, we just talked about whether a user clicked or not clicked a particular item/post/document.– We don't have to use just use clicks. We can have things like:* A user liked a particular post.* A user commented on a particular post.* Or a user performed no action.– We can map these actions like this:* N/A → 0* Clicked → 1* Liked → 2* Commented → 3* Messaged → 4– This is called Implicit Relevance Feedback. (similar to Implicit Response in Recommender Systems)• In addition to

n D C G

, we can also use Mean Average Precision (MAP) (for binary cases).• For implicit relevance feedback, we could transform it to binary by arranging some cutoff point (e.g. anything > 3 is labeled as relevant (i.e. 1) and irrelevant (i.e. 0) otherwise.• We could also use Mean Reciprocal Rank (MRR) (for binary cases). • Note: We can average these metrics across all users to evaluate the model.• Sometimes it's preferable to encode your

n

-ary labels (as in implicit relevance feedback) as binary so that we can use those binary evaluation metrics just for another perspective.• All of these evaluations metrics that we talked about, can also be used in the Lambda, that is,

\begin{array}{c} 𝜆_{i j} = \frac{1}{1 + e^{- (s_{i} - s_{j})}} . | Δ n D C G_{i j} | o r \\ 𝜆_{i j} = \frac{1}{1 + e^{- (s_{i} - s_{j})}} . | Δ M A P_{i j} | o r \\ 𝜆_{i j} = \frac{1}{1 + e^{- (s_{i} - s_{j})}} . | Δ M R R_{i j} | \end{array}

• Sometimes we can to de-bias our clicks data. – Clicks are going to be susceptible to Presentation/Trust Bias.– Basically, the results appearing lower in the ranking can affect:* If the post is even seen* Even if it is seen, the fact that it occurs lower in the list could make the user not trust that result and not click on it.* This could be more relevant in search engines. – How do we account for this bias?* We just have to divide by the bias in the lambda equation

𝜆_{i j} = \frac{\frac{1}{1 + e^{- (s_{i} - s_{j})}} . | Δ n D C G_{i j} |}{b_{i} b_{j}}

• –

b_{i}

is the probability that some item in rank

i

is clicked over some other relevant item that's ranked lower than it. –

b_{j}

is the same as

b_{i}

but in terms of irrelevant items. Back to Top