1. Learning To Rankβ’ In the Recommender Systems section, ?, we talked about how to recommend an item to a user.β However, it only considered one item at a time per prediction.β The prediction, which is a score or probability, help us gauge whether or not we should we should recommend an item.β’ Let's say, instead of that, we wanted to create a feed or a series of posts (or items) for a user to scroll through on their home screen.β’ Now, this is a question of what order (or ranking) do we place a series of items in to provide the best feed for a user.β This is quite a different problem from classification/regression.1.1. Framing a Ranking Problemβ’ Our first goal is to extract relevant posts (or items) from all of the posts available to the user β This is typically called Candidate Generation or Obtaining the Top K.β’ Once, we have the top K, then we will rank them in some particular order for the feed.1.2. Candidate Generationβ’ To perform candidate generation we can again use matrix factorization.β’ In matrix factorization, we perform this,U.P=uTp+ bp+buβ’ Note that both uT and p are in fact embeddings that represent some characteristics about some user or some item.β’ To get the top K items for a user, we can get the closest adjacent items to the user using the embedding vectors, uT and p.β We can use Euclidean distance or cosine similarity or dot product.β’ Note: If, instead of matrix factorization, we've used a deep recommender system, then we could use the embedding layers in the DL model.1.3. Ranking the Top Kβ’ Given the top K items, now we need to rank these top candidates.β’ The (bad) solution would be to just rank the items by their distance from a particular user (closer distance will rank higher on the top). There are a couple of problems with this method:β There could be tons of users and items where, in reality, if we extract the top K, we could generate a more sophisticated model because now we're only dealing with a small fraction of the users and items.β What if we had different systems generating our candidates? * We could subset our users into groups (e.g. user traffic coming from Amazon or Google).* These groups would exist in separate embedding spaces. However, if we want to apply rankings of items across different embedding spaces, then we would have to have some mechanism to rank beyond just the distances because they don't translate across different embeddings.1.3.1. Learning To Rankβ’ Learning to rank takes some probability that some item i should appear before some item j.P(i>j)=yi,j=1
1+e-(si-sj)β’ Where si=f(u,di) and sj=f(u,dj) and di and dj indicate items i and j.β’ The functions, f, are neural networks.β’ Here, it's modeled as a sigmoid function of si and sj.β’ Note: The user feature term, u, don't have to be users per se. They can also be queries.β In fact, originally, learning to rank, was used as a search optimization tool. β In our case, we're going to treat queries as users.1.3.2. Learning To Rank Loss Functionβ’ The loss function is the same sigmoid loss function,L(yij,yij)=-βiβ jyijlogyij+(1-yij)log(1-yij)β’ We're also going to use gradient descent to minimize the loss function.1.4. RankNetβ’ Let's say we have candidate documents A, B and C which we want to rank.β’ The first step is to generate all unique pairs of these documents because our loss function takes into account the probability that document i outranks document j β so, we have to represent all ij pairs.β’ Now, let's say user clicked on A and B when presented with the following feed: A β,C Γ,B β β so, our labels should be A,B,C (simply because user clicked on A and B and not on C).β’ Now, we assign yijas follows:β A,B βy12=1β B,C βy23=1β C,A βy31=0 (because C does NOT rank over A)β’ Note:If there were more documents (that user haven't seen), we'll assign them 0.5, because we don't know which one the user will click on.β’ Now, our neural network is going to look like this:BABCCAdu......βd1d2s1s2f(u,di)=siβ’ Note: The neural networks are the f functions that'd give us si and sj for each pair of documents.β’ Now, we can replace si and sj in equations (1) and (2) to calculate updates based on the gradient of the loss function β wt+1=wt-rβloss.β The only difference here (compared to NN update) is that we're placing adjacent documents into the neural network one at a time and we're applying those differences in outputs to the overall loss function instead of using two documents at once as an input.β This strategy, which is called RankNet, was designed by Microsoft.β’ Now, let's say we trained our NN with these updates and we get a new document we haven't seen. β We first generate the pairwise documents.β We then repeat the same process described above, except this time with s1 and s2, we just simply plug them into the model and get a probability. * If probability > 0.5 β we rank item 1 above item 2.β Then, we go to the next pair of documents and repeat this process.β Eventually, we can come up with a consistent order of which we should place the documents in.* Consistent means that P(i>j) will propagate consistency across the entire rank. For example, if in training P(1>2)=0.5 (i.e. rank uncertainty), AND P(2>3)=0.5 then P(1>3)=0.5 β i.e. complete uncertainty propagates. * Same thing for complete certainty: if P(1>2)=P(2>3)=1 then P(1>3)= 1.1.5. LambdaNetβ’ RankNet pairwise may not be the best penalty.β This means that we may want to consider more than just two documents at a time when trying to rank all the documents. β’ It's also pretty inefficient.β For all pairs of documents, we have to put all of those pairs into the neural network and perform SGD on every one of them.β’ In terms of learning to rank, after RankNet, there came something called LambdaNet.β’ Two improvements came with LambdaNet:β It is more efficient. They factorize the gradient such that they can find the gradient update for a document in comparison to all the others without having to evaluate itself versus every other pair.β LambdaNet enables us to use better metrics such as nDCG for evaluating a particular ranking, where nDCG β normalized Discounted Cumulative Gain.* It considers more than just a pair of documents at a time. 1.6. nDCGβ’ Let's see how nDCG works.β’ Let's say we have the following documents along with their probabilities:β A β 0.5β B β 1β C β 0β’ Here, if we just use the number of inversions, we would count one inversion (A β B) that would give us the optimal ranking β i.e. the model will be penalized by one.β’ Now, let's consider this case:β C β 0β A β 0.5β B β 1β’ The problem is that the same penalty would be incurred for for the inversion (A β B) vs. if it was higher (previous case).β’ nDCG considers documents up to some position p,DCGp=βp2reli-1
log2(i+1)β’ where reli is the relevance of document i, which is those probabilities.β’ Now, we calculate the DCG3 (because there are 3 documents) for the first case we get 1.04 and for the second case we get 0.76.β It means that even though inversions would penalize (A β B) the same, the DCG is clearly lower for the second case, simply because this elements appeared later down the list.β’ How to compare DCG across different users?β To do that we have to take the DCG for a particular user and divide it by the ideal DCG β i.e. IDCG.nDCGp=DCGp
IDCGpβ’ The IDCG in our example is:β B β 1β A β 0.5β C β 0β’ So, the IDCG3 for our example is 1.26 and nDCG3=1.04
1.26=0.82.β’ nDCG allows us to use the same loss for all users. β’ The only problem with incorporating nDCG into our loss function is that it's not differentiable.β’ What LambdaNet does is that instead of defining the gradient as the gradient of the loss function itself, they just assigned the gradient to a value called lambda (π)which is defined as,β’ πij=1
1+e-(si-sj).|ΞnDCGij|β’ β’ At the time, the authors didn't have any mathematical backing for it, but it was later proven that this was completely fine and it actually does optimize for nDCG.β’ Now that we use π instead of the gradients, how do we update our weights?β It's extremely similar to our old update functions. The only difference is that we're subtracting different terms as shown below,wt+1=wt- rβijπij(βsi
βwk-βsj
βwk)β’ Note: There's no partial derivative in equation (6) with respect to some cost function. 1.7. LambdaMARTβ’ The only difference with LambdaNet is that it uses a gradient boosted tree in place of the neural network.β’ LambdaMART does appear to be pairwise because it compares pairwise documents, but nDCG considers elements beyond the pairs β so, technically, it's not a pairwise model.β’ How does MART work?β MART takes a single example, xi=[u1,u2,β¦,p1,p2,β¦] β i.e. some user features + item features where features can be embeddings or explicit features.β The example is fed to the first tree and finds itself at the leaf node after it traverses down the tree. β The error is going to be the prediction at the leaf node, yi, divided by the weight wi.* Here, yi is just the π and wi is just the gradient of the π with respect to the output at the leaf node and yi
wi is actually the error that gets passed on to the next tree to be learned.* Note: The i index just indicates that we perform the above calculations for all the leaf nodes.* Note: For calculating yi
wi (or π divided by the derivative of π), we're taking what's called a Newton Step to figure out our error to be trained on in the next tree.tree1tree2tree3xiyi
wiyi
wiyi
wi++β’ β Using the concept of MART with our Lambda (instead of the actual gradient), we end up with LambdaMART.1.8. Other Notesβ’ So far, in our examples, we just talked about whether a user clicked or not clicked a particular item/post/document.β We don't have to use just use clicks. We can have things like:* A user liked a particular post.* A user commented on a particular post.* Or a user performed no action.β We can map these actions like this:* N/A β 0* Clicked β 1* Liked β 2* Commented β 3* Messaged β 4β This is called Implicit Relevance Feedback. (similar to Implicit Response in Recommender Systems)β’ In addition to nDCG, we can also use Mean Average Precision (MAP) (for binary cases).β’ For implicit relevance feedback, we could transform it to binary by arranging some cutoff point (e.g. anything > 3 is labeled as relevant (i.e. 1) and irrelevant (i.e. 0) otherwise.β’ We could also use Mean Reciprocal Rank (MRR) (for binary cases). β’ Note: We can average these metrics across all users to evaluate the model.β’ Sometimes it's preferable to encode your n-ary labels (as in implicit relevance feedback) as binary so that we can use those binary evaluation metrics just for another perspective.β’ All of these evaluations metrics that we talked about, can also be used in the Lambda, that is,πij=1
1+e-(si-sj).|ΞnDCGij| or πij=1
1+e-(si-sj).|ΞMAPij| or πij=1
1+e-(si-sj).|ΞMRRij|β’ Sometimes we can to de-bias our clicks data. β Clicks are going to be susceptible to Presentation/Trust Bias.β Basically, the results appearing lower in the ranking can affect:* If the post is even seen* Even if it is seen, the fact that it occurs lower in the list could make the user not trust that result and not click on it.* This could be more relevant in search engines. β How do we account for this bias?* We just have to divide by the bias in the lambda equationπij=1
1+e-(si-sj).|ΞnDCGij|
bi bjβ’ β bi is the probability that some item in rank i is clicked over some other relevant item that's ranked lower than it. β bj is the same as bi but in terms of irrelevant items.Back to Top