Multi-Armed Bandit
Table of Content
1. Multi-Armed Bandit (MAB) 2. Epsilon-greedy algorithm 2.1. How much better is epsilon-greedy over random allocation (A/B testing)? 3. Thompson sampling 3.1. How do we know the probability of being optimal? 3.2. When to stop? 4. When to use MAB over A/B testing? 5. Practical considerations 6. MAB tools
1. Multi-Armed Bandit (MAB)In A/B testing we're usually doing a 50/50 split between the treatment and control groups.What if the treatment that we're experimenting with is absolutely horrible → now, we're stuck with 50% of our users potentially getting a horrible experience for, say, two weeks.One way to try to mitigate this would be assigning 99% of our customers to the control group and only assigning 1% to the treatment group to offset the impact if the treatment happens to really bad.This probably wouldn't work though since eventually we need some certain sample size from both groups in order for the experiment to be measured correctly.The goal here is to minimize the negative business impacts while still experimenting at a reasonable pace.How should we tradeoff exploration vs. exploitation?We could explore the less promising treatment but miss a potential better control OR we could exploit the potentially better control but miss an eventually better treatment.A very simple way to handle this would be using an Epsilon-greedy strategy.2. Epsilon-greedy algorithmIf we measure an experiment, say, two day in, we'll simply route each customer to the better performing experience with the probability of 1-𝜀, and route people to the less performing experience with the probability 𝜀. 2.1. How much better is epsilon-greedy over random allocation (A/B testing)?Reward is the outcome of allocating a user to a particular experience (A or B or ...).The Total reward obtained by a MAB is the sum of the rewards obtained by allocating a user u to an arm/experience aura,uThe expected reward is just the average reward given by some arm to some user → ur¯a,uRegret is the reward obtained from the optimal arm minus the reward obtained from the arm chosen.In general, we want to minimize the regret.Expected regretExp(regret)=ur¯a*,u-ur¯a,uThe Exp(regret) of A/B testing → u(r¯a*-r¯a) where u is the number of users.The Exp(regret) of epsilon-greedy → u𝜀(r¯a*-r¯a) The problem with both of these equations is that they're both linear in regret.So, the expected regret increases linearly with the number of users that we introduce to the experiment. 3. Thompson samplingThompson sampling is a technique we can use to get a logarithmic regret as the number of users grow → logu(r¯a*-r¯a) This means that for every user that comes into our experiment, each subsequent user will experience less and less regret. Thompson sampling is based on the idea that the frequency a user should be allocated to an experience should equal the probability of that experience being optimal.In other words, allocation percentage to a group is equal to that group's probability of being the optimal experience (at any point during the experiment).3.1. How do we know the probability of being optimal?Let's say we have two groups (A and B).We represent the CTP probabilities with a Beta distribution where 𝛼=𝛽=1 (i.e. completely uninformative prior about the CTP).Now, as users come in, we update the Beta distributions (of the assigned arm) based on the outcome (i.e. user clicked or not).As we continue and sample more of the population, the Beta distributions (of each arm) will be updated.How do we determine the probability that one arm is better than other arm?We just sample the Beta distribution each of each arm.The calculate the percent of the time that one arm's CTP is higher than the other arm's CTP. 3.2. When to stop?We stop when there's a 95% chance the value remaining in the experiment is less than 1% → How do we measure that?We start by sampling each arm's Beta distributions.Then, we count how many times one arm's distribution is greater than other arm's distribution (i.e. probability of being optimal/best).Next, we take the maximum of samples across all the arms → CTPmaxWe calculate the value add equation, CTPmax-CTPiCTPi, for all the samples in the optimal arm.Sort all the value add numbers and cut off the bottom 95% of the samples.If 95% of the samples are less than 1% then we would stop the experiment.* This means that there's a 95% chance that the value remaining (or how much value there is left in the experiment) is less than a 1% gain.Note: Both 95% and 1% are tuning parameters which you can decide on. 4. When to use MAB over A/B testing?
MABA/B
+ Many arms+ Few arms
+ Move traffic automatically to best arm+ Good when results are needed long-term
+ Short-term results+ Focus on learning
+ Focus on optimizing- Higher regret
- longer experiment
5. Practical considerationsHow often to update Beta distribution?Depends on how long customer actions take to record?Also, user preferences can be non-stationary.MAB can be contextual → context e.g. time of the day, user device, etc. 6. MAB toolsOptimizelyVisual Web OptimizerVowpal Wabbit (contextual bandits) Back to Top