6_frequentist_ab

Frequentist A/B Testing

Table of Content

1. Frequentist A/B Testing 2. A/B Testing 3. Frequentist approach to A/B testing 3.1. Null Hypothesis 3.2. p-value 3.2.1. How to measure the p-value? 3.2.2. Confidence intervals 4. A/A test 5. Tips about A/B testing 6. Infrastructure to run A/B tests 7. Tools for A/B testing

1. Frequentist A/B Testing• The idea here is that we start with a hypothesis. • Replace user experience with another– The hypothesis allows us to describe how we're going to replace a one user experience with another and how a user will react to those new experiences.– For instance, if we have a recommendation algorithm which just recommends the most popular piece of content and we want to introduce some degree of personalization with a ML model. – This typically involves a UI change to reflect the new experience.– However, it can be not so apparent to users such as demand forecasting → forecasting the traffic on a content and doing the due diligence in terms of hardware and networking to handle the traffic. The opposite, not being able to handle the traffic well, would be very noticeable to the users.• Dependent variable selection– After deciding on what new user experience we want to test, we also have to decide on what the dependent variables should be. Some examples of dependent variables could be:– Incremental profit/revenue– Number/rate/probability of ads clicked– Listening/screen time• Directionality of dependent variables– We'd also want to talk about the directionality of these dependent variables.– We want to be very specific in which way these dependent variables will move.– We want to anticipate multiple changes as well → If we make an assumption revenue will go up, we need to make an assumption what will happen to profit.• Experiment participants– We want to decide who will actually participate in the experiment. • A template of a hypothesis could look like this →

\begin{array}{l} " I f w e r e p l a c e X w i t h Y f o r s o m e s e t o f u s e r s t h e n [a, b, c, . . .] w i l l g o [u p / d o w n] \\ a n d [i n v a r i a n t s] w o n' t c h a n g e . " \end{array}

2. A/B Testing• A/B testing usually consists of some control group in which the experience is unchanged. • We compare the users' reactions in the control group to a treatment group which has received the new experience.• If we see a significant (desired) change in the test group compared to the control group, then we determine if the treatment experience should replace the control experience. 3. Frequentist approach to A/B testing• Baseline– For doing the frequentist approach, we need to have a baseline.– The baseline is what the current control experience offers in terms of some metric that we care about (e.g. click through rate).• Minimum detectable change– It's the smallest effect that can be measured → e.g. 1% in CTR– We need to be able to beat the baseline by some amount (i.e. minimum detectable change) to cover the cost of our experiment → That's why it's also called practical significance boundry because if we get anything lower than that, then it's practically not worth it.• Power– Power is the percent of the time that the minimum detectable change is found assuming that it exists.– Power is usually represented by

(1 - 𝛽)

→ it's often

(1 - 0.8)

. • Significance– It's the percent of the time that the minimum detectable change is found assuming that it doesn't exist.– This is often represented as

𝛼

→ it's often

0.5

.• Sample size– The sample size will tell us how big each group (control and treatment) need to be in order to measure some significance or power.– Here's the equation to get the sample size

n = \frac{(z_{𝛼 ⁄ 2} \sqrt{2 p_{1} (1 - p_{2})} + z_{𝛽} \sqrt{p_{1} (1 - p_{1}) + p_{2} (1 - p_{2})})^{2}}{| p_{2} - p_{1} |^{2}}

• –

p_{1}

→ baseline –

p_{2}

→ baseline + minimum detectable change– Note: In our case, since we only want to measure a lift and not a total change, we only need a one-tailed z-score, hence

z_{𝛼 ⁄ 2}

. 3.1. Null Hypothesis• This is the case in which our hypothesis is incorrect such that there's no difference between control and treatment.– We either reject/fail to reject the null hypothesis with failing to reject the null hypothesis means we're accepting there's not a significant difference between control and treatment (i.e. accepting the null hypothesis). 3.2. p-value• What does the significant difference between control and treatment group mean?– We're running the experiments on a sample of population → so, we have to treat every measurement with uncertainty.– A better way to ask is → How likely would certain percent of change have been given the null hypothesis is true?– In frequentist approach, we find answer to this question by finding the probability of seeing that percent change by random change if we were to run the experiment many times.– We use a p-value to answer that question for us.– The p-value is the probability of seeing the result or a more extreme result by random chance if we were to run experiment many times.– Typically, we work with p-value of

0.05

→ i.e. if we run the experiment 100 times, only in 5 of the experiments we'll the result (or a more extreme result). 3.2.1. How to measure the p-value?• Let's say we have the following example:

\begin{array}{ccc} C T P & n o . o f u s e r s (N) \\ C o n t r o l (c) : & 7 % & 1062 \\ T r e a t m e n t (t) : & 8 % & 982 \end{array}

• First, we calculate the

r

value as follows,

r = \frac{C T P_{c} N_{c} + C T P_{t} N_{t}}{N_{c} + N_{t}} = 7.48 %

• • Now, we calculate the

z

value as follows,

z = \frac{C T P_{t} - C T P_{c}}{\sqrt{r (1 - r) . (\frac{1}{N_{c}} + \frac{1}{N_{t}})}} = 0.858

• • If we take the

z

value and look it up in the one-tailed z-score table, we'll see that we get a value of

0.1949

→ which is

> 0.05

→ we fail to reject the null hypothesis.• If we have

10

times the number of users in each group, then we would've got a value of

0.004

and would be able to reject the null hypothesis. 3.2.2. Confidence intervals• As well as p-values, we'd also want confidence intervals surrounding the

C T P

probabilities in the above example.• The reason is that these probabilities came from a sample of population, we can't say with 100% confidence that these are the exact values.• To calculate the confidence interval, we take our

C T P

probabilities and calculate the CI as follows,

C T P - z_{95 %} \sqrt{\frac{C T P (1 - C T P)}{n}}

• Note: Since we're only concerned with lifts in the

C T P

and not necessarily just a change, we're only going to concern ourselves with the lower bound. •

z_{95 %}

is the one-tailed z-score at

95 %

.• Here's the CI for the example above

\begin{array}{ccccc} C T P & n o . o f u s e r s (N) & C I \\ C o n t r o l (c) : & 7 % & 1062 & 5.72 % & (- 1.28 %) \\ T r e a t m e n t (t) : & 8 % & 982 & 6.58 % & (- 1.42 %) \end{array}

• The lower bound CI means is that if we were to run this same experiment 100 times, about 95 of those cases (95%) would result in the

C T P

of at least these values. 4. A/A test• This is when you have a control and a treatment group but they both receive the same experience.• This is a great way to test your A/B testing framework, because there shouldn't be statistically significant differences between the two groups.• This can verify our A/B testing tool by checking for:– Sample bias– Incorrect analysis process (e.g. our random assignment function is not actually random)• A/A testing is also helpful when we're working with overly sensitive metrics.– Let's say we have this metric that we want to optimize that has a very high variance.– We can do a A/A testing on that metric and then measure the differences between both of those groups A and A.– Whatever that difference is, we can assign that to be at least our minimum detectable change. 5. Tips about A/B testing• Result extrapolation: through time– You generally want to make sure that the experiment runs for at least two weeks.– This will help us observe week to week variations if they're there.– Generally, you have to run the A/B test for as long as it takes to get your required sample size.– However, we should consider that things can vary from week to week.* Generally, A/B tests rely on the assumption that we can extrapolate the results.* This means that if we run an experiment for two weeks and we decide to replace the current control with the new treatment (because there was a significant difference in the first two weeks), then we're assuming that the results that we saw during the two week period will remain for as long as that treatment is in service. * This can be a dangerous assumption in terms of seasonality → e.g. running experiments around the holidays and expecting it to last after the holiday season is over.• Result extrapolation: through population– Since we only ran the experiment for two weeks, we only got a sample of all of our users.– So, we're assuming that the users that weren't sampled outside of the experiment will behave similarly to the users that were sampled.• Change effects: Novelty effects– This happens when users use something just because it's new.– This effect could be transient.• Change effects: Change aversion– This is what happens when users don't interact with something just because it's new.– Usually, if we see this effect, we can run the experiment longer so that its effect mitigates.• Time-intensive feedback– Let's say we worked for a university and they wanted us to develop a recommendation algorithm in order to recommend students particular courses.– If we wanted to measure increased class attendance, that might be a good thing to measure in an A/B test.– However, if we wanted to measure graduation rate, this would likely be a bad metric because at least for the incoming class four years would have to pass before we got any sort of feedback.– Generally, feedback within a month is a good idea. 6. Infrastructure to run A/B tests• Usually, we have these following components • We can represent a user using:– Cookies– user ID– Device ID– IP– etc.– Note: We want to pick something that has the least chance of changing during our A/B test.• Note: We want to make sure that the user allocation is accompanied with a cookie (indicating it's A/B group) that's stored in user's browser, so that every the user makes a request to our app, we show her the same experience → Failing to do so will cause in a faulty experiment.– We could also cache user allocations within our app as well.• The user allocator usually manages several experiments, so in addition to the cookie, they also need an experiment ID.• The user allocator will store the actions that users take as well as the user's group in a database.– That database will have some ingestion process attached to it that could lead into an HDFS cluster.– One of the records in the HDFS cluster could look like this → {EXP129_XGH412: {'allocation': 'A', 'action': 'purchase'}}.– Where the EXP129_XGH412 indicate experiment and cookie IDs.• Next, we'd do some data processing on these records, such that we'd be able to count all of the page views that all of group A and B had, as well as the number of purchases they did.• After that, we can another data processing step in which we count the above counts and calculate the p-value,

C T P

, and perhaps the CI.– This information could be propagated to some web interface such that people can go on and monitor, track and see the history of all of their experiments. 7. Tools for A/B testing• Optimizely• Google Optimize• Facebook PlanOut Back to Top

	CTP	no. of users (N)
Control (c):	7%	1062
Treatment (t):	8%	982

	CTP	no. of users (N)	CI
Control (c):	7%	1062	5.72%	(-1.28%)
Treatment (t):	8%	982	6.58%	(-1.42%)