9_impact_estimation

Impact Estimation

Table of Content

1. Impact Estimation 2. How much does the experiment cost? 3. Potential experiment benefits 4. Point of diminished return 5. Risks to the business from an experiment 6. Shadow test: Mitigating business risks 7. Post-experiment 7.1. Are the experiment results valid? 7.2. Is the experience worth launching?

1. Impact Estimation• This section is about how to estimate the impact of experiments, before and after we run them. 2. How much does the experiment cost?• One of first things to consider when experimenting is: How much will it cost?– We measure cost in:* Time · Expect at least a minimum of 2 weeks for the actual experiment· 2 week to 2 months of support work → time required for preparation, training data collection, model creation, experiment infrastructure, model hosting, post-experiment analysis* Headcount → how many people we need· 1 SWE· 1 data engineer· 1 data scientist* Opportunity cost · Other ML projects· Other non-ML projects– All of the above cost factors result in some (rough) estimated $ amount. 3. Potential experiment benefits• How much will the experiment benefit the company/customers?• One approach to estimate the potential benefit is pretend you build a "perfect" model and find out what the result would look like.• We want to measure those benefits in terms of important business metrics such as (additional) revenue or profit. – This makes it possible to compare all the experiments (regardless of each experiment's specific metric).– We can take 1%, 2%, 5%, 10% of the perfect result and find out its impact on revenue/profit.– Example: * 1000 daily visitors to our website* 3% sign-up rate on average* Revenue per sign-up is $50.* The

\max (r e v e n u e)

per day with a perfect model (i.e. 100% sign-up rate) $50,000/day.* Right now (at 3%) → $1500/day* What if our model could get 2% lift?· That'd bring additional $1000/day → $365,000/year.· With 70% profit margin

⥵

$250,000/year in net profit.• Note: We want to compare the cost of an experiment to its potential upside.– You could go with the 10x rule here → i.e. upside > 10 * costs → This means that even if only 20% of the things you plan on doing go according to plan, you will still double your original investment within a year. 4. Point of diminished return• You can often get 80% of the results that you're looking for with very simple models.• If you want that remaining 20% of the result, you typically have to invest in far more complex models which often take a lot more effort.• Often, it's a good idea to go in with such mindset such that you can implement ML in other areas of the business that haven't yet hit the point of diminished return. 5. Risks to the business from an experiment• The obvious risk is could be some loss to a metric you care about → e.g.

C T P

, session time, customer trust, etc.• Customer trust: Data Leakage– If collecting confidential data is required for your experiment models.– You have to make sure to implement different security measures around the way you store the data to ensure you can safely experiment.• Outages– Any time that we introduce a new component (or we change different software components), we risk a potential outage.• How quickly can we stop the experiment?– This is hugely important if you're experimenting with a significant portion of your business model.• Risk of not experimenting– Often, companies that are willing to experiment and adjust their business models are the ones that last.– You always have consider the risks of experimenting against the risk of not experimenting. 6. Shadow test: Mitigating business risks• One thing we can do is shadow testing.• With shadow testing we try to get some understanding of how new experiences behave.• With shadow testing, we run and log both the new and current experiences (when a user makes a request) but send the current experience to the user.• The logs from the new experience helps us:– Do sanity checks → does the new experience results actually make sense– Measure the differences (between the new and current experience)– Look for errors/faults that we didn't catch during the testing phase– Measure CPU/memory/disk latency of the new experience 7. Post-experiment• So far, we've talked about the different things we can do in pre-experiment. In this section, we want to talk about things we can do post-experiment. 7.1. Are the experiment results valid?• Bias correction– One thing we have to do to answer this question is bias correction.– Sample selection bias happens when different users were placed in some particular group and that user had a predisposition to behave a certain way, regardless of the group they're in.• Extrapolation appropriate– If we did the experiment on holidays, payday, start of school, etc. → It means that our experiment is probably not valid to extrapolate.– If you want to experiment on holiday, it's good to look at the comparisons of the previous year and adjust for the normally expected growth. * This is to make sure that the growth (or how much of the growth) is due to the new experience and not the natural year-to-year business-related growth.• Valid statistically– Frequentist A/B testing → p-value– Bayesian A/B → Probability of

B > A

, expected loss– MAB → value remaining is sufficiently low• Experiment collisions– Was another experiment being done such that it would impact a similar user experience or even if the user experience is different, would it impact the same metrics?– It's very important to make sure that there's no experiment collision.• Carryover effects– Let's say that a sale on an item just concluded. Maybe a week later our updated algorithm (that we're experimenting with) just happened to recommend that item a lot more.* It's likely that that item we're recommending now will not be bought at a normal price because the sale a week earlier had just ended and had a lot lower pricing option. * At this point, it could look like your updated recommendation algorithm is unsuccessful.* But, maybe if you wait a few weeks, you'll get a different behavior. 7.2. Is the experience worth launching?• Variants and invariants– Did the variants that you expected to change actually change?– Did any invariants change?– Do any change in variants contradict?• Did the customer service metrics change at all?– Spikes can indicate confusion or dissatisfaction.• Did any cannibalization occur?– Pushing one product over another can have unforeseen implications to another such that the other product is not selling as much anymore. Back to Top