Introduction

In this section, we’re going to focus on questions such as:

Model-centric AI development vs. Data-centric AI development

The way that AI has grown up, there’s been a lot of emphasis on how to choose the right model (e.g. how to choose the NN architecture). For practical projects, it can be even more useful to take a more data-centric approach, where you focus not just on improving your ML model, but on making sure you’re also feeding your algorithm high-quality data.

Key Challenges

One framework to think about an AI system is: AI system = Code + Data

Model development is an iterative process

Screen Shot 2021-11-17 at 11.50.37 AM

Because ML is such an empirical process, being able to go through this loop many times very quickly is key to improving performance.

After several iterations, it’d also be helpful to carry out a richer error analysis and do an audit to make sure it’s working before you push it to production deployment.

Challenges in model development

When building a model, there are three key milestones that most projects should aspire to accomplish (order is important here).

  1. Doing well on the training set (usually measured by averaging training error).
  2. Doing well on the dev/test sets.
  3. Doing well on business metrics/project goals.

Why low average error isn’t good enough

As hard as it is to do good on the hold-out dataset, unfortunately, sometimes that’s not enough. There are some other things needed to be done to make a project successful.

In addition to data drift and concept drift, there are some additional challenges we may have to address for a production ML project.

  1. Performance on disproportionately important examples: A ML system may have a low average test error, but if its performance on a set of disproportionally important examples isn’t good enough, then the ML system will still not be acceptable for production deployment.
    • Example: Web search: There are a lot of web search queries like these: “Apple pie recipe”, “Latest movies”, “Wireless data plan”, “Diwali festival”, etc. These types of queries are called Informational and Transactional queries. You just want to get some information about something you don’t know much about. In such cases, you might be willing to forgive the search engine for not giving you the best “apple pie recipe”. There’s a different type of queries such as “Standford”, “Reddit”, etc. which is called Navigational queries. Here, the user has a very clear intent to navigate to a website. So, they tend to be very unforgiving if a web search engine does anything rather than the right result (e.g. Standford --> Stanford.edu). Navigational queries, in this case, are disproportionately important examples.
    • The challenge here is, of course, that average test set accuracy tends to weigh all examples equally.
    • One thing you could do is to give disproportionately important examples a higher weight. That could work for some applications, but doesn’t always solve the entire problem.

Screen Shot 2021-11-17 at 12.24.44 PM

  1. Performance on key slices of the dataset (fairness):
    • Example: ML for loan approval: Assume an ML system predicting who’s going to repay a loan, and thus recommend approving certain loans for approval.
    • For such a system, you want to make sure it does not unfairly discriminate by ethnicity, gender, location, language, or other protected attributes.
    • Although the AI community was mostly had discussions about fairness in individuals, the fairness issue can also happen in other settings.
      • Example: Product recommendations from retailers: In recommendation systems of large retailers where you work with many vendors and brands, you want to be careful to treat fairly all major user, retailer, and product categories.
      • Even if an ML prediction system has a high average test set accuracy (i.e. it recommends better on average), if it gives very irrelevant recommendations to all users of one ethnicity, that may be unacceptable. OR if it always pushes products from large retailers and ignores smaller brands. OR the recommender never recommends a specific product category.

Screen Shot 2021-11-17 at 12.41.29 PM

  1. Rare classes: Specifically the cases of skewed data distributions.
    • Example: Medical diagnosis: In medical diagnosis, it’s not uncommon for many patients not to have a certain disease, and therefore have a dataset where 99% of examples are negative and only 1% positive.
      • In such cases, you can achieve very good test set accuracy by writing a program that predicts “0” for everyone!
      • In medical fields, it’s not acceptable to ignore (do not diagnose) obvious cases of illness.
      • This can also happen when only one (or a few) classes have very few observations. In such cases, even if you predict all the cases of the rare class wrong, you might still get high average test set accuracy.

Screen Shot 2021-11-17 at 12.53.35 PM

Bottom line

We need to go beyond just doing good on the test set.

Screen Shot 2021-11-17 at 12.53.45 PM

Establish a baseline

What are some of the best practices for quickly establishing a baseline?

Establishing a baseline level of performance

Let’s assume for a speech recognition application, you’ve established these four major categories of speech:

Screen Shot 2021-11-17 at 1.06.20 PM

Unstructured and structured data

It turns out the best practices for establishing a baseline are quite different depending on whether you’re working on unstructured or structured data.

Unstructured data tends to be data that humans are very good at interpreting. So, measuring human-level performance (HLP) is often a good way to establish a baseline.

In contrast, structured data are giant databases (e.g. sales transaction datasets), HLP is usually is a less useful baseline.

Ways to establish a baseline

Baseline helps to indicate what might be possible. In some cases (such as HLP), also gives a sense of what is irreducible error/Bayes error.

By helping us to get a very rough sense of what might be possible, it can help us be much more efficient in terms of prioritizing what to work on.

Tips for getting started

Getting started on modeling

Deployment constraints when picking a model

Should you take into account deployment constraints (e.g. compute constraints) when picking a model? Yes, if baseline is already established and goal is to build and deploy. No (or not necessarily), if purpose is to establish a baseline and determine what’s possible and might be worth pursuing.

Sanity-check for code and algorithm