Data-centric AI development

Model-centric

With Model-centric view, you’d take the data you have and develop a model that does as well as possible on it. Because a lot of academic research in AI was driven by researchers downloading a benchmark dataset and trying to well on that benchmark, most academic research on AI is model-centric (because the benchmark dataset is fixed).

In this view, you’d hold the data fixed and iteratively improve the code/model.

Data-centric

In this view, we think of the quality of the data is paramount. You can use tools such as error analysis or data augmentation to systematically improve the data quality. For many applications, if your data is good enough, there are multiple models that will do just fine.

In this view, you can hold the code fixed and iteratively improve the data.

Note: There’s a role for both of these views in improving the performance of an ML system.

Data Augmentation

A useful picture of data augmentation

Let’s take speech recognition as an example. There are different types of speech input:

Car noise
Plane noise
Train noise
Machine noise
Cafe noise
Library noise
Food court noise

The first four items are similar as they all pertain to some mechanical noise. The last three are also similar as they pertain to environmental noise. The below diagram shows the performance of the ML model vs HLP (human-level performance). As shown, data augmentation in one type of input could actually lift up the performance of other types of input as well and lessen the gap across different types of input (not only itself).

Once you get the new diagram, it’ll then show you where is the next biggest gap (with the highest potential of improvement) to augment the data. In a way, this diagram can help navigate where to put the data augmentation effort.

Data Augmentation

Data augmentation can be a very efficient way to get more data, especially for unstructured data problems. When carrying out data augmentation, there are a lot of choices you have to make. For example:

What are the parameters?
How do you design the data augmentation setup>

Data augmentation example: Speech recognition

In speech recognition, you can create new data by adding up voice signals with noise. For example, you can add cafe noise to someone’s speaking voice and synthesize a new training example.

The goal of data augmentation is to create examples that your learning algorithm can learn from. As a framework for doing that, you can think about how you can create realistic examples that the algorithm does poorly on, but humans (or other baselines) do well on.

Here’s a checklist for when you’re creating new data:

Does it sound realistic?
Is the x $\rightarrow$ y mapping clear? (e.g. can humans recognize speech?)
Is the algorithm currently doing poorly on it?

Data augmentation example: Images

Let’s say we have images of smartphones with scratches. Here you can augment the image with:

Flipping (horizontally)
Changing contrast (brightening images)
- Note: Darkening the image wouldn’t work because with darker images even humans cannot see the scratches.
Take a photo of a phone with no scratches and use Photoshop to artificially draw scratches.
Use GANs to synthesize scratches automatically (although this can be overkill. Simpler techniques are much easier to implement).

Data iteration loop

Here you repeatedly add or remove data (while holding the model fixed) and train and do error analysis to see which works.

Can adding data hurt?

For a lot of ML problems distribution of train/dev/test datasets are reasonably similar. Then, if you’re using data augmentation, you’re adding lots of training set such as adding lots of data with cafe noise. So, now your training set may come from a very different distribution than the dev/test sets. Is this going to hurt your learning algorithm’s performance? Usually, the answer is no with some caveats for unstructured data problems.

For unstructured data problems, if:

The model is large (low bias).
The mapping $x \rightarrow y$ is clear (e.g. given only the input $x$ , humans can make accurate predictions).
- Then, adding data rarely hurts accuracy.
The reverse (small model, not clear mapping) is true for when adding data could hurt.

Bottom line, it’s quite unusual that data augmentation hurts your model, as long as your model is big enough.

Adding features

Sometimes adding new data is difficult. Another useful thing to do in such cases is to take the existing examples and figure out if there are additional features you can add to them.

Structured data

Restaurant recommendation example

Let’s say we have a neural net model that takes customers’ and restaurants’ information and makes a recommendation. Let’s say you you find out, after running an error analysis, that:

Vegetarians are frequently recommended restaurants with only meat options.

Here, it’s hard to synthesize new examples of customers or restaurants. So, data augmentation is hard here.

Possible features to add?

Is a person vegetarian (based on past behaviors)?
Does restaurants have vegetarian options (based on the menu)?

Food delivery example

There are some customers who only order tea/coffee or only order pizza.

What are the added features that can help make a decision?

Product recommendation:

Collaborative filtering $\rightarrow$ content-based filtering.

Unlike collaborative filtering, content-based filtering has a cold start problem (you don’t know how to recommend new products).

Note: Adding features is more appropriate for structured data problems. For unstructured data problems, we use deep learning models which by themselves find very good features.

Experiment tracking

When you’re improving your model iteratively, it’s very important to make sure that you have a robust experiment tracking system.

What to track?

Algorithm and code versioning
Dataset used
Hyperparameters
Save the results somewhere (metrics and trained models)

Tracking tools

Text files
Spreadsheets
Experiment tracking systems, e.g. Weights & Biases, Comet, MLflow, SageMaker Studio.

Desirable features

Information needed to replicate results (if some part of data coming internet, it can damage replicability)
Experiment results, ideally with summary metrics/analysis
Perhaps also: resource monitoring, visualization, model error analysis

From big data to good data