With Model-centric view, you’d take the data you have and develop a model that does as well as possible on it. Because a lot of academic research in AI was driven by researchers downloading a benchmark dataset and trying to well on that benchmark, most academic research on AI is model-centric (because the benchmark dataset is fixed).
In this view, you’d hold the data fixed and iteratively improve the code/model.
In this view, we think of the quality of the data is paramount. You can use tools such as error analysis or data augmentation to systematically improve the data quality. For many applications, if your data is good enough, there are multiple models that will do just fine.
In this view, you can hold the code fixed and iteratively improve the data.
Note: There’s a role for both of these views in improving the performance of an ML system.
Let’s take speech recognition as an example. There are different types of speech input:
The first four items are similar as they all pertain to some mechanical noise. The last three are also similar as they pertain to environmental noise. The below diagram shows the performance of the ML model vs HLP (human-level performance). As shown, data augmentation in one type of input could actually lift up the performance of other types of input as well and lessen the gap across different types of input (not only itself).
Once you get the new diagram, it’ll then show you where is the next biggest gap (with the highest potential of improvement) to augment the data. In a way, this diagram can help navigate where to put the data augmentation effort.
Data augmentation can be a very efficient way to get more data, especially for unstructured data problems. When carrying out data augmentation, there are a lot of choices you have to make. For example:
In speech recognition, you can create new data by adding up voice signals with noise. For example, you can add cafe noise to someone’s speaking voice and synthesize a new training example.
The goal of data augmentation is to create examples that your learning algorithm can learn from. As a framework for doing that, you can think about how you can create realistic examples that the algorithm does poorly on, but humans (or other baselines) do well on.
Here’s a checklist for when you’re creating new data:
Let’s say we have images of smartphones with scratches. Here you can augment the image with:
Here you repeatedly add or remove data (while holding the model fixed) and train and do error analysis to see which works.
For a lot of ML problems distribution of train/dev/test datasets are reasonably similar. Then, if you’re using data augmentation, you’re adding lots of training set such as adding lots of data with cafe noise. So, now your training set may come from a very different distribution than the dev/test sets. Is this going to hurt your learning algorithm’s performance? Usually, the answer is no with some caveats for unstructured data problems.
For unstructured data problems, if:
Bottom line, it’s quite unusual that data augmentation hurts your model, as long as your model is big enough.
Sometimes adding new data is difficult. Another useful thing to do in such cases is to take the existing examples and figure out if there are additional features you can add to them.
Let’s say we have a neural net model that takes customers’ and restaurants’ information and makes a recommendation. Let’s say you you find out, after running an error analysis, that:
Here, it’s hard to synthesize new examples of customers or restaurants. So, data augmentation is hard here.
Possible features to add?
There are some customers who only order tea/coffee or only order pizza.
What are the added features that can help make a decision?
Product recommendation:
Unlike collaborative filtering, content-based filtering has a cold start problem (you don’t know how to recommend new products).
Note: Adding features is more appropriate for structured data problems. For unstructured data problems, we use deep learning models which by themselves find very good features.
When you’re improving your model iteratively, it’s very important to make sure that you have a robust experiment tracking system.
What to track?
Tracking tools
Desirable features