The first time you train a learning algorithm, you can almost guarantee it won’t work. Therefore, we can think of the heart of the ML development process as error analysis. It can tell you what’s the most efficient use of your time in terms of what you should do to improve your learning algorithm’s performance.
Example: Speech recognition: After training your model, pick a number of (say 100) mislabeled examples from the dev set. Create a spreadsheet and for each example specify your guess on why that example is mislabeled by the model (see below). This process helps you understand whether specific categories, that may be the source of most errors, are worthy of further effort and attention.
So far, most of the error analyses are done manually, in a Jupyter notebook or a spreadsheet (like the example above). There are emerging MLOps tools that are making this process easier.
The goal is to come up with a few categories where you could productively improve the algorithm.
As you’re going through the tags, here are a few useful metrics to look at,
In addition to comparing different tags’ performance to that of the baseline, one other useful metric to look at is the percentage of data with that tag.
In the example below, the percentage of data for each tag tells us that we are better work more on improving performance Clean Speech and People Noise whereas solely looking at the Gap to HLP would suggest we should work on Car Noise tag.
Decide on the most important categories to work on based on:
There’s no mathematical formula to tell you what to work on, but by looking at these factors, you should be able to make more fruitful decisions.
Once you decided that there’s a category (or a few categories) to improve the average performance, consider adding data or improving the quality of the data for that category.
For categories you want to prioritize:
Going after improving data quality is generally time-consuming and expensive. By carrying out an analysis (like the above), you know exactly what type of data you need to collect. It makes the efforts more focused and efficient.
Datasets, where the ratio of positive and negative examples is very far from 50-50, are called skewed datasets.
Examples of skewed datasets:
In skewed datasets, using accuracy is not a good idea because just print(0) can get very high accuracy. Instead, it’s more useful to build a confusion matrix.
With the confusion matrix, if your algorithm outputs 0 all the time, it won’t do good on recall.
Sometimes you have a model with better recall and a different model with better precision. How do you compare these two models? There’s a common of doing that using F1 score.
The intuition behind the F1 score is that you want an algorithm to do well on both precision and recall, and if it does worse on either of them, that’s pretty bad. F1 score is a way of combining precision and recall that emphasizes whichever of or is worse.
In mathematics, the above formula is technically called a harmonic mean between precision and recall, which is like taking an average but placing more emphasis on whichever is the lower number.
Note: F1 score is just one way of comparing models based on precision and recall. There are applications where precision and recall weighting are different.
Let’s say you’re detecting defects in smartphones, you may want to detect different types of defects.
Even when your algorithm is doing good on F1 or accuracy, it’s often worth one last performance audit before you push it to production.
Check for accuracy, fairness/bias, and other problems:
Note: The ways that a system could go wrong tends to be very problem-dependent.
Note: Rather than just one person trying to brainstorm what could go wrong, for high stakes applications if you could have a team (or external advisers) could help you brainstorm things that you want to watch out for and reduce the risk of the project not working.