Why is data definition hard?

Iguana detection example

To illustrate that, we’re going to use the Iguana detection example.

Let’s say you’ve collected 100s of pictures of iguanas and you send these pictures to labelers with instructions to use a bounding box to indicate the position of iguanas.
The labelers might have different interpretations as to where the boundaries of iguanas are in the pictures.

Either of these three labels would be fine on its own. What is NOT fine is that you 1/3 of your labelers use the 1st, 1/3 the second, and 1/3 the third labeling convention, because then your labeling is not consistent.

Phone defect detection example

Different ways of labeling scratches and marks can create inconsistency in labels if all used together.

More label ambiguity examples

Speech recognition example

Sometimes sounds bites are transcribed differently (especially when the audio quality is not great or confusing),

“Um, nearest gas station”
“Umm, nearest gas station”
“…, nearest gas station”
“Nearest gas station [unintelligible]”

There are combinatorially many ways to transcribe such cases.

Being able to standardize on one convention will help the speech recognition algorithm.

Example of structured data

A common application in many large organizations is user ID merge, i.e. when you have multiple data records that you think correspond to the same person and you want to merge the user data records together while making sure there are no duplicates (entity resolution problem).

Entity resolution algorithms need some labeled data (to determine if two IDs are the same person or not). If those labels are not consistent, your algorithm could perform poorly.

Some other examples …

Is it a bot/spam account?
Is it a fraudulent transaction?
Is someone looking for a job? (based on her activity on your website)

In all of these cases above, the ground truth can be ambiguous. If you ask people to take their best guess at the ground-truth label for tasks like these, giving labeling instructions that result in more consistent and less noisy and random labels will improve the performance of your learning algorithm.

Data definition questions

What is the input $x$ ?
- Phone defect detection: lighting? contrast? resolution?
- If the quality of your input is not good, you have to improve it in the first place.
- What features need to be included?
What is the target label $y$ ?
- All the examples above.

Major types of data problems

There are different major types of ML projects and the data definition problems would vary based on those types.

Major types of data problems

	Unstructured	Structured
Small data (<=10K)	Manufacturing visual inspection from 100 training examples	Housing price prediction based on square footage, etc. from 50 training examples
Big data (>10K)	Speech recognition from 50 million training examples	Online shopping recommendations for 1 million users

Unstructured vs structured data

Unstructured
- May or may not have a huge collection of unlabeled examples $x$ .
- Humans can label more data
- Data augmentation is more likely to be helpful
Structured
- May be more difficult to obtain more data
- Human labeling may not be possible (with some exceptions)

Small vs big data

Small data
- Clean labels are critical
- Can manually look through a dataset and fix labels.
- Can get all the labelers to talk to each other.
Big data
- Emphasis on the data process

Note: If you’re working on a problem from one of these four quadrants, then on average, advice from someone who worked on the same quadrants will probably be more useful than advice from someone who’s worked in a different quadrant.

Small data and label consistency

A lot of AI had recently grown up in large consumer internet companies which may have 100 million users or a billion users and have very large datasets. As a result, some of the practices on how to deal with small datasets have not been emphasized as much.

It’s possible to find a good model fit for small data where you have consistent and clean labels.

Big data problems can have small data challenges too

Problems with a large dataset but where there’s a long tail of rare events in the input will have small data challenges too.

Web search (rare queries)
Self-driving cars (rare events)
Product recommendation systems

Improving label consistency

Find a few examples and have multiple labelers label the same examples.
When there is disagreement, have MLE, subject matter expert (SME), and/or labelers discuss the definition of $y$ to reach an agreement.
If labelers believe that $x$ doesn’t contain enough information, consider changing $x$ .
Iterate until it’s hard to significantly increase agreement.

Examples

Standardize labels
Merge classes, e.g. Deep scratch and shallow scratch $\rightarrow$ scratch
Have a class/label to capture uncertainty
- In case of class ambiguity cannot be reduced, introduce a new class to capture borderline cases, e.g. in phone defect $\rightarrow$ (defect, not defect, borderline)

Improving label consistency: small vs big data

Small data
- Usually small number of labelers
- Can ask labelers to discuss specific labels.
Big data
- Get to a consistent definition with a small group
- Then send labeling instructions to labelers
- Can consider having multiple labelers label every example and using voting or consensus labels to increase accuracy

Human Level Performance (HLP)

Some ML tasks are trying to predict an inherently ambiguous output, and HLP can establish a useful baseline of performance as a reference.

But, HLP is also sometimes misused.

Why measure HLP?

Estimate Bayes error/irreducible error to help with error analysis and prioritization. You can compare the HLP against the ground truth to find the HLP baseline.

Note: The ground truth itself, is probably, created by a human. So, are we really measuring what is possible or are we just measuring how well two different people happen to agree with each other?

When the ground truth label is itself determined by a person, there’s a very different approach to thinking about HLP.

Other uses of HLP

In academia, establish and beat a respectable benchmark to support a publication.
Business or product owner asks for 99% accuracy. HLP helps establish a more reasonable target.
“Prove” the ML system is superior to humans doing the job and thus the business or product owner should adopt it.
- Note: Although this logically makes sense, in practice, this approach rarely works. So, use this logic with caution (or just don’t use it). Reasons for that would be:
  - Most business applications require more than just high average test accuracy.
  - Learning algorithm may get an unfair advantage when the labeling instructions are inconsistent.

Raising HLP

When the ground truth is externally defined, there are fewer problems with HLP. For example, for X-ray detection problems, if the label is verified by a biopsy, then it’s very reliable. But, if the label comes from another doctor, then HLP is just how well can one doctor predict another doctor’s label vs how well can one learning algorithm predict another doctor’s label.

When the ground truth is externally defined (e.g. biopsy), HLP gives an estimate for Bayes error / irreducible error.
But often ground truth is just another human label.
Rather than improving upon HLP, one must also aspire to examine why the HLP and ground truth (which is created by another human) don’t agree. This way we can raise the HLP to 100%. But this creates a problem. How do you want your ML to beat HLP if it’s already 100%?
- This should be fine, as now you have much more consistent and cleaner labels which allow the learning algorithm to do better in practice $\rightarrow$ aspire to raise HLP instead of just focusing on beating HLP.
When the label $y$ comes from a human label, HLP << 100% may indicate ambiguous labeling instructions.
Improving label consistency will raise HLP.
This makes it harder for ML to beat HLP. But the more consistent labels will raise ML performance, which is ultimately likely to benefit the actual application performance.

HLP on structured data

Structured data problems are less likely to involve human labelers, thus HLP is less frequently used.
Some exceptions:
- User ID merging: same person? (entity resolution algorithm performance depends on the quality of labels)
- Based on the network traffic, is the computer hacked?
- Is the transaction fraudulent?
- Spam account? bot?
- From GPS, what is the mode of transportation - on foot, bike, car, bus?

HLP is important for problems where human-level performance can a provide a useful reference. When measuring HLP, when you find that HLP is much less than 100%, also ask yourself is some of the gaps between HLP and complete consistency is due to inconsistent labeling instructions. If that’s the case, then improving labeling consistency will raise HLP and also give cleaner data for your learning algorithm, which ultimately results in better ML performance.