Why is data definition hard?

Iguana detection example

To illustrate that, we’re going to use the Iguana detection example.

Screen Shot 2021-12-27 at 7.34.23 PM

Either of these three labels would be fine on its own. What is NOT fine is that you 1/3 of your labelers use the 1st, 1/3 the second, and 1/3 the third labeling convention, because then your labeling is not consistent.

Phone defect detection example

Different ways of labeling scratches and marks can create inconsistency in labels if all used together.

Screen Shot 2021-12-27 at 7.39.06 PM

More label ambiguity examples

Speech recognition example

Sometimes sounds bites are transcribed differently (especially when the audio quality is not great or confusing),

There are combinatorially many ways to transcribe such cases.

Being able to standardize on one convention will help the speech recognition algorithm.

Example of structured data

A common application in many large organizations is user ID merge, i.e. when you have multiple data records that you think correspond to the same person and you want to merge the user data records together while making sure there are no duplicates (entity resolution problem).

Entity resolution algorithms need some labeled data (to determine if two IDs are the same person or not). If those labels are not consistent, your algorithm could perform poorly.

Some other examples …

In all of these cases above, the ground truth can be ambiguous. If you ask people to take their best guess at the ground-truth label for tasks like these, giving labeling instructions that result in more consistent and less noisy and random labels will improve the performance of your learning algorithm.

Data definition questions

Major types of data problems

There are different major types of ML projects and the data definition problems would vary based on those types.

Major types of data problems

Unstructured Structured
Small data (<=10K) Manufacturing
visual inspection
from 100 training
examples
Housing price
prediction based
on square
footage, etc. from
50 training
examples
Big data (>10K) Speech
recognition from
50 million training
examples
Online shopping
recommendations
for 1 million users

Unstructured vs structured data

Small vs big data

Note: If you’re working on a problem from one of these four quadrants, then on average, advice from someone who worked on the same quadrants will probably be more useful than advice from someone who’s worked in a different quadrant.

Small data and label consistency

A lot of AI had recently grown up in large consumer internet companies which may have 100 million users or a billion users and have very large datasets. As a result, some of the practices on how to deal with small datasets have not been emphasized as much.

It’s possible to find a good model fit for small data where you have consistent and clean labels.

Big data problems can have small data challenges too

Problems with a large dataset but where there’s a long tail of rare events in the input will have small data challenges too.

Improving label consistency

Examples

Improving label consistency: small vs big data

Human Level Performance (HLP)

Some ML tasks are trying to predict an inherently ambiguous output, and HLP can establish a useful baseline of performance as a reference.

But, HLP is also sometimes misused.

Why measure HLP?

Screen Shot 2021-12-29 at 2.22.16 PM

Note: The ground truth itself, is probably, created by a human. So, are we really measuring what is possible or are we just measuring how well two different people happen to agree with each other?

When the ground truth label is itself determined by a person, there’s a very different approach to thinking about HLP.

Other uses of HLP

Raising HLP

When the ground truth is externally defined, there are fewer problems with HLP. For example, for X-ray detection problems, if the label is verified by a biopsy, then it’s very reliable. But, if the label comes from another doctor, then HLP is just how well can one doctor predict another doctor’s label vs how well can one learning algorithm predict another doctor’s label.

HLP on structured data

HLP is important for problems where human-level performance can a provide a useful reference. When measuring HLP, when you find that HLP is much less than 100%, also ask yourself is some of the gaps between HLP and complete consistency is due to inconsistent labeling instructions. If that’s the case, then improving labeling consistency will raise HLP and also give cleaner data for your learning algorithm, which ultimately results in better ML performance.