Obtaining data

How long should you spend obtaining data?

As mentioned before, the steps of ML iterative process are:
- Model + hyperparameters + (collecting) data
- Training
- Error Anlysis
Let’s say, if training and error analysis would take around 2 days, you don’t want to spend 30 days collecting data, because that will be a whole month you getting into the iteration process.
- You’d want to get into the iteration loop as quickly as possible.
- Instead of asking: how long it would take to obtain $m$ examples? Ask: how much data can we obtain in $k$ days.
- Note: One exception to this rule is if you have already worked on the problem before and from experience you know you need $m$ examples upfront.

Data inventory

Brainstorm list of data sources
It’s always a good idea to take a inventory of different ways of getting data based on cost and time and make data collection decision accordingly.
Example: Speech recognition

Source	Amount	Cost	Time
Owned	100 hrs	$0	0
Crowdsourced - Reading	1000 hrs	$10K	14 days
Pay for labels	100 hrs	$6K	7 days
Purchase data	1000 hrs	$10K	1 day

Note: Other factors:
- Data quality
- Privacy
- Regulatory constraints

Labeling data

Options:
- In-house
- Outsourced
- Crowdsourced
Having MLEs label data is expensive. But doing this for just a few days is usually fine.
Who is qualified to label?
- Speech recognition: any reasonable fluent speaker
- Factory inspection, medical image diagnosis: SME (Subject Matter Expert)
- Recommender systems: maybe impossible to label well
Note: Don’t increase data by more than 10x at a time.
- You want to make sure everytime you gather quickly enough to train your model again. If still need more data, then collect more.
- Do not spend too much time collecting data all at once.

Data pipelines

Data pipelines (sometimes called Data Cascades), refers to when your data has multiple steps of processing before getting to the final output.

Data pipeline example:

Note: If you have done data preprocessing step through some “scripts”, one issue you may face when taking the system to production is replicability.
- Data preprocessing scripts tends to be messy and hacky. It’s very likely to face issues when new data comes in.
Note: The amount of effort you spend on writing replicable data preprocessing scripts also depends on the phase of the project.
- PoC phase:
  - The goal is to decide if the application is workable and worth deploying.
  - Focus on getting the prototype to work.
  - It’s OK if data preprocessing is manual. But take extensive notes/comments.
- Production phase:
  - After project utility is established, use more sophisticated tools to make sure the data pipeline is replicable.
  - E.g. TensorFlow Transform, Apache Beam, Airflow, …

Meta-data, data provenance and lineage

For some applications, having and tracking meta data, data provenance, and data lineage can be a big help.

Data pipeline example:

Task: Predict if someone is looking for a job. ( $x$ = user data, $y$ = looking for a job?)

This level of complexity of a data pipeline is not atypical in large commercial systems.
One of the challenges of working with data pipelines like this is what if after running this system for months, you discover that the IP address blacklist (i.e. spam dataset) you’re using has some mistakes.
- In particular, what if you discover that there was there was some IP addresses that were incorrectly blacklisted.
- The question is, having built up this big complex system, if you were to update your spam dataset, won’t that change your spam model and therefore all the following steps? How do you go back and fix this problem?
- This problem can be exacerbated if this system is developed by different engineers and you have files spread across the laptops of your MLE development team.
To make sure your system is maintainable, especially when a piece of data upstream ends up needing to be changed, it can be very helpful to keep track of data provenance as well as lineage.
Data provenance refers to where the data came from, e.g. who did you purchase the spam IP address from?
Data lineage refers to the sequence of steps to get to the end of the pipeline.
At the very least, having an extensive documentation could help you reconstruct data provenance and lineage, but to build robust maintainable systems (not in PoC stage), there are more sophisticated tools to help you keep track of what happened.
- Note: Tools for data provenance and lineage (as of now, 2021) are still immature. Therefore, extensive documentation can help a lot.
Note: Make extensive use of meta-data.
- Meta data is data about data.
- Example: In manufacturing visual inspection: time, factory, line #, camera settings, phone model, inspector ID, etc.
- If you don’t store your meta data in a timely way, it could be much harder to go back to recapture and organize that data.
- Meta data can be very useful for:
  - Error analysis: spotting unexpected effects.
  - Keeping track of data provenance.

Balanced train/dev/test splits

It turns out when your dataset is small, having a balanced train/dev/test can significantly impro e your ML development process.
Visual inspection example: dataset $\rightarrow$ 100 examples, 30 positive (defective) examples.
- Train/dev/test $\rightarrow$ $60\%$ - $20\%$ - $20\%$
- Random split $\rightarrow$ $21$ - $2$ - $7$ $\rightarrow$ $35\%$ - $10\%$ - $35\%$
- In this case, the dev dataset is quite non-representative (only 10% positive examples).
- What you want $\rightarrow$ $18$ - $6$ - $6$ $\rightarrow$ $30\%$ - $30\%$ - $30\%$ $\rightarrow$ Balanced datasets.
- Note: In large datasets, you don’t need to worry about it, the random split will be representative enough.