Obtaining data
How long should you spend obtaining data?
- As mentioned before, the steps of ML iterative process are:
- Model + hyperparameters + (collecting) data
- Training
- Error Anlysis
- Let’s say, if training and error analysis would take around 2 days, you don’t want to spend 30 days collecting data, because that will be a whole month you getting into the iteration process.
- You’d want to get into the iteration loop as quickly as possible.
- Instead of asking: how long it would take to obtain m examples? Ask: how much data can we obtain in k days.
- Note: One exception to this rule is if you have already worked on the problem before and from experience you know you need m examples upfront.
Data inventory
- Brainstorm list of data sources
- It’s always a good idea to take a inventory of different ways of getting data based on cost and time and make data collection decision accordingly.
- Example: Speech recognition
| Source |
Amount |
Cost |
Time |
| Owned |
100 hrs |
$0 |
0 |
| Crowdsourced - Reading |
1000 hrs |
$10K |
14 days |
| Pay for labels |
100 hrs |
$6K |
7 days |
| Purchase data |
1000 hrs |
$10K |
1 day |
- Note: Other factors:
- Data quality
- Privacy
- Regulatory constraints
Labeling data
- Options:
- In-house
- Outsourced
- Crowdsourced
- Having MLEs label data is expensive. But doing this for just a few days is usually fine.
- Who is qualified to label?
- Speech recognition: any reasonable fluent speaker
- Factory inspection, medical image diagnosis: SME (Subject Matter Expert)
- Recommender systems: maybe impossible to label well
- Note: Don’t increase data by more than 10x at a time.
- You want to make sure everytime you gather quickly enough to train your model again. If still need more data, then collect more.
- Do not spend too much time collecting data all at once.
Data pipelines
- Data pipelines (sometimes called Data Cascades), refers to when your data has multiple steps of processing before getting to the final output.
Data pipeline example:
- Note: If you have done data preprocessing step through some “scripts”, one issue you may face when taking the system to production is replicability.
- Data preprocessing scripts tends to be messy and hacky. It’s very likely to face issues when new data comes in.
- Note: The amount of effort you spend on writing replicable data preprocessing scripts also depends on the phase of the project.
- PoC phase:
- The goal is to decide if the application is workable and worth deploying.
- Focus on getting the prototype to work.
- It’s OK if data preprocessing is manual. But take extensive notes/comments.
- Production phase:
- After project utility is established, use more sophisticated tools to make sure the data pipeline is replicable.
- E.g. TensorFlow Transform, Apache Beam, Airflow, …
- For some applications, having and tracking meta data, data provenance, and data lineage can be a big help.
Data pipeline example:
- Task: Predict if someone is looking for a job. (x = user data, y = looking for a job?)
- This level of complexity of a data pipeline is not atypical in large commercial systems.
- One of the challenges of working with data pipelines like this is what if after running this system for months, you discover that the IP address blacklist (i.e. spam dataset) you’re using has some mistakes.
- In particular, what if you discover that there was there was some IP addresses that were incorrectly blacklisted.
- The question is, having built up this big complex system, if you were to update your spam dataset, won’t that change your spam model and therefore all the following steps? How do you go back and fix this problem?
- This problem can be exacerbated if this system is developed by different engineers and you have files spread across the laptops of your MLE development team.
- To make sure your system is maintainable, especially when a piece of data upstream ends up needing to be changed, it can be very helpful to keep track of data provenance as well as lineage.
- Data provenance refers to where the data came from, e.g. who did you purchase the spam IP address from?
- Data lineage refers to the sequence of steps to get to the end of the pipeline.
- At the very least, having an extensive documentation could help you reconstruct data provenance and lineage, but to build robust maintainable systems (not in PoC stage), there are more sophisticated tools to help you keep track of what happened.
- Note: Tools for data provenance and lineage (as of now, 2021) are still immature. Therefore, extensive documentation can help a lot.
- Note: Make extensive use of meta-data.
- Meta data is data about data.
- Example: In manufacturing visual inspection: time, factory, line #, camera settings, phone model, inspector ID, etc.
- If you don’t store your meta data in a timely way, it could be much harder to go back to recapture and organize that data.
- Meta data can be very useful for:
- Error analysis: spotting unexpected effects.
- Keeping track of data provenance.
Balanced train/dev/test splits
- It turns out when your dataset is small, having a balanced train/dev/test can significantly impro e your ML development process.
- Visual inspection example: dataset → 100 examples, 30 positive (defective) examples.
- Train/dev/test → 60%-20%-20%
- Random split → 21-2-7 → 35%-10%-35%
- In this case, the dev dataset is quite non-representative (only 10% positive examples).
- What you want → 18-6-6 → 30%-30%-30% → Balanced datasets.
- Note: In large datasets, you don’t need to worry about it, the random split will be representative enough.