13_productionization

Productionization 1. Productionization• What we we'd want to do before going to production is to reduce the human error. as much as practical. For instance:– Let's say we have a team consist of 3 SWE, 2 DE, 2 DS, 1 Manager, 1 UX Designer– Each person has a 1% chance of making a mistake in each experiment run– Running an experiment every month mean that there's a 66% chance something goes wrong every year.– Note: If we increase the team size to 6 SWE, 4 DE, 4 DS → chance of mistake increases to 85%. In a fairly large team, the chance of mistake approximates to 100%.– Mistakes could be something too obvious or too subtle.• In deployment, we have to synchronize everything: data, model, environment. • What can we do to reduce the human error?– Data Processing* Have automated tests which verify that performing some action would produce the desired response. · For instance, we could have automated test to check our data ingestion engine so that when someone makes an action on the website, we could verify that its data is being captured accurately and completely.· Note: We can use headless browsers to perform such tests. Example of such browsers are Chrome and Firefox.· For data processing jobs, we could use unit tests such as pitest (PIT mutation) and run those tests against production data, such that we can validate (to some extent at least) that our data processing jobs are behaving correctly.* We can also use version control and code reviews.– Data Storage* We should validate all of our schemas.* We should make sure that we have consistency across all of the partitions that we arrange.* We should also make sure that the data is actually preserved the way that it's entered (in terms of ingestion).– Data Orchestration* We should make sure that all the joins and aggregations are happening correctly. * We have to write unit tests and have version control for our Airflow DAGs.– Data/Model Exploration* We should make sure that whatever environment we have in our workspace is that exact same environment that will be used when it comes to serving production traffic.· For instance, if we have used TensorFlow, we should use the same version in both dev. and prod. environments. We can use requirements.txt file (for Python packages).* We also have to make sure that the data itself is versioned. This is to ensure reproducability.· Note: S3 offers data versioning. You could also use DVC (Data Version Control). Delta tables also allow for versioning data by keeping a historical versions of the data which is easily available.* As for the models, we have to know the model parameters used for training, the hyperparameters, and the overall performance of the model.* Tools for data and model management:· MLflow· ML MD → ML metadata· SageMaker Studio* When we want to productionize a model for an experiment:· We should at all times be tracking the experiment and always be able to answer: Who is this experiment affecting? What about those people is it affecting? Where is it affecting them? and For how long will it be affecting them?· This is an effort to avoid experiment collisions.· It also provides a way to quickly look up past experiments and to be able to see what worked and what didn't (so that we don't repeat the same mistake).