MLOps/MLE in Databricks
Table of Content
1. Intro 2. Why Should I Care About MLOps? 3. Guiding Principles of MLOps 4. Fundamentals of MLOps 4.1. Semantics of dev, staging and prod 4.2. Execution Environments 4.3. Code 4.4. Models 4.5. Data 4.6. ML Deployment Patterns 4.6.1. Deploy models 4.6.2. Deploy code 5. MLOps Architecture and Process 5.1. Architecture Components 5.2. Reference Architecture 5.2.1. Dev 5.2.2. Staging 5.2.3. Prod 6. Why ML Engineering (MLE)? 7. The Core Tenets of MLE. 7.1. Planning 7.2. Scoping and Research 7.2.1. Experimentation 7.2.2. Development 7.2.3. Deployment 7.2.4. Evaluation 7.3. The Goals of MLE 8. Data Science + MLE 8.1. A Foundation of Simplicity 8.2. Principles of Agile 8.3. The Foundation of MLE 9. How To Automate Your ML Pipeline 9.1. The Data Lakehouse 9.2. Why Automate ML? And When? 9.3. MLflow 9.4. Streamlining Model Validation 9.5. Automating Your Entire ML Pipeline 9.6. Incorporating CI/CD 10. Modern Analytics with Azure Databricks 11. Delta Live Tables 12. The Composable Customer Data Platform 13. Data Management 101 on Databricks 14. Data Engineeris Guide to Apache Spark and Delta Lake 15. Delta Lake Cheat Sheet
1. Intro MLOps → a set of processes and automation for managing models, data and code to improve performance stability and long-term efficiency in ML systems. The aim of MLOps → improve the long-term performance stability and success rate of ML systems while maximizing the efficiency of teams who build them. MLOpsModelOps + DataOps + DevOps MLflow → One of popular open source tools for MLOps
Figure 1:MLOps workflow and personas
1. Data Preparation → Prior to any data science or ML work lies the data engineering needed to prepare production data and make it available for consumption. This data may be referred to as “raw data,” and in later steps, data scientists will extract features and labels from the raw data.2. Exploratory Data Analysis → Analysis is conducted by data scientists to assess statistical properties of the data available, and determine if they address the business question. This requires frequent communication and iteration with business stakeholders.3. Feature Engineering → Data scientists clean data and apply business logic and specialized transformations to engineer features for model training. These data, or features, are split into training, testing and validation sets.4. Model Training → Data scientists explore multiple algorithms and hyperparameter configurations using the prepared data, and a best-performing model is determined according to predefined evaluation metric(s).5. Model Validation → Prior to deployment a selected model is subjected to a validation step to ensure that it exceeds some baseline level of performance, in addition to meeting any other technical, business or regulatory requirements. This necessitates collaboration between data scientists, business stakeholders and ML engineers.6. Deployment → ML engineers will deploy a validated model via batch, streaming or online serving, depending on the requirements of the use case.7. Monitoring → ML engineers will monitor deployed models for signs of performance degradation or errors. Data scientists will often be involved in early monitoring phases to ensure that new models perform as expected after deployment. This will inform if and when the deployed model should be updated by returning to earlier stages in the workflow.(a) The data governance officer is ultimately responsible for making sure this entire process is compliant with company and regulatory policies.
Figure 2:MLOps different personas
[Back to TOC] 2. Why Should I Care About MLOps?Dependencies change over time → model, data, and code must change accordingly.Data driftOpen source libraries become outdated.Regulatory environments evolve.Teams change.ML systems should be resilient to these changes. Two main risks with ML systems:Technical riskNon-compliance risk [Back to TOC] 3. Guiding Principles of MLOps Always keep your business goals in mind→ The core purpose of MLOps is to ensure that data-driven applications remain stable, kept up to date and continue to have positive impact on the business.When prioritizing technical work on MLOps consider the business impact:* Does it enable new business use cases?* Does it improve data teams' productivity?* Does it reduce operational costs or risks?Take a data-centric approach to machine learningData pipelinefeature engineering + training + inference + monitoringML data pipeline should employ systemic approaches to monitoring and mitigating data quality issues.It's highly recommended to develop ML applications on the same platform used to manage production data.* For example, don't download data to your laptop and develop locally.Implement MLOps in modular fashion→ Modularized code enables testing of individual components and mitigates difficulties with future code refactoring.Define clear steps, e.g. → training, evaluation, deployment* Supersteps → e.g. training-to-deployment pipeline* Define modular responsibilities.Process should guide automation→ Automation is with the goal of increasing productivity and reducing the risk of human error.* However, not every step of the process should be automated. [Back to TOC] 4. Fundamentals of MLOps 4.1. Semantics of dev, staging and prodML workflows include the following key assets → code + model + data + (execution environment).These assets need to be:* developeddev* testedstaging· Staging → replicates the execution environment of production → code changes in dev are tested prior being deployed to prod → acts as a gateway for code to reach production.* deployedprodThese divisions can be best understood in terms of quality guarantees and access control. 4.2. Execution EnvironmentsExecution Environment → compute instance + runtime + libraries + automated jobs.In Databricks → an “environment” can be defined via dev/staging/prod separation at a few levels. An organization could create distinct environments across multiple cloud accounts, multiple Databricks workspaces in the same cloud account, or within a single Databricks workspace.
[Back to TOC] 4.3. CodeML project code is often stored in a version control repository (such as Git), with most organizations using branches corresponding to the lifecycle phases of development, staging or production. There are a few common patterns:Some use only dev branches and one main branch for staging/prod.Some use main and dev branches (dev) and branches cut for testing potential releases (staging) and branches cut for final releases (prod). As a best practice, code should only be run in an execution environment that corresponds to it or in one that’s higher. For example, the dev environment can run any code, but the prod environment can only run prod code 4.4. ModelsIt is important to note that model and code lifecycle phases often operate asynchronously.→ i.e. you may want to push a new model version before you push a code change, and vice versa.Consider the following scenarios:* To detect fraudulent transactions, you develop an ML pipeline that retrains a model weekly. Deploying the code can be a relatively infrequent process, but each week a new model undergoes its own lifecycle of being generated, tested and marked as “production” to predict on the most recent transactions. In this case the code lifecycle is slower than the model lifecycle.* To classify documents using large deep neural networks, training and deploying the model is often a onetime process due to cost. Updates to the serving and monitoring code in the project may be deployed more frequently than a new version of the model. In this case the model lifecycle is slower than the code.Since model lifecycles do not correspond one-to-one with code lifecycles, it makes sense for model management to have its own service.* MLflow and its Model Registry support managing model artifacts directly via UI and APIs.* The loose coupling of model artifacts and code provides flexibility to update production models without code changes, streamlining the deployment process in many cases.4.5. DataLabel data as → dev, staging, or prod.Access to data in each environment is controlled with table access controls and cloud storage permissions.
[Back to TOC] 4.6. ML Deployment PatternsThe fact that models and code can be managed separately results in multiple possible patterns for getting ML artifacts through staging and into production.
Figure 3:These two patterns differ in terms of whether the model artifact or the training code that produces the model artifact is promoted toward production.
4.6.1. Deploy modelsThe model artifact is generated by training code in the dev environment.→ This artifact is then tested in staging for compliance and performance before finally being deployed into production. This is a simpler handoff for data scientists, and in cases where model training is prohibitively expensive, training the model once and managing that artifact may be preferable. NOTE: If production data is not accessible from the development environment (e.g., for security reasons), this architecture may not be viable. This architecture does not naturally support automated model retraining.→ While you could automate retraining in the dev environment, you would then be treating “dev” training code as production ready, which many deployment teams would not accept.This option hides the fact that ancillary code for featurization, inference and monitoring needs to be deployed to production, requiring a separate code deployment path. 4.6.2. Deploy codeThe code to train models is developed in the dev environment, and this code is moved to staging and then production. Models will be trained in each environment: initially in the dev environment as part of model development, in staging (on a limited subset of data) as part of integration tests, and finally in the production environment (on the full production data) to produce the final model. Since training code goes through code review and testing, it is safer to set up automated retraining. Handing off might be more difficult as the learning curve is steep → opinionated project templates and workflows are helpful. Data scientists need visibility into training results from the prod environment.
NOTE In general we recommend following the “deploy code” approach, and the reference architecture in this document is aligned to it. Nevertheless, there is no perfect process that covers every scenario, and the options outlined above are not mutually exclusive. Within a single organization, you may find some use cases deploying training code and others deploying model artifacts. Your choice of process will depend on the business use case, resources available and what is most likely to succeed.
[Back to TOC] 5. MLOps Architecture and ProcessDatabricks features used to facilitate MLOps in the workflow prescribed. 5.1. Architecture ComponentsData Lakehouse → A Data Lakehouse architecture unifies the best elements of data lakes and data warehouses.→ Delivering data management and performance typically found in data warehouses with the low-cost, flexible object stores offered by data lakes.Data in the lakehouse are typically organized using a “medallion” architecture of Bronze, Silver and Gold tables of increasing refinement and quality.
MLflow → an open source project for managing the end-to-end machine learning lifecycle. It has the following primary components: Tracking → Allows you to track experiments to record and compare parameters, metrics and model artifacts. Models ("MLflow Flavors") → Allows you to store and deploy models from any ML library to a variety of model serving and inference platforms. Model Registry → Provides a centralized model store for managing models’ full lifecycle stage transitions: from staging to production, with capabilities for versioning and annotating. * The registry also provides webhooks for automation and continuous deployment. NOTE: Databricks also provides a fully managed and hosted version of MLflow with enterprise security features, high availability, and other features. Databricks and MLflow Autlogging → A no-code solution that extends MLflow automatic logging to deliver automatic experiment tracking for machine learning training sessions on Databricks.Databricks Autologging automatically captures model parameters, metrics, files and lineage information when you train models with training runs recorded as MLflow tracking runs. Feature Store → A centralized repository of features. It enables feature sharing and discovery across an organization.It also ensures that the same feature computation code is used for model training and inference. MLflow Model Serving → It allows you to host machine learning models from Model Registry as REST endpoints that are updated automatically based on the availability of model versions and their stages. Databricks SQL → It provides a simple experience for SQL users who want to run quick ad hoc queries on their data lake, create multiple visualization types to explore query results from different perspectives, and build and share dashboards. Databricks Workflows and Jobs → Databricks Workflows (Jobs and Delta Live Tables) can execute pipelines in automated, non-interactive ways. For ML, Jobs can be used to define pipelines for computing features, training models, or other ML steps or pipelines. [Back to TOC] 5.2. Reference ArchitectureA general reference architecture for implementing MLOps on the Databricks Lakehouse platform using the recommended “deploy code” pattern from earlier.NOTE: It's not by no mean comprehensive but it convers most of ML use cases.
Figure 4:overview of an ML system end-to-end
Code source control is the primary conduit for deploying ML pipelines from development to production.
Figure 5:MLOps Reference Architecture
[Back to TOC] 5.2.1. Dev
Figure 6:MLOps - dev environment
Data → Data scientists working in the dev environment possess read-only access to production data.They also require read-write access to a separate dev storage environment to develop and experiment with new features and other data tables. EDA → This process is used to assess whether the available data has the potential to address the business problem.Also used for figuring out data preparation and feature engineering steps.This ad hoc process is generally not part of a pipeline. Project Code → A code repository containing all of the pipelines or modules involved in the ML system.Dev branches are used to develop changes to existing pipelines or to create new ones. Feature Table Refresh → This pipeline reads from raw data tables and feature tables and writes to tables in the Feature Store. The pipeline consists of two steps: Data preparation → corrects any data quality issues prior to featurization. Featurization → In the dev environment, new features and updated featurization logic can be tested by writing to feature tables in dev storage, and these dev feature tables can be used for model prototyping. Model TrainingTraining & Tuning → The training process reads features from the feature store and/or Silver- or Gold-level Lakehouse tables, and it logs model parameters, metrics and artifacts to the MLflow tracking server. * → After training and hyperparameter tuning, the final model artifact is logged to the tracking server to record a robust link between the model, its input data, and the code used to generate it. Evaluation → Model quality is evaluated by testing on held-out data. * → The results of these tests are logged to the MLflow tracking server. * → If governance requires additional metrics or supplemental documentation about the model, this is the time to add them using MLflow tracking. * Model interpretations (e.g., plots produced by SHAP or LIME) and plain text descriptions are common, but defining the specifics for such governance requires input from business stakeholders or a data governance officer. Model Output → The output of this pipeline is an ML model artifact stored in the MLflow tracking server. * When this training pipeline is run in staging or production, ML engineers (or their CI/CD code) can load the model via the model URI (or path) and then push the model to the Model Registry for management and testing.* Commit Code → committing the dev branch changes into source control. [Back to TOC] 5.2.2. StagingThe transition of code from development to production occurs in the staging environment.This code includes:Model training code Ancillary code for featurization, inference, etc.Both data scientists and ML engineers are responsible for writing tests for code and models, but ML engineers manage the continuous integration pipelines and orchestration.
Figure 7:MLOps - staging environment
Data → The staging environment may have its own storage area for testing feature tables and ML pipelines.This data is generally temporary and only retained long enough to run tests and to investigate test failures.This data can be made readable from the development environment for debugging. Merge Code Merge Request → The deployment process begins when a merge (or pull) request is submitted against the staging branch of the project in source control.* It is common to use the “main” branch as the staging branch.* Unit Tests (CI) → This merge request automatically builds source code and triggers unit tests. * If tests fail, the merge request is rejected. Integration Tests (CI) → The merge request then goes through integration tests, which run all pipelines to confirm that they function correctly together. → The staging environment should mimic the production environment as much as is reasonable, running and testing pipelines for featurization, model training, inference and monitoring. Integration tests can trade off fidelity of testing for speed and cost.* For example, when models are expensive to train, it is common to test model training on small data sets or for fewer iterations to reduce cost. When models are deployed behind REST APIs, some high-SLA models may merit full-scale load testing within these integration tests, whereas other models may be tested with small batch jobs or a few queries to temporary REST endpoints.* Once integration tests pass on the staging branch, the code may be promoted toward production. Merge → If all tests pass, the new code is merged into the staging branch of the project. If tests fail, the CI/CD system should notify users and post results on the merge (pull) request. NOTE: It can be useful to schedule periodic integration tests on the staging branch, especially if the branch is updated frequently with concurrent merge requests. Cut Release Branch → Once CI tests have passed on a commit in the staging branch, ML engineers can cut a release branch from that commit. [Back to TOC] 5.2.3. ProdThe production environment is typically managed by a select set of ML engineers and is where ML pipelines directly serve the business or application. These pipelines compute fresh feature values, train and test new model versions, publish predictions to downstream tables or applications, and monitor the entire process to avoid performance degradation and instability. Though data scientists may not have write or compute access in the production environment, it is important to provide them with visibility to test results, logs, model artifacts and the status of ML pipelines in production. This visibility allows them to identify and diagnose problems in production.
Figure 8:MLOps - prod environment
Feature Table Refresh → This pipeline transforms the latest production Lakehouse data into production feature tables. It can use batch or streaming computation, depending on the freshness requirements for downstream training and inference. The pipeline can be defined as a Databricks Job which is scheduled, triggered or continuously running. Model Training → The model training pipeline runs either when code changes affect upstream featurization or training logic, or when automated retraining is scheduled or triggered. This pipeline runs on the full production data. Training & Tuning → During the training process, logs are recorded to the MLflow tracking server. These include model metrics, parameters, tags and the model itself. * During development, data scientists may test many algorithms and hyperparameters, but it is common to restrict those choices to the top-performing options in the production training code. * Restricting tuning can reduce the variance from tuning in automated retraining, and it can make training and tuning faster. Evaluation → Model quality is evaluated by testing on held-out production data. * The results of these tests are logged to the MLflow tracking server. * During development, data scientists will have selected meaningful evaluation metrics for the use case, and those metrics or their custom logic will be used in this step. Register and Request Transition → Following model training, the model artifact is registered to the MLflow Model Registry of the production environment, set initially to ’stage=None’. * The final step of this pipeline is to request a transition of the newly registered model to ‘stage=Staging’.* Continuous Deployment (CD) → The CD pipeline is executed when the training pipeline finishes and requests to transition the model to ‘stage=Staging’. There are three key tasks in this pipeline: Compliance Checks → These tests load the model from the Model Registry, perform compliance checks (for tags, documentation, etc.), and approve or reject the request based on test results. * If compliance checks require human expertise, this automated step can compute statistics or visualizations for people to review in a manual approval step at the end of the CD pipeline. * Regardless of the outcome, results for that model version are recorded to the Model Registry through metadata in tags and comments in descriptions. * The MLflow UI can be used to manage stage transition requests manually, but requests and transitions can be automated via MLflow APIs and webhooks. * If the model passes the compliance checks, then the transition request is approved and the model is promoted to ‘stage=Staging’. · If the model fails, the transition request is rejected and the model is moved to ‘stage=Archived’ in the Model Registry.* Compare Staging vs. Production → To prevent performance degradation, models promoted to ‘stage=Staging’ must be compared to the ‘stage=Production’ models they are meant to replace. * The metric(s) for comparison should be defined according to the use case, and the method for comparison can vary from canary deployments to A/B tests. * All comparison results are saved to metrics tables in the lakehouse. * If this is the first deployment and there is no ‘stage=Production’ model yet, the ‘stage=Staging’ model should be compared to a business heuristic or other threshold as a baseline. * For a new version of an existing ‘stage=Production’ model, the ‘stage=Staging’ model is compared with the current ‘stage=Production’ model.* Request Model Transition to ProductionIf the candidate model passes the comparison tests, a request is made to transition it to ‘stage=Production’ in the Model Registry. * As with other stage transition requests, notifications, approvals and rejections can be managed manually via the MLflow UI or automatically through APIs and webhooks. * This is also a good time to consider human oversight, as it is the last step before a model is fully available to downstream applications. · A person can manually review the compliance checks and performance comparisons to perform checks which are difficult to automate. Online Serving (REST APIs) → For lower throughput and lower latency use cases, online serving is generally necessary. With MLflow, it is simple to deploy models to Databricks Model Serving, cloud provider serving endpoints, or on-prem or custom serving layers. In all cases, the serving system loads the production model from the Model Registry upon initialization. On each request, it fetches features from an online Feature Store, scores the data and returns predictions. The serving system, data transport layer or the model itself could log requests and predictions. Inference: Batch or Streaming → This pipeline is responsible for reading the latest data from the Feature Store, loading the model from ‘stage=Production’ in the Model Registry, performing inference and publishing predictions. For higher throughput, higher latency use cases, batch or streaming inference is generally the most cost-effective option. * A batch job would likely publish predictions to Lakehouse tables, over a JDBC connection, or to flat files. * A streaming job would likely publish predictions either to Lakehouse tables or to message queues like Apache Kafka. Monitoring → Input data and model predictions are monitored, both for: Statistical properties (data drift, model performance, etc.)Computational performance (errors, throughput, etc.). These metrics are published for dashboards and alerts. Data Ingestion → This pipeline reads in logs from batch, streaming or online inference. Check Accuracy and Data Drift → The pipeline then computes metrics about the input data, the model’s predictions and the infrastructure performance. * Metrics that measure statistical properties are generally chosen by data scientists during development, whereas metrics for infrastructure are generally chosen by ML engineers. Publish Metrics → The pipeline writes to Lakehouse tables for analysis and reporting. * Tools such as Databricks SQL are used to produce monitoring dashboards, allowing for health checks and diagnostics. * The monitoring job or the dashboarding tool issues notifications when health metrics surpass defined thresholds. Trigger Model Training → When the model monitoring metrics indicate performance issues, or when a model inevitably becomes out of date, the data scientist may need to return to the development environment and develop a new model version. Retraining → This architecture supports automatic retraining using the same model training pipeline above. While we recommend beginning with manually triggered retraining, organizations can add scheduled and/or triggered retraining when needed. Scheduled → If fresh data are regularly made available, rerunning model training on a defined schedule can help models to keep up with changing trends and behavior. Triggered → If the monitoring pipeline can identify model performance issues and send alerts, it can additionally trigger retraining. For example, if the distribution of incoming data changes significantly or if the model performance degrades, automatic retraining and redeployment can boost model performance with minimal human intervention. NOTE: When the featurization or retraining pipelines themselves begin to exhibit performance issues, the data scientist may need to return to the dev environment and resume experimentation to address such issues.
NOTE While automated retraining is supported in this architecture, it isn’t required, and caution must be taken in cases where it is implemented. It is inherently difficult to automate selecting the correct action to take from model monitoring alerts. For example, if data drift is observed, does it indicate that we should automatically retrain, or does it indicate that we should engineer additional features to encode some new signal in the data?
[Back to TOC] Topics be added later... 6. Why ML Engineering (MLE)?7. The Core Tenets of MLE.7.1. Planning7.2. Scoping and Research7.2.1. Experimentation7.2.2. Development7.2.3. Deployment7.2.4. Evaluation7.3. The Goals of MLE8. Data Science + MLE8.1. A Foundation of Simplicity8.2. Principles of Agile8.3. The Foundation of MLE9. How To Automate Your ML Pipeline9.1. The Data Lakehouse9.2. Why Automate ML? And When?9.3. MLflow9.4. Streamlining Model Validation9.5. Automating Your Entire ML Pipeline9.6. Incorporating CI/CD10. Modern Analytics with Azure Databricks11. Delta Live Tables12. The Composable Customer Data Platform13. Data Management 101 on Databricks14. Data Engineeris Guide to Apache Spark and Delta Lake15. Delta Lake Cheat Sheet