Productionalising Machine Learning Models

7 min readApr 22, 2024

Photo by Christina @ wocintechchat.com on Unsplash

Productionalization of a machine learning model demands a considerable engineering effort. While model prototyping typically focuses on accuracy metrics, deploying a model to production would also consider additional operational metrics so as to optimize inference, such as:

Latency, which is the time taken for the model to process a single input and produce a prediction, or the delay between sending a request to the model and receiving the response. Latency is important for real-time or interactive applications, such as chatbot or live STT transcriptions.
Throughput, which is the number of inference requests the model can handle in a given time period. Throughput is important for applications with high volume of requests, for e.g. social media or e-commerce apps.
Runtime (for batch inference).
Resource utilization, such as the amount of memory the loaded model takes up. This is important e.g., for deployment in edge devices where the hardware specs are limited.

Hence, the productionalization effort must strike a balance between model complexity (which correlates with accuracy) and practical challenges (such as scalability and ease of maintenance), following MLOps best practices.

The Machine Learning Lifecycle. Source: https://www.datascience-pm.com/the-genai-life-cycle/

Deployment Strategies

Typically, the output of a model development phase is twofold:

model artifact
inference script (perhaps a .py) that loads the model artifact, does the necessary data preprocessing/feature engineering, and use the features to generate predictions.

Now, the main objective of model deployment is to serve the model so that it is available for inference (i.e. return predictions upon an inference call). This is usually done using a REST API (e.g. FastAPI or Flask wrapper) that is packaged into a Docker container, ready to be shipped to production (such as a compute engine).

Some popular tools to help with this are the following.

🐦Hummingbird: converts scikit-learn models that run only on one single CPU core into tensor computations so as to allow for parallelization (such as inference on a GPU).
🌍Modelbit: returns an API endpoint that is ready to serve end users upon deployment request from a Jupyter notebook.
🔦TorchScript: runs PyTorch models in script mode (c.f. eager mode used for prototyping that is Python-centric) language agnostic and more suitable for deployment.

Model Compression

In many cases, deploying the exact same model as it was in development is probably not practical. Rather, model compression techniques are used to achieve lower inference latency (hence faster predictions), reduced computational cost (better scalability), and smaller memory footprint in production settings.

The most common ones are:

Knowledge distillation (a.k.a. teacher-student model), which have many use cases for LLM, such as DistillBERT, a student model of BERT. Being 40% smaller than the teacher model, DistillBERT still retains 97% of the NLU capability of BERT and is 60% faster in inference.
Quantization
Pruning
Low-rank factorization

MLOps

MLOps is a set of best practices and workflows that helps machine learning practitioners to move prototypes to production environments effectively. A typical workflow would look like:

Use version control such as Git to track code changes. The new model is to be pushed to a new branch. A pull request is then opened to merge it to the master branch. This is important because in a typical production environment, the master branch contains the code version that is being deployed in the REST API.
Trigger a Continuous Integration pipeline (e.g. via Github Actions) when a pull request is opened. The CI pipeline will run unit tests, execute the training script to produce model artifacts, and validate the model (using for e.g., Giskard).
If the CI tests pass, the model is automatically pushed to model registry and the branch gets merged to master.
Trigger a Continuous Deployment pipeline (e.g. via Github Actions) that builds a Docker image using the Dockerfile in the repo and pushes it to a container registry. The deployment to the desired compute engine can be done via Kubernetes by kubectl apply -f a manifest file.

Integration Testing: end-to-end testing of ML systems is usually performed by deploying it onto a so-called staging environment to get a staging REST API where requests are sent and metrics are logged. Here, the accuracy and latency of the responses are evaluated.

Production Testing: if the ML system passes the integration tests, it will be further tested in production. There are several variants of this:

AB Test
Shadow Test
Canary Deployment

In AB test and canary deployment, often there’s an API gateway that routes incoming requests to either the existing model in production (base model) or the model to be evaluated (test model). The performance of each model is then monitored and compared. In a canary deployment, a small fraction of requests is routed to the canary model (e.g. 5%) to mitigate risks.

Now, productionalization doesn’t stop at hosting the model somewhere and obtaining an API endpoint…

Post-deployment Considerations

Post-deployment tasks that the team might take include:

Identifying better features for the model
Fine-tuning the hyperparameters
Optimizing the deployment infrastructure
Collecting and labeling new data
Retraining the model to adapt to changing conditions (e.g. via adaptive learning).

To do those tasks effectively, we will need the tools below.

Version Control

To effectively collaborate in a team and ensure reproducibilty of results, we need version control.

In ML projects, we need git-based functionalities not just for the codebase but also for the data, models, and other artifacts. This includes rolling back to the previous version of the model when it starts underperforming, and to track the dependencies across multiple versions of code, configurations, data, model, etc.

CI/CD

In software engineering, CI/CD involves building, testing, and deploying code efficiently. In a Machine Learning system, the CI pipeline may have:

Automatic code checking
Automatic builds
Automatic testing

The automated verification processes might kick in once the changes are committed to the repository. On the other hand, the CD of an ML system involves automation of the deployment process (e.g. to the API endpoint) so that the updated model is available to the end users.

Model Monitoring

Some risks that machine learning models face in production are:

Training-serving skew due to mismatch between training data and input data in production. This would manifest itself immediately after moving into production.
Model decay over time. Identify model decay rate by doing analytics for the optimal retraining strategy.
Excessive latency due to volume of input data, data pipeline process, or choice of model.
Concept drift resulting from shift in the relationship between input and output, such as changes in preference or consumer behaviour.
Data drift due to shifting distribution of features that can occur slowly or quickly over time, such as population shifts or adversarial reactions (e.g. spam identification). Mitigation strategies include reweighting the training data, or using domain adaptation techniques.
Nonstationarity. Mitigation strategies include adaptive learning.

To address the issues above, the deployed model needs to be continously monitored (e.g. using proper logging) for:

Model outputs and predictions
🟢 Monitoring of model performance metrics over time to identify the optimal retraining frequency.
🟢 Monitoring of distribution of predicted values v.s. the actual values to identify potential concept drift (or just bias).
🟢 Monitoring of model performance across different subgroups to get consistent performance in all subgroups.
🟢 Inspecting of impact of different features for generating predictions, e.g. with feature importance, SHAP or LIME.
Input data
🟢 Basic quality checks such as schema/encoding, expected volume of data, numeric values within range, missing values, and so on.
🟢 Data distribution monitoring (via visualization or statistical tests) to flag potential data drift.
🟢 Correlation of features to target to identify potential concept drift.
Data pipeline
🟢 Monitoring of disparities between training data and data used for inference in production.
🟢 Monitoring of data distribution pre- and post- processing.
🟢 Sanity check of feature values prior to model training.
Resource utilization
Latency

Containerization

Dockerization is often use to encapsulate the entire environment for the ML system, including dependencies, libraries, and configurations. This would ensure that the same environment can be replicated across the dev, staging and production by providing a way to document/track the versions of all software components, libraries, and dependencies.

Notebook Prototype vs Live System

Typically, Machine Learning scientists are very good in building algorithms and experimentation but may not have deep skills in software engineering, MLOps, or infrastructure management. On the other hand, engineers who specialize in deployment may not have in-depth knowledge of the ML algos to make code changes and optimizations.

This issue makes it challenging to scale ML models in production that has an impact on performance and reliability.

Example of an ML system on Google Cloud.

In a nutshell, the two primary objectives of productionalization are

optimize the model for deployment.
ensure robustness and reliability of the system.

This involves testing the system against various edge cases and scenarios to ensure that it can gracefully handle unexpected inputs and situations. Once the model is in production, it needs ongoing support such as:

Monitoring systems and processes (e.g. Prometheus + Grafana)
Model maintenance (e.g. retraining and updating)
Model versioning

Such ops support would require commitment from the data science team members. Moreover, the team must also support users of the ML systems by providing documentation, conducting user training, explaining model predictions, and guiding users on recourse for model issues.

Feel free to hit me up if your organization needs help with end-to-end data science and machine learning:

👆 https://www.linkedin.com/in/kevinsiswandi/
👆 https://github.com/physicist91