Productionalising Machine Learning Models

5 min readApr 22, 2024

Photo by Christina @ wocintechchat.com on Unsplash

Productionalization of a machine learning model demands a considerable engineering effort. While model prototyping typically focuses on accuracy metrics, deploying a model to production would also consider additional operational metrics so as to optimize inference, such as:

Latency, which is the time taken for the model to process a single input and produce a prediction, or the delay between sending a request to the model and receiving the response. Latency is important for real-time or interactive applications, such as chatbot or live STT transcriptions.
Throughput, which is the number of inference requests the model can handle in a given time period. Throughput is important for applications with high volume of requests, for e.g. social media or e-commerce apps.
Runtime (for batch inference).
Resource utilization, such as the amount of memory the loaded model takes up. This is important e.g., for deployment in edge devices where the hardware specs are limited.

Hence, productionalization of the developed model must strike a balance between model complexity (which would tend to correlate with accuracy) and practical challenges such as scalability and ease of maintenance.

Deployment Strategies

The main objective of model deployment is to serve the model so that it is available for inference (i.e. return predictions upon an inference call). This is usually done using a REST API.

Some popular tools to help with this are the following.

Hummingbird: convert scikit-learn models (that run only on a single CPU core) into tensor computations, which allow for parallelization (e.g. inference on a GPU).
Modelbit: returns an API endpoint that is ready to serve end users upon deployment request from a Jupyter notebook.
TorchScript: run PyTorch models in script mode (c.f. eager mode for prototyping that’s Python-centric) that’s language agnostic and more suitable for deployment.

Often, deploying the original same model exactly as it is in development is impractical. Rather, model compression techniques are often used to have lower inference latency (faster predictions), reduced computational cost (better scalability), and smaller memory footprint in production settings.

The most common ones are:

Knowledge distillation (a.k.a. teacher-student model), which have many use cases for LLM, such as DistillBERT, a student model of BERT. Being 40% smaller than the teacher model, DistillBERT still retains 97% of the NLU capability of BERT and is 60% faster in inference.
Quantization
Pruning
Low-rank factorization

Now, productionalization doesn’t stop at hosting the model somewhere and obtaining an API endpoint…

Post-deployment Considerations

Post-deployment tasks that the team might take include:

Identifying better features for the model
Fine-tuning the hyperparameters
Optimizing the deployment infrastructure
Collecting and labeling new data
Retraining the model to adapt to changing conditions (e.g. via adaptive learning).

To do those tasks effectively, we will need the tools below.

Version Control

To effectively collaborate in a team and ensure reproducibilty of results, we need version control.

In ML projects, we need git-based functionalities not just for the codebase but also for the data, models, and other artifacts. This includes rolling back to the previous version of the model when it starts underperforming, and to track the dependencies across multiple versions of code, configurations, data, model, etc.

CI/CD

In software engineering, CI/CD involves building, testing, and deploying code efficiently. In a Machine Learning system, the CI pipeline may have:

Automatic code checking
Automatic builds
Automatic testing

The automated verification processes might kick in once the changes are committed to the repository. On the other hand, the CD of an ML system involves automation of the deployment process (e.g. to the API endpoint) so that the updated model is available to the end users.

Model Monitoring

Some risks that machine learning models face in production are:

Training-serving skew due to mismatch between training data and input data in production. This would manifest itself immediately after moving into production.
Model decay over time. Identify model decay rate by doing analytics for the optimal retraining strategy.
Excessive latency due to volume of input data, data pipeline process, or choice of model.
Concept drift resulting from shift in the relationship between input and output, such as changes in preference or consumer behaviour.
Data drift due to shifting distribution of features that can occur slowly or quickly over time, such as population shifts or adversarial reactions (e.g. spam identification). Mitigation strategies include reweighting the training data, or using domain adaptation techniques.
Nonstationarity. Mitigation strategies include adaptive learning.

To address the issues above, the deployed model needs to be continously monitored (e.g. using proper logging) for:

Model outputs and predictions
🟢 Monitoring of model performance metrics over time to identify the optimal retraining frequency.
🟢 Monitoring of distribution of predicted values v.s. the actual values to identify potential concept drift (or just bias).
🟢 Monitoring of model performance across different subgroups to get consistent performance in all subgroups.
🟢 Inspecting of impact of different features for generating predictions, e.g. with feature importance, SHAP or LIME.
Input data
🟢 Basic quality checks such as schema/encoding, expected volume of data, numeric values within range, missing values, and so on.
🟢 Data distribution monitoring (via visualization or statistical tests) to flag potential data drift.
🟢 Correlation of features to target to identify potential concept drift.
Data pipeline
🟢 Monitoring of disparities between training data and data used for inference in production.
🟢 Monitoring of data distribution pre- and post- processing.
🟢 Sanity check of feature values prior to model training.
Resource utilization
Latency

Containerization

Dockerization is often use to encapsulate the entire environment for the ML system, including dependencies, libraries, and configurations. This would ensure that the same environment can be replicated across the dev, staging and production by providing a way to document/track the versions of all software components, libraries, and dependencies.

Notebook Prototype vs Live System

Typically, Machine Learning scientists are very good in building algorithms and experimentation but may not have deep skills in software engineering, MLOps, or infrastructure management. On the other hand, engineers who specialize in deployment may not have in-depth knowledge of the ML algos to make code changes and optimizations.

This issue makes it challenging to scale ML models in production that has an impact on performance and reliability.

Example of an ML system on Google Cloud.

In a nutshell, the two primary objectives of productionalization are

optimize the model for deployment.
ensure robustness and reliability of the system.

This involves testing the system against various edge cases and scenarios to ensure that it can gracefully handle unexpected inputs and situations. Once the model is in production, it needs ongoing support such as:

Monitoring systems and processes (e.g. Prometheus + Grafana)
Model maintenance (e.g. retraining and updating)
Model versioning

Such ops support would require commitment from the data science team members. Moreover, the team must also support users of the ML systems by providing documentation, conducting user training, explaining model predictions, and guiding users on recourse for model issues.

Feel free to hit me up if your organization needs help with end-to-end data science and machine learning:

👆 https://www.linkedin.com/in/kevinsiswandi/
👆 https://github.com/physicist91