What I learned doing MLOps with SageMaker

1 day ago 5

A well-designed MLOps setup is essential for organizations deploying AI solutions at scale. Over the years, I’ve designed and implemented numerous MLOps configurations across various client use cases. This experience has shaped my approach, which was initially influenced by the work of Dr. Sokratis Kartakis, Giuseppe Angelo Porcelli, and Shelbee Eigenbrode.

Recently, AWS announced the deprecation of several services, which many people used for MLOps — CodeCommit, SageMaker Studio Classic, and SageMaker Experiments. I noticed that the number of practical guides on redesign related to these changes is limited, which makes navigating the MLOps landscape more challenging for teams.

This is why I decided to share my vision of an effective MLOps setup and specific AWS infrastructure components that can support it. My goal is to provide practical guidance that may help both teams that are currently redesigning their MLOps and those who are building the new one.

First of all, let’s decide on the components that make mature MLOps process. Various organizations have proposed maturity models and each of them includes different maturity levels and requirements. By synthesizing these frameworks, we can identify the critical elements that characterize a fully mature MLOps process:

Standardized development environment for experiments and collaboration.
Project templates that allows data scientists to initiate model development with automatic integration into established training and deployment workflows.
Automated ML pipelines for continuous training, evaluation, and deployment of models.
Centralized model registry for versioning, lineage tracking, and managing the model lifecycle.
Automated testing which includes model performance assessment and integration testing.
Monitoring which includes monitoring of pipelines and model endpoints.

Automated model retraining based on metrics, suggested by the Azure MLOps maturity model, from my experience is not a default use case, so it was omitted. The same applies to the feature store suggested by the GCP MLOps maturity model.

Before approaching the actual MLOps setup on AWS that I want to share, it is important to understand what is meant by baseline and advanced scenarios in this article.

During my work, I eventually concluded that there are two general scenarios related to model development and deployment, which require slightly different approaches to MLOps setup:

Baseline scenario — after developing, training, and testing steps, the final version of the model is approved and deployed to production.
Advanced scenario — there is final additional testing in the production environment when two endpoints are deployed simultaneously.

The baseline scenario is typically employed for batch processing, internal-use endpoints, or situations where extensive production testing isn’t required. The advanced scenario is preferred when implementing blue-green, canary, or shadow model deployment strategies in a production environment.

While both scenarios can be expanded to accommodate additional use cases and customizations, I’ve observed that either the baseline or advanced scenario serves as the foundation for most implementations. In the following sections, I’ll describe the setup process for both scenarios.

Git branching strategy

Code versioning is a foundation of every process, so let’s start with git branching strategy for ML model repositories. Each repository should be based on two primary branches: production and development. They represent code of the models deployed to corresponding environment.

While developing, data scientists create feature branches with the prefix “feature-”. Once initial development is completed and tested, changes are merged into the development branch for development endpoint deployment. When the model is ready for deployment to the production environment, the development branch is merged into the production branch.

Branching strategy (Image by Author)

Projects setup

We need a project template for new model development, seamlessly integrated with ML pipelines after creation. It should contain a template code that can be used as a starting point for development. SageMaker Projects are the feature that is specifically developed for this purpose. Sample project code can be developed based on numerous examples provided in the SageMaker project templates.

After user creates the project from the SageMaker Studio UI, the following resources should be created in a baseline scenario:

Repository with template code and two branches: prod and dev.
Option 1 — SageMaker Model registry: three model groups (prod, dev, and feature).
Option 2 — MLflow tracking server: nothing should be created.

Creation of SageMaker project and Model groups.

The prod and dev model groups store models that are trained with code stored in production and development branches respectively. There is only one model group for all feature-* branches. This approach is more practical because feature-* branches are often short-lived, with frequent merges and deletions, making the creation of a model group for each branch impractical. Moreover, during development, data scientists can register models directly from locally launched pipelines without committing code to the feature branch. This limits the ability to consistently trace model back to specific code version. Finally, SageMaker imposes a soft limit on the number of model groups, so creating unnecessary groups is inefficient.

The advanced scenario is slightly different: we need to create four model groups — champion, challenger, dev, and feature. Champion and challenger are the standard naming conventions used to evaluate ML models in a production environment, so we use them to differentiate between two endpoints deployed in production environment.

From the infrastructure standpoint, the process of project creation is the same for both baseline and advanced scenarios. All projects' lifecycle is managed by the Lambda function, which is triggered each time project is created or deleted. In SageMaker, projects are implemented as Service Catalog products, which are written using CloudFormation. As I don’t like creating Lambda function for each project and prefer Terraform over CloudFormation, I recommend using the following setup:

Create Lambda function configuration separately using Terraform.
Define Lambda function call resource in CloudFormation.

The Lambda function should create a repository with prod branch for the project, copy template code from the template repository, and create dev branch. Also, it should create model groups in case the SageMaker Model registry is used.

ML pipelines

For a simple baseline scenario, the ML pipeline follows a straightforward process:

The data scientist starts development by creating a new SageMaker project or a feature branch in the existing model code repository.
While developing a new model or feature, they can run SageMaker Pipeline code locally or push updates to the feature branch to trigger cloud-based training. The pipeline automates data preparation, model training, and artifact registration in the Model Registry.
Once the model is registered in the feature model group, deployment to the dev environment can be triggered via Model registry approval. To optimize costs, these endpoints can be automatically deleted once per day/week by dedicated Lambda function.
When the new model or feature version is ready for testing, the feature branch is merged into the dev branch, triggering the pipeline, which registers the model in the dev model group.
Upon deployment approval (can be automated by registering the model as Approved), the model is deployed to the dev environment, and automated testing is executed.
If the model passes all tests, the data scientist merges changes into the prod branch. Instead of retraining, the model artifact from the dev model group is copied to the prod model group.
After final deployment approval in prod model group., the model is deployed to the prod environment.

If you use the MLflow tracking server, note that EventBridge events aren’t generated for model alias assignments. To trigger automated endpoint deployment, you have two options:

Use approval logic in the Model Registry — since all models are automatically registered in Model registry as well, you can leverage approval-based triggers for deployment.
Develop a custom process — this process would monitor model alias assignments in the MLflow tracking server and initiate endpoint deployment accordingly.

⚠️ The managed MLflow on SageMaker doesn’t expose the underlying webhook capabilities that would be available in a self-deployed MLflow server according to AWS documentation.

Training and deployment MLOps pipelines with SageMaker.

The key difference in the advanced scenario lies in the production deployment process:

After merging the dev branch into the prod branch, the model artifact is copied to both the champion and challenger model groups.
Approving the model in each model group triggers deployment to the corresponding endpoint.
If using the MLflow tracking server, model aliases can be assigned, but a custom process is required to deploy the model to the appropriate endpoint based on the assigned alias.

The technical implementation consists of two pipelines:

Training pipeline — is the one which launches training and model artifacts and metadata registration after commit or merge into branch. It can be implemented using any CI/CD tool: GitHub Actions, GitLab CI/CD, Bitbucket Pipelines, Jenkins, etc. I recommend to stick with the one already used in the company.
Deployment pipeline — the one that deploys an endpoint to the corresponding environment. It can be triggered by the EventBridge rule which tracks model deployment status changes (Model registry approach). The most cost-efficient way is to use the Lambda function for endpoint deployment.

Technical implementation of ML pipelines (Image by Author)

Model registry

When using the SageMaker Model Registry or MLflow tracking server to register new model versions, assigning metadata is essential for model lineage tracking. Key metadata includes:

Dataset reference — A link to the dataset or dataset version used for training.
Commit ID — Identifies the exact version of the code used for training.
Hyperparameters — Stores key hyperparameters for quick reference.
Performance metrics — Captures essential evaluation metrics such as accuracy, precision, and recall.

Testing

You can evaluate model performance and data quality using the QualityCheck step in SageMaker Pipelines. Alternatively, you can develop a custom step or a tailored solution to assess model performance based on your specific requirements.

SageMaker does not provide built-in features for automated integration and load testing, so a custom solution is required. Like model performance assessment, this can be implemented using a custom step in the SageMaker Pipeline or a custom solution. The best approach depends on specific requirements, but you can start by using the following approach.

You can automate the integration testing after model deployment to a SageMaker endpoint, by using an EventBridge rule and a Lambda function:

Store model testing configurations in a DynamoDB table.
When the endpoint is updated, the Lambda function retrieves the configuration and runs the tests.
Send test results to a preferred communication channel, such as Slack.

Custom solution for integration testing (Image by Author)

⚠️ Keep in mind that this example cannot be used for some endpoint types, as they do not generate the EventBrisge event which is used as a trigger.

Monitoring

SageMaker provides a built-in feature for monitoring data quality, model quality, and data or feature drift — Model Monitor. However, there are a few important considerations:

Ground truth data is required to assess model quality. If ground truth data is unavailable for predictions, Model Monitor cannot be used.
Limited data type support — Model Monitor is designed primarily for tabular data and offers minimal support for images, text, or video.
Fixed monitoring intervals — Monitoring can be scheduled hourly, daily, weekly, or monthly, but real-time continuous monitoring is not supported.
Delayed results — There is a lag between inference time and when monitoring results become available, making it unsuitable for immediate alerts.

If you only need to track technical metrics like CPU and memory utilization, you can use CloudWatch Endpoint Instance Metrics.

🦊 Thank you for reading till the end. I do hope it was helpful. If you spot any mistakes or have questions — please let me know in the comments.

If you need help with MLOps setup in your company, let’s talk.

If you are interested in biweekly AWS tips — follow me on Substack.

Read Entire Article

What I learned doing MLOps with SageMaker

Git branching strategy

Projects setup

ML pipelines

Model registry

Testing

Monitoring

Related

Show HN: Tiptap AI Agent – Add AI workflows to your text edi...

Shaping a European Strategy in Quantum Technology

The Buddha at Kamakura – Rudyard Kipling