The next 700 ML model-serving platforms

3 hours ago 1

TLDR: A theoretical framework for understanding and comparing machine learning model serving systems in cloud environments, focusing on SageMaker, Vertex AI, and Azure ML. Just as previous "Next 700" papers sought to distill the essence of programming languages, we extract core concepts underlying ML model deployment systems.

1. Introduction

Today's ML engineers must choose between various serving systems, each with their own abstractions, terminology, and trade-offs. These platforms differ in their approaches to fundamental concepts like:

  • Model containerization and packaging
  • Scaling and resource allocation
  • Version management and deployment strategies
  • Monitoring and observability
  • Resource optimization and cost management

2. A Calculus for ML Model Serving

Core Concepts

ModelArtifact ::= (code, weights, metadata) Container ::= (ModelArtifact, runtime, deps) Endpoint ::= (Container, scaling_config, routing) Version ::= (Endpoint, traffic_weight)

Operations

package : ModelArtifact → Container deploy : Container → Endpoint scale : Endpoint × Config → Endpoint route : Version × Version × Weight → Version

3. Platform Analysis

3.1 Amazon SageMaker

SageMaker's approach closely mirrors our theoretical model, with explicit container building and endpoint management. Key mappings include:

  • Model artifacts are packaged into ECR containers
  • Endpoints provide real-time inference with automatic scaling
  • Production variants enable traffic splitting

Basic Model Deployment

Theoretical Representation:

# SageMaker strict implementation of core grammar ModelArtifact ::= ( code = "s3://bucket/model.tar.gz", # Model code and artifacts weights = "s3://bucket/weights", # Model weights metadata = { # Essential metadata only "framework": str, # e.g., "huggingface" "version": str, # e.g., "4.37" "py_version": str # e.g., "py310" } ) Container ::= ( ModelArtifact, runtime = { "image": str, # ECR image URI "execution_role": str # IAM role }, deps = { "environment": dict, # Environment variables "entry_point": str # Inference script } ) Endpoint ::= ( Container, scaling_config = { "instance_count": int, "instance_type": str }, routing = { "variants": list[str], # Production variant names "weights": list[float] # Traffic weights } ) Version ::= ( Endpoint, traffic_weight = float # Simple weight for this version ) # Core operations package : ModelArtifact → Container # Create SageMaker model deploy : Container → Endpoint # Deploy to endpoint scale : Endpoint × Config → Endpoint # Update instance count/type route : Version × Version × Weight → Version # Update traffic split

Implementation:

from sagemaker.huggingface import HuggingFaceModel from sagemaker.huggingface.model import get_huggingface_llm_image_uri # Define the model image image_uri = get_huggingface_llm_image_uri( "huggingface", version="1.4.2" ) # Create the model (packaging step) huggingface_model = HuggingFaceModel( env=env, # Environment variables for the container role=SAGEMAKER_ROLE, transformers_version="4.37", pytorch_version="2.1", py_version="py310", image_uri=image_uri ) # Deploy the model (endpoint creation) predictor = huggingface_model.deploy( initial_instance_count=deployment.instance_count, instance_type=deployment.instance_type, endpoint_name=endpoint_name, ) # Inference invocation predictor = HuggingFacePredictor( endpoint_name=endpoint_name, sagemaker_session=sagemaker_session ) response = predictor.predict(input)

3.2 Azure ML SDK

Azure ML implements a workspace-centric approach with managed online endpoints, emphasizing environment management and model registry integration.

  • Managed deployments handle container creation implicitly
  • Scaling is defined through deployment configurations
  • Blue-green deployments manage version transitions

Theoretical Representation:

# Azure ML implementation of our core grammar ModelArtifact ::= ( code = "model/path", # Local or registry path weights = "weights/path", metadata = { "name": str, # e.g., "hf-model" "type": AssetType, # e.g., CUSTOM_MODEL "description": str, "registry": optional[str] # e.g., "HuggingFace" } ) Container ::= ( ModelArtifact, runtime = { "image": str, # e.g., "mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04" "conda_file": { "channels": list[str], "dependencies": list[str] } }, deps = { "environment_variables": dict[str, str], "pip_packages": list[str] } ) Endpoint ::= ( Container, scaling_config = { "instance_type": str, # e.g., "Standard_DS3_v2" "instance_count": int, "min_replicas": int, "max_replicas": int }, routing = { "deployment_name": str, "traffic_percentage": int } ) Version ::= ( Endpoint, traffic_weight = { "blue_green_config": { "active": str, # blue or green "percentage": int, "evaluation_rules": dict } } ) # Core operations package(ModelArtifact) → Container # Creates Azure container environment deploy(Container) → Endpoint # Deploys to Azure managed endpoint scale(Endpoint × Config) → Endpoint # Updates endpoint scaling route(Version × Weight) → Version # Updates traffic routing

Implementation:

from azure.ai.ml import MLClient from azure.identity import DefaultAzureCredential from azure.ai.ml.entities import ( Environment, Model, ManagedOnlineEndpoint, ManagedOnlineDeployment ) from azure.ai.ml.constants import AssetTypes # Initialize workspace client credential = DefaultAzureCredential() ml_client = MLClient( credential=credential, subscription_id=subscription_id, resource_group_name=resource_group, workspace_name=workspace_name ) # Define environment with dependencies environment = Environment( name="bert-env", image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest", conda_file={ "channels": ["conda-forge", "pytorch"], "dependencies": [ "python=3.11", "pip", "pytorch", "transformers", "numpy" ] } ) # Register model from registry model = Model( path=f"hf://{model_id}", type=AssetTypes.CUSTOM_MODEL, name="hf-model", description="HuggingFace model from Model Hub" ) # Create and configure endpoint endpoint_name = f"hf-ep-{int(time.time())}" ml_client.begin_create_or_update( ManagedOnlineEndpoint(name=endpoint_name) ).wait() # Deploy model deployment = ml_client.online_deployments.begin_create_or_update( ManagedOnlineDeployment( name="demo", endpoint_name=endpoint_name, model=model_id, environment=environment, instance_type="Standard_DS3_v2", instance_count=1, ) ).wait() # Update traffic rules endpoint = ml_client.online_endpoints.get(endpoint_name) endpoint.traffic = {"demo": 100} ml_client.begin_create_or_update(endpoint).result()

3.3 Google Cloud Vertex AI

Vertex AI takes a streamlined approach to model deployment, with strong integration with Google Cloud's container infrastructure and emphasis on GPU acceleration.

Theoretical Representation:

# Vertex AI implementation of our core grammar ModelArtifact ::= ( code = "gs://model/path", # GCS path weights = "gs://weights/path", metadata = { "model_id": str, # e.g., "hf-bert-base" "framework": str, # e.g., "huggingface" "generation_config": dict } ) Container ::= ( ModelArtifact, runtime = { "image_uri": str, # e.g., "us-docker.pkg.dev/vertex-ai/prediction/..." "accelerator": str # e.g., "NVIDIA_TESLA_A100" }, deps = { "env_vars": { "MODEL_ID": str, "MAX_INPUT_LENGTH": str, "MAX_TOTAL_TOKENS": str, "NUM_SHARD": str } } ) Endpoint ::= ( Container, scaling_config = { "machine_type": str, # e.g., "a2-highgpu-4g" "min_replica_count": int, "max_replica_count": int, "accelerator_count": int }, routing = { "traffic_split": dict[str, int], "prediction_config": dict } ) Version ::= ( Endpoint, traffic_weight = { "split_name": str, "percentage": int, "monitoring_config": dict } ) # Core operations package(ModelArtifact) → Container # Creates Vertex AI container deploy(Container) → Endpoint # Deploys to Vertex endpoint scale(Endpoint × Config) → Endpoint # Updates endpoint scaling route(Version × Weight) → Version # Updates traffic routing

Implementation:

from google.cloud import aiplatform def deploy_hf_model( project_id: str, location: str, model_id: str, machine_type: str = "a2-highgpu-4g", ): aiplatform.init(project=project_id, location=location) env_vars = { "MODEL_ID": model_id, "MAX_INPUT_LENGTH": "512", "MAX_TOTAL_TOKENS": "1024", "MAX_BATCH_PREFILL_TOKENS": "2048", "NUM_SHARD": "1" } # Upload model with container configuration model = aiplatform.Model.upload( display_name=f"hf-{model_id.replace('/', '-')}", serving_container_image_uri=( "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/" "huggingface-text-generation-inference-cu121.2-2.ubuntu2204.py310" ), serving_container_environment_variables=env_vars ) # Deploy model with compute configuration endpoint = model.deploy( machine_type=machine_type, min_replica_count=1, max_replica_count=1, accelerator_type="NVIDIA_TESLA_A100", accelerator_count=1, sync=True ) return endpoint def create_completion( endpoint, prompt: str, max_tokens: int = 100, temperature: float = 0.7 ): response = endpoint.predict({ "text": prompt, "parameters": { "max_new_tokens": max_tokens, "temperature": temperature, "top_p": 0.95, "top_k": 40, } }) return response

4. Hypothetical Frameworks

4.1 ServerlessML

ServerlessML takes a radical approach by completely eliminating the concept of endpoints and containers, instead treating models as pure functions:

Theoretical Representation:

ModelArtifact ::= (code, weights, metadata, scaling_rules) Function ::= (ModelArtifact, memory_size, timeout) Invocation ::= (Function, cold_start_policy) # Key innovation: No explicit container or endpoint deploy : ModelArtifact → Function invoke : Function → Response scale : automatic based on concurrent invocations

Implementation:

from serverlessml import MLFunction model = MLFunction( model_path="model.pkl", framework="pytorch", memory_size="2GB", scaling_rules={ "cold_start_policy": "eager_loading", "max_concurrent": 1000, "idle_timeout": "10m" } ) # Deployment is implicit - function is ready to serve function_url = model.deploy()

Pros:

  • Zero infrastructure management - models are treated as pure functions
  • True pay-per-invocation pricing with no idle costs
  • Automatic scaling from zero to thousands of concurrent requests

Cons:

  • Cold starts can impact latency-sensitive applications
  • Limited control over underlying infrastructure
  • May be more expensive for constant high-throughput workloads

4.2 StatefulML

StatefulML introduces a novel approach by making model state and caching first-class concepts:

Theoretical Representation:

ModelArtifact ::= (code, weights, metadata) ModelState ::= (cache, warm_weights, dynamic_config) Container ::= (ModelArtifact, ModelState, runtime) StateManager ::= (Container, caching_policy, update_strategy) # Key innovation: Explicit state management deploy : (ModelArtifact, StateManager) → Container update_state : (Container, ModelState) → Container cache_forward : (Container, Request) → Response

Implementation:

from statefulml import MLContainer, StateManager state_manager = StateManager( caching_policy={ "strategy": "predictive_cache", "cache_size": "4GB", "eviction_policy": "feature_based_lru" }, update_strategy={ "type": "incremental", "frequency": "5m", "warm_up": True } ) model = MLContainer( model_path="model.pkl", framework="tensorflow", state_manager=state_manager, dynamic_config={ "feature_importance_tracking": True, "automatic_cache_tuning": True } ) endpoint = model.deploy()

Pros:

  • Intelligent caching reduces latency for common patterns
  • State persistence improves warm start performance
  • Dynamic optimization based on actual usage patterns

Cons:

  • More complex deployment and management
  • Higher memory requirements for state maintenance
  • Potential consistency issues with distributed state

5. Future Directions

DynamicResource ::= ( GranularAllocation, ElasticScaling, CostAwareScheduling ) SharedModelConfig ::= ( CrossEndpointSharing, DynamicModelLoading, ResourcePooling ) EnhancedMonitoring ::= ( PredictiveAlerts, AutomaticDiagnosis, AdaptiveOptimization )

6. Conclusion

This framework provides a way to understand and compare ML serving systems. While current platforms have significant differences, many reflect platform-specific constraints rather than fundamental requirements of the domain. Future systems can benefit from this analysis to provide more consistent and powerful abstractions for ML deployment.

Read Entire Article