TLDR: A theoretical framework for understanding and comparing machine learning model serving systems in cloud environments, focusing on SageMaker, Vertex AI, and Azure ML. Just as previous "Next 700" papers sought to distill the essence of programming languages, we extract core concepts underlying ML model deployment systems.
1. Introduction
Today's ML engineers must choose between various serving systems, each with their own abstractions, terminology, and trade-offs. These platforms differ in their approaches to fundamental concepts like:
- Model containerization and packaging
- Scaling and resource allocation
- Version management and deployment strategies
- Monitoring and observability
- Resource optimization and cost management
2. A Calculus for ML Model Serving
Core Concepts
ModelArtifact ::= (code, weights, metadata) Container ::= (ModelArtifact, runtime, deps) Endpoint ::= (Container, scaling_config, routing) Version ::= (Endpoint, traffic_weight)Operations
package : ModelArtifact → Container deploy : Container → Endpoint scale : Endpoint × Config → Endpoint route : Version × Version × Weight → Version3. Platform Analysis
3.1 Amazon SageMaker
SageMaker's approach closely mirrors our theoretical model, with explicit container building and endpoint management. Key mappings include:
- Model artifacts are packaged into ECR containers
- Endpoints provide real-time inference with automatic scaling
- Production variants enable traffic splitting
Basic Model Deployment
Theoretical Representation:
# SageMaker strict implementation of core grammar ModelArtifact ::= ( code = "s3://bucket/model.tar.gz", # Model code and artifacts weights = "s3://bucket/weights", # Model weights metadata = { # Essential metadata only "framework": str, # e.g., "huggingface" "version": str, # e.g., "4.37" "py_version": str # e.g., "py310" } ) Container ::= ( ModelArtifact, runtime = { "image": str, # ECR image URI "execution_role": str # IAM role }, deps = { "environment": dict, # Environment variables "entry_point": str # Inference script } ) Endpoint ::= ( Container, scaling_config = { "instance_count": int, "instance_type": str }, routing = { "variants": list[str], # Production variant names "weights": list[float] # Traffic weights } ) Version ::= ( Endpoint, traffic_weight = float # Simple weight for this version ) # Core operations package : ModelArtifact → Container # Create SageMaker model deploy : Container → Endpoint # Deploy to endpoint scale : Endpoint × Config → Endpoint # Update instance count/type route : Version × Version × Weight → Version # Update traffic splitImplementation:
from sagemaker.huggingface import HuggingFaceModel from sagemaker.huggingface.model import get_huggingface_llm_image_uri # Define the model image image_uri = get_huggingface_llm_image_uri( "huggingface", version="1.4.2" ) # Create the model (packaging step) huggingface_model = HuggingFaceModel( env=env, # Environment variables for the container role=SAGEMAKER_ROLE, transformers_version="4.37", pytorch_version="2.1", py_version="py310", image_uri=image_uri ) # Deploy the model (endpoint creation) predictor = huggingface_model.deploy( initial_instance_count=deployment.instance_count, instance_type=deployment.instance_type, endpoint_name=endpoint_name, ) # Inference invocation predictor = HuggingFacePredictor( endpoint_name=endpoint_name, sagemaker_session=sagemaker_session ) response = predictor.predict(input)3.2 Azure ML SDK
Azure ML implements a workspace-centric approach with managed online endpoints, emphasizing environment management and model registry integration.
- Managed deployments handle container creation implicitly
- Scaling is defined through deployment configurations
- Blue-green deployments manage version transitions
Theoretical Representation:
# Azure ML implementation of our core grammar ModelArtifact ::= ( code = "model/path", # Local or registry path weights = "weights/path", metadata = { "name": str, # e.g., "hf-model" "type": AssetType, # e.g., CUSTOM_MODEL "description": str, "registry": optional[str] # e.g., "HuggingFace" } ) Container ::= ( ModelArtifact, runtime = { "image": str, # e.g., "mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04" "conda_file": { "channels": list[str], "dependencies": list[str] } }, deps = { "environment_variables": dict[str, str], "pip_packages": list[str] } ) Endpoint ::= ( Container, scaling_config = { "instance_type": str, # e.g., "Standard_DS3_v2" "instance_count": int, "min_replicas": int, "max_replicas": int }, routing = { "deployment_name": str, "traffic_percentage": int } ) Version ::= ( Endpoint, traffic_weight = { "blue_green_config": { "active": str, # blue or green "percentage": int, "evaluation_rules": dict } } ) # Core operations package(ModelArtifact) → Container # Creates Azure container environment deploy(Container) → Endpoint # Deploys to Azure managed endpoint scale(Endpoint × Config) → Endpoint # Updates endpoint scaling route(Version × Weight) → Version # Updates traffic routingImplementation:
from azure.ai.ml import MLClient from azure.identity import DefaultAzureCredential from azure.ai.ml.entities import ( Environment, Model, ManagedOnlineEndpoint, ManagedOnlineDeployment ) from azure.ai.ml.constants import AssetTypes # Initialize workspace client credential = DefaultAzureCredential() ml_client = MLClient( credential=credential, subscription_id=subscription_id, resource_group_name=resource_group, workspace_name=workspace_name ) # Define environment with dependencies environment = Environment( name="bert-env", image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest", conda_file={ "channels": ["conda-forge", "pytorch"], "dependencies": [ "python=3.11", "pip", "pytorch", "transformers", "numpy" ] } ) # Register model from registry model = Model( path=f"hf://{model_id}", type=AssetTypes.CUSTOM_MODEL, name="hf-model", description="HuggingFace model from Model Hub" ) # Create and configure endpoint endpoint_name = f"hf-ep-{int(time.time())}" ml_client.begin_create_or_update( ManagedOnlineEndpoint(name=endpoint_name) ).wait() # Deploy model deployment = ml_client.online_deployments.begin_create_or_update( ManagedOnlineDeployment( name="demo", endpoint_name=endpoint_name, model=model_id, environment=environment, instance_type="Standard_DS3_v2", instance_count=1, ) ).wait() # Update traffic rules endpoint = ml_client.online_endpoints.get(endpoint_name) endpoint.traffic = {"demo": 100} ml_client.begin_create_or_update(endpoint).result()3.3 Google Cloud Vertex AI
Vertex AI takes a streamlined approach to model deployment, with strong integration with Google Cloud's container infrastructure and emphasis on GPU acceleration.
Theoretical Representation:
# Vertex AI implementation of our core grammar ModelArtifact ::= ( code = "gs://model/path", # GCS path weights = "gs://weights/path", metadata = { "model_id": str, # e.g., "hf-bert-base" "framework": str, # e.g., "huggingface" "generation_config": dict } ) Container ::= ( ModelArtifact, runtime = { "image_uri": str, # e.g., "us-docker.pkg.dev/vertex-ai/prediction/..." "accelerator": str # e.g., "NVIDIA_TESLA_A100" }, deps = { "env_vars": { "MODEL_ID": str, "MAX_INPUT_LENGTH": str, "MAX_TOTAL_TOKENS": str, "NUM_SHARD": str } } ) Endpoint ::= ( Container, scaling_config = { "machine_type": str, # e.g., "a2-highgpu-4g" "min_replica_count": int, "max_replica_count": int, "accelerator_count": int }, routing = { "traffic_split": dict[str, int], "prediction_config": dict } ) Version ::= ( Endpoint, traffic_weight = { "split_name": str, "percentage": int, "monitoring_config": dict } ) # Core operations package(ModelArtifact) → Container # Creates Vertex AI container deploy(Container) → Endpoint # Deploys to Vertex endpoint scale(Endpoint × Config) → Endpoint # Updates endpoint scaling route(Version × Weight) → Version # Updates traffic routingImplementation:
from google.cloud import aiplatform def deploy_hf_model( project_id: str, location: str, model_id: str, machine_type: str = "a2-highgpu-4g", ): aiplatform.init(project=project_id, location=location) env_vars = { "MODEL_ID": model_id, "MAX_INPUT_LENGTH": "512", "MAX_TOTAL_TOKENS": "1024", "MAX_BATCH_PREFILL_TOKENS": "2048", "NUM_SHARD": "1" } # Upload model with container configuration model = aiplatform.Model.upload( display_name=f"hf-{model_id.replace('/', '-')}", serving_container_image_uri=( "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/" "huggingface-text-generation-inference-cu121.2-2.ubuntu2204.py310" ), serving_container_environment_variables=env_vars ) # Deploy model with compute configuration endpoint = model.deploy( machine_type=machine_type, min_replica_count=1, max_replica_count=1, accelerator_type="NVIDIA_TESLA_A100", accelerator_count=1, sync=True ) return endpoint def create_completion( endpoint, prompt: str, max_tokens: int = 100, temperature: float = 0.7 ): response = endpoint.predict({ "text": prompt, "parameters": { "max_new_tokens": max_tokens, "temperature": temperature, "top_p": 0.95, "top_k": 40, } }) return response4. Hypothetical Frameworks
4.1 ServerlessML
ServerlessML takes a radical approach by completely eliminating the concept of endpoints and containers, instead treating models as pure functions:
Theoretical Representation:
ModelArtifact ::= (code, weights, metadata, scaling_rules) Function ::= (ModelArtifact, memory_size, timeout) Invocation ::= (Function, cold_start_policy) # Key innovation: No explicit container or endpoint deploy : ModelArtifact → Function invoke : Function → Response scale : automatic based on concurrent invocationsImplementation:
from serverlessml import MLFunction model = MLFunction( model_path="model.pkl", framework="pytorch", memory_size="2GB", scaling_rules={ "cold_start_policy": "eager_loading", "max_concurrent": 1000, "idle_timeout": "10m" } ) # Deployment is implicit - function is ready to serve function_url = model.deploy()Pros:
- Zero infrastructure management - models are treated as pure functions
- True pay-per-invocation pricing with no idle costs
- Automatic scaling from zero to thousands of concurrent requests
Cons:
- Cold starts can impact latency-sensitive applications
- Limited control over underlying infrastructure
- May be more expensive for constant high-throughput workloads
4.2 StatefulML
StatefulML introduces a novel approach by making model state and caching first-class concepts:
Theoretical Representation:
ModelArtifact ::= (code, weights, metadata) ModelState ::= (cache, warm_weights, dynamic_config) Container ::= (ModelArtifact, ModelState, runtime) StateManager ::= (Container, caching_policy, update_strategy) # Key innovation: Explicit state management deploy : (ModelArtifact, StateManager) → Container update_state : (Container, ModelState) → Container cache_forward : (Container, Request) → ResponseImplementation:
from statefulml import MLContainer, StateManager state_manager = StateManager( caching_policy={ "strategy": "predictive_cache", "cache_size": "4GB", "eviction_policy": "feature_based_lru" }, update_strategy={ "type": "incremental", "frequency": "5m", "warm_up": True } ) model = MLContainer( model_path="model.pkl", framework="tensorflow", state_manager=state_manager, dynamic_config={ "feature_importance_tracking": True, "automatic_cache_tuning": True } ) endpoint = model.deploy()Pros:
- Intelligent caching reduces latency for common patterns
- State persistence improves warm start performance
- Dynamic optimization based on actual usage patterns
Cons:
- More complex deployment and management
- Higher memory requirements for state maintenance
- Potential consistency issues with distributed state
5. Future Directions
DynamicResource ::= ( GranularAllocation, ElasticScaling, CostAwareScheduling ) SharedModelConfig ::= ( CrossEndpointSharing, DynamicModelLoading, ResourcePooling ) EnhancedMonitoring ::= ( PredictiveAlerts, AutomaticDiagnosis, AdaptiveOptimization )6. Conclusion
This framework provides a way to understand and compare ML serving systems. While current platforms have significant differences, many reflect platform-specific constraints rather than fundamental requirements of the domain. Future systems can benefit from this analysis to provide more consistent and powerful abstractions for ML deployment.
.png)
