The existing machine learning model serving infrastructure was a significant source of operational friction. Each model was wrapped in a bespoke Python Flask application, containerized, and manually deployed. This approach led to a cascade of predictable but severe problems: inconsistent logging formats, non-existent metrics, high memory consumption due to the Python runtime, and, most critically, deployments that required scheduled downtime. Promoting a new model version to production was a high-stakes, manual procedure involving stopping the old service and starting the new one, creating a window of unavailability. This was untenable.
Our objective was to engineer a replacement system governed by three core principles: resilience, performance, and operational simplicity. The system had to support automated, zero-downtime model updates. It needed a minimal resource footprint. And it had to be managed through a simple, declarative orchestration layer without incurring the massive operational overhead of a full-blown Kubernetes cluster, which was overkill for this workload.
The technology selection process was driven by these principles. For model lifecycle management, MLflow was the obvious choice. Its Model Registry provides a canonical source of truth for model versions and their lifecycle stages (e.g., Staging
, Production
), accessible via a clean REST API. For orchestration, Docker Swarm was selected for its simplicity and “good enough” feature set, including native support for rolling updates and health checks.
The most critical and unconventional decision was for the serving runtime itself. Instead of another Python service, we chose Go, specifically using the go-kit toolkit. The rationale was clear: Go provides compiled, single static binaries with a small memory footprint and excellent concurrency performance, making it ideal for a lean, high-throughput serving layer. Go-kit provides the boilerplate for production-grade microservices—structured logging, Prometheus metrics, distributed tracing, and transport layers (HTTP, gRPC)—enforcing a consistent, observable structure that was sorely lacking in the previous setup. This combination formed the backbone of our new architecture: MLflow for governance, Go-kit for high-performance serving, and Docker Swarm for simple, resilient deployment.
The Foundation: MLflow and Swarm Cluster Setup
Before writing a single line of serving code, the foundational infrastructure must be established. This consists of an MLflow tracking server and a Docker Swarm cluster. For a production-grade setup, MLflow requires a backend store for metadata (like PostgreSQL) and an artifact store for model files (like MinIO or S3).
Here is the docker-compose.yml
used to deploy a self-contained MLflow instance via docker stack deploy
. This is not a toy example; it includes a persistent backend database and a MinIO S3-compatible artifact store.
# mlflow-stack.yml
version: '3.7'
services:
postgres_db:
image: postgres:13
container_name: mlflow_postgres
environment:
- POSTGRES_USER=mlflow
- POSTGRES_PASSWORD=mlflow_password
- POSTGRES_DB=mlflow_db
volumes:
- mlflow_db_data:/var/lib/postgresql/data
networks:
- mlflow-net
minio:
image: minio/minio:RELEASE.2022-03-17T06-34-49Z
container_name: mlflow_minio
command: server /data --console-address ":9090"
environment:
- MINIO_ROOT_USER=minio_access_key
- MINIO_ROOT_PASSWORD=minio_secret_key
volumes:
- mlflow_minio_data:/data
networks:
- mlflow-net
mlflow_server:
image: python:3.9-slim
container_name: mlflow_server
# Using a simple python image and installing mlflow for flexibility
command: >
bash -c "pip install mlflow boto3 psycopg2-binary &&
mlflow server
--backend-store-uri postgresql://mlflow:mlflow_password@postgres_db:5432/mlflow_db
--default-artifact-root s3://mlflow-artifacts/
--host 0.0.0.0"
ports:
- "5000:5000"
environment:
- AWS_ACCESS_KEY_ID=minio_access_key
- AWS_SECRET_ACCESS_KEY=minio_secret_key
- MLFLOW_S3_ENDPOINT_URL=http://minio:9000
depends_on:
- postgres_db
- minio
networks:
- mlflow-net
volumes:
mlflow_db_data:
driver: local
mlflow_minio_data:
driver: local
networks:
mlflow-net:
driver: overlay
attachable: true
Deploying this stack is straightforward on a Swarm manager node:
# Initialize swarm if not already done
docker swarm init
# Create the mlflow-artifacts bucket in MinIO (can be done via UI or mc client)
# ... manual one-time setup for bucket ...
# Deploy the stack
docker stack deploy -c mlflow-stack.yml mlflow
This provides a stable, network-isolated MLflow environment where our Go service can fetch model metadata and artifacts.
The Core Component: A Production-Grade Go-Kit Serving Service
The heart of this solution is the Go service. Its primary responsibility is to, upon startup, query the MLflow Model Registry for the “Production” version of a specified model, download the corresponding artifact, load it into memory, and expose an endpoint to serve predictions.
Project Structure
A typical go-kit service structure is used to separate concerns:
.
├── cmd
│ └── main.go # Service entrypoint
├── pkg
│ ├── endpoint # Defines service endpoints and request/response types
│ ├── logging # Logging middleware setup
│ ├── metrics # Prometheus metrics middleware setup
│ ├── model
│ │ ├── loader.go # Interface for model loading
│ │ └── mlflow_loader.go # MLflow-specific implementation
│ ├── service.go # Core business logic
│ └── transport # HTTP/gRPC transport bindings
├── Dockerfile
└── go.mod
The Model Loader Abstraction
To keep the service decoupled from MLflow, we first define an interface for model loading. This allows for future implementations, such as loading from a local file path for testing or another model registry.
// pkg/model/loader.go
package model
import (
"context"
"io"
)
// Model represents a loaded machine learning model that can make predictions.
// For this example, we assume a simple interface. In reality, this would be
// tied to a specific inference engine like ONNX Runtime, TensorFlow, etc.
type Model interface {
Predict(ctx context.Context, inputData map[string]interface{}) (map[string]interface{}, error)
}
// Loader is responsible for fetching and loading a specific model version.
type Loader interface {
// LoadModel retrieves the specified model at a given stage (e.g., "Production")
// and returns a reader to its artifact content and any relevant metadata.
LoadModel(ctx context.Context, modelName string, modelStage string) (io.ReadCloser, map[string]string, error)
}
MLflow Loader Implementation
This is the most critical piece of integration. The implementation uses the MLflow REST API v2.0 to find the latest model version for a given stage and then uses an S3 client to download the artifact. In a real-world project, the code must be robust, with proper error handling, timeouts, and retries.
// pkg/model/mlflow_loader.go
package model
import (
"context"
"encoding/json"
"fmt"
"io"
"log/slog"
"net/http"
"net/url"
"os"
"time"
"github.com/aws/aws-sdk-go/aws"
"github.com/aws/aws-sdk-go/aws/credentials"
"github.com/aws/aws-sdk-go/aws/session"
"github.com/aws/aws-sdk-go/service/s3"
"github.com/aws/aws-sdk-go/service/s3/s3manager"
"github.com/go-kit/kit/circuitbreaker"
"github.com/sony/gobreaker"
)
// MLflow API response structures
type latestVersionResponse struct {
ModelVersions []struct {
Name string `json:"name"`
Version string `json:"version"`
Source string `json:"source"` // This is the S3 artifact path
Status string `json:"status"`
Description string `json:"description"`
} `json:"model_versions"`
}
// mlflowLoaderConfig holds all necessary configuration.
type mlflowLoaderConfig struct {
MLflowTrackingURI string
S3EndpointURL string
S3AccessKeyID string
S3SecretAccessKey string
S3Region string
S3UseSSL bool
Logger *slog.Logger
}
type mlflowLoader struct {
client *http.Client
s3Downloader *s3manager.Downloader
config mlflowLoaderConfig
}
// NewMLflowLoader creates a robust loader with circuit breaker protection.
func NewMLflowLoader(config mlflowLoaderConfig) (Loader, error) {
// A common mistake is not setting ForcePathStyle for S3-compatible storage like MinIO.
sess, err := session.NewSession(&aws.Config{
Credentials: credentials.NewStaticCredentials(config.S3AccessKeyID, config.S3SecretAccessKey, ""),
Endpoint: aws.String(config.S3EndpointURL),
Region: aws.String(config.S3Region),
S3ForcePathStyle: aws.Bool(true),
DisableSSL: aws.Bool(!config.S3UseSSL),
})
if err != nil {
return nil, fmt.Errorf("failed to create S3 session: %w", err)
}
httpClient := &http.Client{Timeout: 10 * time.Second}
// Wrap HTTP client calls with a circuit breaker. This prevents the service from
// endlessly retrying a down MLflow server during startup, allowing it to fail fast.
cb := gobreaker.NewCircuitBreaker(gobreaker.Settings{
Name: "MLflowAPI",
MaxRequests: 3,
Timeout: 5 * time.Second,
ReadyToTrip: func(counts gobreaker.Counts) bool {
return counts.ConsecutiveFailures > 3
},
})
// This is a simplified way to apply the circuit breaker for this example.
// In a full go-kit service, this would be endpoint middleware.
protectedClient := &http.Client{
Transport: &circuitBreakerTransport{
cb: cb,
next: httpClient.Transport,
logger: config.Logger,
},
}
return &mlflowLoader{
client: protectedClient,
s3Downloader: s3manager.NewDownloader(sess),
config: config,
}, nil
}
func (m *mlflowLoader) LoadModel(ctx context.Context, modelName string, modelStage string) (io.ReadCloser, map[string]string, error) {
// 1. Query MLflow Model Registry API to find the latest version in the "Production" stage.
m.config.Logger.Info("querying MLflow for model version", "model", modelName, "stage", modelStage)
endpoint := fmt.Sprintf("%s/api/2.0/mlflow/registered-models/get-latest-versions", m.config.MLflowTrackingURI)
req, err := http.NewRequestWithContext(ctx, http.MethodGet, endpoint, nil)
if err != nil {
return nil, nil, fmt.Errorf("failed to create request: %w", err)
}
q := req.URL.Query()
q.Add("name", modelName)
q.Add("stages", modelStage) // The API accepts a list, but we request one.
req.URL.RawQuery = q.Encode()
resp, err := m.client.Do(req)
if err != nil {
return nil, nil, fmt.Errorf("MLflow API request failed: %w", err)
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
body, _ := io.ReadAll(resp.Body)
return nil, nil, fmt.Errorf("MLflow API returned non-200 status: %d - %s", resp.StatusCode, string(body))
}
var apiResp latestVersionResponse
if err := json.NewDecoder(resp.Body).Decode(&apiResp); err != nil {
return nil, nil, fmt.Errorf("failed to decode MLflow API response: %w", err)
}
if len(apiResp.ModelVersions) == 0 {
return nil, nil, fmt.Errorf("no model version found for model '%s' in stage '%s'", modelName, modelStage)
}
targetVersion := apiResp.ModelVersions[0]
m.config.Logger.Info("found model version", "version", targetVersion.Version, "source", targetVersion.Source)
// 2. Parse the S3 URI and download the artifact.
artifactURL, err := url.Parse(targetVersion.Source)
if err != nil {
return nil, nil, fmt.Errorf("invalid artifact source URI: %w", err)
}
bucket := artifactURL.Host
// The key is the path, but we need to trim the leading slash.
key := artifactURL.Path[1:]
// The model artifact might be a directory. MLflow typically includes an MLmodel file.
// For simplicity, we assume the key points to the specific model file (e.g., model.onnx).
// A more robust implementation would download the directory or parse the MLmodel file.
modelFilePath := key
// Create a temporary file to download the model to.
// Using a WriteAtBuffer allows S3 downloader to write concurrently.
tmpFile, err := os.CreateTemp("", "model-artifact-*.bin")
if err != nil {
return nil, nil, fmt.Errorf("failed to create temp file: %w", err)
}
m.config.Logger.Info("downloading artifact", "bucket", bucket, "key", modelFilePath)
n, err := m.s3Downloader.Download(tmpFile, &s3.GetObjectInput{
Bucket: aws.String(bucket),
Key: aws.String(modelFilePath),
})
if err != nil {
// Clean up the temp file on failure
os.Remove(tmpFile.Name())
return nil, nil, fmt.Errorf("failed to download artifact from S3: %w", err)
}
m.config.Logger.Info("artifact downloaded successfully", "size_bytes", n)
// Reset the file pointer to the beginning for reading
if _, err := tmpFile.Seek(0, 0); err != nil {
return nil, nil, fmt.Errorf("failed to seek temp file: %w", err)
}
// The caller is responsible for closing the file. We also wrap it to delete on close.
wrappedCloser := &fileCloserWrapper{File: tmpFile}
metadata := map[string]string{
"model_name": targetVersion.Name,
"model_version": targetVersion.Version,
"source_uri": targetVersion.Source,
}
return wrappedCloser, metadata, nil
}
// --- Helper types for circuit breaker and file cleanup ---
type circuitBreakerTransport struct {
cb *gobreaker.CircuitBreaker
next http.RoundTripper
logger *slog.Logger
}
func (t *circuitBreakerTransport) RoundTrip(req *http.Request) (*http.Response, error) {
var resp *http.Response
var err error
_, err = t.cb.Execute(func() (interface{}, error) {
resp, err = t.next.RoundTrip(req)
if err != nil {
return nil, err
}
// Consider 5xx errors as failures for the circuit breaker
if resp.StatusCode >= 500 {
return nil, fmt.Errorf("server error: %s", resp.Status)
}
return nil, nil
})
if err != nil {
t.logger.Error("circuit breaker triggered", "error", err)
return nil, err
}
return resp, nil
}
type fileCloserWrapper struct {
*os.File
}
func (w *fileCloserWrapper) Close() error {
err := w.File.Close()
// Best-effort removal of the temp file after it's closed.
os.Remove(w.Name())
return err
}
This loader is now a robust, self-contained component for fetching our production model. The inclusion of a circuit breaker on the MLflow API calls is a crucial production-readiness feature, preventing a transient MLflow outage from cascading into a thundering herd of retries from restarting service instances.
The Deployment Strategy: Zero-Downtime Rolling Updates
With the service logic defined, the final piece is the deployment manifest for Docker Swarm. This file defines the service, its resource constraints, health checks, and—most importantly—the update configuration that enables zero-downtime rollouts.
The update flow is visualized as follows:
sequenceDiagram participant User participant MLflow Registry participant CI/CD participant Docker Swarm participant Service v1 participant Service v2 User->>MLflow Registry: Promote model version 2 to 'Production' stage MLflow Registry->>CI/CD: Webhook triggers pipeline CI/CD->>Docker Swarm: Executes `docker service update --force model-server` Docker Swarm->>Service v2: Creates a new container instance Service v2->>MLflow Registry: Queries for 'Production' model MLflow Registry-->>Service v2: Responds with version 2 details Service v2->>S3 Artifact Store: Downloads model version 2 Service v2-->>Docker Swarm: Reports 'healthy' status Note over Docker Swarm: New task is healthy, proceeding with update. Docker Swarm->>Service v1: Stops the old container instance
This entire flow is orchestrated by the docker-compose.yml
for the service stack.
# model-server-stack.yml
version: '3.7'
services:
model-server:
image: my-registry/model-server:latest # The image is assumed to be built and pushed by CI
networks:
- mlflow-net # Connect to the same network as MLflow for communication
environment:
# --- Service Configuration ---
- MODEL_NAME=my-fraud-detector
- MODEL_STAGE=Production
# --- MLflow Loader Configuration ---
- MLFLOW_TRACKING_URI=http://mlflow_server:5000
- AWS_ACCESS_KEY_ID=minio_access_key
- AWS_SECRET_ACCESS_KEY=minio_secret_key
- MLFLOW_S3_ENDPOINT_URL=http://minio:9000
deploy:
replicas: 3
update_config:
# One-by-one update: Stop one old task, start one new task, wait for it to be healthy.
parallelism: 1
# Wait 15 seconds before updating the next replica. Gives time for metrics to settle.
delay: 15s
order: start-first
failure_action: rollback
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 3
healthcheck:
# A real healthcheck would hit an actual /health endpoint in the Go service.
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 20s # Give the service time to download and load the model.
networks:
mlflow-net:
external: true
name: mlflow_mlflow-net
The key to zero-downtime deployment lies in the deploy.update_config
section:
-
parallelism: 1
: Swarm will update only one replica at a time. -
delay: 15s
: After a replica is updated successfully, Swarm waits before starting the next one. -
order: start-first
: Swarm starts the new task and waits for it to pass its health check before stopping the old task. This ensures service capacity is never reduced during the update. -
healthcheck
: This is non-negotiable. Swarm uses the health check to determine if an updated task is “healthy.” Without a proper health check, Swarm cannot know if the new version with the new model has started successfully. Thestart_period
is critical for models that take time to download and load, preventing Swarm from killing the container prematurely.
When a data scientist promotes a new model to “Production” in the MLflow UI, a CI/CD pipeline is triggered. The pipeline’s only job is to run docker service update --force model-server
. The --force
flag tells Swarm to re-pull the image and restart the tasks even if the service definition hasn’t changed, triggering the rolling update process. Each new container will then independently execute the startup logic: query MLflow, fetch the new production model, load it, and report as healthy. The entire process is automated, safe, and incurs zero downtime.
This architecture has a few remaining limitations to consider. The model loading process at service startup introduces a latency penalty to container start times. For exceptionally large models (many gigabytes), this could exceed reasonable health check windows. A future iteration might involve a dedicated sidecar container that pre-fetches model artifacts to a shared volume, allowing the main service container to start almost instantly. Furthermore, the service’s startup is tightly coupled to the availability of the MLflow server and the S3 artifact store. While the circuit breaker mitigates cascading failures, a more advanced solution could involve caching the last known good model on the host node, allowing the service to start with a slightly stale model if the registry is unreachable, thus prioritizing availability over model freshness in failure scenarios.