Implementing a Zero-Downtime ML Model Serving Architecture with Go-Kit MLflow and Docker Swarm

MLOps

Word Count: 2.7k

Read Times: 16 Min

The existing machine learning model serving infrastructure was a significant source of operational friction. Each model was wrapped in a bespoke Python Flask application, containerized, and manually deployed. This approach led to a cascade of predictable but severe problems: inconsistent logging formats, non-existent metrics, high memory consumption due to the Python runtime, and, most critically, deployments that required scheduled downtime. Promoting a new model version to production was a high-stakes, manual procedure involving stopping the old service and starting the new one, creating a window of unavailability. This was untenable.

Our objective was to engineer a replacement system governed by three core principles: resilience, performance, and operational simplicity. The system had to support automated, zero-downtime model updates. It needed a minimal resource footprint. And it had to be managed through a simple, declarative orchestration layer without incurring the massive operational overhead of a full-blown Kubernetes cluster, which was overkill for this workload.

The technology selection process was driven by these principles. For model lifecycle management, MLflow was the obvious choice. Its Model Registry provides a canonical source of truth for model versions and their lifecycle stages (e.g., Staging, Production), accessible via a clean REST API. For orchestration, Docker Swarm was selected for its simplicity and “good enough” feature set, including native support for rolling updates and health checks.

The most critical and unconventional decision was for the serving runtime itself. Instead of another Python service, we chose Go, specifically using the go-kit toolkit. The rationale was clear: Go provides compiled, single static binaries with a small memory footprint and excellent concurrency performance, making it ideal for a lean, high-throughput serving layer. Go-kit provides the boilerplate for production-grade microservices—structured logging, Prometheus metrics, distributed tracing, and transport layers (HTTP, gRPC)—enforcing a consistent, observable structure that was sorely lacking in the previous setup. This combination formed the backbone of our new architecture: MLflow for governance, Go-kit for high-performance serving, and Docker Swarm for simple, resilient deployment.

The Foundation: MLflow and Swarm Cluster Setup

Before writing a single line of serving code, the foundational infrastructure must be established. This consists of an MLflow tracking server and a Docker Swarm cluster. For a production-grade setup, MLflow requires a backend store for metadata (like PostgreSQL) and an artifact store for model files (like MinIO or S3).

Here is the docker-compose.yml used to deploy a self-contained MLflow instance via docker stack deploy. This is not a toy example; it includes a persistent backend database and a MinIO S3-compatible artifact store.

# mlflow-stack.yml
version: '3.7'

services:
  postgres_db:
    image: postgres:13
    container_name: mlflow_postgres
    environment:
      - POSTGRES_USER=mlflow
      - POSTGRES_PASSWORD=mlflow_password
      - POSTGRES_DB=mlflow_db
    volumes:
      - mlflow_db_data:/var/lib/postgresql/data
    networks:
      - mlflow-net

  minio:
    image: minio/minio:RELEASE.2022-03-17T06-34-49Z
    container_name: mlflow_minio
    command: server /data --console-address ":9090"
    environment:
      - MINIO_ROOT_USER=minio_access_key
      - MINIO_ROOT_PASSWORD=minio_secret_key
    volumes:
      - mlflow_minio_data:/data
    networks:
      - mlflow-net

  mlflow_server:
    image: python:3.9-slim
    container_name: mlflow_server
    # Using a simple python image and installing mlflow for flexibility
    command: >
      bash -c "pip install mlflow boto3 psycopg2-binary &&
      mlflow server
      --backend-store-uri postgresql://mlflow:mlflow_password@postgres_db:5432/mlflow_db
      --default-artifact-root s3://mlflow-artifacts/
      --host 0.0.0.0"
    ports:
      - "5000:5000"
    environment:
      - AWS_ACCESS_KEY_ID=minio_access_key
      - AWS_SECRET_ACCESS_KEY=minio_secret_key
      - MLFLOW_S3_ENDPOINT_URL=http://minio:9000
    depends_on:
      - postgres_db
      - minio
    networks:
      - mlflow-net

volumes:
  mlflow_db_data:
    driver: local
  mlflow_minio_data:
    driver: local

networks:
  mlflow-net:
    driver: overlay
    attachable: true

Deploying this stack is straightforward on a Swarm manager node:

# Initialize swarm if not already done
docker swarm init

# Create the mlflow-artifacts bucket in MinIO (can be done via UI or mc client)
# ... manual one-time setup for bucket ...

# Deploy the stack
docker stack deploy -c mlflow-stack.yml mlflow

This provides a stable, network-isolated MLflow environment where our Go service can fetch model metadata and artifacts.

The Core Component: A Production-Grade Go-Kit Serving Service

The heart of this solution is the Go service. Its primary responsibility is to, upon startup, query the MLflow Model Registry for the “Production” version of a specified model, download the corresponding artifact, load it into memory, and expose an endpoint to serve predictions.

Project Structure

A typical go-kit service structure is used to separate concerns:

.
├── cmd
│   └── main.go              # Service entrypoint
├── pkg
│   ├── endpoint             # Defines service endpoints and request/response types
│   ├── logging              # Logging middleware setup
│   ├── metrics              # Prometheus metrics middleware setup
│   ├── model
│   │   ├── loader.go        # Interface for model loading
│   │   └── mlflow_loader.go # MLflow-specific implementation
│   ├── service.go           # Core business logic
│   └── transport            # HTTP/gRPC transport bindings
├── Dockerfile
└── go.mod

The Model Loader Abstraction

To keep the service decoupled from MLflow, we first define an interface for model loading. This allows for future implementations, such as loading from a local file path for testing or another model registry.

// pkg/model/loader.go
package model

import (
	"context"
	"io"
)

// Model represents a loaded machine learning model that can make predictions.
// For this example, we assume a simple interface. In reality, this would be
// tied to a specific inference engine like ONNX Runtime, TensorFlow, etc.
type Model interface {
	Predict(ctx context.Context, inputData map[string]interface{}) (map[string]interface{}, error)
}

// Loader is responsible for fetching and loading a specific model version.
type Loader interface {
	// LoadModel retrieves the specified model at a given stage (e.g., "Production")
	// and returns a reader to its artifact content and any relevant metadata.
	LoadModel(ctx context.Context, modelName string, modelStage string) (io.ReadCloser, map[string]string, error)
}

MLflow Loader Implementation

This is the most critical piece of integration. The implementation uses the MLflow REST API v2.0 to find the latest model version for a given stage and then uses an S3 client to download the artifact. In a real-world project, the code must be robust, with proper error handling, timeouts, and retries.

// pkg/model/mlflow_loader.go
package model

import (
	"context"
	"encoding/json"
	"fmt"
	"io"
	"log/slog"
	"net/http"
	"net/url"
	"os"
	"time"

	"github.com/aws/aws-sdk-go/aws"
	"github.com/aws/aws-sdk-go/aws/credentials"
	"github.com/aws/aws-sdk-go/aws/session"
	"github.com/aws/aws-sdk-go/service/s3"
	"github.com/aws/aws-sdk-go/service/s3/s3manager"
	"github.com/go-kit/kit/circuitbreaker"
	"github.com/sony/gobreaker"
)

// MLflow API response structures
type latestVersionResponse struct {
	ModelVersions []struct {
		Name        string `json:"name"`
		Version     string `json:"version"`
		Source      string `json:"source"` // This is the S3 artifact path
		Status      string `json:"status"`
		Description string `json:"description"`
	} `json:"model_versions"`
}

// mlflowLoaderConfig holds all necessary configuration.
type mlflowLoaderConfig struct {
	MLflowTrackingURI string
	S3EndpointURL     string
	S3AccessKeyID     string
	S3SecretAccessKey string
	S3Region          string
	S3UseSSL          bool
	Logger            *slog.Logger
}

type mlflowLoader struct {
	client *http.Client
	s3Downloader *s3manager.Downloader
	config mlflowLoaderConfig
}

// NewMLflowLoader creates a robust loader with circuit breaker protection.
func NewMLflowLoader(config mlflowLoaderConfig) (Loader, error) {
	// A common mistake is not setting ForcePathStyle for S3-compatible storage like MinIO.
	sess, err := session.NewSession(&aws.Config{
		Credentials:      credentials.NewStaticCredentials(config.S3AccessKeyID, config.S3SecretAccessKey, ""),
		Endpoint:         aws.String(config.S3EndpointURL),
		Region:           aws.String(config.S3Region),
		S3ForcePathStyle: aws.Bool(true),
		DisableSSL:       aws.Bool(!config.S3UseSSL),
	})
	if err != nil {
		return nil, fmt.Errorf("failed to create S3 session: %w", err)
	}

	httpClient := &http.Client{Timeout: 10 * time.Second}

	// Wrap HTTP client calls with a circuit breaker. This prevents the service from
	// endlessly retrying a down MLflow server during startup, allowing it to fail fast.
	cb := gobreaker.NewCircuitBreaker(gobreaker.Settings{
		Name:        "MLflowAPI",
		MaxRequests: 3,
		Timeout:     5 * time.Second,
		ReadyToTrip: func(counts gobreaker.Counts) bool {
			return counts.ConsecutiveFailures > 3
		},
	})

	// This is a simplified way to apply the circuit breaker for this example.
	// In a full go-kit service, this would be endpoint middleware.
	protectedClient := &http.Client{
		Transport: &circuitBreakerTransport{
			cb:     cb,
			next:   httpClient.Transport,
			logger: config.Logger,
		},
	}

	return &mlflowLoader{
		client: protectedClient,
		s3Downloader: s3manager.NewDownloader(sess),
		config: config,
	}, nil
}


func (m *mlflowLoader) LoadModel(ctx context.Context, modelName string, modelStage string) (io.ReadCloser, map[string]string, error) {
	// 1. Query MLflow Model Registry API to find the latest version in the "Production" stage.
	m.config.Logger.Info("querying MLflow for model version", "model", modelName, "stage", modelStage)
	
	endpoint := fmt.Sprintf("%s/api/2.0/mlflow/registered-models/get-latest-versions", m.config.MLflowTrackingURI)
	req, err := http.NewRequestWithContext(ctx, http.MethodGet, endpoint, nil)
	if err != nil {
		return nil, nil, fmt.Errorf("failed to create request: %w", err)
	}

	q := req.URL.Query()
	q.Add("name", modelName)
	q.Add("stages", modelStage) // The API accepts a list, but we request one.
	req.URL.RawQuery = q.Encode()

	resp, err := m.client.Do(req)
	if err != nil {
		return nil, nil, fmt.Errorf("MLflow API request failed: %w", err)
	}
	defer resp.Body.Close()

	if resp.StatusCode != http.StatusOK {
		body, _ := io.ReadAll(resp.Body)
		return nil, nil, fmt.Errorf("MLflow API returned non-200 status: %d - %s", resp.StatusCode, string(body))
	}

	var apiResp latestVersionResponse
	if err := json.NewDecoder(resp.Body).Decode(&apiResp); err != nil {
		return nil, nil, fmt.Errorf("failed to decode MLflow API response: %w", err)
	}

	if len(apiResp.ModelVersions) == 0 {
		return nil, nil, fmt.Errorf("no model version found for model '%s' in stage '%s'", modelName, modelStage)
	}
	
	targetVersion := apiResp.ModelVersions[0]
	m.config.Logger.Info("found model version", "version", targetVersion.Version, "source", targetVersion.Source)

	// 2. Parse the S3 URI and download the artifact.
	artifactURL, err := url.Parse(targetVersion.Source)
	if err != nil {
		return nil, nil, fmt.Errorf("invalid artifact source URI: %w", err)
	}
	
	bucket := artifactURL.Host
	// The key is the path, but we need to trim the leading slash.
	key := artifactURL.Path[1:] 

	// The model artifact might be a directory. MLflow typically includes an MLmodel file.
	// For simplicity, we assume the key points to the specific model file (e.g., model.onnx).
	// A more robust implementation would download the directory or parse the MLmodel file.
	modelFilePath := key

	// Create a temporary file to download the model to.
	// Using a WriteAtBuffer allows S3 downloader to write concurrently.
	tmpFile, err := os.CreateTemp("", "model-artifact-*.bin")
	if err != nil {
		return nil, nil, fmt.Errorf("failed to create temp file: %w", err)
	}
	
	m.config.Logger.Info("downloading artifact", "bucket", bucket, "key", modelFilePath)
	
	n, err := m.s3Downloader.Download(tmpFile, &s3.GetObjectInput{
		Bucket: aws.String(bucket),
		Key:    aws.String(modelFilePath),
	})
	if err != nil {
		// Clean up the temp file on failure
		os.Remove(tmpFile.Name())
		return nil, nil, fmt.Errorf("failed to download artifact from S3: %w", err)
	}
	
	m.config.Logger.Info("artifact downloaded successfully", "size_bytes", n)

	// Reset the file pointer to the beginning for reading
	if _, err := tmpFile.Seek(0, 0); err != nil {
		return nil, nil, fmt.Errorf("failed to seek temp file: %w", err)
	}

	// The caller is responsible for closing the file. We also wrap it to delete on close.
	wrappedCloser := &fileCloserWrapper{File: tmpFile}
	
	metadata := map[string]string{
		"model_name":    targetVersion.Name,
		"model_version": targetVersion.Version,
		"source_uri":    targetVersion.Source,
	}

	return wrappedCloser, metadata, nil
}


// --- Helper types for circuit breaker and file cleanup ---

type circuitBreakerTransport struct {
	cb     *gobreaker.CircuitBreaker
	next   http.RoundTripper
	logger *slog.Logger
}

func (t *circuitBreakerTransport) RoundTrip(req *http.Request) (*http.Response, error) {
	var resp *http.Response
	var err error
	_, err = t.cb.Execute(func() (interface{}, error) {
		resp, err = t.next.RoundTrip(req)
		if err != nil {
			return nil, err
		}
		// Consider 5xx errors as failures for the circuit breaker
		if resp.StatusCode >= 500 {
			return nil, fmt.Errorf("server error: %s", resp.Status)
		}
		return nil, nil
	})

	if err != nil {
		t.logger.Error("circuit breaker triggered", "error", err)
		return nil, err
	}
	return resp, nil
}

type fileCloserWrapper struct {
	*os.File
}

func (w *fileCloserWrapper) Close() error {
	err := w.File.Close()
	// Best-effort removal of the temp file after it's closed.
	os.Remove(w.Name())
	return err
}

This loader is now a robust, self-contained component for fetching our production model. The inclusion of a circuit breaker on the MLflow API calls is a crucial production-readiness feature, preventing a transient MLflow outage from cascading into a thundering herd of retries from restarting service instances.

The Deployment Strategy: Zero-Downtime Rolling Updates

With the service logic defined, the final piece is the deployment manifest for Docker Swarm. This file defines the service, its resource constraints, health checks, and—most importantly—the update configuration that enables zero-downtime rollouts.

The update flow is visualized as follows:

sequenceDiagram
    participant User
    participant MLflow Registry
    participant CI/CD
    participant Docker Swarm
    participant Service v1
    participant Service v2

    User->>MLflow Registry: Promote model version 2 to 'Production' stage
    MLflow Registry->>CI/CD: Webhook triggers pipeline
    CI/CD->>Docker Swarm: Executes `docker service update --force model-server`
    Docker Swarm->>Service v2: Creates a new container instance
    Service v2->>MLflow Registry: Queries for 'Production' model
    MLflow Registry-->>Service v2: Responds with version 2 details
    Service v2->>S3 Artifact Store: Downloads model version 2
    Service v2-->>Docker Swarm: Reports 'healthy' status
    Note over Docker Swarm: New task is healthy, proceeding with update.
    Docker Swarm->>Service v1: Stops the old container instance

This entire flow is orchestrated by the docker-compose.yml for the service stack.

# model-server-stack.yml
version: '3.7'

services:
  model-server:
    image: my-registry/model-server:latest # The image is assumed to be built and pushed by CI
    networks:
      - mlflow-net # Connect to the same network as MLflow for communication
    environment:
      # --- Service Configuration ---
      - MODEL_NAME=my-fraud-detector
      - MODEL_STAGE=Production
      # --- MLflow Loader Configuration ---
      - MLFLOW_TRACKING_URI=http://mlflow_server:5000
      - AWS_ACCESS_KEY_ID=minio_access_key
      - AWS_SECRET_ACCESS_KEY=minio_secret_key
      - MLFLOW_S3_ENDPOINT_URL=http://minio:9000
    deploy:
      replicas: 3
      update_config:
        # One-by-one update: Stop one old task, start one new task, wait for it to be healthy.
        parallelism: 1
        # Wait 15 seconds before updating the next replica. Gives time for metrics to settle.
        delay: 15s
        order: start-first
        failure_action: rollback
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3
    healthcheck:
      # A real healthcheck would hit an actual /health endpoint in the Go service.
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 20s # Give the service time to download and load the model.

networks:
  mlflow-net:
    external: true
    name: mlflow_mlflow-net

The key to zero-downtime deployment lies in the deploy.update_config section:

parallelism: 1: Swarm will update only one replica at a time.
delay: 15s: After a replica is updated successfully, Swarm waits before starting the next one.
order: start-first: Swarm starts the new task and waits for it to pass its health check before stopping the old task. This ensures service capacity is never reduced during the update.
healthcheck: This is non-negotiable. Swarm uses the health check to determine if an updated task is “healthy.” Without a proper health check, Swarm cannot know if the new version with the new model has started successfully. The start_period is critical for models that take time to download and load, preventing Swarm from killing the container prematurely.

When a data scientist promotes a new model to “Production” in the MLflow UI, a CI/CD pipeline is triggered. The pipeline’s only job is to run docker service update --force model-server. The --force flag tells Swarm to re-pull the image and restart the tasks even if the service definition hasn’t changed, triggering the rolling update process. Each new container will then independently execute the startup logic: query MLflow, fetch the new production model, load it, and report as healthy. The entire process is automated, safe, and incurs zero downtime.

This architecture has a few remaining limitations to consider. The model loading process at service startup introduces a latency penalty to container start times. For exceptionally large models (many gigabytes), this could exceed reasonable health check windows. A future iteration might involve a dedicated sidecar container that pre-fetches model artifacts to a shared volume, allowing the main service container to start almost instantly. Furthermore, the service’s startup is tightly coupled to the availability of the MLflow server and the S3 artifact store. While the circuit breaker mitigates cascading failures, a more advanced solution could involve caching the last known good model on the host node, allowing the service to start with a slightly stale model if the registry is unreachable, thus prioritizing availability over model freshness in failure scenarios.

Go Microservices Docker Swarm MLflow Kit

Decoupling Synchronous NumPy Workloads from Django Using OCI Functions and API Gateway Routing

2023-10-27 Cloud Architecture

Python Serverless API Gateway Django OCI NumPy

Implementing a Remote Execution Engine for Containerized Scikit-learn Jobs with a SwiftUI Client and Podman

2023-10-27 MLOps

Python SwiftUI FastAPI MLOps Scikit-learn Podman Jupyter Data Warehouse