Implementing End-to-End ML Model Traceability with OpenTelemetry Across Spinnaker and Argo CD


The incident that broke our MLOps process began with a subtle alert. P99 latency for our Go-based image recognition service had spiked by 300ms. Simultaneously, the model’s prediction confidence score for a key category dropped below our business SLO. We were serving a degraded model to production users, but the “why” was a complete black box. The on-call engineer knew the deployment had happened hours ago via our Spinnaker pipeline, but that was it. Which of the dozen model candidate experiments from the data science team had been promoted? What version of the training dataset was it built with? Which commit triggered this entire cascade? It took six hours of manual archaeology—digging through Spinnaker logs, git histories, and S3 artifact buckets—to trace the problem back to a corrupted pre-processing step in a single training run. The process was untenable.

The core problem was a lack of continuity. Our systems were islands. Spinnaker orchestrated the CI/CD pipeline, kicking off a Keras training job in a Kubernetes pod. The resulting model was baked into a Go service container and pushed to our registry. Finally, a manifest update in our GitOps repository triggered an Argo CD sync. Each step was automated, but the context connecting them was lost at every boundary. We decided to treat the entire model lifecycle, from git commit to production inference, as a single, distributed transaction that could be traced. The tool for this had to be OpenTelemetry, leveraging its standardized context propagation to weave a single thread through our entire heterogeneous stack.

Our architecture already used Spinnaker for its powerful and flexible pipeline orchestration capabilities, which are ideal for the multi-stage, conditional logic of MLOps (e.g., run integration tests, trigger training, run model evaluation, promote if accuracy > 95%). We also relied on Argo CD for its strict GitOps, declarative approach to Kubernetes deployments, providing an auditable and consistent state for our production environment. Using them together is a common pattern: Spinnaker acts as the imperative “conductor” that makes decisions and prepares the release, while Argo CD is the declarative “enforcer” that ensures the state defined in Git is what’s running on the cluster. The challenge was to bridge these two paradigms with a consistent trace context.

The last stop for our trace is the Go service that serves the model. In a real-world project, this service must be instrumented for observability from day one. It needs to not only handle inference requests but also propagate trace context and enrich spans with relevant metadata.

First, we establish the OpenTelemetry provider setup. This is boilerplate but crucial for production-grade code. It needs to be configurable, handle shutdown gracefully, and connect to an OTLP endpoint.

// file: internal/telemetry/provider.go
package telemetry

import (
	"context"
	"errors"
	"os"
	"time"

	"go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/attribute"
	"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
	"go.opentelemetry.io/otel/propagation"
	"go.opentelemetry.io/otel/sdk/resource"
	sdktrace "go.opentelemetry.io/otel/sdk/trace"
	semconv "go.opentelemetry.io/otel/semconv/v1.21.0"
)

// InitTracerProvider initializes and registers a new OpenTelemetry TracerProvider.
// It configures an OTLP exporter and sets up a resource with service information.
// A shutdown function is returned to gracefully flush and close the provider.
func InitTracerProvider() (func(context.Context), error) {
	ctx := context.Background()

	// In a real application, the endpoint would come from configuration.
	otelAgentAddr, ok := os.LookupEnv("OTEL_EXPORTER_OTLP_ENDPOINT")
	if !ok {
		otelAgentAddr = "0.0.0.0:4317"
	}

	traceExporter, err := otlptracegrpc.New(ctx,
		otlptracegrpc.WithInsecure(), // Use secure connection in production.
		otlptracegrpc.WithEndpoint(otelAgentAddr),
	)
	if err != nil {
		return nil, errors.New("failed to create OTLP trace exporter: " + err.Error())
	}

	// Service name and version are critical for identifying telemetry data.
	// These should be injected at build time.
	res, err := resource.New(ctx,
		resource.WithAttributes(
			semconv.ServiceName("inference-service"),
			semconv.ServiceVersion("1.0.0"),
			attribute.String("environment", "production"),
		),
	)
	if err != nil {
		return nil, errors.New("failed to create telemetry resource: " + err.Error())
	}

	bsp := sdktrace.NewBatchSpanProcessor(traceExporter)
	tracerProvider := sdktrace.NewTracerProvider(
		sdktrace.WithSampler(sdktrace.AlwaysSample()), // In production, use ParentBased(TraceIDRatio(0.1)).
		sdktrace.WithResource(res),
		sdktrace.WithSpanProcessor(bsp),
	)

	otel.SetTracerProvider(tracerProvider)

	// Set the W3C Trace Context propagator as the global propagator.
	otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(propagation.TraceContext{}, propagation.Baggage{}))

	// The returned function handles graceful shutdown.
	return func(shutdownCtx context.Context) {
		// Do not make the application hang when it is shutdown.
		ctx, cancel := context.WithTimeout(shutdownCtx, 5*time.Second)
		defer cancel()
		if err := tracerProvider.Shutdown(ctx); err != nil {
			otel.Handle(err)
		}
	}, nil
}

With the provider configured, we need an HTTP middleware to automatically extract incoming trace contexts and create spans for each request. This isolates the tracing logic from the business logic in our handlers.

// file: internal/server/middleware.go
package server

import (
	"net/http"

	"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
	"go.opentelemetry.io/otel"
)

// telemetryMiddleware wraps the otelhttp.NewHandler to provide standardized tracing.
// It creates a new span for each incoming request, extracting parent context
// from headers if available.
func telemetryMiddleware(next http.Handler, operation string) http.Handler {
	// otelhttp provides a robust, pre-built handler that follows semantic conventions.
	// It extracts context, creates a span, and adds standard HTTP attributes.
	return otelhttp.NewHandler(next, operation,
		otelhttp.WithTracerProvider(otel.GetTracerProvider()),
		otelhttp.WithPropagators(otel.GetTextMapPropagator()),
	)
}

The core prediction handler now only needs to focus on its task: performing inference and adding model-specific attributes to the span that the middleware already created.

// file: internal/server/handlers.go
package server

import (
	"encoding/json"
	"net/http"
	"os"

	"go.opentelemetry.io/otel/attribute"
	"go.opentelemetry.io/otel/trace"
)

type PredictionHandler struct {
	ModelID string // The ID of the currently loaded model.
}

type PredictionRequest struct {
	ImageData string `json:"image_data"` // Base64 encoded image data
}

type PredictionResponse struct {
	Class       string  `json:"class"`
	Confidence  float32 `json:"confidence"`
	ModelID     string  `json:"model_id"`
}

func (h *PredictionHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
	// The middleware has already started a span and put it in the request context.
	span := trace.SpanFromContext(r.Context())

	// Add model-specific attributes to the existing span. This is crucial for debugging.
	span.SetAttributes(
		attribute.String("ml.model.id", h.ModelID),
		attribute.String("ml.model.type", "image_recognition"),
	)
	
	// ... (request decoding and validation logic)

	// Simulate inference. In a real service, this would call the Keras model via TF Serving or an embedded library.
	// We add an event to the span to mark the start and end of the core logic.
	span.AddEvent("Starting model inference")
	prediction, confidence := h.runInference()
	span.AddEvent("Model inference complete")

	span.SetAttributes(
		attribute.String("ml.prediction.class", prediction),
		attribute.Float64("ml.prediction.confidence", float64(confidence)),
	)
	
	resp := PredictionResponse{
		Class:      prediction,
		Confidence: confidence,
		ModelID:    h.ModelID,
	}

	w.Header().Set("Content-Type", "application/json")
	w.WriteHeader(http.StatusOK)
	json.NewEncoder(w).Encode(resp)
}

// runInference is a placeholder for the actual model execution.
func (h *PredictionHandler) runInference() (string, float32) {
	// Simulate some work
	// time.Sleep(50 * time.Millisecond)
	return "cat", 0.987
}

// In main.go, we would read the model ID from an environment variable set by the deployment.
func main() {
    // ... setup code
    shutdown, err := telemetry.InitTracerProvider()
    // ... error handling
    defer shutdown(context.Background())

    modelID := os.Getenv("MODEL_ID")
    if modelID == "" {
        modelID = "unknown"
    }
    
    handler := &PredictionHandler{ModelID: modelID}
    http.Handle("/predict", telemetryMiddleware(handler, "predict"))

    // ... start server
}

Instrumenting the Keras Training Job

The next challenge was pushing observability upstream into the training process itself. A common mistake is to treat the training job as a black box. But the metadata from that run—hyperparameters, dataset version, final accuracy—is gold during an incident investigation. We needed our Python training script to participate in the distributed trace.

The key is to pass the traceparent header value, generated by Spinnaker, into the training container as an environment variable (TRACEPARENT). The Python script then uses the OpenTelemetry SDK to extract this context and create a child span for the training job.

# file: train.py
import os
import time
import random
import tensorflow as tf
from tensorflow import keras

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.propagate import extract
from opentelemetry.semconv.resource import ResourceAttributes

# In a real project, this would be more robust.
def setup_otel_tracer():
    """Sets up and configures the OpenTelemetry tracer."""
    resource = Resource(attributes={
        ResourceAttributes.SERVICE_NAME: "keras-training-job"
    })
    
    # Configure the OTLP exporter
    # Endpoint should be configured via environment variable for flexibility.
    otel_exporter = OTLPSpanExporter(
        endpoint=os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT", "otel-collector:4317"),
        insecure=True  # Use secure connections in production.
    )

    provider = TracerProvider(resource=resource)
    processor = BatchSpanProcessor(otel_exporter)
    provider.add_span_processor(processor)
    trace.set_tracer_provider(provider)

    return trace.get_tracer(__name__)

def train_model(tracer, parent_context):
    """
    Trains a simple Keras model and records telemetry about the process.
    The parent_context is extracted from the environment variables passed by Spinnaker.
    """
    with tracer.start_as_current_span("keras-training-run", context=parent_context) as span:
        # These attributes are essential for traceability.
        dataset_version = os.getenv("DATASET_VERSION", "v1.2.0")
        learning_rate = float(os.getenv("LEARNING_RATE", "0.001"))
        epochs = int(os.getenv("EPOCHS", "10"))
        
        span.set_attribute("ml.dataset.version", dataset_version)
        span.set_attribute("ml.hyperparams.learning_rate", learning_rate)
        span.set_attribute("ml.hyperparams.epochs", epochs)

        print(f"Starting training with dataset {dataset_version}...")

        # Dummy model training
        (x_train, y_train), _ = keras.datasets.mnist.load_data()
        x_train = x_train.astype("float32") / 255.0
        
        model = keras.Sequential([
            keras.layers.Flatten(input_shape=(28, 28)),
            keras.layers.Dense(128, activation="relu"),
            keras.layers.Dense(10, activation="softmax"),
        ])
        
        model.compile(
            optimizer=keras.optimizers.Adam(learning_rate=learning_rate),
            loss="sparse_categorical_crossentropy",
            metrics=["accuracy"],
        )

        span.add_event("Training started")
        history = model.fit(x_train, y_train, epochs=epochs, batch_size=64, validation_split=0.1)
        span.add_event("Training complete")

        final_accuracy = history.history["val_accuracy"][-1]
        final_loss = history.history["val_loss"][-1]

        print(f"Final validation accuracy: {final_accuracy:.4f}")

        # Record the final metrics on the span.
        span.set_attribute("ml.metric.accuracy", final_accuracy)
        span.set_attribute("ml.metric.loss", final_loss)

        # The model ID is critical for linking this training run to the deployed artifact.
        model_id = f"mnist-model-{int(time.time())}"
        model_path = f"/mnt/models/{model_id}"
        model.save(model_path)

        # This ID must be passed to the next stage of the pipeline.
        print(f"MODEL_ID={model_id}") 
        span.set_attribute("ml.model.id", model_id)
        span.set_attribute("ml.model.path", f"s3://my-models-bucket/{model_id}") # In reality, you'd upload it.

        # A common pitfall is assuming the script ends and flushes traces.
        # Explicitly flush the provider before exiting.
        trace.get_tracer_provider().force_flush()


if __name__ == "__main__":
    tracer = setup_otel_tracer()
    
    # This is the critical link: extract the context passed by the orchestrator.
    # Spinnaker will set the TRACEPARENT environment variable.
    parent_context = extract(os.environ)
    
    train_model(tracer, parent_context)

Weaving the Thread Through Spinnaker and Argo CD

This is where the implementation gets complex. We need to propagate context across the boundary of two very different systems.

1. Spinnaker Pipeline Configuration

Our Spinnaker pipeline now has a clear flow:

  1. Trigger (Git): A commit to our model training repository starts the pipeline.
  2. Generate Context: A “Run Job (Manifest)” stage runs a tiny container that simply generates a traceparent string and writes it to a Spinnaker context file. A simpler but less robust method is using a pre-expression.
  3. Train Model: A “Run Job (Manifest)” stage deploys the train.py container. The Kubernetes manifest for this job is configured to pass the traceparent from the previous stage as the TRACEPARENT environment variable. It also passes other parameters like the dataset version.
  4. Build Service: After training, a subsequent stage builds the Go inference service Docker image, baking in the trained model artifact (identified by the MODEL_ID output from the training job).
  5. Update GitOps Repo: This is the hand-off. A final stage in Spinnaker checks out our Argo CD deployment repository, updates the image tag in the deployment.yaml to point to the new container, and—this is the key—adds the traceparent as an annotation. It then commits and pushes this change.

Here is what the Spinnaker stage for updating the GitOps repo would conceptually do. In practice, this is a “Run Job” stage executing a script.

# Script executed by Spinnaker to update GitOps repo
#!/bin/bash
set -e

GIT_REPO_URL="[email protected]:my-org/inference-service-deploy.git"
MODEL_ID="$1"
NEW_IMAGE_TAG="$2"
TRACEPARENT="$3" # Passed in from Spinnaker context

git clone $GIT_REPO_URL
cd inference-service-deploy

# Use kustomize or yq to patch the deployment manifest. yq is shown for clarity.
yq -i ".spec.template.metadata.annotations[\"orchestration.trace_id\"] = \"$TRACEPARENT\"" k8s/deployment.yaml
yq -i ".spec.template.spec.containers[0].image = \"my-registry/inference-service:$NEW_IMAGE_TAG\"" k8s/deployment.yaml
# Also set the MODEL_ID env var for the service to pick up
yq -i ".spec.template.spec.containers[0].env[] |= select(.name == \"MODEL_ID\").value = \"$MODEL_ID\"" k8s/deployment.yaml

git config user.name "Spinnaker"
git config user.email "[email protected]"
git commit -am "Deploy model $MODEL_ID with trace $TRACEPARENT"
git push

The modified deployment.yaml now carries our trace context as metadata:

# k8s/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: inference-service
spec:
  replicas: 3
  template:
    metadata:
      annotations:
        # This annotation carries the trace context from Spinnaker to the running Pods.
        orchestration.trace_id: "00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01"
    spec:
      containers:
      - name: server
        image: my-registry/inference-service:build-123
        env:
        - name: MODEL_ID
          value: "mnist-model-1698338400"
        - name: OTEL_EXPORTER_OTLP_ENDPOINT
          value: "otel-collector.observability:4317"

2. Argo CD and Closing the Loop

Argo CD detects the commit, pulls the new manifest, and performs a sync, updating the Deployment. Kubernetes creates new Pods based on this template. The pods are now “stamped” with the trace ID from the Spinnaker pipeline that created them.

The final piece of the puzzle is to connect the live inference traces back to the deployment trace. A simple parent-child relationship isn’t quite right, because the deployment happened in the past. OpenTelemetry provides the perfect mechanism for this: a Link. Our Go service, upon startup, can read its own pod annotations and create a link from any new inference spans back to the original deployment trace.

// file: internal/telemetry/link.go
package telemetry

import (
	"context"
	"os"

	"go.opentelemetry.io/otel/propagation"
	"go.opentelemetry.io/otel/trace"
)

var deploymentLink trace.Link

// InitDeploymentLink reads the trace context from an environment variable (set from pod annotation)
// and creates a reusable link. This is called once at application startup.
func InitDeploymentLink() {
	// The Downward API can be used to project a Pod annotation into an env var.
	traceParent := os.Getenv("ORCHESTRATION_TRACE_ID")
	if traceParent == "" {
		return
	}

	// Create a context carrier from the traceparent string.
	carrier := make(propagation.MapCarrier)
	carrier.Set("traceparent", traceParent)

	// Use the global propagator to extract the SpanContext.
	propagator := propagation.TraceContext{}
	ctx := propagator.Extract(context.Background(), carrier)
	
	spanContext := trace.SpanContextFromContext(ctx)

	if spanContext.IsValid() {
		deploymentLink = trace.Link{
			SpanContext: spanContext,
			Attributes:  nil, // Can add attributes like git_commit_sha here
		}
	}
}

// WithDeploymentLink returns a SpanStartOption to add the deployment link to a new span.
func WithDeploymentLink() trace.SpanStartOption {
	if deploymentLink.SpanContext.IsValid() {
		return trace.WithLinks(deploymentLink)
	}
	return trace.WithLinks() // return empty option
}

The telemetryMiddleware would then be updated to use this link.

// file: internal/server/middleware.go (updated)
// ...
func telemetryMiddleware(next http.Handler, operation string) http.Handler {
    // ...
    // The link is created once at startup, so this is very cheap.
    opts := []trace.SpanStartOption{
        trace.WithSpanKind(trace.SpanKindServer),
        telemetry.WithDeploymentLink(), 
    }

	return otelhttp.NewHandler(next, operation,
        // ... other options
		otelhttp.WithSpanOptions(opts...),
	)
}

This completes the circle. A trace from a production inference request now contains a direct, clickable link to the trace of the Spinnaker pipeline that deployed that exact model.

The final trace graph looks like this:

graph TD
    A[Git Commit] -->|Triggers| B(Spinnaker Pipeline);
    subgraph B
        B1(Generate Trace Context) --> B2(Run Keras Training Job);
        B2 --> B3(Build Go Service Container);
        B3 --> B4(Commit to GitOps Repo);
    end
    
    B4 --> |Argo CD Sync| C(K8s Deployment);
    C --> D(Go Inference Pod);
    E[User Request] --> D;

    subgraph Trace View
        T1(Span: Spinnaker Pipeline) --> T2(Child Span: Keras Training);
        T3(Span: Inference Request) -- Link --> T1;
    end

    style B2 fill:#f9f,stroke:#333,stroke-width:2px
    style T2 fill:#f9f,stroke:#333,stroke-width:2px
    style D fill:#ccf,stroke:#333,stroke-width:2px
    style T3 fill:#ccf,stroke:#333,stroke-width:2px

This system fundamentally changed our incident response. What was once a six-hour manual investigation is now a 30-second query in our observability platform. The tight coupling of ML artifacts to their CI/CD and operational context is no longer a “nice-to-have” but a core requirement for running stable, production-grade MLOps.

The implementation is not without its own complexities. It relies on a convention of passing context via environment variables and annotations, which can be fragile. A more robust solution might involve a custom Spinnaker plugin or webhook that interacts directly with an observability backend. Furthermore, the trace context is lost if a long-running training job is orphaned or rescheduled by Kubernetes. For truly resilient, long-running asynchronous tasks, a more durable context propagation mechanism, perhaps involving persisting trace state in a database and resuming from it, would be the next architectural iteration.


  TOC