Implementing Session-Aware Canary Releases for Mobile APIs Using Spinnaker and Consul Service Mesh


The incident that forced this entire architectural shift was deceptively simple. We rolled out a new version of our user-profile Web API, profile-api-v1.2, using a standard blue-green deployment strategy. The Kubernetes readiness probes passed, smoke tests were green, and we flipped the service selector to point all traffic to the new deployments. For our web front-end clients, the transition was seamless. For our mobile users, it was a catastrophe. A subtle, non-backward-compatible change in the API’s data structure for avatars was crashing older versions of the mobile app. The real problem was that even an hour after the switch, our logs showed a significant percentage of mobile traffic still hammering the decommissioned v1.1 pods, which were in their termination grace period. Our mobile clients, with their aggressive DNS caching and long-lived HTTP connection pools, were effectively blind to our backend changes. A simple rollback wasn’t enough; the damage was done, and it proved that our deployment strategy for mobile-facing services was fundamentally flawed.

Our initial concept was to abandon infrastructure-level traffic shifting (like changing Kubernetes Service selectors or DNS records) and move to application-layer routing. The goal was to control traffic based on context that only the application layer understands, such as user identity, device type, or session tokens. This would allow us to perform a canary release not on a random 5% of traffic, but on 5% of our users. Specifically, we wanted to expose the new API version to internal employee accounts first, then to a small cohort of beta testers, and only then gradually to the general user base. This required a control plane sophisticated enough to inspect L7 headers and dynamically reroute requests.

We were already using HashiCorp Consul for basic service discovery, so extending it to a full service mesh with Envoy proxies was the logical next step. Consul Service Mesh provides the L7 traffic management primitives we needed, like ServiceSplitter and ServiceRouter, which could be configured to shift traffic based on HTTP headers. For orchestration, Spinnaker was the incumbent choice in our organization. Its powerful pipeline-as-code capabilities and built-in support for canary analysis via Kayenta were critical. It could orchestrate not only the deployment of Kubernetes manifests but also the sequence of configuration changes in Consul that would drive the session-aware traffic shifting. The architecture was decided: Spinnaker would be the conductor, Kubernetes the stage, and Consul Service Mesh the traffic director.

Baseline Environment: The Stable API and Service Mesh Foundation

Before automating the canary release, a stable, mesh-enabled baseline is essential. In a real-world project, this setup is non-trivial. Our profile-api is a simple Go service. Version v1.0 is our production baseline.

// main.go - profile-api v1.0
package main

import (
	"encoding/json"
	"log"
	"net/http"
	"os"

	"github.com/prometheus/client_golang/prometheus"
	"github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
	httpRequestsTotal = prometheus.NewCounterVec(
		prometheus.CounterOpts{
			Name: "profile_api_http_requests_total",
			Help: "Total number of HTTP requests.",
		},
		[]string{"version", "code", "method"},
	)
)

func init() {
	prometheus.MustRegister(httpRequestsTotal)
}

type UserProfile struct {
	UserID    string `json:"userId"`
	Username  string `json:"username"`
	AvatarURL string `json:"avatarUrl"`
	Version   string `json:"version"` // To identify which version is serving
}

func profileHandler(w http.ResponseWriter, r *http.Request) {
	// In a real app, this would come from a database.
	profile := UserProfile{
		UserID:    "user-123",
		Username:  "stable-user",
		AvatarURL: "https://cdn.example.com/avatars/stable.png",
		Version:   "v1.0",
	}

	w.Header().Set("Content-Type", "application/json")
	if err := json.NewEncoder(w).Encode(profile); err != nil {
		log.Printf("ERROR: Failed to encode response: %v", err)
		http.Error(w, "Internal Server Error", http.StatusInternalServerError)
		httpRequestsTotal.WithLabelValues("v1.0", "500", "GET").Inc()
		return
	}
	httpRequestsTotal.WithLabelValues("v1.0", "200", "GET").Inc()
}

func main() {
	port := os.Getenv("PORT")
	if port == "" {
		port = "8080"
	}

	http.HandleFunc("/profile", profileHandler)
	http.Handle("/metrics", promhttp.Handler())
	log.Printf("INFO: Profile API v1.0 starting on port %s", port)
	if err := http.ListenAndServe(":"+port, nil); err != nil {
		log.Fatalf("FATAL: Server failed to start: %v", err)
	}
}

The key here is the Prometheus metric. A robust canary analysis depends entirely on high-quality, well-tagged metrics. We must differentiate between versions (version label) to compare their performance accurately.

The corresponding Kubernetes Deployment injects the Consul and Envoy sidecars via annotations. This is a standard practice for integrating services into the mesh.

# k8s/profile-api-v1.0-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: profile-api-v1-0
spec:
  replicas: 3
  selector:
    matchLabels:
      app: profile-api
      version: v1-0
  template:
    metadata:
      labels:
        app: profile-api
        version: v1-0
      annotations:
        'consul.hashicorp.com/connect-inject': 'true'
        'prometheus.io/scrape': 'true'
        'prometheus.io/port': '8080'
        'prometheus.io/path': '/metrics'
    spec:
      containers:
      - name: profile-api
        image: your-repo/profile-api:1.0
        ports:
        - containerPort: 8080
          name: http
        env:
        - name: PORT
          value: "8080"
---
apiVersion: v1
kind: Service
metadata:
  name: profile-api
spec:
  selector:
    app: profile-api
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8080

Notice the Kubernetes Service profile-api uses a general app: profile-api selector. It doesn’t target a specific version. This is critical. The service object becomes a stable entry point, while Consul Service Mesh will manage routing traffic to the correct versioned pods underneath.

To enable mesh-based routing, we define Consul’s configuration using Kubernetes CRDs. ServiceDefaults establishes the protocol, and ServiceResolver tells Consul how to find service instances.

# consul/service-defaults.yaml
apiVersion: consul.hashicorp.com/v1alpha1
kind: ServiceDefaults
metadata:
  name: profile-api
spec:
  protocol: http

# consul/service-resolver.yaml
apiVersion: consul.hashicorp.com/v1alpha1
kind: ServiceResolver
metadata:
  name: profile-api
spec:
  # Initially, redirect all requests for 'profile-api' to the v1.0 subset.
  default:
    serviceSubset: v1-0
  subsets:
    v1-0:
      filter: 'Service.Meta.version == "v1-0"'

The ServiceResolver explicitly defines a subset named v1-0 using a filter on the service metadata. This metadata is automatically populated by Consul from the Kubernetes pod labels. At this point, 100% of traffic to the virtual service profile-api is routed to pods with the label version: v1-0.

The Canary Version and L7 Routing Configuration

Now, we introduce v1.1, our canary candidate. This version changes the AvatarURL field to a more structured Avatar object. This is the kind of breaking change that our previous deployment strategy failed to catch safely.

// main.go - profile-api v1.1
package main

// ... (imports and metrics setup are identical to v1.0)

// The breaking change is here.
type Avatar struct {
	Primary   string `json:"primary"`
	Thumbnail string `json:"thumbnail"`
}

type UserProfileV2 struct {
	UserID   string `json:"userId"`
	Username string `json:"username"`
	Avatar   Avatar `json:"avatar"` // Changed from string to object
	Version  string `json:"version"`
}

func profileHandler(w http.ResponseWriter, r *http.Request) {
	profile := UserProfileV2{
		UserID:   "user-123",
		Username: "canary-user",
		Avatar: Avatar{
			Primary:   "https://cdn.example.com/avatars/canary_large.png",
			Thumbnail: "https://cdn.example.com/avatars/canary_small.png",
		},
		Version:  "v1.1",
	}

	w.Header().Set("Content-Type", "application/json")
	if err := json.NewEncoder(w).Encode(profile); err != nil {
		log.Printf("ERROR: Failed to encode response: %v", err)
		http.Error(w, "Internal Server Error", http.StatusInternalServerError)
		httpRequestsTotal.WithLabelValues("v1.1", "500", "GET").Inc()
		return
	}
	httpRequestsTotal.WithLabelValues("v1.1", "200", "GET").Inc()
}

// ... (main function is identical)

The Kubernetes deployment for v1.1 is nearly identical to v1.0, differing only in the name, image tag, and version label.

# k8s/profile-api-v1.1-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: profile-api-v1-1
spec:
  replicas: 1 # Start with a single canary instance
  selector:
    matchLabels:
      app: profile-api
      version: v1-1
  template:
    metadata:
      labels:
        app: profile-api
        version: v1-1
      annotations:
        # ... same annotations as v1.0
    spec:
      containers:
      - name: profile-api
        image: your-repo/profile-api:1.1
        # ... rest is the same

To control traffic between these two versions, we introduce two more Consul CRDs: ServiceSplitter and ServiceRouter.

The ServiceSplitter defines the ratio of traffic between different service subsets.

# consul/service-splitter-initial.yaml
apiVersion: consul.hashicorp.com/v1alpha1
kind: ServiceSplitter
metadata:
  name: profile-api
spec:
  splits:
    # 95% of traffic goes to the stable version
    - weight: 95
      serviceSubset: v1-0
    # 5% of traffic goes to the canary version
    - weight: 5
      serviceSubset: v1-1

The ServiceRouter allows for conditional routing based on L7 properties like headers. This is the key to our session-aware strategy.

# consul/service-router-canary.yaml
apiVersion: consul.hashicorp.com/v1alpha1
kind: ServiceRouter
metadata:
  name: profile-api-router
spec:
  routes:
    # Route 1: Internal Testers
    # If the X-User-Cohort header is 'internal', force traffic to the canary.
    - match:
        http:
          header:
            - name: X-User-Cohort
              exact: internal
      destination:
        serviceSubset: v1-1 # Send to canary
    
    # Route 2: Default traffic
    # For all other traffic, let the ServiceSplitter decide.
    - match:
        http:
          pathPrefix: /
      destination:
        service: profile-api # Let the splitter handle it

The pitfall here is the order of routes. Consul processes them sequentially. We must put our most specific match (X-User-Cohort) first, followed by the general catch-all. This configuration establishes a powerful flow: internal testers are always routed to the canary, while a small, random 5% of public traffic is also sent there via the splitter.

Here’s how the traffic flow looks at this stage:

graph TD
    subgraph Mobile Clients
        A[Internal User App]
        B[Beta User App]
        C[Public User App]
    end

    subgraph API Gateway
        GW[Gateway with Envoy Proxy]
    end

    subgraph Kubernetes Cluster
        subgraph Consul Service Mesh
            D(profile-api:80)
        end
        subgraph Service v1.0
            P1_0(Pod v1.0)
            P2_0(Pod v1.0)
        end
        subgraph Service v1.1 - Canary
            P1_1(Pod v1.1)
        end
    end

    A -- "GET /profile\nX-User-Cohort: internal" --> GW
    B -- "GET /profile\nX-User-Id: 2048" --> GW
    C -- "GET /profile" --> GW

    GW -- L7 Routing Rules --> D

    D -- "match(X-User-Cohort: internal)" --> P1_1
    D -- "95% of default traffic" --> P1_0 & P2_0
    D -- "5% of default traffic" --> P1_1

The Spinnaker Orchestration Pipeline

With the building blocks in place, we can construct the Spinnaker pipeline to automate this entire process. A declarative Spinnaker pipeline defined in JSON is the standard for maintainability.

The pipeline has several distinct stages:

  1. Deploy Canary: Deploys the manifest for the profile-api-v1.1 Kubernetes deployment.
  2. Initial Traffic Shift (5%): Applies the ServiceResolver and ServiceSplitter configurations to Consul to start sending a small fraction of traffic to the canary. This requires a Spinnaker stage that can apply raw Kubernetes manifests.
  3. Canary Analysis: A kayentaCanary stage. This stage runs for a defined period (e.g., 30 minutes), continuously comparing metrics from the canary (v1.1) against the baseline (v1.0).
  4. Session-Aware Rollout Loop: This is the most complex part. It’s not a single stage but a series of stages that are repeated. Each iteration uses a “Run Job” stage to execute a script that updates the ServiceRouter to target a new cohort of users.
  5. Promote or Rollback: Based on the success of the canary analysis and rollout, Spinnaker either promotes the canary to 100% traffic or executes a rollback plan.

Here’s a snippet of the Spinnaker pipeline JSON for the key stages:

// spinnaker-pipeline.json (fragment)
{
  "name": "Deploy Profile API (Session-Aware Canary)",
  "stages": [
    {
      "name": "Deploy Canary",
      "type": "deployManifest",
      "account": "my-k8s-account",
      "source": "text",
      "manifests": [
        // ... content of k8s/profile-api-v1.1-deployment.yaml ...
      ]
    },
    {
      "name": "Apply Initial 5% Split",
      "type": "deployManifest",
      "requisiteStageRefIds": ["Deploy Canary"],
      "account": "my-k8s-account",
      "source": "text",
      "manifests": [
        // ... content of consul/service-resolver-with-canary.yaml ...
        // ... content of consul/service-splitter-initial.yaml ...
        // ... content of consul/service-router-canary.yaml ...
      ]
    },
    {
      "name": "Canary Analysis",
      "type": "kayentaCanary",
      "requisiteStageRefIds": ["Apply Initial 5% Split"],
      "canaryConfig": {
        "canaryAnalysisIntervalMins": "5",
        "scopes": [
          {
            "scopeName": "default",
            "controlScope": "profile-api-v1-0",
            "controlLocation": "prod-us-west-1",
            "experimentScope": "profile-api-v1-1",
            "experimentLocation": "prod-us-west-1",
            "extendedScopeParams": {
              "resourceType": "kubernetes"
            }
          }
        ],
        "lifetimeDuration": "PT30M",
        "scoreThresholds": {
          "pass": 95,
          "marginal": 75
        }
      }
    },
    {
      "name": "Rollout to User Cohort 1 (User ID mod 10 == 0)",
      "type": "runJob",
      "requisiteStageRefIds": ["Canary Analysis"],
      "account": "my-k8s-account",
      "application": "profile-api",
      "job": {
        "apiVersion": "batch/v1",
        "kind": "Job",
        "metadata": {
          "generateName": "update-consul-router-"
        },
        "spec": {
          "template": {
            "spec": {
              "containers": [
                {
                  "name": "consul-updater",
                  "image": "your-repo/consul-updater:latest",
                  "command": ["python", "update_router.py", "--match-header", "X-User-Id", "--match-regex", ".*0$"]
                }
              ],
              "restartPolicy": "Never"
            }
          }
        }
      }
    }
    // ... More rollout stages for other cohorts ...
  ]
}

The consul-updater job is crucial. It runs a script that programmatically modifies the ServiceRouter CRD. A common mistake is to manage these complex, multi-stage configurations manually. Automation is key to consistency and safety.

The script itself can use the Kubernetes Python client to patch the CRD.

# update_router.py
import os
import argparse
from kubernetes import client, config

def main():
    parser = argparse.ArgumentParser(description="Update Consul ServiceRouter for canary rollout.")
    parser.add_argument("--match-header", required=True, help="Header to match on.")
    parser.add_argument("--match-regex", required=True, help="Regex for header value.")
    args = parser.parse_args()

    config.load_incluster_config()
    api = client.CustomObjectsApi()

    router_name = "profile-api-router"
    namespace = os.getenv("KUBE_NAMESPACE", "default")
    
    try:
        # Fetch the existing ServiceRouter
        router = api.get_namespaced_custom_object(
            group="consul.hashicorp.com",
            version="v1alpha1",
            name=router_name,
            namespace=namespace,
            plural="servicerouters",
        )
        
        new_route = {
            "match": {
                "http": {
                    "header": [
                        {
                            "name": args.match_header,
                            "regex": args.match_regex,
                        }
                    ]
                }
            },
            "destination": {
                "serviceSubset": "v1-1" # Always route match to canary
            }
        }

        # Prepend the new, specific route to the list of routes.
        # This ensures it's evaluated before the general catch-all.
        router["spec"]["routes"].insert(0, new_route)

        api.patch_namespaced_custom_object(
            group="consul.hashicorp.com",
            version="v1alpha1",
            name=router_name,
            namespace=namespace,
            plural="servicerouters",
            body=router,
        )
        print(f"Successfully added route for {args.match_header}: {args.match_regex} to {router_name}")

    except client.ApiException as e:
        print(f"Error updating ServiceRouter: {e}")
        exit(1)

if __name__ == "__main__":
    main()

Each run of this job, triggered by Spinnaker, adds another layer of specificity to the ServiceRouter, peeling off another cohort of users and directing them to the canary. Between each cohort rollout, another kayentaCanary analysis stage can be run to ensure the stability holds as the user base for the canary expands. This iterative process of expand and verify is the core of a safe canary release.

The final promotion step involves another runJob that simplifies the ServiceResolver to point 100% to the v1-1 subset, and then a disableManifest stage in Spinnaker to scale down and remove the v1-0 deployment.

The implementation presented solves the “sticky client” problem by making routing decisions at the edge of our infrastructure (the Envoy proxy in the API Gateway), which sees every single request. The state of the rollout isn’t dependent on client behavior but is centrally controlled and orchestrated. The key was leveraging the synergy between Spinnaker’s workflow engine and Consul’s L7 traffic control capabilities.

This pattern, however, introduces its own complexities. The operational burden of managing a service mesh is not insignificant; debugging Envoy configurations can be challenging. Performance is another consideration, as every request now passes through an additional proxy, adding a small amount of latency. The current user-cohort logic is also coupled to the deployment pipeline; a more decoupled approach would use a dedicated feature-flagging system, which the ServiceRouter could query for its decisions. Future iterations will likely focus on integrating such a system to provide business-level control over rollouts, completely separating the “who gets what” from the deployment mechanics.


  TOC