The incident that forced this entire architectural shift was deceptively simple. We rolled out a new version of our user-profile Web API, profile-api-v1.2
, using a standard blue-green deployment strategy. The Kubernetes readiness probes passed, smoke tests were green, and we flipped the service selector to point all traffic to the new deployments. For our web front-end clients, the transition was seamless. For our mobile users, it was a catastrophe. A subtle, non-backward-compatible change in the API’s data structure for avatars was crashing older versions of the mobile app. The real problem was that even an hour after the switch, our logs showed a significant percentage of mobile traffic still hammering the decommissioned v1.1
pods, which were in their termination grace period. Our mobile clients, with their aggressive DNS caching and long-lived HTTP connection pools, were effectively blind to our backend changes. A simple rollback wasn’t enough; the damage was done, and it proved that our deployment strategy for mobile-facing services was fundamentally flawed.
Our initial concept was to abandon infrastructure-level traffic shifting (like changing Kubernetes Service selectors or DNS records) and move to application-layer routing. The goal was to control traffic based on context that only the application layer understands, such as user identity, device type, or session tokens. This would allow us to perform a canary release not on a random 5% of traffic, but on 5% of our users. Specifically, we wanted to expose the new API version to internal employee accounts first, then to a small cohort of beta testers, and only then gradually to the general user base. This required a control plane sophisticated enough to inspect L7 headers and dynamically reroute requests.
We were already using HashiCorp Consul for basic service discovery, so extending it to a full service mesh with Envoy proxies was the logical next step. Consul Service Mesh provides the L7 traffic management primitives we needed, like ServiceSplitter
and ServiceRouter
, which could be configured to shift traffic based on HTTP headers. For orchestration, Spinnaker was the incumbent choice in our organization. Its powerful pipeline-as-code capabilities and built-in support for canary analysis via Kayenta were critical. It could orchestrate not only the deployment of Kubernetes manifests but also the sequence of configuration changes in Consul that would drive the session-aware traffic shifting. The architecture was decided: Spinnaker would be the conductor, Kubernetes the stage, and Consul Service Mesh the traffic director.
Baseline Environment: The Stable API and Service Mesh Foundation
Before automating the canary release, a stable, mesh-enabled baseline is essential. In a real-world project, this setup is non-trivial. Our profile-api
is a simple Go service. Version v1.0
is our production baseline.
// main.go - profile-api v1.0
package main
import (
"encoding/json"
"log"
"net/http"
"os"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
httpRequestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "profile_api_http_requests_total",
Help: "Total number of HTTP requests.",
},
[]string{"version", "code", "method"},
)
)
func init() {
prometheus.MustRegister(httpRequestsTotal)
}
type UserProfile struct {
UserID string `json:"userId"`
Username string `json:"username"`
AvatarURL string `json:"avatarUrl"`
Version string `json:"version"` // To identify which version is serving
}
func profileHandler(w http.ResponseWriter, r *http.Request) {
// In a real app, this would come from a database.
profile := UserProfile{
UserID: "user-123",
Username: "stable-user",
AvatarURL: "https://cdn.example.com/avatars/stable.png",
Version: "v1.0",
}
w.Header().Set("Content-Type", "application/json")
if err := json.NewEncoder(w).Encode(profile); err != nil {
log.Printf("ERROR: Failed to encode response: %v", err)
http.Error(w, "Internal Server Error", http.StatusInternalServerError)
httpRequestsTotal.WithLabelValues("v1.0", "500", "GET").Inc()
return
}
httpRequestsTotal.WithLabelValues("v1.0", "200", "GET").Inc()
}
func main() {
port := os.Getenv("PORT")
if port == "" {
port = "8080"
}
http.HandleFunc("/profile", profileHandler)
http.Handle("/metrics", promhttp.Handler())
log.Printf("INFO: Profile API v1.0 starting on port %s", port)
if err := http.ListenAndServe(":"+port, nil); err != nil {
log.Fatalf("FATAL: Server failed to start: %v", err)
}
}
The key here is the Prometheus metric. A robust canary analysis depends entirely on high-quality, well-tagged metrics. We must differentiate between versions (version
label) to compare their performance accurately.
The corresponding Kubernetes Deployment injects the Consul and Envoy sidecars via annotations. This is a standard practice for integrating services into the mesh.
# k8s/profile-api-v1.0-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: profile-api-v1-0
spec:
replicas: 3
selector:
matchLabels:
app: profile-api
version: v1-0
template:
metadata:
labels:
app: profile-api
version: v1-0
annotations:
'consul.hashicorp.com/connect-inject': 'true'
'prometheus.io/scrape': 'true'
'prometheus.io/port': '8080'
'prometheus.io/path': '/metrics'
spec:
containers:
- name: profile-api
image: your-repo/profile-api:1.0
ports:
- containerPort: 8080
name: http
env:
- name: PORT
value: "8080"
---
apiVersion: v1
kind: Service
metadata:
name: profile-api
spec:
selector:
app: profile-api
ports:
- protocol: TCP
port: 80
targetPort: 8080
Notice the Kubernetes Service profile-api
uses a general app: profile-api
selector. It doesn’t target a specific version. This is critical. The service object becomes a stable entry point, while Consul Service Mesh will manage routing traffic to the correct versioned pods underneath.
To enable mesh-based routing, we define Consul’s configuration using Kubernetes CRDs. ServiceDefaults
establishes the protocol, and ServiceResolver
tells Consul how to find service instances.
# consul/service-defaults.yaml
apiVersion: consul.hashicorp.com/v1alpha1
kind: ServiceDefaults
metadata:
name: profile-api
spec:
protocol: http
# consul/service-resolver.yaml
apiVersion: consul.hashicorp.com/v1alpha1
kind: ServiceResolver
metadata:
name: profile-api
spec:
# Initially, redirect all requests for 'profile-api' to the v1.0 subset.
default:
serviceSubset: v1-0
subsets:
v1-0:
filter: 'Service.Meta.version == "v1-0"'
The ServiceResolver
explicitly defines a subset named v1-0
using a filter on the service metadata. This metadata is automatically populated by Consul from the Kubernetes pod labels. At this point, 100% of traffic to the virtual service profile-api
is routed to pods with the label version: v1-0
.
The Canary Version and L7 Routing Configuration
Now, we introduce v1.1
, our canary candidate. This version changes the AvatarURL
field to a more structured Avatar
object. This is the kind of breaking change that our previous deployment strategy failed to catch safely.
// main.go - profile-api v1.1
package main
// ... (imports and metrics setup are identical to v1.0)
// The breaking change is here.
type Avatar struct {
Primary string `json:"primary"`
Thumbnail string `json:"thumbnail"`
}
type UserProfileV2 struct {
UserID string `json:"userId"`
Username string `json:"username"`
Avatar Avatar `json:"avatar"` // Changed from string to object
Version string `json:"version"`
}
func profileHandler(w http.ResponseWriter, r *http.Request) {
profile := UserProfileV2{
UserID: "user-123",
Username: "canary-user",
Avatar: Avatar{
Primary: "https://cdn.example.com/avatars/canary_large.png",
Thumbnail: "https://cdn.example.com/avatars/canary_small.png",
},
Version: "v1.1",
}
w.Header().Set("Content-Type", "application/json")
if err := json.NewEncoder(w).Encode(profile); err != nil {
log.Printf("ERROR: Failed to encode response: %v", err)
http.Error(w, "Internal Server Error", http.StatusInternalServerError)
httpRequestsTotal.WithLabelValues("v1.1", "500", "GET").Inc()
return
}
httpRequestsTotal.WithLabelValues("v1.1", "200", "GET").Inc()
}
// ... (main function is identical)
The Kubernetes deployment for v1.1
is nearly identical to v1.0
, differing only in the name, image tag, and version label.
# k8s/profile-api-v1.1-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: profile-api-v1-1
spec:
replicas: 1 # Start with a single canary instance
selector:
matchLabels:
app: profile-api
version: v1-1
template:
metadata:
labels:
app: profile-api
version: v1-1
annotations:
# ... same annotations as v1.0
spec:
containers:
- name: profile-api
image: your-repo/profile-api:1.1
# ... rest is the same
To control traffic between these two versions, we introduce two more Consul CRDs: ServiceSplitter
and ServiceRouter
.
The ServiceSplitter
defines the ratio of traffic between different service subsets.
# consul/service-splitter-initial.yaml
apiVersion: consul.hashicorp.com/v1alpha1
kind: ServiceSplitter
metadata:
name: profile-api
spec:
splits:
# 95% of traffic goes to the stable version
- weight: 95
serviceSubset: v1-0
# 5% of traffic goes to the canary version
- weight: 5
serviceSubset: v1-1
The ServiceRouter
allows for conditional routing based on L7 properties like headers. This is the key to our session-aware strategy.
# consul/service-router-canary.yaml
apiVersion: consul.hashicorp.com/v1alpha1
kind: ServiceRouter
metadata:
name: profile-api-router
spec:
routes:
# Route 1: Internal Testers
# If the X-User-Cohort header is 'internal', force traffic to the canary.
- match:
http:
header:
- name: X-User-Cohort
exact: internal
destination:
serviceSubset: v1-1 # Send to canary
# Route 2: Default traffic
# For all other traffic, let the ServiceSplitter decide.
- match:
http:
pathPrefix: /
destination:
service: profile-api # Let the splitter handle it
The pitfall here is the order of routes. Consul processes them sequentially. We must put our most specific match (X-User-Cohort
) first, followed by the general catch-all. This configuration establishes a powerful flow: internal testers are always routed to the canary, while a small, random 5% of public traffic is also sent there via the splitter.
Here’s how the traffic flow looks at this stage:
graph TD subgraph Mobile Clients A[Internal User App] B[Beta User App] C[Public User App] end subgraph API Gateway GW[Gateway with Envoy Proxy] end subgraph Kubernetes Cluster subgraph Consul Service Mesh D(profile-api:80) end subgraph Service v1.0 P1_0(Pod v1.0) P2_0(Pod v1.0) end subgraph Service v1.1 - Canary P1_1(Pod v1.1) end end A -- "GET /profile\nX-User-Cohort: internal" --> GW B -- "GET /profile\nX-User-Id: 2048" --> GW C -- "GET /profile" --> GW GW -- L7 Routing Rules --> D D -- "match(X-User-Cohort: internal)" --> P1_1 D -- "95% of default traffic" --> P1_0 & P2_0 D -- "5% of default traffic" --> P1_1
The Spinnaker Orchestration Pipeline
With the building blocks in place, we can construct the Spinnaker pipeline to automate this entire process. A declarative Spinnaker pipeline defined in JSON is the standard for maintainability.
The pipeline has several distinct stages:
- Deploy Canary: Deploys the manifest for the
profile-api-v1.1
Kubernetes deployment. - Initial Traffic Shift (5%): Applies the
ServiceResolver
andServiceSplitter
configurations to Consul to start sending a small fraction of traffic to the canary. This requires a Spinnaker stage that can apply raw Kubernetes manifests. - Canary Analysis: A
kayentaCanary
stage. This stage runs for a defined period (e.g., 30 minutes), continuously comparing metrics from the canary (v1.1
) against the baseline (v1.0
). - Session-Aware Rollout Loop: This is the most complex part. It’s not a single stage but a series of stages that are repeated. Each iteration uses a “Run Job” stage to execute a script that updates the
ServiceRouter
to target a new cohort of users. - Promote or Rollback: Based on the success of the canary analysis and rollout, Spinnaker either promotes the canary to 100% traffic or executes a rollback plan.
Here’s a snippet of the Spinnaker pipeline JSON for the key stages:
// spinnaker-pipeline.json (fragment)
{
"name": "Deploy Profile API (Session-Aware Canary)",
"stages": [
{
"name": "Deploy Canary",
"type": "deployManifest",
"account": "my-k8s-account",
"source": "text",
"manifests": [
// ... content of k8s/profile-api-v1.1-deployment.yaml ...
]
},
{
"name": "Apply Initial 5% Split",
"type": "deployManifest",
"requisiteStageRefIds": ["Deploy Canary"],
"account": "my-k8s-account",
"source": "text",
"manifests": [
// ... content of consul/service-resolver-with-canary.yaml ...
// ... content of consul/service-splitter-initial.yaml ...
// ... content of consul/service-router-canary.yaml ...
]
},
{
"name": "Canary Analysis",
"type": "kayentaCanary",
"requisiteStageRefIds": ["Apply Initial 5% Split"],
"canaryConfig": {
"canaryAnalysisIntervalMins": "5",
"scopes": [
{
"scopeName": "default",
"controlScope": "profile-api-v1-0",
"controlLocation": "prod-us-west-1",
"experimentScope": "profile-api-v1-1",
"experimentLocation": "prod-us-west-1",
"extendedScopeParams": {
"resourceType": "kubernetes"
}
}
],
"lifetimeDuration": "PT30M",
"scoreThresholds": {
"pass": 95,
"marginal": 75
}
}
},
{
"name": "Rollout to User Cohort 1 (User ID mod 10 == 0)",
"type": "runJob",
"requisiteStageRefIds": ["Canary Analysis"],
"account": "my-k8s-account",
"application": "profile-api",
"job": {
"apiVersion": "batch/v1",
"kind": "Job",
"metadata": {
"generateName": "update-consul-router-"
},
"spec": {
"template": {
"spec": {
"containers": [
{
"name": "consul-updater",
"image": "your-repo/consul-updater:latest",
"command": ["python", "update_router.py", "--match-header", "X-User-Id", "--match-regex", ".*0$"]
}
],
"restartPolicy": "Never"
}
}
}
}
}
// ... More rollout stages for other cohorts ...
]
}
The consul-updater
job is crucial. It runs a script that programmatically modifies the ServiceRouter
CRD. A common mistake is to manage these complex, multi-stage configurations manually. Automation is key to consistency and safety.
The script itself can use the Kubernetes Python client to patch the CRD.
# update_router.py
import os
import argparse
from kubernetes import client, config
def main():
parser = argparse.ArgumentParser(description="Update Consul ServiceRouter for canary rollout.")
parser.add_argument("--match-header", required=True, help="Header to match on.")
parser.add_argument("--match-regex", required=True, help="Regex for header value.")
args = parser.parse_args()
config.load_incluster_config()
api = client.CustomObjectsApi()
router_name = "profile-api-router"
namespace = os.getenv("KUBE_NAMESPACE", "default")
try:
# Fetch the existing ServiceRouter
router = api.get_namespaced_custom_object(
group="consul.hashicorp.com",
version="v1alpha1",
name=router_name,
namespace=namespace,
plural="servicerouters",
)
new_route = {
"match": {
"http": {
"header": [
{
"name": args.match_header,
"regex": args.match_regex,
}
]
}
},
"destination": {
"serviceSubset": "v1-1" # Always route match to canary
}
}
# Prepend the new, specific route to the list of routes.
# This ensures it's evaluated before the general catch-all.
router["spec"]["routes"].insert(0, new_route)
api.patch_namespaced_custom_object(
group="consul.hashicorp.com",
version="v1alpha1",
name=router_name,
namespace=namespace,
plural="servicerouters",
body=router,
)
print(f"Successfully added route for {args.match_header}: {args.match_regex} to {router_name}")
except client.ApiException as e:
print(f"Error updating ServiceRouter: {e}")
exit(1)
if __name__ == "__main__":
main()
Each run of this job, triggered by Spinnaker, adds another layer of specificity to the ServiceRouter
, peeling off another cohort of users and directing them to the canary. Between each cohort rollout, another kayentaCanary
analysis stage can be run to ensure the stability holds as the user base for the canary expands. This iterative process of expand and verify is the core of a safe canary release.
The final promotion step involves another runJob
that simplifies the ServiceResolver
to point 100% to the v1-1
subset, and then a disableManifest
stage in Spinnaker to scale down and remove the v1-0
deployment.
The implementation presented solves the “sticky client” problem by making routing decisions at the edge of our infrastructure (the Envoy proxy in the API Gateway), which sees every single request. The state of the rollout isn’t dependent on client behavior but is centrally controlled and orchestrated. The key was leveraging the synergy between Spinnaker’s workflow engine and Consul’s L7 traffic control capabilities.
This pattern, however, introduces its own complexities. The operational burden of managing a service mesh is not insignificant; debugging Envoy configurations can be challenging. Performance is another consideration, as every request now passes through an additional proxy, adding a small amount of latency. The current user-cohort logic is also coupled to the deployment pipeline; a more decoupled approach would use a dedicated feature-flagging system, which the ServiceRouter
could query for its decisions. Future iterations will likely focus on integrating such a system to provide business-level control over rollouts, completely separating the “who gets what” from the deployment mechanics.