Deploying our primary monolithic frontend had become a high-stakes, all-or-nothing event. A carefully planned release window, a collective deep breath, and a switch-flip that determined whether the next few hours would be calm or a frantic rollback scramble. The core technical pain point was the sheer blast radius of any failure. With a single, large Vue.js application serving all user-facing traffic, even a minor bug in a peripheral component could degrade the entire user experience, forcing a complete and disruptive rollback. The business demanded faster iteration, but engineering couldn’t stomach the increasing risk. The obvious answer was a gradual rollout strategy, specifically canary releasing, but applying this pattern to a stateful monolith isn’t as straightforward as it is for stateless microservices.
Our initial concept was to build a system that allowed us to deploy a new “canary” version of the frontend alongside the stable version and incrementally shift production traffic to it. This process needed to be observable, controllable, and—most importantly—rapidly reversible. A key constraint was to leverage our existing toolchain where possible to minimize operational overhead. This led us to a rather unconventional combination of technologies: Puppet for infrastructure state management, Envoy Proxy for sophisticated traffic control, and a Vue/Pinia frontend as a dedicated control plane for our release engineers.
The decision to use Puppet was rooted in our existing Infrastructure as Code practices. It already managed our servers, so extending it to manage application deployments was a natural step. Puppet’s declarative nature ensures that the desired state—two specific versions of our monolith deployed and running—is continuously enforced. For traffic shifting, standard load balancers like Nginx were considered, but their reload-based configuration updates were too slow and disruptive for the fine-grained, real-time control we needed. Envoy Proxy, with its dynamic configuration discovery service (xDS) API, was the clear winner. It allows for atomic, zero-downtime updates to traffic routing rules on the fly. Finally, to manage Envoy’s dynamic state, we needed a user interface. A simple CLI was an option, but a visual dashboard could provide real-time metric feedback, making the decision to advance or roll back a release data-driven rather than instinct-based. Since our monolith was a Vue application, our team had the expertise to quickly build this internal tool, and Pinia was selected for its simple yet powerful state management, perfectly suited for tracking the state of a release process.
The Foundational State: Application Deployment with Puppet
The first step was to ensure we could reliably deploy two distinct versions of our application onto the same set of hosts. Puppet’s role is to act as the source of truth for what is deployed, not how traffic is routed to it. We defined a custom resource type in our Puppet manifest to represent an application deployment.
# modules/app/manifests/deployment.pp
# Defines a single instance of our monolithic frontend application.
# It handles pulling a specific git version, installing dependencies,
# building the production assets, and running it as a systemd service
# on a specified port.
define app::deployment (
String $version,
Integer $port,
String $service_name,
String $app_root,
String $git_repo,
) {
$app_path = "${app_root}/${service_name}"
# Ensure the root application directory exists
file { $app_root:
ensure => directory,
owner => 'appuser',
group => 'appuser',
}
# 1. Manage the source code via Git
# This ensures the correct version is checked out. If the version changes
# in Puppet, this resource will pull the new tag.
vcsrepo { $app_path:
ensure => present,
provider => git,
source => $git_repo,
revision => $version, # Corresponds to a git tag, e.g., 'v2.1.5'
user => 'appuser',
require => File[$app_root],
}
# 2. Install dependencies
# This command only runs if the package.json changes, which happens
# when the vcsrepo resource updates the source code.
exec { "npm_install_${service_name}":
command => 'npm install --production',
cwd => $app_path,
user => 'appuser',
path => '/usr/bin:/bin/',
refreshonly => true, # Only run on change
subscribe => Vcsrepo[$app_path],
require => Vcsrepo[$app_path],
}
# 3. Build the application
# Similarly, only runs when the source code is updated.
exec { "npm_build_${service_name}":
command => 'npm run build',
cwd => $app_path,
user => 'appuser',
path => '/usr/bin:/bin/',
refreshonly => true,
subscribe => Exec["npm_install_${service_name}"],
require => Exec["npm_install_${service_name}"],
}
# 4. Manage the systemd service
# This file content is templated to inject the correct port.
file { "/etc/systemd/system/${service_name}.service":
ensure => file,
owner => 'root',
group => 'root',
mode => '0644',
content => template('app/service.erb'),
notify => Service[$service_name],
}
# 5. Ensure the service is running and enabled
# It will be restarted if the build or systemd unit file changes.
service { $service_name:
ensure => running,
enable => true,
require => [
Exec["npm_build_${service_name}"],
File["/etc/systemd/system/${service_name}.service"],
],
}
}
The corresponding systemd template (service.erb
) would look something like this:
# /etc/systemd/system/<%= @service_name %>.service
# This file is managed by Puppet.
[Unit]
Description=Monolith Frontend Service <%= @service_name %>
After=network.target
[Service]
Type=simple
User=appuser
Group=appuser
WorkingDirectory=<%= @app_root %>/<%= @service_name %>
# We use the built-in Vue development server for simplicity here.
# In a real-world project, this would be a more robust server like 'serve' or a Node.js server.
ExecStart=/usr/bin/npm run serve -- --port <%= @port %>
Restart=on-failure
StandardOutput=journal
StandardError=journal
SyslogIdentifier=<%= @service_name %>
[Install]
WantedBy=multi-user.target
With this defined type, our node classification for the frontend servers becomes incredibly simple and declarative.
# manifests/site.pp
node 'frontend-server-01.example.com', 'frontend-server-02.example.com' {
# Deploy the stable version
app::deployment { 'monolith-stable':
version => 'v2.1.4',
port => 8081,
service_name => 'monolith-stable',
app_root => '/srv/www',
git_repo => 'https://git.example.com/frontend/monolith.git',
}
# Deploy the new canary version
app::deployment { 'monolith-canary':
version => 'v2.1.5-canary',
port => 8082,
service_name => 'monolith-canary',
app_root => '/srv/www',
git_repo => 'https://git.example.com/frontend/monolith.git',
}
# Also ensure Envoy is installed and configured to start
class { 'envoy':
# Puppet module configuration for Envoy...
}
}
When we want to start a canary release, an engineer simply changes the version
parameter for monolith-canary
and Puppet takes care of the rest. On the next run, it will pull the new git tag, rebuild the application, and restart the service. At this point, we have both versions running on the servers, but no traffic is going to the canary yet. That’s Envoy’s job.
The Dynamic Core: Traffic Shifting with Envoy and xDS
Envoy sits as a reverse proxy on each frontend server, listening on port 80 and routing traffic to either the stable (port 8081) or canary (port 8082) application instance. The key is how we tell Envoy to split the traffic. Instead of reloading a static configuration file, we point Envoy to a custom management server—our xDS server.
Here is the bootstrap envoy.yaml
configuration that Puppet deploys. It’s minimal and its only job is to tell Envoy where to find its real configuration.
# /etc/envoy/envoy.yaml - Managed by Puppet
admin:
address:
socket_address:
address: 127.0.0.1
port_value: 9901
static_resources:
listeners:
- name: listener_0
address:
socket_address:
address: 0.0.0.0
port_value: 80
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
stat_prefix: ingress_http
# This is the crucial part. We are telling Envoy to fetch its route
# configuration from our xDS server via the RDS (Route Discovery Service) protocol.
rds:
config_source:
api_config_source:
api_type: GRPC
transport_api_version: V3
grpc_services:
- envoy_grpc:
cluster_name: xds_cluster
http_filters:
- name: envoy.filters.http.router
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
# We also statically define the cluster for the xDS server itself.
clusters:
- name: xds_cluster
type: STRICT_DNS
connect_timeout: 0.25s
# In a real production setup, this would be an HTTP/2 cluster.
typed_extension_protocol_options:
envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
"@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
explicit_http_config:
http2_protocol_options: {}
load_assignment:
cluster_name: xds_cluster
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
# This points to our custom control plane service.
address: xds-control-plane.service.consul
port_value: 50051
dynamic_resources:
# Tell Envoy to fetch cluster definitions (our app backends) from the xDS server.
# This is CDS (Cluster Discovery Service).
cds_config:
resource_api_version: V3
api_config_source:
api_type: GRPC
transport_api_version: V3
grpc_services:
- envoy_grpc:
cluster_name: xds_cluster
Now for the brains of the operation: a simple Go-based xDS server. This server maintains the current traffic split percentage and generates the appropriate Envoy configuration on demand. A real-world implementation would use a proper Go control plane library, but a simplified example illustrates the concept.
// xds-server/main.go
package main
import (
"context"
"fmt"
"log"
"net"
"sync"
"time"
"google.golang.org/grpc"
"google.golang.org/protobuf/types/known/anypb"
"google.golang.org/protobuf/types/known/wrapperspb"
cluster "github.com/envoyproxy/go-control-plane/envoy/config/cluster/v3"
core "github.comcom/envoyproxy/go-control-plane/envoy/config/core/v3"
endpoint "github.com/envoyproxy/go-control-plane/envoy/config/endpoint/v3"
listener "github.com/envoyproxy/go-control-plane/envoy/config/listener/v3"
route "github.com/envoyproxy/go-control-plane/envoy/config/route/v3"
hcm "github.com/envoyproxy/go-control-plane/envoy/extensions/filters/network/http_connection_manager/v3"
clusterservice "github.com/envoyproxy/go-control-plane/envoy/service/cluster/v3"
discoveryservice "github.com/envoyproxy/go-control-plane/envoy/service/discovery/v3"
routeservice "github.com/envoyproxy/go-control-plane/envoy/service/route/v3"
"github.com/envoyproxy/go-control-plane/pkg/cache/v3"
"github.com/envoyproxy/go-control-plane/pkg/resource/v3"
"github.com/envoyproxy/go-control-plane/pkg/server/v3"
)
// This simple global variable holds our state. In production, this would
// be backed by Redis or etcd for persistence and high availability.
var (
canaryWeight uint32 = 0 // Weight from 0 to 100
stateMutex sync.RWMutex
)
// generateSnapshot creates a full configuration snapshot for Envoy.
func generateSnapshot() (cache.Snapshot, error) {
stateMutex.RLock()
defer stateMutex.RUnlock()
stableWeight := 100 - canaryWeight
// 1. Define Clusters (CDS)
clusters := []cache.Resource{
&cluster.Cluster{
Name: "stable_cluster",
ConnectTimeout: durationpb.New(5 * time.Second),
ClusterDiscoveryType: &cluster.Cluster_Type{Type: cluster.Cluster_STATIC},
LbPolicy: cluster.Cluster_ROUND_ROBIN,
LoadAssignment: createLoadAssignment("stable_cluster", "127.0.0.1", 8081),
},
&cluster.Cluster{
Name: "canary_cluster",
ConnectTimeout: durationpb.New(5 * time.Second),
ClusterDiscoveryType: &cluster.Cluster_Type{Type: cluster.Cluster_STATIC},
LbPolicy: cluster.Cluster_ROUND_ROBIN,
LoadAssignment: createLoadAssignment("canary_cluster", "127.0.0.1", 8082),
},
}
// 2. Define Routes (RDS)
// This is the core of the canary logic.
routes := []cache.Resource{
&route.RouteConfiguration{
Name: "local_route",
VirtualHosts: []*route.VirtualHost{
{
Name: "local_service",
Domains: []string{"*"},
Routes: []*route.Route{
{
Match: &route.RouteMatch{
PathSpecifier: &route.RouteMatch_Prefix{
Prefix: "/",
},
},
Action: &route.Route_Route{
Route: &route.RouteAction{
ClusterSpecifier: &route.RouteAction_WeightedClusters{
WeightedClusters: &route.WeightedCluster{
Clusters: []*route.WeightedCluster_ClusterWeight{
{
Name: "stable_cluster",
Weight: wrapperspb.UInt32(stableWeight),
},
{
Name: "canary_cluster",
Weight: wrapperspb.UInt32(canaryWeight),
},
},
},
},
},
},
},
},
},
},
},
}
// Create the snapshot
version := fmt.Sprintf("v%d", time.Now().Unix())
snapshot, err := cache.NewSnapshot(
version,
map[resource.Type][]cache.Resource{
resource.ClusterType: clusters,
resource.RouteType: routes,
},
)
if err != nil {
return nil, err
}
return snapshot, nil
}
// ... (omitting gRPC server setup and helper functions for brevity)
// The full server would listen for gRPC connections from Envoy, and whenever
// the state (canaryWeight) changes, it pushes a new snapshot to all connected Envoy clients.
// We also need a simple HTTP API for the Pinia dashboard to call.
func main() {
// ...
}
This setup creates a powerful feedback loop. Puppet defines the available application versions. The xDS server defines how traffic is split between them. Envoy executes the split. The only missing piece is a way for a human to control the canaryWeight
.
graph TD subgraph "Puppet Managed State" A[Puppet Server] -->|manifests| B(Server A); A -->|manifests| C(Server B); B -- "deploys app v2.1.4 on :8081" --> D{Stable App}; B -- "deploys app v2.1.5 on :8082" --> E{Canary App}; C -- "deploys app v2.1.4 on :8081" --> F{Stable App}; C -- "deploys app v2.1.5 on :8082" --> G{Canary App}; end subgraph "Dynamic Traffic Control" H(Operator) -->|uses| I[Pinia Dashboard]; I -->|POST /set_weight| J[xDS Control Plane]; J -->|gRPC xDS Stream| K(Envoy on Server A); J -->|gRPC xDS Stream| L(Envoy on Server B); end subgraph "Traffic Flow" M[User Traffic] --> K; M[User Traffic] --> L; K -- "90% traffic" --> D; K -- "10% traffic" --> E; L -- "90% traffic" --> F; L -- "10% traffic" --> G; end
The Control Plane: A Pinia-Powered Dashboard
The final component is the human interface. This is where Pinia shines, providing a reactive and centralized state management solution for our control panel. The store holds the complete state of the canary release process.
// src/stores/releaseStore.js
import { defineStore } from 'pinia';
import axios from 'axios';
// A common mistake is to put too much logic in components.
// The store should be the single source of truth and contain all
// business logic related to the release state.
const CONTROL_PLANE_API = 'http://xds-control-plane.service.consul/api';
export const useReleaseStore = defineStore('release', {
state: () => ({
stableVersion: 'v2.1.4',
canaryVersion: 'v2.1.5-canary',
releaseStatus: 'monitoring', // e.g., 'paused', 'advancing', 'failed', 'promoted'
trafficWeight: 0, // Percentage (0-100) sent to canary
isLoading: false,
error: null,
metrics: {
// These would be populated from a Prometheus query
stable: { successRate: 99.9, p95Latency: 120 },
canary: { successRate: 99.9, p95Latency: 125 },
},
}),
actions: {
// Action to update the traffic weight. This is the core interactive feature.
async setTrafficWeight(newWeight) {
if (this.isLoading) return;
this.isLoading = true;
this.error = null;
const weight = Math.max(0, Math.min(100, newWeight));
try {
// In a real-world project, you'd add optimistic updates
// and robust error handling here.
await axios.post(`${CONTROL_PLANE_API}/weight`, { weight });
this.trafficWeight = weight;
this.releaseStatus = 'monitoring';
} catch (e) {
this.error = 'Failed to update traffic weight.';
// Revert UI state if API call fails
console.error(e);
} finally {
this.isLoading = false;
}
},
// Action to fully promote the canary version
async promoteCanary() {
// This would first set weight to 100
await this.setTrafficWeight(100);
// Then, it would trigger an external process (e.g., a Jenkins job)
// to update the Puppet manifest, setting the stable version to the
// canary version and removing the canary deployment.
console.log('Promoting canary... Puppet run will be triggered.');
this.releaseStatus = 'promoted';
},
// Action for an emergency rollback
async rollbackCanary() {
await this.setTrafficWeight(0);
// This would also trigger an external process to update Puppet,
// simply removing the canary deployment resource.
console.log('Rolling back canary... Puppet run will be triggered.');
this.releaseStatus = 'rolled_back';
},
// Action to fetch health metrics from our observability stack
async fetchMetrics() {
// This is a mock. A real implementation would query Prometheus
// for metrics tagged by the Envoy cluster ('stable_cluster', 'canary_cluster').
// Example PromQL query:
// sum(rate(envoy_cluster_upstream_rq_xx{envoy_cluster_name="canary_cluster", envoy_response_code_class="2"}[5m]))
// /
// sum(rate(envoy_cluster_upstream_rq_total{envoy_cluster_name="canary_cluster"}[5m]))
this.metrics.canary.successRate = (Math.random() * 0.2 + 99.7).toFixed(2);
this.metrics.canary.p95Latency = Math.floor(Math.random() * 10 + 120);
}
},
});
The Vue component then becomes a straightforward consumer of this store, providing the UI elements for the release engineer.
<!-- src/components/ReleaseControlPanel.vue -->
<template>
<div class="control-panel">
<h2>Canary Release Control: {{ releaseStore.canaryVersion }}</h2>
<div class="status">Current Status: <strong>{{ releaseStore.releaseStatus }}</strong></div>
<div class="traffic-control">
<label for="weight-slider">Canary Traffic: {{ releaseStore.trafficWeight }}%</label>
<input
type="range"
id="weight-slider"
min="0"
max="100"
:value="releaseStore.trafficWeight"
@input="updateWeight($event.target.value)"
:disabled="releaseStore.isLoading"
/>
</div>
<div class="metrics">
<div class="metric-card stable">
<h4>Stable: {{ releaseStore.stableVersion }}</h4>
<p>Success Rate: {{ releaseStore.metrics.stable.successRate }}%</p>
<p>P95 Latency: {{ releaseStore.metrics.stable.p95Latency }}ms</p>
</div>
<div class="metric-card canary">
<h4>Canary: {{ releaseStore.canaryVersion }}</h4>
<p>Success Rate: {{ releaseStore.metrics.canary.successRate }}%</p>
<p>P95 Latency: {{ releaseStore.metrics.canary.p95Latency }}ms</p>
</div>
</div>
<div v-if="releaseStore.error" class="error-message">{{ releaseStore.error }}</div>
<div class="actions">
<button @click="releaseStore.promoteCanary()" :disabled="releaseStore.trafficWeight !== 100">Promote to Stable</button>
<button @click="releaseStore.rollbackCanary()" class="danger">Immediate Rollback</button>
</div>
</div>
</template>
<script setup>
import { useReleaseStore } from '@/stores/releaseStore';
import { onMounted, onUnmounted } from 'vue';
import debounce from 'lodash.debounce';
const releaseStore = useReleaseStore();
// Debounce the slider input to avoid flooding the API
const updateWeight = debounce((value) => {
releaseStore.setTrafficWeight(parseInt(value, 10));
}, 300);
let metricsInterval;
onMounted(() => {
// Periodically refresh metrics
releaseStore.fetchMetrics();
metricsInterval = setInterval(() => releaseStore.fetchMetrics(), 5000);
});
onUnmounted(() => {
clearInterval(metricsInterval);
});
</script>
This architecture provides a complete, closed-loop system. Puppet ensures the physical state of the deployments. Envoy, directed by the xDS server, manages the logical network state. The Pinia-powered dashboard provides the human-in-the-loop control and observability, empowering engineers to make data-informed decisions during the most critical phase of the development lifecycle.
The current implementation, while functional, has clear limitations in a large-scale production environment. The xDS control plane is a single point of failure with an in-memory state; this state should be externalized to a highly available key-value store like etcd or Consul. The promotion and rollback process still requires a manual trigger that kicks off a separate automation to update Puppet; a more advanced version would integrate this into a single, cohesive workflow, potentially managed by a CI/CD orchestrator like ArgoCD or Flux, which could update the git repository that Puppet uses as its source of truth. Furthermore, the reliance on manual metric observation is a stepping stone towards automated canary analysis, where the system itself would advance or halt the rollout based on predefined Service Level Objectives (SLOs), removing the human from the loop for routine releases.