The frontend release process had become our most significant source of production incidents. A seemingly innocuous change in a JavaScript bundle could degrade performance for a subset of users on a specific browser, or a faulty API integration could trigger a cascade of client-side exceptions. Each rollback was a frantic, manual process involving direct cluster access, bypassing our established CI/CD workflows and leaving our system state inconsistent with our version control. The core of the problem was a complete lack of immediate, quantifiable feedback on a release’s health post-deployment. We were flying blind, relying on user complaints as our primary monitoring system.
This led to the conception of a “Log-Driven Canary Promotion” system. The principle was to automate the canary release pattern by using real-time, structured logs from the frontend application as the sole source of truth for decision-making. A new release would be exposed to a small fraction of production traffic. A pipeline would then observe its behavior by analyzing client-side logs. If predefined Service Level Objectives (SLOs)—such as JavaScript error rates or Core Web Vitals—were met, the pipeline would automatically promote the release. If breached, it would trigger an automatic rollback. Human intervention would only be required in the case of a failed deployment.
Our existing stack presented a set of constraints. We had a mature Jenkins installation for CI, and migrating was not on the table. For deployment, we had committed to a GitOps model using ArgoCD to ensure our cluster state was declaratively managed and auditable. The missing piece was the feedback loop. For this, we selected Loki. While Prometheus is excellent for numerical metrics, it falls short when analyzing the high-cardinality, contextual data inherent in frontend exceptions. We needed to inspect stack traces, browser user agents, and user session IDs to understand the nature of failures, not just their frequency. Loki’s LogQL, combined with its cost-effective storage model for indexed labels and unindexed log content, made it the pragmatic choice for analyzing the torrent of logs a production frontend application generates.
The final architecture formalizes this feedback loop.
graph TD subgraph "Git Repositories" A[App Repo] -->|Merge to main| B(Jenkins) C[Manifest Repo] end subgraph "CI/CD Orchestration" B --"1. Build & Push Image"--> D{Container Registry} B --"2. Update Canary Image Tag"--> C B --"5. Query Logs"--> E[Loki API] B --"6. Decision Gate"--> B B --"7. Promote or Rollback"--> C end subgraph "GitOps Engine" C --"Watched by"--> F(ArgoCD) F --"3. Reconcile State (Deploy Canary)"--> G[Kubernetes Cluster] F --"8. Reconcile State (Promote/Rollback)"--> G end subgraph "Production Environment" G --"Serves Traffic"--> H[Users] H --"4. Send Structured Logs"--> E end style B fill:#f9f,stroke:#333,stroke-width:2px style F fill:#9cf,stroke:#333,stroke-width:2px style E fill:#fc9,stroke:#333,stroke-width:2px
The process is entirely git-driven. Jenkins acts as the orchestrator, modifying manifests in a dedicated GitOps repository. ArgoCD acts as the reconciler, ensuring the cluster state matches the desired state defined in git. Loki provides the critical, real-time data needed for the orchestration logic to make an informed decision.
Frontend Log Instrumentation
The foundation of this system is high-quality, structured client-side logs. We developed a lightweight logging service in our frontend application to capture relevant events and ship them to a Grafana Agent endpoint, which then forwards them to Loki. A common mistake is to simply log strings; instead, every log entry must be a JSON object with consistent fields and labels that Loki can index.
Here is a simplified but production-ready version of the logger service. It captures uncaught exceptions, promise rejections, and provides a method for instrumenting Core Web Vitals.
// src/services/LokiLogger.js
// A small, dependency-free utility to generate a compliant RFC3339 timestamp with nanoseconds
const getNanoTimestamp = () => {
const now = new Date();
// Pad milliseconds to create nanosecond precision (with zeros)
const nano = now.getMilliseconds().toString().padEnd(6, '0');
return `${now.toISOString().slice(0, -1)}${nano}Z`;
};
class LokiLogger {
constructor(config) {
this.endpoint = config.endpoint;
this.batchSize = config.batchSize || 100;
this.batchInterval = config.batchInterval || 5000; // 5 seconds
// Static labels sent with every log batch. This is crucial for Loki.
// The version label distinguishes canary logs from production logs.
this.staticLabels = {
app: config.appName || 'my-frontend',
env: config.env || 'production',
version: config.version, // e.g., 'stable' or 'canary-a9bcf12'
};
this.logQueue = [];
this.batchTimer = null;
this.init();
}
init() {
// Automatically capture global errors
window.addEventListener('error', event => {
this.error('uncaught_exception', {
message: event.message,
filename: event.filename,
lineno: event.lineno,
colno: event.colno,
// Avoid logging the full error object as it can be circular
stack: event.error ? event.error.stack : 'N/A',
});
});
window.addEventListener('unhandledrejection', event => {
this.error('unhandled_rejection', {
reason: event.reason instanceof Error ? event.reason.stack : String(event.reason),
});
});
}
// Generic log method
log(level, message, context = {}) {
if (!this.endpoint) {
console.warn('LokiLogger: Endpoint not configured. Skipping log.');
return;
}
const logEntry = {
// Loki's push API requires a stream object and an array of [timestamp, log_line]
stream: {
level,
...this.staticLabels,
},
values: [
[
getNanoTimestamp(),
JSON.stringify({
message,
...context,
// Add some useful browser context to every log
url: window.location.href,
userAgent: navigator.userAgent,
}),
],
],
};
this.logQueue.push(logEntry);
if (this.logQueue.length >= this.batchSize) {
this.flush();
} else if (!this.batchTimer) {
this.batchTimer = setTimeout(() => this.flush(), this.batchInterval);
}
}
info(message, context) {
this.log('info', message, context);
}
warn(message, context) {
this.log('warn', message, context);
}
error(message, context) {
this.log('error', message, context);
}
// Method to capture Core Web Vitals
instrumentWebVitals() {
try {
// Assumes the web-vitals library is available
import('web-vitals').then(({ onLCP, onFID, onCLS }) => {
onLCP(metric => this.info('web_vitals', { metric: metric.name, value: metric.value }));
onFID(metric => this.info('web_vitals', { metric: metric.name, value: metric.value }));
onCLS(metric => this.info('web_vitals', { metric: metric.name, value: metric.value }));
});
} catch (e) {
this.warn('web_vitals_instrumentation_failed', { error: e.message });
}
}
async flush() {
if (this.batchTimer) {
clearTimeout(this.batchTimer);
this.batchTimer = null;
}
if (this.logQueue.length === 0) {
return;
}
const batch = { streams: this.logQueue.slice() };
this.logQueue = [];
try {
await fetch(this.endpoint, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(batch),
keepalive: true, // Ensures the request is sent even if the page is unloading
});
} catch (error) {
console.error('LokiLogger: Failed to send log batch.', error);
// In a real-world scenario, you might implement a retry mechanism or store logs in localStorage
}
}
}
// Example instantiation in the main application entry point
const logger = new LokiLogger({
endpoint: '/loki/api/v1/push', // Pushing to a relative path handled by a reverse proxy
appName: 'main-webapp',
env: process.env.NODE_ENV,
version: process.env.REACT_APP_VERSION, // This MUST be injected at build time
batchSize: 50,
});
logger.instrumentWebVitals();
export default logger;
The critical part is injecting process.env.REACT_APP_VERSION
. During the CI build for a canary, this variable is set to a unique identifier, like canary-a9bcf12
, while the stable build uses stable
. This version
label becomes the primary dimension for comparing the canary and stable deployments in Loki.
GitOps Repository and Deployment Structure
Our GitOps repository uses Kustomize to manage environment-specific configurations. This declarative approach is fundamental to GitOps.
The directory structure looks like this:
my-app-manifests/
├── base/
│ ├── kustomization.yaml
│ ├── deployment.yaml
│ ├── service.yaml
│ └── ingress.yaml
└── overlays/
└── production/
├── kustomization.yaml
├── deployment-stable.yaml
├── deployment-canary.yaml
└── ingress-canary.yaml
The base/deployment.yaml
defines the common template.
# my-app-manifests/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-frontend
spec:
replicas: 3
selector:
matchLabels:
app: my-frontend
template:
metadata:
labels:
app: my-frontend
spec:
containers:
- name: my-frontend
image: my-registry/my-frontend:placeholder # This will be overridden
ports:
- containerPort: 80
env:
- name: REACT_APP_VERSION
value: "placeholder" # Overridden by overlays
The production overlay defines two separate Deployments: one for stable
and one for canary
.
# my-app-manifests/overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- ../../base
- ingress-canary.yaml
patches:
- path: deployment-stable.yaml
target:
kind: Deployment
name: my-frontend
- path: deployment-canary.yaml
target:
kind: Deployment
name: my-frontend # This patch creates a new resource
deployment-stable.yaml
pins the main deployment to the stable version.
# my-app-manifests/overlays/production/deployment-stable.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-frontend-stable
spec:
replicas: 10
selector:
matchLabels:
app: my-frontend
version: stable
template:
metadata:
labels:
app: my-frontend
version: stable
spec:
containers:
- name: my-frontend
image: my-registry/my-frontend:v1.2.0 # The current stable image tag
env:
- name: REACT_APP_VERSION
value: "stable"
The deployment-canary.yaml
is what Jenkins will manipulate. Initially, it might have replicas: 0
.
# my-app-manifests/overlays/production/deployment-canary.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-frontend-canary
spec:
replicas: 0 # Initially disabled
selector:
matchLabels:
app: my-frontend
version: canary
template:
metadata:
labels:
app: my-frontend
version: canary
spec:
containers:
- name: my-frontend
image: my-registry/my-frontend:v1.2.1-a9bcf12 # The new candidate image tag
env:
- name: REACT_APP_VERSION
value: "canary-a9bcf12" # Unique version for logging
Traffic is split using an Ingress controller that supports weighted routing (e.g., NGINX Ingress with canary annotations). The ingress-canary.yaml
would configure a 90/10 split between the my-frontend-stable
and my-frontend-canary
services.
The Jenkins Orchestration Pipeline
The Jenkinsfile
is the brain of the operation. It’s a Declarative Pipeline that coordinates the build, canary deployment, analysis, and final promotion or rollback.
// Jenkinsfile
pipeline {
agent any
environment {
// Assume Jenkins credentials are set up for registry and git
REGISTRY = "my-registry"
APP_NAME = "my-frontend"
MANIFEST_REPO_URL = "[email protected]:my-org/my-app-manifests.git"
LOKI_URL = "https://loki.my-domain.com"
// Key for Loki API access, stored in Jenkins secrets
LOKI_API_KEY = credentials('loki-api-key')
}
stages {
stage('1. Build and Push') {
steps {
script {
// Use commit hash for unique image tagging
env.IMAGE_TAG = "${env.REGISTRY}/${env.APP_NAME}:${env.BUILD_ID}-${shortenGitCommit()}"
env.CANARY_VERSION = "canary-${shortenGitCommit()}"
// Build the Docker image with the canary version injected as a build arg
docker.build(env.IMAGE_TAG, "--build-arg REACT_APP_VERSION=${env.CANARY_VERSION} .")
// Push the image to the registry
docker.withRegistry("https://my-registry.com", "registry-credentials") {
docker.image(env.IMAGE_TAG).push()
}
}
}
}
stage('2. Deploy Canary') {
steps {
script {
// This function encapsulates the git operations
updateManifests('deploy_canary')
}
}
}
stage('3. Automated Canary Analysis') {
steps {
script {
// A pitfall here is running a very short analysis. It must be long enough
// to gather statistically significant data.
def analysisDurationMinutes = 15
def checkIntervalSeconds = 60
def errorRateThreshold = 0.01 // 1% error rate
for (int i = 0; i < (analysisDurationMinutes * 60 / checkIntervalSeconds); i++) {
sleep(time: checkIntervalSeconds, unit: 'SECONDS')
echo "Running analysis cycle ${i + 1}/${analysisDurationMinutes * 60 / checkIntervalSeconds}..."
def canaryErrorRate = getLokiErrorRate(env.CANARY_VERSION)
def stableErrorRate = getLokiErrorRate("stable")
echo "Canary Error Rate: ${canaryErrorRate}, Stable Error Rate: ${stableErrorRate}"
// Compare canary against stable baseline plus a threshold
// A common mistake is using a fixed threshold. Comparing to baseline is more robust.
if (canaryErrorRate > stableErrorRate + errorRateThreshold) {
error("Canary analysis failed! Error rate ${canaryErrorRate} exceeds baseline ${stableErrorRate} + threshold ${errorRateThreshold}.")
}
}
echo "Canary analysis passed successfully."
}
}
}
stage('4. Promote to Stable') {
steps {
script {
updateManifests('promote_stable')
}
}
}
}
post {
// This block runs regardless of pipeline status
always {
// Clean up the canary deployment if it's still running
// This is important for ensuring a clean state for the next run
}
failure {
script {
// If any stage fails, especially the analysis, trigger a rollback.
echo "Pipeline failed. Rolling back canary deployment."
updateManifests('rollback_canary')
// Send notifications (e.g., Slack)
}
}
success {
script {
echo "Pipeline successful. Canary promoted to stable."
// Clean up by setting canary replicas to 0 after promotion
updateManifests('cleanup_canary')
}
}
}
}
// Function to shorten git commit for cleaner tagging
def shortenGitCommit() {
return sh(script: 'git rev-parse --short HEAD', returnStdout: true).trim()
}
// Function to interact with the GitOps repository
void updateManifests(String action) {
def manifestDir = "manifests-checkout"
// Use a unique directory to avoid workspace conflicts
dir(manifestDir) {
git credentialsId: 'github-ssh-key', url: env.MANIFEST_REPO_URL
def kustomizePath = "overlays/production"
switch(action) {
case 'deploy_canary':
// Use kustomize to set the image and replicas for the canary deployment
sh """
kustomize edit set image my-registry/my-frontend=${env.IMAGE_TAG} --autodetect-image-name
yq -i '.spec.replicas = 1' ${kustomizePath}/deployment-canary.yaml
yq -i '.spec.template.spec.containers[0].env[0].value = "${env.CANARY_VERSION}"' ${kustomizePath}/deployment-canary.yaml
"""
break
case 'promote_stable':
// Update the stable deployment with the new image tag from the canary
sh "kustomize edit set image my-registry/my-frontend=${env.IMAGE_TAG} --autodetect-image-name"
break
case 'rollback_canary':
case 'cleanup_canary':
// Scale down the canary deployment
sh "yq -i '.spec.replicas = 0' ${kustomizePath}/deployment-canary.yaml"
break
}
// Commit and push the changes, which ArgoCD will then pick up
sh """
git config --global user.email "[email protected]"
git config --global user.name "Jenkins CI"
git add .
git commit -m "Jenkins: ${action} for build ${env.BUILD_ID}"
git push origin main
"""
}
}
// Function to query Loki for error rates
@NonCPS // Required for proper handling of floating point numbers in pipelines
def getLokiErrorRate(String version) {
// This LogQL query calculates the rate of error logs vs all logs for a given version.
// It is a powerful way to normalize the error count against traffic volume.
def query = URLEncoder.encode(
"""(sum(rate({app="my-frontend", env="production", version="${version}", level="error"}[5m])) / sum(rate({app="my-frontend", env="production", version="${version}"}[5m]))) or vector(0)""",
"UTF-8"
)
def response = sh(
script: """
curl -s -G "${env.LOKI_URL}/loki/api/v1/query" \
-H "Authorization: Bearer ${env.LOKI_API_KEY}" \
--data-urlencode "query=${query}"
""",
returnStdout: true
).trim()
// A common pitfall is not handling empty results from Loki gracefully.
// The `or vector(0)` in LogQL helps, but we still need robust JSON parsing.
def jsonResponse = new groovy.json.JsonSlurper().parseText(response)
if (jsonResponse.data.result.size() > 0 && jsonResponse.data.result[0].value.size() > 1) {
return jsonResponse.data.result[0].value[1].toFloat()
} else {
return 0.0f
}
}
This pipeline codifies the entire process. A developer pushes code, Jenkins builds it, updates the GitOps repo, and then enters a monitoring loop. The getLokiErrorRate
function is the core of the analysis, executing a LogQL query to get a normalized error rate and returning a floating-point number that the pipeline logic can use for its decision. The use of or vector(0)
in the LogQL query is a critical defensive measure to prevent the query from failing if no logs are found, ensuring the pipeline doesn’t crash on a lack of data.
The system now functions as an automated release guardian. We have successfully caught multiple bad releases before they impacted more than 10% of our user base, including CSS changes that broke rendering on Firefox and API contract mismatches that spammed the console with errors. The developer experience is improved, as they receive rapid, data-driven feedback on their changes in the production environment without the stress of a manual “big bang” release.
However, this architecture is not without its limitations. The primary dependency is Jenkins, a tool that is not natively designed for this kind of continuous, stateful orchestration. A failed Jenkins master during an analysis phase could leave a canary deployment in a limbo state. Furthermore, the analysis logic is currently limited to a simple error rate comparison. A more sophisticated implementation would analyze performance metrics like LCP or CLS percentiles, which also requires more complex LogQL queries. The next logical evolution of this system is to replace the Jenkins analysis loop with a dedicated GitOps-native progressive delivery tool like Argo Rollouts or Flagger. These tools run as controllers within the Kubernetes cluster, removing Jenkins as a single point of failure and offering more advanced analysis capabilities out-of-the-box, such as querying Prometheus metrics and running A/B tests. This would relegate Jenkins to its optimal role: a pure CI tool that builds artifacts and triggers the delivery workflow.