Implementing a Real-Time Observability Plane for Spinnaker Pipelines on Docker Swarm Using a JavaScript Webhook and InfluxDB

DevOps

Word Count: 2.9k

Read Times: 18 Min

Our Spinnaker pipelines were a black box. Deployments to the Docker Swarm cluster would kick off, and unless someone was actively watching the UI, failures went unnoticed for far too long. We had no historical data on pipeline performance, no way to correlate a deployment with a subsequent spike in application errors, and no means of answering simple questions like, “How often does the integration test stage fail for the checkout-service?” The operational cost of this blindness was accumulating, manifesting as longer incident recovery times and an inability to systematically improve our delivery process. The core pain point was a lack of structured, queryable data about the CI/CD process itself.

The initial concept was to leverage Spinnaker’s built-in webhook notifications. For every significant event in a pipeline’s lifecycle—start, stop, stage completion, failure—Spinnaker can dispatch a JSON payload to a specified HTTP endpoint. If we could capture these events, we could build a powerful dataset. The plan was to create a lightweight service to act as this webhook target. This service would need to be secure, resilient, and capable of transforming Spinnaker’s verbose event data into a format optimized for time-series analysis.

Technology selection was driven by pragmatism and our existing ecosystem. For the webhook receiver, Node.js was the obvious choice. Its asynchronous, event-driven nature is perfectly suited for an I/O-bound service that primarily receives HTTP requests and writes to a database. Our team’s proficiency in JavaScript meant we could build and maintain it efficiently. For data storage, InfluxDB was selected over a relational database or a document store. Deployment events are fundamentally time-series data: something happened at a specific point in time. InfluxDB’s data model, which revolves around measurements, tags, and fields, and its powerful Flux query language are purpose-built for this exact use case. Attempting to perform complex windowing or downsampling operations on pipeline durations in PostgreSQL would be clumsy and inefficient by comparison.

The deployment target was a non-negotiable constraint: our entire infrastructure ran on Docker Swarm. While Spinnaker’s integrations with Kubernetes are more mature, our reality was Swarm. This meant our solution had to be containerized and deployable as a Swarm service. Finally, security for the webhook endpoint was critical. Exposing an internal service endpoint to Spinnaker, which lives in a separate network segment, required proper authentication. A simple static API key was deemed too inflexible and a security risk. OAuth 2.0, specifically the Client Credentials Grant flow, was chosen as the standard for machine-to-machine authentication. It provides a robust mechanism for issuing and validating short-lived bearer tokens, which is a far more secure posture.

The final architecture can be visualized as a straightforward data flow:

graph TD
    subgraph Spinnaker
        A[Pipeline Execution] --> B{Run Job Stage};
        B -- Fetches Token --> C[OAuth 2.0 Server];
        C -- Returns JWT --> B;
        B -- Passes Token --> D[Webhook Notification];
    end
    subgraph "Webhook Service (Docker Swarm)"
        E[Node.js / Express App];
    end
    subgraph "Data & Auth"
        F[InfluxDB];
        C;
    end
    D -- POST request with Bearer Token --> E;
    E -- Validates Token --> C;
    E -- Transforms & Writes Data --> F;

This plan felt solid. It addressed the core pain point using a combination of technologies that played to their strengths, while respecting our existing infrastructure constraints. The real work, of course, was in the implementation.

Infrastructure Setup: Docker Swarm Stack

The first step was defining the necessary infrastructure as a Docker Swarm stack. This includes our target database, InfluxDB, and a simple OAuth 2.0 provider for demonstration purposes. In a real-world project, this would be an existing identity provider like Okta, Auth0, or an internal service. For this build, we use a mock OAuth2 server container.

Here is the docker-stack.yml file that defines our core services. Note the use of Docker secrets for managing sensitive information like tokens and passwords.

# docker-stack.yml
version: '3.8'

services:
  influxdb:
    image: influxdb:2.7
    volumes:
      - influxdb_data:/var/lib/influxdb2
    ports:
      - "8086:8086"
    environment:
      - DOCKER_INFLUXDB_INIT_MODE=setup
      - DOCKER_INFLUXDB_INIT_USERNAME=${INFLUXDB_USER}
      - DOCKER_INFLUXDB_INIT_PASSWORD=${INFLUXDB_PASS}
      - DOCKER_INFLUXDB_INIT_ORG=${INFLUXDB_ORG}
      - DOCKER_INFLUXDB_INIT_BUCKET=${INFLUXDB_BUCKET}
      - DOCKER_INFLUXDB_INIT_ADMIN_TOKEN=${INFLUXDB_TOKEN}
    deploy:
      replicas: 1
      placement:
        constraints: [node.role == manager]
      restart_policy:
        condition: on-failure

  oauth2-provider:
    # In a real scenario, use a production-grade OAuth2 server.
    # This is a simple mock server for demonstration.
    # Image source: a basic express server with `simple-oauth2`
    image: my-mock-oauth2-server:1.0 
    ports:
      - "9000:9000"
    environment:
      - CLIENT_ID=${OAUTH_CLIENT_ID}
      - CLIENT_SECRET=${OAUTH_CLIENT_SECRET}
    deploy:
      replicas: 1
      restart_policy:
        condition: on-failure

  webhook-service:
    image: my-spinnaker-webhook-service:1.0
    ports:
      - "3000:3000"
    secrets:
      - influxdb_token
      - oauth_jwt_secret
    environment:
      - INFLUX_URL=http://influxdb:8086
      - INFLUX_ORG=${INFLUXDB_ORG}
      - INFLUX_BUCKET=${INFLUXDB_BUCKET}
      - INFLUX_TOKEN_FILE=/run/secrets/influxdb_token
      - OAUTH_PROVIDER_URL=http://oauth2-provider:9000
      - OAUTH_JWT_SECRET_FILE=/run/secrets/oauth_jwt_secret
      - LOG_LEVEL=info
    depends_on:
      - influxdb
      - oauth2-provider
    deploy:
      replicas: 2
      restart_policy:
        condition: on-failure

volumes:
  influxdb_data:

secrets:
  influxdb_token:
    external: true
  oauth_jwt_secret:
    external: true

Before deploying, we create the necessary secrets:
printf "YOUR_SUPER_SECRET_INFLUX_TOKEN" | docker secret create influxdb_token -
printf "ANOTHER_VERY_SECRET_KEY_FOR_JWT" | docker secret create oauth_jwt_secret -

Then, the stack is deployed with docker stack deploy -c docker-stack.yml observability. This provides the foundation upon which our custom service will run.

The JavaScript Webhook Service

This is the heart of the system. It’s a Node.js application built with Express.js that performs three primary functions:

Secures an endpoint using OAuth 2.0 Bearer tokens.
Parses and validates incoming Spinnaker webhook payloads.
Transforms the payload into InfluxDB Line Protocol and writes it to the database.

Project Structure and Dependencies

The package.json outlines our key dependencies: @influxdata/influxdb-client for database interaction, express as the web server, jsonwebtoken for JWT validation, and winston for structured logging.

{
  "name": "spinnaker-webhook-service",
  "version": "1.0.0",
  "main": "server.js",
  "scripts": {
    "start": "node server.js"
  },
  "dependencies": {
    "@influxdata/influxdb-client": "^1.33.2",
    "express": "^4.18.2",
    "jsonwebtoken": "^9.0.2",
    "winston": "^3.11.0"
  }
}

Core Server and Middleware Implementation

The main server.js file sets up the Express application, logging, and the core routes. A critical piece is the authMiddleware. This function intercepts every request to our protected endpoint, extracts the JWT from the Authorization header, and verifies its signature against our shared secret. A common mistake is to forget to handle missing tokens or verification failures, which would leave the endpoint exposed.

// server.js
const express = require('express');
const jwt = require('jsonwebtoken');
const fs = require('fs');
const InfluxDBWriter = require('./influxWriter');
const SpinnakerPayloadParser = require('./payloadParser');
const logger = require('./logger');

const app = express();
app.use(express.json({ limit: '5mb' })); // Spinnaker payloads can be large

const PORT = process.env.PORT || 3000;
const JWT_SECRET = fs.readFileSync(process.env.OAUTH_JWT_SECRET_FILE, 'utf8').trim();

const influxWriter = new InfluxDBWriter();

// Health check endpoint, does not require authentication
app.get('/health', (req, res) => {
  res.status(200).json({ status: 'UP' });
});

// Authentication middleware
const authMiddleware = (req, res, next) => {
  const authHeader = req.headers.authorization;
  if (!authHeader || !authHeader.startsWith('Bearer ')) {
    logger.warn('Authentication attempt with missing or malformed token.');
    return res.status(401).json({ error: 'Unauthorized: Missing Bearer token' });
  }

  const token = authHeader.split(' ')[1];

  jwt.verify(token, JWT_SECRET, (err, decoded) => {
    if (err) {
      logger.error('Token verification failed', { error: err.message });
      return res.status(403).json({ error: 'Forbidden: Invalid token' });
    }
    // You could potentially use decoded payload for more fine-grained access control
    req.user = decoded; 
    next();
  });
};

// The primary webhook endpoint
app.post('/webhook/spinnaker', authMiddleware, (req, res) => {
  const payload = req.body;
  
  // A simple validation to ensure we're dealing with a Spinnaker event
  if (!payload.content || !payload.details || !payload.details.type) {
    logger.warn('Received malformed payload', { payload });
    return res.status(400).json({ error: 'Bad Request: Malformed Spinnaker payload' });
  }

  logger.info(`Processing event: ${payload.details.type} for application ${payload.content.application}`);

  try {
    const points = SpinnakerPayloadParser.parse(payload);
    if (points.length > 0) {
      influxWriter.writePoints(points);
    }
    res.status(202).json({ message: 'Accepted' });
  } catch (error) {
    logger.error('Failed to parse or write payload', { error: error.message, stack: error.stack });
    // Don't leak internal error details to the client
    res.status(500).json({ error: 'Internal Server Error' });
  }
});

app.listen(PORT, () => {
  logger.info(`Spinnaker webhook service listening on port ${PORT}`);
});

InfluxDB Integration

The influxWriter.js module encapsulates all interaction with InfluxDB. It uses the official client library. A key consideration in a production system is batching. Writing every single data point in a separate HTTP request is highly inefficient. The client library provides a batching API, which we use to collect points and flush them periodically or when a certain batch size is reached. This dramatically reduces network overhead and load on the InfluxDB instance. Error handling here is also vital; if InfluxDB is temporarily unavailable, we must log the failure without crashing the service.

// influxWriter.js
const { InfluxDB, Point } = require('@influxdata/influxdb-client');
const fs = require('fs');
const logger = require('./logger');

class InfluxDBWriter {
  constructor() {
    const url = process.env.INFLUX_URL;
    const token = fs.readFileSync(process.env.INFLUX_TOKEN_FILE, 'utf8').trim();
    const org = process.env.INFLUX_ORG;
    this.bucket = process.env.INFLUX_BUCKET;

    if (!url || !token || !org || !this.bucket) {
      throw new Error("InfluxDB configuration is missing. Check environment variables.");
    }
    
    this.influxDB = new InfluxDB({ url, token });
    this.writeApi = this.influxDB.getWriteApi(org, this.bucket, 'ns', {
      batchSize: 100, // Flush after 100 points
      flushInterval: 5000, // Or flush every 5 seconds
    });

    logger.info('InfluxDB writer initialized.');

    // Add error handling for the write API
    this.writeApi.events.on('writeError', (error) => {
      logger.error('InfluxDB write error', { 
        errorMessage: error.message,
        stack: error.stack,
        body: error.body 
      });
    });
  }

  writePoints(points) {
    if (!Array.isArray(points) || points.length === 0) {
      return;
    }
    logger.info(`Writing ${points.length} points to InfluxDB bucket: ${this.bucket}`);
    for (const point of points) {
      this.writeApi.writePoint(point);
    }
  }

  async close() {
    try {
      await this.writeApi.close();
      logger.info('InfluxDB writer closed successfully.');
    } catch (error) {
      logger.error('Error closing InfluxDB writer', { error: error.message });
    }
  }
}

module.exports = InfluxDBWriter;

Spinnaker Payload Transformation Logic

This is where the real business logic resides. Spinnaker’s webhook payload is rich but noisy. A single event can contain a deeply nested JSON object that is several hundred kilobytes in size. The goal of payloadParser.js is to surgically extract the valuable information and map it to a clean InfluxDB data model.

Our data model for InfluxDB will use a single measurement, pipeline_events.

Tags (indexed for fast queries): application, pipelineName, stageType, status, triggeredBy.
Fields (the actual data values): durationMs (integer), isSuccess (boolean, 1 or 0), executionId (string).

The pitfall here is choosing what to make a tag versus a field. Tags should be used for metadata that you will use in GROUP BY or WHERE clauses. Fields are the data you want to aggregate or select. Putting high-cardinality values like executionId in a tag can lead to performance degradation in InfluxDB, a phenomenon known as “series cardinality explosion.” We’ve chosen to make it a field, which is a conscious trade-off: filtering by a specific execution ID will be slower, but the overall health of the database is preserved.

// payloadParser.js
const { Point } = require('@influxdata/influxdb-client');

class SpinnakerPayloadParser {
  static parse(payload) {
    const { content, details } = payload;
    const { execution, context } = content;
    const { application, type } = details;

    const points = [];

    // We only care about stage and pipeline completion events
    if (type !== 'orca:stage:complete' && type !== 'orca:pipeline:complete') {
      return points;
    }

    const commonTags = {
      application: application,
      pipelineName: execution.name,
      triggeredBy: execution.trigger.user || 'anonymous',
    };

    const point = new Point('pipeline_events')
      .timestamp(new Date(context.completed || context.startTime)); // Use completion time as timestamp

    const status = context.status; // e.g., SUCCEEDED, FAILED, TERMINAL
    point.tag('status', status);

    const isSuccess = status === 'SUCCEEDED' ? 1 : 0;
    point.intField('isSuccess', isSuccess);
    point.stringField('executionId', execution.id);

    // Calculate duration only if we have start and end times
    if (context.startTime && context.endTime) {
      const durationMs = context.endTime - context.startTime;
      point.intField('durationMs', durationMs);
    }

    // Differentiate between stage and pipeline events
    if (type === 'orca:stage:complete') {
      point.tag('stageType', context.type || 'unknown');
    } else { // orca:pipeline:complete
      point.tag('stageType', 'pipeline_summary');
    }

    // Add all common tags to the point
    for (const [key, value] of Object.entries(commonTags)) {
      if (value) {
        point.tag(key, value);
      }
    }

    points.push(point);
    return points;
  }
}

module.exports = SpinnakerPayloadParser;

Configuring the Spinnaker Pipeline

The final piece of the puzzle is configuring a Spinnaker pipeline to send these notifications. A major challenge with Spinnaker is that its native webhook notification stage does not have built-in support for an OAuth 2.0 Client Credentials flow. In a real-world project, a pragmatic workaround is often necessary.

Our solution is to add a “Run Job” stage before the webhook stage. This stage runs a simple container with curl and jq installed, executes a script to fetch a token from our OAuth provider, and then passes that token to the webhook stage’s payload.

Here’s the relevant JSON snippet for the Spinnaker pipeline configuration:

// Part of a Spinnaker pipeline JSON definition
{
  "name": "Get Auth Token",
  "refId": "1",
  "requisiteStageRefIds": [],
  "type": "runJob",
  "account": "docker-swarm-account",
  "cloudProvider": "docker",
  "payload": {
    "image": "buildpack-deps:curl",
    "command": [
      "/bin/bash",
      "-c",
      "TOKEN_RESPONSE=$(curl -s -X POST 'http://oauth2-provider:9000/token' -H 'Content-Type: application/x-www-form-urlencoded' -d 'grant_type=client_credentials&client_id=spinnaker&client_secret=supersecret'); echo '{\"accessToken\": \"'$(echo $TOKEN_RESPONSE | jq -r .access_token)'\"}'"
    ]
  }
},
{
  "name": "Notify Webhook Service",
  "refId": "2",
  "requisiteStageRefIds": ["1"],
  "type": "webhook",
  "url": "http://webhook-service:3000/webhook/spinnaker",
  "customHeaders": {
    "Authorization": "Bearer ${ #stage('Get Auth Token')['context']['artifacts'][0]['results']['accessToken'] }"
  },
  "payload": {
    // Spinnaker populates the full event payload here automatically
  },
  "status": [
    "SUCCEEDED",
    "FAILED",
    "TERMINAL"
  ]
}

This implementation detail is crucial. It’s an example of the small, practical hurdles encountered when integrating disparate systems. The use of Spinnaker’s expression language (${...}) to extract the token from the context of the previous stage is powerful but can be brittle if the output format of the “Run Job” stage changes.

Querying the Results

With data flowing into InfluxDB, we can finally answer our initial questions. Using the InfluxDB UI or API with the Flux language, we gain deep insights.

Query 1: Find all failed pipeline runs in the last 7 days for a specific application.

from(bucket: "spinnaker")
  |> range(start: -7d)
  |> filter(fn: (r) => r._measurement == "pipeline_events")
  |> filter(fn: (r) => r.application == "checkout-service")
  |> filter(fn: (r) => r.stageType == "pipeline_summary")
  |> filter(fn: (r) => r.status == "TERMINAL" or r.status == "FAILED")
  |> keep(columns: ["_time", "pipelineName", "executionId", "triggeredBy"])
  |> sort(columns: ["_time"], desc: true)

Query 2: Calculate the average and 95th percentile duration for the “Deploy to Prod” stage across all applications.

from(bucket: "spinnaker")
  |> range(start: -30d)
  |> filter(fn: (r) => r._measurement == "pipeline_events")
  |> filter(fn: (r) => r.stageType == "deploy") // Assuming 'deploy' is the stage type
  |> filter(fn: (r) => r._field == "durationMs")
  |> group(columns: ["application"])
  |> aggregateWindow(every: 1d, fn: mean, createEmpty: false)
  |> yield(name: "average_duration")

from(bucket: "spinnaker")
  |> range(start: -30d)
  |> filter(fn: (r) => r._measurement == "pipeline_events")
  |> filter(fn: (r) => r.stageType == "deploy")
  |> filter(fn: (r) => r._field == "durationMs")
  |> group(columns: ["application"])
  |> quantile(q: 0.95, method: "exact_mean")
  |> yield(name: "p95_duration")

The ability to perform these queries transforms the CI/CD platform from a simple execution engine into a data source for continuous improvement. We can now set SLOs on pipeline performance and reliability, track them, and make data-driven decisions.

This solution, while functional, is not without its limitations. The method of fetching the OAuth token via a curl job is a workaround; a more integrated approach would involve developing a custom Spinnaker stage (an Orca plugin), which is a significant undertaking. Furthermore, the webhook service currently has a “fire and forget” error handling model for InfluxDB writes. If InfluxDB is down, the data point is lost. For higher reliability, a message queue like RabbitMQ or Redis Streams could be introduced between the webhook service and the database writer to act as a durable buffer, ensuring that events are eventually processed even during a database outage. Finally, this system captures pipeline metadata, but not the rich log output from each stage. A comprehensive observability solution would require integrating a log aggregation system and correlating log events with the pipeline metrics we now have, providing an even deeper level of insight.

Spinnaker JavaScript InfluxDB OAuth 2.0 Docker Swarm

Canary Rollout of IAM Policies in Kubernetes via a Message-Driven Custom Controller

2023-10-27 Cloud Native

Go Message Queue Cloud Native & DevOps Kubernetes IAM Frontend Istio NATS

A Rust Middleware for Batch Processing Laravel Events into Elasticsearch with a Dynamic JS Transformation Layer

2023-10-27 Backend Development

Turbopack Elasticsearch Rust Distributed & Middleware Laravel