Building an Interactive Canary Judgment Stage in Spinnaker with an SNS-Driven Chakra UI Plugin


The standard “Manual Judgment” stage in a Spinnaker pipeline is functional, but it’s a dead end from a user experience perspective. It sends a Slack or email notification with a link, forcing an on-call engineer to context-switch, navigate a complex UI, and manually piece together information from different monitoring dashboards before clicking “Continue” or “Stop”. For critical canary deployments, this process is slow, error-prone, and lacks the real-time data needed for a confident decision. Our team was hitting this wall repeatedly. Automated canary analysis with Kayenta is the ideal end-state, but setting it up for every single microservice is a significant undertaking. We needed an intermediate solution: an interactive, data-rich judgment stage that brought the decision-making context directly into the pipeline view.

Our initial concept was to replace the default judgment stage with a custom UI plugin built directly into Spinnaker’s Deck interface. This UI would not just be a simple “approve/reject” button. It had to be a live dashboard for the canary analysis, presenting key application metrics, recent logs, and health status for both the canary and baseline deployments side-by-side. The most critical requirement was that this feedback loop had to be event-driven. Polling for status updates every few seconds from the browser felt archaic and inefficient. We wanted deployment events pushed from the Spinnaker pipeline to the UI in real-time.

This led to our technology selection process. Spinnaker’s Deck UI is a React application, making it extensible with custom components. For the UI itself, we chose Chakra UI. In a real-world project, speed of development for internal tools is paramount. Chakra UI’s composition-based components and focus on accessibility meant we could build a polished, professional-looking interface without getting bogged down in CSS-in-JS boilerplate or fighting style overrides. It allowed us to focus on logic, not presentation.

The real architectural decision point was the real-time communication channel. A naive approach would be a dedicated WebSocket server, but that introduces stateful infrastructure that needs to be managed, scaled, and secured. The pitfall here is underestimating the operational overhead of even a “simple” service. We decided to leverage a serverless, event-driven architecture using AWS services. AWS SNS became the obvious choice for the messaging bus. A Spinnaker pipeline can be easily configured to publish a message to an SNS topic at the start of the canary phase. This is a fire-and-forget operation, perfectly decoupling the pipeline from the notification consumers.

But a browser client can’t subscribe to an SNS topic directly. We needed a bridge. The solution was a chain: Spinnaker publishes to SNS, an AWS Lambda function subscribes to that topic, processes the message, and then pushes it to the browser. To get the data to the browser in real-time, we chose AWS IoT Core. While it might sound like an unconventional choice, its MQTT-over-WebSockets protocol provides a managed, scalable, and secure real-time messaging layer that is surprisingly easy to integrate with a frontend application using standard libraries. This serverless stack—SNS, Lambda, IoT Core—is robust, cost-effective for this workload, and requires zero infrastructure maintenance.

The final piece was closing the loop. After the engineer makes a decision in the custom UI, how do we signal back to the Spinnaker pipeline? The answer is the Spinnaker API. The “Promote” and “Rollback” buttons in our Chakra UI component would make authenticated API calls to resume the pipeline execution, passing along the judgment result.

Here is a high-level view of the entire flow:

sequenceDiagram
    participant Spinnaker
    participant SNS
    participant Lambda
    participant AWSIoTCore as AWS IoT Core (MQTT)
    participant ChakraUI as Chakra UI Plugin
    participant SpinnakerAPI as Spinnaker API

    Spinnaker->>+SNS: Publishes Canary Started event (deploymentId, version)
    SNS->>+Lambda: Triggers Lambda function
    Lambda->>+AWSIoTCore: Publishes formatted data to MQTT topic
    AWSIoTCore-->>-ChakraUI: Pushes real-time update
    Note over ChakraUI: Engineer analyzes data and makes a decision
    ChakraUI->>+SpinnakerAPI: Calls API to complete judgment stage (e.g., 'promote')
    SpinnakerAPI-->>-Spinnaker: Resumes pipeline execution

Crafting the Spinnaker Pipeline

The foundation is a Spinnaker pipeline that incorporates stages for publishing the initial event and waiting for the external signal from our UI. We define a canary deployment strategy as usual, but we inject custom stages around the manual analysis period.

A common mistake is to overcomplicate the pipeline logic. We keep it simple. The pipeline’s responsibility is to deploy, notify, wait, and then act upon a received signal.

Here’s a simplified JSON representation of the key stages:

[
  {
    "name": "Start Canary Judgment",
    "type": "awsSnsPublish",
    "refId": "1",
    "requisiteStageRefIds": ["deployCanary"],
    "topicArn": "arn:aws:sns:us-east-1:123456789012:spinnaker-canary-events",
    "subject": "Canary Deployment Started: {{ execution.application }}/{{ trigger.tag }}",
    "message": {
      "executionId": "{{ execution.id }}",
      "application": "{{ execution.application }}",
      "canaryVersion": "{{ trigger.tag }}",
      "pipelineName": "{{ execution.name }}",
      "status": "PENDING_JUDGMENT",
      "timestamp": "{{ #stage('Start Canary Judgment')['startTime'] }}"
    }
  },
  {
    "name": "Wait for Interactive Judgment",
    "type": "manualJudgment",
    "refId": "2",
    "requisiteStageRefIds": ["1"],
    "failPipeline": true,
    "instructions": "Awaiting judgment from the external Canary Dashboard. This stage will be completed via API call.",
    "notifications": [],
    "propagateAuthenticationContext": true,
    "judgmentInputs": []
  },
  {
    "name": "Promote Canary",
    "type": "deploy",
    "refId": "3",
    "requisiteStageRefIds": ["2"],
    "continuePipeline": true,
    "completeOtherBranchesThenContinue": false,
    "restrictedExecutionWindow": null,
    "failPipeline": true,
    "clusters": [
      {
        "account": "my-aws-account",
        "application": "{{ execution.application }}",
        "strategy": "redblack",
        "scaleDown": true,
        "serverGroupName": "{{ execution.application }}-prod"
        // ... more deployment configuration
      }
    ]
  },
  {
    "name": "Rollback Canary",
    "type": "disableCluster",
    "refId": "4",
    "requisiteStageRefIds": ["2"],
    "continuePipeline": false,
    "failPipeline": true,
    "clusters": [
      {
        "account": "my-aws-account",
        "application": "{{ execution.application }}",
        "cloudProvider": "aws",
        "regions": ["us-east-1"],
        "serverGroupName": "{{ #stage('deployCanary')['context']['outputs']['aws.cluster.name'] }}",
        "target": "current_asg_dynamic"
      }
    ]
  }
]

The key stages are:

  1. Start Canary Judgment (awsSnsPublish): This stage fires a message to our SNS topic. We use Spinnaker’s expression language to populate the payload with critical context like the execution ID and application name.
  2. Wait for Interactive Judgment (manualJudgment): This is the crucial part. We use a standard manualJudgment stage, but we configure it to have no notifications and no judgment inputs. It effectively acts as a pause button for the pipeline. Our external UI will complete this stage via an API call, which will then trigger either the “Promote” or “Rollback” branch. The propagateAuthenticationContext is important for the API call later.

The Serverless Backend: SNS to Lambda to IoT Core

The infrastructure to support this flow is straightforward to define using Infrastructure as Code. In a real-world project, this would be managed with Terraform or CloudFormation.

First, the Lambda function. Its role is simple: receive an event from SNS, extract the relevant data, perhaps enrich it with some initial metrics from CloudWatch, and then publish it to a specific MQTT topic on AWS IoT Core. Using a unique MQTT topic per execution ID (spinnaker/canary/{executionId}) allows our UI to subscribe only to the events for the specific deployment it cares about.

Here’s the Node.js code for the Lambda handler. It includes basic error handling and logging, which are non-negotiable in production code.

// lambda-canary-bridge/index.js
const { IoTDataPlaneClient, PublishCommand } = require("@aws-sdk/client-iot-data-plane");
const REGION = process.env.AWS_REGION;
const IOT_ENDPOINT = process.env.IOT_ENDPOINT;

const iotData = new IoTDataPlaneClient({ region: REGION, endpoint: IOT_ENDPOINT });

// Basic structured logging
const log = (level, message, context) => {
  console.log(JSON.stringify({ level, message, ...context }));
};

exports.handler = async (event) => {
  log('info', 'Received SNS event', { eventId: event.Records[0].Sns.MessageId });

  try {
    const snsMessage = JSON.parse(event.Records[0].Sns.Message);
    const { executionId, application, canaryVersion, status } = snsMessage;

    if (!executionId || !application) {
      log('error', 'Invalid SNS message format: missing executionId or application', { message: snsMessage });
      throw new Error("Invalid SNS message format.");
    }
    
    // The MQTT topic is dynamically created based on the execution ID
    const topic = `spinnaker/canary/${executionId}`;

    const payload = {
      timestamp: new Date().toISOString(),
      source: 'SpinnakerPipeline',
      details: {
        executionId,
        application,
        canaryVersion,
        status
        // In a real implementation, you would add more data here.
        // For example, query CloudWatch or Prometheus for initial health metrics.
      }
    };

    const publishParams = {
      topic,
      payload: JSON.stringify(payload),
      qos: 1, // Quality of Service: at least once delivery
    };

    const command = new PublishCommand(publishParams);
    await iotData.send(command);

    log('info', 'Successfully published to IoT Core', { topic, executionId });

    return {
      statusCode: 200,
      body: JSON.stringify({ message: "Event processed successfully." }),
    };

  } catch (error) {
    log('error', 'Failed to process event', { errorMessage: error.message, stack: error.stack });
    // This will cause SNS to retry the delivery based on its policy
    throw error;
  }
};

This Lambda needs an IAM role with permissions to be invoked by SNS and to publish messages to AWS IoT Core. The IOT_ENDPOINT is a critical environment variable that you must fetch from your AWS account (it’s unique per account).

Building the Spinnaker Deck Plugin with Chakra UI

This is where the user-facing component comes to life. A Spinnaker Deck plugin is essentially a standalone JavaScript package that registers new components (like custom stages) with the main Deck application.

We’ll create a new React component, InteractiveCanaryJudgmentStage, that will be rendered by Deck when our custom stage is active. This component will handle the connection to AWS IoT Core and display the real-time data.

For connecting to MQTT over WebSockets, we’ll use the aws-iot-device-sdk-v2 package, which handles the complex authentication signing (SigV4) required to connect to AWS IoT Core from a browser.

Here’s the core logic of the React component.

// src/components/InteractiveCanaryJudgmentStage.tsx

import React, { useState, useEffect, useCallback } from 'react';
import {
  Box,
  Heading,
  VStack,
  HStack,
  Button,
  Spinner,
  Alert,
  AlertIcon,
  Stat,
  StatLabel,
  StatNumber,
  StatGroup,
  useToast,
  Code,
} from '@chakra-ui/react';
import { IExecution, IStage } from '@spinnaker/core';
import { awsIot } from 'aws-iot-device-sdk-v2';
import { once } from 'events';

// These would come from a config service or environment variables in a real app
const AWS_REGION = 'us-east-1';
const IOT_ENDPOINT = 'your-iot-endpoint.iot.us-east-1.amazonaws.com';
const IDENTITY_POOL_ID = 'us-east-1:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx';

interface CanaryStatus {
  status: 'PENDING_JUDGMENT' | 'ANALYZING' | 'ACTION_REQUIRED';
  canaryVersion?: string;
  application?: string;
  metrics?: { cpu: number; memory: number; errorRate: number; };
}

interface InteractiveCanaryJudgmentStageProps {
  execution: IExecution;
  stage: IStage;
}

// A custom hook to manage the MQTT connection
const useMqttConnection = (topic: string, onMessage: (payload: any) => void) => {
  const [connectionStatus, setConnectionStatus] = useState<'DISCONNECTED' | 'CONNECTING' | 'CONNECTED' | 'ERROR'>('DISCONNECTED');
  
  useEffect(() => {
    let connection: awsIot.MqttClientConnection | null = null;
    
    const connect = async () => {
      setConnectionStatus('CONNECTING');
      try {
        // Fetches temporary credentials from Cognito Identity Pool
        const credentialsProvider = awsIot.AwsCredentialsProvider.fromCognitoIdentity({
          identityPoolId: IDENTITY_POOL_ID,
          region: AWS_REGION,
        });
        
        // This is the core logic for connecting from the browser
        const config = awsIot.AwsIotMqttConnectionConfigBuilder.new_with_websockets({
            region: AWS_REGION,
            credentials_provider: credentialsProvider,
            client_id: `spinnaker-deck-${Date.now()}`,
            endpoint: IOT_ENDPOINT,
        }).build();

        connection = new awsIot.MqttClient(new awsIot.MqttClientConnection(config));
        
        connection.on('connect', () => setConnectionStatus('CONNECTED'));
        connection.on('error', (err) => {
            console.error('MQTT Connection Error:', err);
            setConnectionStatus('ERROR');
        });

        await connection.connect();
        
        const onMessageHandler = (topic: string, payload: ArrayBuffer, dup: boolean, qos: awsIot.QoS, retain: boolean) => {
            const decoder = new TextDecoder('utf8');
            const messageJson = decoder.decode(payload);
            try {
                onMessage(JSON.parse(messageJson));
            } catch (e) {
                console.error("Failed to parse incoming MQTT message", messageJson);
            }
        };

        await connection.subscribe(topic, awsIot.QoS.AtLeastOnce, onMessageHandler);

      } catch (error) {
        console.error("Failed to establish MQTT connection", error);
        setConnectionStatus('ERROR');
      }
    };
    
    connect();
    
    return () => {
      connection?.disconnect();
    };
  }, [topic, onMessage]);
  
  return connectionStatus;
};

export const InteractiveCanaryJudgmentStage = ({ execution, stage }: InteractiveCanaryJudgmentStageProps) => {
  const [canaryStatus, setCanaryStatus] = useState<CanaryStatus>({ status: 'PENDING_JUDGMENT' });
  const [isSubmitting, setIsSubmitting] = useState(false);
  const toast = useToast();
  
  const executionId = execution.id;
  const mqttTopic = `spinnaker/canary/${executionId}`;

  const handleMessage = useCallback((payload: any) => {
    // A robust implementation would have schema validation here
    setCanaryStatus(prev => ({ ...prev, ...payload.details }));
  }, []);

  const connectionStatus = useMqttConnection(mqttTopic, handleMessage);

  const handleJudgment = async (judgment: 'promote' | 'rollback') => {
    setIsSubmitting(true);
    const judgmentStatus = judgment === 'promote' ? 'CONTINUE' : 'STOP';
    
    try {
      // The spinnaker/core package provides a REST service for API calls
      // In a real plugin, you would import and use it.
      // For demonstration, we use fetch.
      const response = await fetch(`/applications/${execution.application}/pipelines/${execution.id}/${stage.refId}`, {
        method: 'PUT',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({
          judgmentStatus,
          judgmentInput: judgment, // You can pass data back if needed
        }),
      });

      if (!response.ok) {
        const errorData = await response.json();
        throw new Error(errorData.message || 'Failed to submit judgment');
      }
      
      toast({
        title: `Judgment submitted: ${judgment.toUpperCase()}`,
        status: 'success',
        duration: 5000,
        isClosable: true,
      });

    } catch (error) {
      console.error('Failed to submit judgment:', error);
      toast({
        title: 'API Error',
        description: error.message,
        status: 'error',
        duration: 9000,
        isClosable: true,
      });
    } finally {
      setIsSubmitting(false);
    }
  };

  return (
    <Box p={6} borderWidth="1px" borderRadius="lg">
      <VStack spacing={4} align="stretch">
        <Heading size="md">Interactive Canary Judgment: {canaryStatus.application}</Heading>
        
        <Alert status={connectionStatus === 'CONNECTED' ? 'success' : 'warning'}>
          <AlertIcon />
          Real-time Connection Status: {connectionStatus}
        </Alert>
        
        {connectionStatus === 'CONNECTING' && <Spinner />}

        <StatGroup>
          <Stat>
            <StatLabel>Status</StatLabel>
            <StatNumber>{canaryStatus.status}</StatNumber>
          </Stat>
          <Stat>
            <StatLabel>Canary Version</StatLabel>
            <StatNumber>{canaryStatus.canaryVersion || 'N/A'}</StatNumber>
          </Stat>
        </StatGroup>
        
        <Heading size="sm">Live Metrics</Heading>
        <Code p={4} borderRadius="md" whiteSpace="pre-wrap">
          {JSON.stringify(canaryStatus.metrics || { info: "Waiting for metrics..." }, null, 2)}
        </Code>

        <HStack spacing={4} justify="flex-end">
          <Button
            colorScheme="red"
            onClick={() => handleJudgment('rollback')}
            isLoading={isSubmitting}
            disabled={connectionStatus !== 'CONNECTED'}
          >
            Rollback
          </Button>
          <Button
            colorScheme="green"
            onClick={() => handleJudgment('promote')}
            isLoading={isSubmitting}
            disabled={connectionStatus !== 'CONNECTED'}
          >
            Promote to Production
          </Button>
        </HStack>
      </VStack>
    </Box>
  );
};

This component demonstrates the complete loop. The useMqttConnection custom hook encapsulates the complex logic of setting up a secure WebSocket connection to AWS IoT Core using temporary credentials from a Cognito Identity Pool. The main component subscribes to the relevant topic and updates its state as messages arrive. The “Promote” and “Rollback” buttons then use fetch to call the Spinnaker API endpoint for the manual judgment stage, passing the decision. This call effectively un-pauses the pipeline, directing it down the correct execution path.

The use of Chakra UI components like Box, VStack, Alert, and StatGroup allowed for rapid construction of a clean, readable layout without writing a single line of custom CSS.

The result is a seamless experience. When a developer’s deployment hits the canary phase, this interactive panel appears directly within the pipeline view. It automatically connects to the event stream, displays status, and provides clear actions. There’s no need to hunt for links in Slack or cross-reference dashboards. The decision point is brought to the user, enriched with the necessary context, enabling faster, safer, and more confident deployments.

This architecture, however, is not without its limitations and areas for future improvement. The security of the API call from the browser relies on the user’s existing Spinnaker authentication context. For environments with stricter security postures, introducing a backend-for-frontend (BFF) proxy would be a better approach, allowing the API call to be made from a trusted service account instead of the end-user’s browser. The data payload is also currently minimal; a more advanced implementation would have the Lambda function perform data enrichment by querying a time-series database like Prometheus or CloudWatch Metrics to include actual performance data in the event pushed to the UI. Finally, while AWS IoT Core is highly scalable, the client-side connection management could be made more resilient with custom retry logic and backoff strategies within the React component to handle transient network issues more gracefully.


  TOC