Implementing a Real-Time Svelte Dashboard for Visualizing CockroachDB Cluster Resilience Under Simulated Failures


Static monitoring dashboards for distributed systems provide a false sense of security. They show a system’s state in peacetime, but its true value is only revealed during failure. The technical pain point is not a lack of metrics, but the inability to correlate them with live, induced failure events to build an intuitive understanding of a system’s resilience. To address this, we needed to build an interactive control plane—a tool to not just observe, but to actively provoke a distributed database and watch its self-healing mechanisms in real-time, making abstract concepts like the CAP theorem tangible.

Our initial concept was a three-part architecture: a multi-node CockroachDB cluster as the subject, a Go backend acting as the chaos controller and data provider, and a Svelte frontend for visualization and interaction. CockroachDB was the obvious choice, as it’s designed from the ground up for survivability and explicitly prefers Consistency and Partition Tolerance (CP). Svelte, compiled via Vite, offers the performance and reactivity needed for a fluid real-time UI without the overhead of a virtual DOM. Go provides the low-level control and concurrency required for a backend that simultaneously queries database internals, manages a WebSocket stream, and orchestrates Docker containers. Caddy would serve as the unified entry point, handling the Svelte application and proxying API/WebSocket traffic with its characteristic simplicity and automatic HTTPS.

The core challenge was to design a system that could reliably induce failure in one component (a database node) while the other components (the backend, frontend) remained stable enough to observe the outcome. This is not a standard CRUD application; it’s a purpose-built tool for resilience engineering.

The Foundation: A Geo-Distributed CockroachDB Cluster

Before any application code can be written, a plausible testbed is required. A single-node database is useless for this experiment. We need a multi-node cluster where the failure of one node is a non-catastrophic event. docker-compose is sufficient for a local simulation.

This configuration spins up a three-node cluster, with an additional roach-init service to perform the one-time cluster initialization.

# docker-compose.yml
version: '3.8'

services:
  roach1:
    image: cockroachdb/cockroach:v23.1.9
    container_name: roach1
    hostname: roach1
    volumes:
      - roach1-data:/cockroach/cockroach-data
    command: start --insecure --join=roach1,roach2,roach3 --listen-addr=roach1:26257 --http-addr=roach1:8080
    ports:
      - "26257:26257"
      - "8080:8080"
    networks:
      - roachnet

  roach2:
    image: cockroachdb/cockroach:v23.1.9
    container_name: roach2
    hostname: roach2
    volumes:
      - roach2-data:/cockroach/cockroach-data
    command: start --insecure --join=roach1,roach2,roach3 --listen-addr=roach2:26257 --http-addr=roach2:8081
    networks:
      - roachnet

  roach3:
    image: cockroachdb/cockroach:v23.1.9
    container_name: roach3
    hostname: roach3
    volumes:
      - roach3-data:/cockroach/cockroach-data
    command: start --insecure --join=roach1,roach2,roach3 --listen-addr=roach3:26257 --http-addr=roach3:8082
    networks:
      - roachnet

  roach-init:
    image: cockroachdb/cockroach:v23.1.9
    container_name: roach-init
    depends_on:
      - roach1
      - roach2
      - roach3
    command: init --insecure --host=roach1:26257
    networks:
      - roachnet

networks:
  roachnet:

volumes:
  roach1-data:
  roach2-data:
  roach3-data:

After running docker-compose up -d, the roach-init service executes the cockroach init command against roach1. Because all nodes are configured with the same --join flag, they form a single, resilient cluster. The key here is that data (ranges in CockroachDB terminology) will be replicated across all three nodes. Killing one container will not result in data loss.

The Control Plane: A Go Backend for Chaos and Observation

The backend is the heart of this system. It has two primary responsibilities: querying CockroachDB’s extensive internal metrics tables and exposing an API to control the Docker containers running the database nodes.

A common mistake in designing such a system is to tightly couple the data fetching with the API requests. A better approach is to run the data fetching in a background goroutine, caching the state, and have the WebSocket broadcaster push this cached state to clients periodically. This decouples the database polling frequency from the number of connected clients.

Here’s the core structure of the Go backend.

// main.go
package main

import (
	"context"
	"database/sql"
	"encoding/json"
	"log"
	"net/http"
	"sync"
	"time"

	"github.com/docker/docker/client"
	"github.com/gorilla/websocket"
	_ "github.com/lib/pq"
)

// Represents the state of a single CockroachDB node.
type NodeState struct {
	NodeID    int    `json:"node_id"`
	Address   string `json:"address"`
	IsLive    bool   `json:"is_live"`
	Ranges    int    `json:"ranges"`
	LeaderRanges int `json:"leader_ranges"`
	ContainerID string `json:"container_id"`
	ContainerState string `json:"container_state"`
}

// Represents the overall cluster state.
type ClusterState struct {
	Nodes          []NodeState `json:"nodes"`
	RangesTotal    int         `json:"ranges_total"`
	RangesUnavailable int `json:"ranges_unavailable"`
	Timestamp      time.Time   `json:"timestamp"`
}

// Global state cache and WebSocket connection manager.
var (
	currentState = ClusterState{}
	stateMutex   = &sync.RWMutex{}
	upgrader     = websocket.Upgrader{
		CheckOrigin: func(r *http.Request) bool { return true },
	}
	clients      = make(map[*websocket.Conn]bool)
	clientsMutex = &sync.RWMutex{}
	dockerCli    *client.Client
)

func main() {
	// Initialize Docker client
	var err error
	dockerCli, err = client.NewClientWithOpts(client.FromEnv, client.WithAPIVersionNegotiation())
	if err != nil {
		log.Fatalf("Failed to create Docker client: %v", err)
	}

	// Database connection string
	connStr := "postgresql://root@localhost:26257?sslmode=disable"
	db, err := sql.Open("postgres", connStr)
	if err != nil {
		log.Fatalf("Failed to connect to database: %v", err)
	}
	defer db.Close()

	// Start the background poller to update cluster state
	go pollClusterState(db)

	// Start the WebSocket broadcaster
	go broadcastState()

	// Setup HTTP server
	http.HandleFunc("/ws", handleWebSocket)
	http.HandleFunc("/api/chaos/node/stop", handleStopNode)
	http.HandleFunc("/api/chaos/node/start", handleStartNode)

	log.Println("Starting server on :8081...")
	if err := http.ListenAndServe(":8081", nil); err != nil {
		log.Fatalf("Server failed: %v", err)
	}
}

// pollClusterState periodically queries CockroachDB and Docker for state.
func pollClusterState(db *sql.DB) {
	ticker := time.NewTicker(2 * time.Second)
	defer ticker.Stop()

	for range ticker.C {
		ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)

		// Fetch node liveness and basic info
		rows, err := db.QueryContext(ctx, "SELECT node_id, address, is_live FROM crdb_internal.gossip_nodes")
		if err != nil {
			log.Printf("Error querying gossip_nodes: %v", err)
			cancel()
			continue
		}

		nodeMap := make(map[int]*NodeState)
		for rows.Next() {
			var node NodeState
			if err := rows.Scan(&node.NodeID, &node.Address, &node.IsLive); err != nil {
				log.Printf("Error scanning node row: %v", err)
				continue
			}
			nodeMap[node.NodeID] = &node
		}
		rows.Close()

		// Fetch replica counts per node
		rows, err = db.QueryContext(ctx, "SELECT node_id, count(*) AS range_count, sum(CAST(is_leaseholder AS INT)) AS leader_count FROM crdb_internal.ranges GROUP BY node_id")
		if err != nil {
			log.Printf("Error querying ranges: %v", err)
			cancel()
			continue
		}
		for rows.Next() {
			var nodeID, rangeCount, leaderCount int
			if err := rows.Scan(&nodeID, &rangeCount, &leaderCount); err != nil {
				log.Printf("Error scanning range row: %v", err)
				continue
			}
			if node, ok := nodeMap[nodeID]; ok {
				node.Ranges = rangeCount
				node.LeaderRanges = leaderCount
			}
		}
		rows.Close()

		// Fetch unavailability metrics
		var unavailableRanges int
		err = db.QueryRowContext(ctx, `SELECT value FROM crdb_internal.node_metrics WHERE name = 'ranges.unavailable'`).Scan(&unavailableRanges)
		if err != nil && err != sql.ErrNoRows {
			log.Printf("Error querying unavailable ranges: %v", err)
		}

		// Update state with Docker info
		nodes := make([]NodeState, 0, len(nodeMap))
		containers, _ := dockerCli.ContainerList(ctx, client.ContainerListOptions{})
		for _, node := range nodeMap {
			// A simple heuristic to map DB node to Docker container
			// In a real project, this would be more robust, perhaps using labels.
			for _, container := range containers {
				if container.Names[0] == "/roach"+strconv.Itoa(node.NodeID) {
					node.ContainerID = container.ID
					node.ContainerState = container.State
				}
			}
			nodes = append(nodes, *node)
		}
		
		cancel()

		// Atomically update the global state
		stateMutex.Lock()
		currentState = ClusterState{
			Nodes:          nodes,
			RangesUnavailable: unavailableRanges,
			Timestamp:      time.Now(),
		}
		stateMutex.Unlock()
	}
}

// broadcastState sends the current state to all connected WebSocket clients.
func broadcastState() {
	ticker := time.NewTicker(2 * time.Second)
	defer ticker.Stop()

	for range ticker.C {
		stateMutex.RLock()
		// A deep copy is not strictly necessary here since we're marshalling,
		// but it's good practice if we were passing the object around elsewhere.
		msg, err := json.Marshal(currentState)
		stateMutex.RUnlock()

		if err != nil {
			log.Printf("Error marshalling state: %v", err)
			continue
		}

		clientsMutex.Lock()
		for client := range clients {
			if err := client.WriteMessage(websocket.TextMessage, msg); err != nil {
				log.Printf("WebSocket write error: %v", err)
				client.Close()
				delete(clients, client)
			}
		}
		clientsMutex.Unlock()
	}
}

func handleWebSocket(w http.ResponseWriter, r *http.Request) {
	conn, err := upgrader.Upgrade(w, r, nil)
	if err != nil {
		log.Println(err)
		return
	}

	clientsMutex.Lock()
	clients[conn] = true
	clientsMutex.Unlock()

	// The read loop is mainly for detecting closed connections.
	go func() {
		defer func() {
			clientsMutex.Lock()
			delete(clients, conn)
			clientsMutex.Unlock()
			conn.Close()
		}()
		for {
			if _, _, err := conn.NextReader(); err != nil {
				break
			}
		}
	}()
}

// Chaos handlers for stopping/starting nodes
func handleStopNode(w http.ResponseWriter, r *http.Request) {
	containerName := r.URL.Query().Get("name")
	if containerName == "" {
		http.Error(w, "Missing 'name' query parameter", http.StatusBadRequest)
		return
	}

	log.Printf("Attempting to stop container: %s", containerName)
	if err := dockerCli.ContainerStop(context.Background(), containerName, nil); err != nil {
		log.Printf("Failed to stop container %s: %v", containerName, err)
		http.Error(w, err.Error(), http.StatusInternalServerError)
		return
	}
	w.WriteHeader(http.StatusOK)
	w.Write([]byte("OK"))
}

func handleStartNode(w http.ResponseWriter, r *http.Request) {
	containerName := r.URL.Query().Get("name")
	if containerName == "" {
		http.Error(w, "Missing 'name' query parameter", http.StatusBadRequest)
		return
	}

	log.Printf("Attempting to start container: %s", containerName)
	if err := dockerCli.ContainerStart(context.Background(), containerName, client.ContainerStartOptions{}); err != nil {
		log.Printf("Failed to start container %s: %v", containerName, err)
		http.Error(w, err.Error(), http.StatusInternalServerError)
		return
	}
	w.WriteHeader(http.StatusOK)
	w.Write([]byte("OK"))
}

A critical detail here is querying crdb_internal.gossip_nodes. This table provides a near real-time view of which nodes the cluster considers live. When we stop a container, we’ll see the is_live flag for that node flip to false after a short timeout. We also query crdb_internal.node_metrics for ranges.unavailable. This metric is key: when a node containing the only copy of a range goes down, this count will briefly spike until the cluster re-replicates the data from other nodes. This is the CAP theorem in action—the system prioritizes consistency by making data temporarily unavailable rather than serving stale data.

The Visualization Layer: A Reactive SvelteKit Frontend

The frontend’s job is to present this stream of data clearly and provide the controls for inducing chaos. SvelteKit provides the structure, and Svelte’s reactivity makes updating the UI from a WebSocket stream trivial.

First, we need a service to manage the WebSocket connection and a Svelte store to hold the cluster state.

// src/lib/clusterState.js
import { writable } from 'svelte/store';

// This store will hold the latest state received from the backend.
export const clusterState = writable({ nodes: [], ranges_unavailable: 0 });

const WS_URL = 'ws://localhost:8081/ws'; // In production this would be wss:// and proxied by Caddy

let socket;

function connect() {
    socket = new WebSocket(WS_URL);

    socket.onopen = () => {
        console.log('WebSocket connection established.');
    };

    socket.onmessage = (event) => {
        try {
            const data = JSON.parse(event.data);
            // The magic happens here: updating the store automatically
            // triggers re-renders in any component that uses it.
            clusterState.set(data);
        } catch (error) {
            console.error('Error parsing WebSocket message:', error);
        }
    };

    socket.onclose = (event) => {
        console.warn('WebSocket connection closed. Reconnecting in 3 seconds...', event.reason);
        setTimeout(connect, 3000); // Simple reconnect logic
    };

    socket.onerror = (error) => {
        console.error('WebSocket error:', error);
        socket.close(); // This will trigger the onclose handler for reconnection
    };
}

// Initial connection
connect();

The main page component subscribes to this store and renders the UI. Using Svelte’s reactive $: syntax is cleaner than lifecycle methods for derived data.

<!-- src/routes/+page.svelte -->
<script>
    import { clusterState } from '$lib/clusterStateStore';
    import { onMount } from 'svelte';

    // The '$' prefix provides automatic subscription and unsubscription to the store.
    // The UI will update whenever `clusterState` changes.
    let nodes = [];
    clusterState.subscribe(value => {
        if (value && value.nodes) {
            nodes = value.nodes.sort((a, b) => a.node_id - b.node_id);
        }
    });

    let unavailableRanges = 0;
    clusterState.subscribe(value => {
        unavailableRanges = value.ranges_unavailable || 0;
    });

    let apiInProgress = new Set();

    async function stopNode(nodeId) {
        const containerName = `roach${nodeId}`;
        if (apiInProgress.has(containerName)) return;

        apiInProgress.add(containerName);
        apiInProgress = apiInProgress; // Trigger Svelte reactivity

        try {
            await fetch(`/api/chaos/node/stop?name=${containerName}`);
        } catch (error) {
            console.error(`Failed to stop node ${nodeId}:`, error);
        } finally {
            apiInProgress.delete(containerName);
            apiInProgress = apiInProgress;
        }
    }

    async function startNode(nodeId) {
        const containerName = `roach${nodeId}`;
        if (apiInProgress.has(containerName)) return;

        apiInProgress.add(containerName);
        apiInProgress = apiInProgress;

        try {
            await fetch(`/api/chaos/node/start?name=${containerName}`);
        } catch (error) {
            console.error(`Failed to start node ${nodeId}:`, error);
        } finally {
            apiInProgress.delete(containerName);
            apiInProgress = apiInProgress;
        }
    }
</script>

<main class="container">
    <header>
        <h1>CockroachDB Chaos Controller</h1>
        <div class="metrics">
            <span class="metric-label">Unavailable Ranges:</span>
            <span class="metric-value" class:warn={unavailableRanges > 0}>
                {unavailableRanges}
            </span>
        </div>
    </header>

    <div class="node-grid">
        {#each nodes as node (node.node_id)}
            <div class="node-card" class:live={node.is_live} class:down={!node.is_live || node.container_state !== 'running'}>
                <h2>Node {node.node_id}</h2>
                <div class="details">
                    <p><strong>Status:</strong> {node.is_live ? 'Live' : 'Suspect/Dead'}</p>
                    <p><strong>Container:</strong> {node.container_state}</p>
                    <p><strong>Address:</strong> {node.address}</p>
                    <p><strong>Total Ranges:</strong> {node.ranges}</p>
                    <p><strong>Leaseholder Ranges:</strong> {node.leader_ranges}</p>
                </div>
                <div class="actions">
                    {#if node.container_state === 'running'}
                        <button on:click={() => stopNode(node.node_id)} disabled={apiInProgress.has(`roach${node.node_id}`)}>
                            Stop Node
                        </button>
                    {:else}
                        <button on:click={() => startNode(node.node_id)} disabled={apiInProgress.has(`roach${node.node_id}`)} class="start-btn">
                            Start Node
                        </button>
                    {/if}
                </div>
            </div>
        {/each}
    </div>
</main>

<style>
    /* ... (CSS for styling the grid, cards, etc.) ... */
</style>

The Gateway: Caddy for Serving and Proxying

The final piece is Caddy. It serves the static SvelteKit application (after running npm run build) and acts as a reverse proxy for both the API and WebSocket traffic to the Go backend. This unification is where Caddy shines. Its configuration is declarative and vastly simpler than equivalent setups in other web servers.

# Caddyfile
{
    # Enable more detailed logging for debugging
    log {
        level INFO
    }
}

localhost {
    # Route API and WebSocket requests to the Go backend service.
    # Caddy automatically handles WebSocket protocol upgrades.
    handle_path /api/* {
        reverse_proxy localhost:8081
    }

    handle_path /ws {
        reverse_proxy localhost:8081
    }

    # Serve the SvelteKit static application as the default handler.
    # The 'try_files' directive is crucial for single-page applications (SPAs)
    # to ensure client-side routing works correctly.
    handle {
        root * /path/to/your/sveltekit-project/build
        try_files {path} {path}/index.html /index.html
        file_server
    }
}

With this Caddyfile, requests to / will serve the index.html of our Svelte app. Requests to /api/chaos/... will be seamlessly proxied to the Go backend, as will the /ws connection. In a production environment, changing localhost to a real domain name would enable Caddy to provision and manage a TLS certificate automatically.

Running the full system, the user is presented with a grid of three healthy nodes. Clicking “Stop Node” on one of them initiates the chaos experiment. The API call goes through Caddy to the Go backend, which executes docker stop. Within seconds, the background poller in the Go service detects the change via crdb_internal.gossip_nodes and the Docker API. The updated state is broadcasted over the WebSocket, and Svelte reactively updates the UI to show the node as “Suspect/Dead” and its container state as “exited”. Crucially, the user can watch the “Unavailable Ranges” metric spike to a non-zero value and then, as CockroachDB re-replicates the data to the two remaining nodes, drop back to zero. This entire sequence makes the abstract promise of a self-healing database a concrete, observable reality.

The current implementation, while functional, is a simulation. It relies on docker stop which is a graceful shutdown, not a network partition or a hardware failure. The mapping between CockroachDB nodes and Docker containers is based on a naming convention, which is brittle. A more advanced version would need to handle true network chaos using tools like tc or iptables and employ a more robust service discovery mechanism. The polling mechanism also introduces inherent latency; a production system might leverage CockroachDB’s changefeeds for a more event-driven approach to state updates. Nevertheless, as a tool for developing an intuition for distributed system behavior, this architecture proves invaluable.


  TOC