Implementing a Reactive Infrastructure Control Plane with Django, Crossplane, and Jotai

Cloud Native

Word Count: 2.8k

Read Times: 17 Min

The team’s velocity was grinding to a halt under the weight of infrastructure complexity. Every new feature required a developer to navigate a labyrinth of Terraform modules, Helm charts, and bespoke deployment scripts. The cognitive load was immense, and the feedback loop for provisioning even a simple staging environment stretched from hours to days. Our initial solution—a collection of shell scripts wrapped in a Jenkins pipeline—was fragile, opaque, and a nightmare to maintain. It became clear we weren’t just building applications; we were forced to build an entirely new discipline of infrastructure orchestration for every project. This was our core technical pain point: the chasm between application logic and the underlying infrastructure was too wide and too treacherous to cross repeatedly.

We needed an abstraction. A paved road. An Internal Developer Platform (IDP) that could offer developers a simple, declarative way to request resources while providing the platform team with the governance and control to enforce best practices. The concept crystallized around a central API: a developer should be able to send a JSON payload like {"name": "feature-x-review-app", "size": "small"} and, in return, get a fully provisioned, isolated application environment. The crucial requirement was that this process had to be transparent and reactive. Developers needed to see what was happening in real-time, not stare at a static loading spinner for twenty minutes only to see it fail.

This led us to a rather unconventional combination of technologies. For the infrastructure layer, Crossplane was a game-changer. By extending the Kubernetes API to manage external resources, it allowed us to model our entire infrastructure—databases, clusters, storage buckets—as declarative Kubernetes objects. This completely changes the way we work, turning chaotic imperative scripts into a manageable, state-driven system.

For the API control plane, we chose Django. Our team has deep Python expertise, and the Django REST Framework (DRF) provides an incredibly robust and rapid way to build the API layer that would sit in front of Crossplane. More importantly, Django could house the business logic, validation, and user authentication needed to gatekeep direct access to the Kubernetes API. It would be our smart proxy.

Finally, for the frontend, the requirement for real-time status updates pushed us towards a modern, granular state management solution. This is where Jotai entered the picture. Its atomic, bottom-up approach to state management seemed perfect for our use case. We envisioned a dashboard where each provisioned resource would have its own state, updated independently via WebSockets. We didn’t want a single status update to trigger a re-render of the entire page. Jotai’s philosophy felt like the frontend equivalent of the micro-level control we were getting with Crossplane. This was the blueprint: a reactive data flow from a Crossplane composition, through a Django control plane, to a Jotai-powered UI, and back again.

The Declarative Foundation: Modeling Environments in Crossplane

Before writing a single line of Python or JavaScript, we had to define the “what.” What constitutes an “application environment” in our world? Using Crossplane, we defined this contract with a CompositeResourceDefinition (XRD). This XRD, named XApplicationEnvironment, acts as the schema for our custom infrastructure API.

# xapplicationenvironment.yaml
apiVersion: apiextensions.crossplane.io/v1
kind: CompositeResourceDefinition
metadata:
  name: xapplicationenvironments.platform.acme.io
spec:
  group: platform.acme.io
  names:
    kind: XApplicationEnvironment
    plural: xapplicationenvironments
  claimNames:
    kind: ApplicationEnvironment
    plural: applicationenvironments
  connectionSecretKeys:
    - kubeconfig
    - db_host
    - db_password
  versions:
  - name: v1alpha1
    served: true
    referenceable: true
    schema:
      openAPIV3Schema:
        type: object
        properties:
          spec:
            type: object
            properties:
              parameters:
                type: object
                properties:
                  nodeSize:
                    description: "Size of the GKE nodes."
                    type: string
                    enum: ["small", "medium", "large"]
                  storageGB:
                    description: "The size of the DB disk in GB."
                    type: integer
                    default: 10
                  deletionPolicy:
                    description: "Deletion policy for the environment resources."
                    type: string
                    default: "Delete"
                required:
                  - nodeSize
            required:
              - parameters
          status:
            type: object
            properties:
              dbEndpoint:
                description: "The connection endpoint for the CloudSQL instance."
                type: string
              gkeEndpoint:
                description: "The GKE cluster endpoint."
                type: string
              bucketUrl:
                description: "URL of the GCS bucket."
                type: string

This XRD defines a high-level abstraction. A developer doesn’t ask for a GKECluster or a CloudSQLInstance; they ask for an ApplicationEnvironment with a nodeSize. The implementation details are hidden within a Composition. This is the coolest part of Crossplane: it separates the what (the API) from the how (the implementation).

Here is a snippet of our Composition that satisfies the XApplicationEnvironment contract using Google Cloud resources.

# composition.yaml
apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
  name: gcp.applicationenvironment.platform.acme.io
  labels:
    provider: gcp
spec:
  writeConnectionSecretsToNamespace: crossplane-system
  compositeTypeRef:
    apiVersion: platform.acme.io/v1alpha1
    kind: XApplicationEnvironment
  patchSets:
  - name: metadata
    patches:
    - fromFieldPath: "metadata.labels"
  resources:
    - name: gke-cluster
      base:
        apiVersion: container.gcp.upbound.io/v1beta1
        kind: Cluster
        spec:
          forProvider:
            location: us-central1
            initialNodeCount: 1
            removeDefaultNodePool: true
          deletionPolicy: "Delete"
      patches:
        - fromFieldPath: "spec.parameters.deletionPolicy"
          toFieldPath: "spec.deletionPolicy"
        - fromFieldPath: "metadata.name"
          toFieldPath: "metadata.name"
          transforms:
            - type: string
              string:
                fmt: "%s-gke"
        - fromFieldPath: "spec.parameters.nodeSize"
          toFieldPath: "spec.forProvider.nodePool[0].config.machineType"
          transforms:
            - type: map
              map:
                small: "e2-small"
                medium: "e2-medium"
                large: "e2-standard-4"
      connectionDetails:
        - fromConnectionSecretKey: kubeconfig
    - name: cloudsql-instance
      base:
        apiVersion: sql.gcp.upbound.io/v1beta1
        kind: DatabaseInstance
        spec:
          forProvider:
            databaseVersion: POSTGRES_14
            region: us-central1
            settings:
              - tier: db-g1-small
      patches:
        # ... patches for name, deletion policy, etc. ...
      connectionDetails:
        - fromConnectionSecretKey: "connectionName"
          name: "db_host"
        - fromConnectionSecretKey: "password"
          name: "db_password"

    # ... Resource template for a GCS Bucket ...

With these two manifests applied to our management cluster, Crossplane’s machinery is ready. We can now create ApplicationEnvironment objects (the “claims”), and Crossplane will provision the corresponding GCP resources. The next step is to build an API that can create these claims on behalf of our users.

The Django API as a Smart Control Plane

We deliberately decided against exposing the Kubernetes API directly. Instead, our Django application acts as an intelligent intermediary. This allows us to enforce policies, manage tenancy, and provide a simplified interface that hides the underlying Kubernetes complexity.

First, we configured the Django project to communicate with our Kubernetes cluster using the official Python client.

# settings.py
# ... other Django settings

from kubernetes import config

# This assumes the Django app is running in-cluster with a ServiceAccount
# For local development, it would load from ~/.kube/config
try:
    config.load_incluster_config()
except config.ConfigException:
    config.load_kube_config()

Next, we built a DRF ViewSet to handle the creation of ApplicationEnvironment claims. The create method is where the magic happens. It takes simple JSON from the user, constructs the full Kubernetes object dictionary, and uses the client to apply it.

# idp_app/views.py
import logging
from uuid import uuid4

from rest_framework import viewsets, status
from rest_framework.response import Response
from kubernetes import client, utils
from kubernetes.client.rest import ApiException

logger = logging.getLogger(__name__)

class ApplicationEnvironmentViewSet(viewsets.ViewSet):
    """
    A ViewSet for creating and managing ApplicationEnvironments.
    """
    def create(self, request):
        """
        Creates an ApplicationEnvironment claim in Kubernetes.
        """
        user_id = request.user.username
        env_name = request.data.get("name")
        node_size = request.data.get("nodeSize", "small")

        if not env_name:
            return Response(
                {"error": "Environment name is required."},
                status=status.HTTP_400_BAD_REQUEST,
            )
        
        # A unique ID for tracking and linking.
        request_id = str(uuid4())
        claim_name = f"{env_name}-{user_id}"

        # Construct the claim object dynamically
        claim_obj = {
            "apiVersion": "platform.acme.io/v1alpha1",
            "kind": "ApplicationEnvironment",
            "metadata": {
                "name": claim_name,
                "namespace": "default", # Or a user-specific namespace
                "labels": {
                    "request-id": request_id,
                    "owner": user_id,
                },
            },
            "spec": {
                "compositionSelector": {
                    "matchLabels": {
                        "provider": "gcp"
                    }
                },
                "parameters": {
                    "nodeSize": node_size
                }
            }
        }

        api = client.ApiClient()
        custom_objects_api = client.CustomObjectsApi(api)

        try:
            custom_objects_api.create_namespaced_custom_object(
                group="platform.acme.io",
                version="v1alpha1",
                namespace="default",
                plural="applicationenvironments",
                body=claim_obj,
            )
            logger.info(f"Successfully created claim {claim_name} with request_id {request_id}")
            return Response(
                {"status": "creating", "name": claim_name, "requestId": request_id},
                status=status.HTTP_202_ACCEPTED
            )
        except ApiException as e:
            logger.error(f"Failed to create claim {claim_name}: {e.body}")
            return Response(
                {"error": "Failed to create environment.", "details": e.body},
                status=status.HTTP_500_INTERNAL_SERVER_ERROR,
            )

This is already a huge improvement. Developers can now hit a simple REST endpoint. But we’re missing the reactive part. How does the frontend know when the environment is ready? Polling the Kubernetes API for status is inefficient and slow. We need to push status updates to the client in real-time. This is a perfect job for Django Channels and WebSockets.

We created a WebSocket consumer that uses the Kubernetes client’s watch functionality. It opens a persistent connection to the Kubernetes API and receives an event every time an ApplicationEnvironment object changes. It then pushes a distilled version of that event down the WebSocket to the correct client.

# idp_app/consumers.py
import json
import asyncio
import logging
from channels.generic.websocket import AsyncJsonWebsocketConsumer
from kubernetes import client, config, watch

logger = logging.getLogger(__name__)

class EnvironmentStatusConsumer(AsyncJsonWebsocketConsumer):
    async def connect(self):
        self.user = self.scope["user"]
        if self.user.is_anonymous:
            await self.close()
            return
        
        self.group_name = f"user-{self.user.username}"
        await self.channel_layer.group_add(self.group_name, self.channel_name)
        await self.accept()

        # Start the Kubernetes watcher in a background task
        asyncio.create_task(self.watch_environments())

    async def disconnect(self, close_code):
        await self.channel_layer.group_discard(self.group_name, self.channel_name)

    async def watch_environments(self):
        """
        Watches for changes in ApplicationEnvironment custom resources and
        sends updates to the client.
        """
        # Load config inside the async task
        try:
            config.load_incluster_config()
        except config.ConfigException:
            config.load_kube_config()
        
        custom_api = client.CustomObjectsApi()
        w = watch.Watch()
        
        # A common mistake is not handling watch stream disconnections.
        # This loop ensures we attempt to reconnect.
        while True:
            try:
                # We filter by label to only watch resources owned by the current user
                stream = w.stream(
                    custom_api.list_namespaced_custom_object,
                    group="platform.acme.io",
                    version="v1alpha1",
                    namespace="default",
                    plural="applicationenvironments",
                    label_selector=f"owner={self.user.username}"
                )
                for event in stream:
                    event_type = event['type']
                    obj = event['object']
                    name = obj['metadata']['name']
                    
                    # The Crossplane `Ready` condition is the source of truth.
                    status_conditions = obj.get('status', {}).get('conditions', [])
                    is_ready = any(c.get('type') == 'Ready' and c.get('status') == 'True' for c in status_conditions)
                    
                    status_text = "Ready" if is_ready else "Provisioning"
                    if event_type == 'DELETED':
                        status_text = 'Deleting'

                    request_id = obj['metadata']['labels'].get('request-id')

                    # Send message to WebSocket
                    await self.send_json({
                        "requestId": request_id,
                        "name": name,
                        "status": status_text,
                        "eventType": event_type,
                    })

            except Exception as e:
                logger.error(f"K8s watch stream for user {self.user.username} failed: {e}")
                await asyncio.sleep(5) # Wait before retrying

    # This method is not used directly but is required by group_send
    async def environment_update(self, event):
        await self.send_json(event["message"])

With this in place, our Django backend is now a fully-fledged, real-time control plane. It can accept requests to create infrastructure and proactively push status updates to any connected client. The final piece is the client itself.

A Reactive Frontend with Jotai

The frontend’s primary job is to provide a clean interface for creating environments and a real-time dashboard reflecting their status. We chose Jotai because its atomic model maps perfectly to our problem. Each environment on the dashboard is an independent unit of state.

First, we defined our core atoms. The environmentsAtom holds a map of environments, where the key is the requestId we generated in the backend. Each value is another atom containing the detailed state for that environment. This “atom-in-atom” pattern is incredibly powerful for dynamic lists.

// src/state/atoms.js
import { atom } from 'jotai';

// An atom to hold the state of the WebSocket connection
export const socketStatusAtom = atom('disconnected');

// The main store for our environments.
// We use a Map for efficient lookups by our unique request ID.
// The value is an atom itself, so only components using that specific
// environment will re-render when it updates.
export const environmentsAtom = atom(new Map());

// A derived atom to get the environments as an array for rendering lists.
export const environmentsListAtom = atom((get) =>
  Array.from(get(environmentsAtom).values())
);

// An action atom for adding or updating an environment.
// This is the primary way we'll mutate our state from the WebSocket handler.
export const upsertEnvironmentAtom = atom(
  null,
  (get, set, update) => {
    const currentEnvironments = new Map(get(environmentsAtom));
    const existingAtom = currentEnvironments.get(update.requestId);
    
    if (existingAtom) {
      // If the atom for this environment already exists, just update its value.
      set(existingAtom, (prev) => ({ ...prev, ...update }));
    } else {
      // Otherwise, create a new atom and add it to our map.
      const newEnvironmentAtom = atom({
        requestId: update.requestId,
        name: update.name,
        status: update.status,
      });
      currentEnvironments.set(update.requestId, newEnvironmentAtom);
    }
    set(environmentsAtom, new Map(currentEnvironments));
  }
);

Next, we created a WebSocket management component. This component is responsible for establishing the connection and dispatching updates to our Jotai store. It’s designed to be mounted once at the top level of the application.

// src/components/WebSocketManager.jsx
import { useEffect, useRef } from 'react';
import { useSetAtom } from 'jotai';
import { socketStatusAtom, upsertEnvironmentAtom } from '../state/atoms';

const WebSocketManager = () => {
  const setSocketStatus = useSetAtom(socketStatusAtom);
  const upsertEnvironment = useSetAtom(upsertEnvironmentAtom);
  const socket = useRef(null);

  useEffect(() => {
    const connect = () => {
      // In a real app, the token would come from an auth context.
      const authToken = 'your-auth-token'; 
      const socketUrl = `ws://localhost:8000/ws/status/?token=${authToken}`;
      
      socket.current = new WebSocket(socketUrl);

      socket.current.onopen = () => {
        console.log('WebSocket connected');
        setSocketStatus('connected');
      };

      socket.current.onmessage = (event) => {
        const data = JSON.parse(event.data);
        console.log('Received update:', data);
        // This is the core integration point. A message from the backend
        // triggers an update to our Jotai state.
        upsertEnvironment(data);
      };

      socket.current.onclose = () => {
        console.log('WebSocket disconnected. Reconnecting in 5s...');
        setSocketStatus('disconnected');
        // Simple exponential backoff could be added here.
        setTimeout(connect, 5000); 
      };

      socket.current.onerror = (err) => {
        console.error('WebSocket error:', err);
        socket.current.close();
      };
    };

    connect();

    return () => {
      if (socket.current) {
        socket.current.close();
      }
    };
  }, [setSocketStatus, upsertEnvironment]);

  return null; // This component does not render anything.
};

export default WebSocketManager;

Finally, the UI component that displays a single environment is incredibly simple. It uses useAtomValue to subscribe only to the atom for the environment it represents. When the WebSocket manager updates that specific atom, only this component will re-render, not the entire list. This is the performance and elegance we were looking for.

// src/components/EnvironmentCard.jsx
import { useAtomValue } from 'jotai';

const StatusIndicator = ({ status }) => {
  const colorMap = {
    'Provisioning': 'bg-yellow-500',
    'Ready': 'bg-green-500',
    'Deleting': 'bg-gray-500',
    'Error': 'bg-red-500',
  };
  return (
    <div className={`w-3 h-3 rounded-full ${colorMap[status] || 'bg-gray-300'}`} />
  );
};

const EnvironmentCard = ({ environmentAtom }) => {
  const environment = useAtomValue(environmentAtom);

  return (
    <div className="p-4 border rounded-lg shadow-sm flex justify-between items-center">
      <div>
        <p className="font-bold text-lg">{environment.name}</p>
        <p className="text-sm text-gray-500">{environment.requestId}</p>
      </div>
      <div className="flex items-center space-x-2">
        <StatusIndicator status={environment.status} />
        <span className="font-mono text-sm">{environment.status}</span>
      </div>
    </div>
  );
};

This completes the entire reactive loop. The overall architecture can be visualized as follows:

sequenceDiagram
    participant User
    participant Frontend (React/Jotai)
    participant Backend (Django/Channels)
    participant Kubernetes API
    participant Crossplane

    User->>Frontend (React/Jotai): Clicks "Create Environment"
    Frontend (React/Jotai)->>Backend (Django/Channels): POST /api/environments/
    Backend (Django/Channels)->>Kubernetes API: CREATE ApplicationEnvironment Claim
    Kubernetes API->>Crossplane: Notifies of new Claim
    Crossplane->>Kubernetes API: Creates composed resources (GKE, SQL)
    Crossplane->>Cloud Provider API: Provisions actual infrastructure

    Note over Cloud Provider API,Crossplane: Long-running provisioning...

    Cloud Provider API-->>Crossplane: Infrastructure is ready
    Crossplane-->>Kubernetes API: Updates status of composed resources
    Kubernetes API-->>Crossplane: Updates Claim status (e.g., Ready=True)
    
    participant K8s Watcher
    Kubernetes API-->>K8s Watcher: Event: Claim Status Changed
    K8s Watcher-->>Backend (Django/Channels): Pushes event internally
    Backend (Django/Channels)-->>Frontend (React/Jotai): WebSocket message: {status: 'Ready'}
    Frontend (React/Jotai)-->>User: UI updates in real-time

This architecture, while complex to set up, delivers an incredible developer experience. It abstracts away the raw infrastructure YAML, provides a secure and governed API, and offers the real-time feedback necessary to build trust in the platform. The choice to combine Crossplane’s declarative power, Django’s robust API capabilities, and Jotai’s granular state management created a system where each component plays to its strengths.

However, this solution is not without its own complexities and limitations. The Django WebSocket consumer, running as part of the main application, is a potential single point of failure and a scalability bottleneck. A more production-grade architecture might decouple this watcher into a separate microservice that communicates with the API server via a message bus like Kafka or NATS. Furthermore, the security model is nascent; a complete implementation would require a sophisticated mapping of Django user permissions to Kubernetes RBAC to ensure true multi-tenancy and least privilege. The current implementation also focuses on the “happy path” of provisioning; handling complex failure modes, rollbacks, and providing detailed error messages to the user through this reactive loop remains a significant challenge for future iterations.

Kubernetes Crossplane Jotai Django Internal Developer Platform WebSockets

Implementing a Heterogeneous Telemetry Ingestion Pipeline with Go-Fiber, Scala, and OpenSearch on DigitalOcean

2023-10-27 System Architecture

Scala Microservices DigitalOcean OpenSearch Go-Fiber Swift Data Pipeline

Implementing End-to-End ML Model Traceability with OpenTelemetry Across Spinnaker and Argo CD

2023-10-27 MLOps

Go Observability OpenTelemetry Keras Spinnaker Argo CD