Implementing a Declarative GPU Workspace Provisioner with Crossplane and a React Frontend

Cloud Native

Word Count: 2.7k

Read Times: 17 Min

The computer vision team’s development loop was broken. Onboarding a new engineer involved a multi-day, documentation-heavy process of manually configuring a GPU-enabled cloud instance. Engineers were spending more time wrestling with NVIDIA drivers, CUDA versions, and Python dependencies than they were with OpenCV algorithms. Every developer’s environment was a unique snowflake, leading to CI failures and the classic “it works on my machine” standoff. The core pain point was a lack of standardized, reproducible, on-demand development environments. Our initial attempts with shell scripts and wikis only exacerbated the problem, creating more maintenance overhead for the platform team.

Our goal became to provide a fully automated, self-service platform. A developer should be able to request a pre-configured, isolated workspace with a specific GPU type and software stack via a simple web UI and have it provisioned within minutes. The entire lifecycle—creation, status monitoring, and deletion—had to be declarative and managed through a central control plane. This wasn’t just about automation; it was about treating developer environments as ephemeral, cattle-not-pets resources.

We evaluated several technology stacks. A simple Terraform module wrapped in a Jenkins job was an early contender. However, this approach is fundamentally imperative. It’s a “fire-and-forget” script. We needed a system with a continuous reconciliation loop that could actively manage the state of the workspace, ensuring it always matched the desired configuration. This naturally led us to the Kubernetes ecosystem.

Crossplane emerged as the ideal foundation. Instead of using Kubernetes to orchestrate containers, Crossplane allows us to use the Kubernetes API to orchestrate anything—in our case, cloud infrastructure like EC2 instances, security groups, and IAM roles. We could define our own custom resource, kind: GpuDevWorkspace, and let Crossplane translate that abstract request into concrete AWS resources. This provides the declarative, continuously reconciling control plane we were missing. For the user interface, a simple React frontend would serve as the self-service portal, abstracting away all the underlying Kubernetes and cloud complexity from our target users: the CV engineers. OpenCV isn’t a direct part of the platform’s stack, but its demanding requirements (specific drivers, libraries, GPU access) are the entire reason this platform needs to exist.

Defining the Platform’s API with Crossplane

The first step was to define what a GpuDevWorkspace is. This is the contract between the platform and its users. We used Crossplane’s CompositeResourceDefinition (XRD) to create a new Kubernetes API. A common mistake here is to expose every possible cloud provider option. In a real-world project, the goal is to provide a curated, opinionated abstraction. We decided to expose only what was necessary for the CV team.

# filename: 01-xrd-gpudevworkspace.yaml
apiVersion: apiextensions.crossplane.io/v1
kind: CompositeResourceDefinition
metadata:
  name: gpudevworkspaces.platform.acme.io
spec:
  group: platform.acme.io
  names:
    kind: GpuDevWorkspace
    plural: gpudevworkspaces
  claimNames:
    kind: GpuDevWorkspaceClaim
    plural: gpudevworkspaceclaims
  versions:
  - name: v1alpha1
    served: true
    referenceable: true
    schema:
      openAPIV3Schema:
        type: object
        properties:
          spec:
            type: object
            properties:
              parameters:
                type: object
                properties:
                  # User-configurable parameters for the workspace
                  instanceType:
                    type: string
                    description: "The AWS EC2 instance type (e.g., g4dn.xlarge)."
                    default: "g4dn.xlarge"
                  region:
                    type: string
                    description: "AWS region for the workspace."
                    default: "us-east-1"
                  owner:
                    type: string
                    description: "Identifier for the user who owns this workspace."
                  opencvVersion:
                    type: string
                    description: "The version of OpenCV to pre-install."
                    default: "4.8.0"
                required:
                  - owner
            required:
              - parameters
          status:
            type: object
            properties:
              # Fields to be populated by the backend/controllers
              instanceId:
                type: string
              publicIp:
                type: string
              connectionDetails:
                type: string
                description: "Instructions on how to connect (e.g., SSH command)."
              phase:
                type: string
                description: "The current state of the workspace (Provisioning, Ready, Deleting, Failed)."

This XRD establishes our GpuDevWorkspace resource. Notice the status block; this is crucial for communicating the state of the provisioning process back to the user via our frontend.

With the API defined, we needed to teach Crossplane how to satisfy it. This is done with a Composition. The Composition maps our abstract GpuDevWorkspace to a set of concrete managed resources from a Crossplane provider, in this case, provider-aws.

# filename: 02-composition-aws-ec2.yaml
apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
  name: gpudevworkspace-aws-ec2
  labels:
    provider: aws
spec:
  compositeTypeRef:
    apiVersion: platform.acme.io/v1alpha1
    kind: GpuDevWorkspace
  resources:
    - name: ec2Instance
      base:
        apiVersion: ec2.aws.upbound.io/v1beta1
        kind: Instance
        spec:
          forProvider:
            # We use a pre-baked AMI for faster startup times.
            # This AMI should have Docker, NVIDIA drivers, and CUDA pre-installed.
            ami: "ami-0a123b456c789d0e1" 
            tags:
              ManagedBy: "Crossplane"
              WorkspaceOwner: "unknown" # Will be patched
      patches:
        - fromFieldPath: "spec.parameters.instanceType"
          toFieldPath: "spec.forProvider.instanceType"
        - fromFieldPath: "spec.parameters.region"
          toFieldPath: "spec.forProvider.region"
        - fromFieldPath: "spec.parameters.owner"
          toFieldPath: "spec.forProvider.tags.WorkspaceOwner"
        - fromFieldPath: "metadata.uid" # Use a unique ID for the name
          toFieldPath: "metadata.annotations[crossplane.io/external-name]"
          transforms:
            - type: string
              string:
                fmt: "gpudev-%s"
        # Patch the startup script into userData
        - type: CombineFromComposite
          combine:
            variables:
              - fromFieldPath: spec.parameters.opencvVersion
            strategy: string
            string:
              fmt: |
                #!/bin/bash
                # cloud-init script to configure the instance on first boot.
                # The base AMI already has NVIDIA drivers, CUDA, and Docker.
                set -e
                export OPENCV_VERSION="%s"
                
                # Simple logging for debug purposes
                exec > >(tee /var/log/user-data.log|logger -t user-data -s 2>/dev/console) 2>&1
                
                echo "Starting user-data script for OpenCV v${OPENCV_VERSION}"
                
                # Update and install dependencies
                apt-get update
                apt-get install -y python3-pip python3-dev
                
                # Install OpenCV with contributions
                pip3 install opencv-python-headless==${OPENCV_VERSION}
                pip3 install opencv-contrib-python-headless==${OPENCV_VERSION}
                
                echo "OpenCV installation complete."
                
                # Create a user for the developer
                useradd -m -s /bin/bash developer
                mkdir -p /home/developer/.ssh
                # NOTE: In a production setup, the public key should be fetched
                # from a secure store, not hardcoded.
                echo "ssh-rsa AAAA..." >> /home/developer/.ssh/authorized_keys
                chown -R developer:developer /home/developer/.ssh
                chmod 700 /home/developer/.ssh
                chmod 600 /home/developer/.ssh/authorized_keys
                
                echo "Developer user configured. Workspace is ready."
          toFieldPath: spec.forProvider.userData
      # These patches surface outputs from the managed resource back to the CR status
      - fromFieldPath: "status.atProvider.id"
        toFieldPath: "status.instanceId"
      - fromFieldPath: "status.atProvider.publicIp"
        toFieldPath: "status.publicIp"
      - fromFieldPath: status.atProvider.publicIp
        toFieldPath: status.connectionDetails
        transforms:
          - type: string
            string:
              fmt: "ssh developer@%s"

The Composition is where the real work happens. The patches array is the most powerful feature, allowing us to map fields from our abstract resource (GpuDevWorkspace) to the concrete ec2.Instance resource. The userData script is critical; it performs the final configuration for the specific OpenCV version requested by the user. A significant pitfall in early designs is making this script too complex. Our strategy is to pre-bake as much as possible into a custom Amazon Machine Image (AMI) to minimize boot time. The script should only handle lightweight, dynamic configuration.

The Bridge: A Backend-for-Frontend Service

Exposing the Kubernetes API directly to a web browser is a security nightmare. We need an intermediary service—a Backend-for-Frontend (BFF)—that acts as a secure proxy. This Node.js/Express application will expose a simple REST API to our React frontend and use a service account to interact with the Kubernetes cluster to manage GpuDevWorkspace resources.

// filename: bff/server.js
const express = require('express');
const { KubeConfig, CustomObjectsApi } = require('@kubernetes/client-node');

const app = express();
app.use(express.json());

const kc = new KubeConfig();
// This loads the configuration from the environment where the BFF is running.
// In Kubernetes, it will automatically use the pod's service account.
kc.loadFromDefault();

const k8sApi = kc.makeApiClient(CustomObjectsApi);

const GROUP = 'platform.acme.io';
const VERSION = 'v1alpha1';
const PLURAL = 'gpudevworkspaces';
const NAMESPACE = 'default'; // Or a dedicated namespace for workspaces

// --- API Endpoints ---

// Create a new workspace
app.post('/api/workspaces', async (req, res) => {
    const { owner, instanceType, opencvVersion } = req.body;

    if (!owner) {
        return res.status(400).json({ error: 'Owner field is required.' });
    }

    // A real-world project would have more robust validation.
    const workspaceName = `ws-${owner.toLowerCase().replace(/[^a-z0-9]/g, '')}-${Date.now()}`;

    const workspaceSpec = {
        apiVersion: `${GROUP}/${VERSION}`,
        kind: 'GpuDevWorkspace',
        metadata: {
            name: workspaceName,
            namespace: NAMESPACE,
        },
        spec: {
            parameters: {
                owner,
                instanceType: instanceType || 'g4dn.xlarge',
                opencvVersion: opencvVersion || '4.8.0',
                region: 'us-east-1',
            },
            // This ensures the underlying cloud resources are deleted when this CR is deleted.
            compositionUpdatePolicy: 'Automatic',
            writeConnectionSecretToRef: {
                name: `${workspaceName}-conn`,
                namespace: NAMESPACE,
            },
        },
    };

    try {
        console.log(`Creating GpuDevWorkspace: ${workspaceName}`);
        const { body } = await k8sApi.createNamespacedCustomObject(GROUP, VERSION, NAMESPACE, PLURAL, workspaceSpec);
        res.status(202).json(body);
    } catch (err) {
        console.error('Failed to create workspace:', err.body ? JSON.stringify(err.body) : err.message);
        res.status(500).json({ error: 'Failed to communicate with the Kubernetes API.' });
    }
});

// List all workspaces
app.get('/api/workspaces', async (req, res) => {
    try {
        const { body } = await k8sApi.listNamespacedCustomObject(GROUP, VERSION, NAMESPACE, PLURAL);
        // We only send a curated list of fields to the frontend.
        const workspaces = body.items.map(item => ({
            name: item.metadata.name,
            owner: item.spec.parameters.owner,
            instanceType: item.spec.parameters.instanceType,
            status: {
                phase: item.status?.phase || 'Pending',
                instanceId: item.status?.instanceId,
                publicIp: item.status?.publicIp,
                connection: item.status?.connectionDetails,
            },
            creationTimestamp: item.metadata.creationTimestamp,
        }));
        res.json(workspaces);
    } catch (err) {
        console.error('Failed to list workspaces:', err.body ? JSON.stringify(err.body) : err.message);
        res.status(500).json({ error: 'Failed to list workspaces.' });
    }
});

// Delete a workspace
app.delete('/api/workspaces/:name', async (req, res) => {
    const { name } = req.params;
    try {
        console.log(`Deleting GpuDevWorkspace: ${name}`);
        await k8sApi.deleteNamespacedCustomObject(GROUP, VERSION, NAMESPACE, PLURAL, name);
        res.status(204).send();
    } catch (err) {
        // Handle 'not found' gracefully
        if (err.statusCode === 404) {
             return res.status(404).json({ error: 'Workspace not found.' });
        }
        console.error(`Failed to delete workspace ${name}:`, err.body ? JSON.stringify(err.body) : err.message);
        res.status(500).json({ error: `Failed to delete workspace ${name}.` });
    }
});


const PORT = process.env.PORT || 3001;
app.listen(PORT, () => {
    console.log(`BFF server listening on port ${PORT}`);
});

This BFF is lean by design. Its only job is to translate authenticated HTTP requests into Kubernetes API calls. Error handling and logging are paramount. The code handles missing fields, Kubernetes API errors, and formats the output specifically for the frontend’s needs, preventing the leakage of internal Kubernetes object structures. For production, this BFF would need proper authentication and authorization middleware to identify the user making the request.

The Self-Service Portal: A React Frontend

The final piece is the user-facing portal. We used React with a simple state management solution to keep it lightweight. The UI has two main functions: a form to create new workspaces and a dashboard to view and delete existing ones.

// filename: frontend/src/WorkspaceDashboard.js
import React, { useState, useEffect, useCallback } from 'react';
import axios from 'axios';

// --- Components ---

const WorkspaceForm = ({ onWorkspaceCreated }) => {
    const [owner, setOwner] = useState('');
    const [instanceType, setInstanceType] = useState('g4dn.xlarge');
    const [isSubmitting, setIsSubmitting] = useState(false);
    const [error, setError] = useState(null);

    const handleSubmit = async (e) => {
        e.preventDefault();
        setIsSubmitting(true);
        setError(null);
        try {
            await axios.post('/api/workspaces', { owner, instanceType });
            setOwner(''); // Reset form
            onWorkspaceCreated();
        } catch (err) {
            setError(err.response?.data?.error || 'Failed to create workspace.');
        } finally {
            setIsSubmitting(false);
        }
    };

    return (
        <form onSubmit={handleSubmit} className="workspace-form">
            <h2>Create New Workspace</h2>
            {error && <div className="error-banner">{error}</div>}
            <div>
                <label>Owner Email:</label>
                <input type="email" value={owner} onChange={(e) => setOwner(e.target.value)} required />
            </div>
            <div>
                <label>Instance Type:</label>
                <select value={instanceType} onChange={(e) => setInstanceType(e.target.value)}>
                    <option value="g4dn.xlarge">g4dn.xlarge (Standard)</option>
                    <option value="g4dn.2xlarge">g4dn.2xlarge (Large)</option>
                    <option value="p3.2xlarge">p3.2xlarge (High Perf)</option>
                </select>
            </div>
            <button type="submit" disabled={isSubmitting}>
                {isSubmitting ? 'Provisioning...' : 'Create Workspace'}
            </button>
        </form>
    );
};

const WorkspaceList = ({ workspaces, onDelete }) => {
    const handleDelete = async (name) => {
        if (window.confirm(`Are you sure you want to delete workspace ${name}?`)) {
            try {
                await axios.delete(`/api/workspaces/${name}`);
                onDelete();
            } catch (err) {
                alert('Failed to delete workspace.');
            }
        }
    };
    
    return (
        <div className="workspace-list">
            <h2>Active Workspaces</h2>
            <table>
                <thead>
                    <tr>
                        <th>Name</th>
                        <th>Owner</th>
                        <th>Status</th>
                        <th>Connection Info</th>
                        <th>Actions</th>
                    </tr>
                </thead>
                <tbody>
                    {workspaces.map(ws => (
                        <tr key={ws.name}>
                            <td>{ws.name}</td>
                            <td>{ws.owner}</td>
                            <td><span className={`status-pill ${ws.status.phase.toLowerCase()}`}>{ws.status.phase}</span></td>
                            <td>{ws.status.phase === 'Ready' ? <code>{ws.status.connection}</code> : '...'}</td>
                            <td>
                                <button onClick={() => handleDelete(ws.name)} className="delete-btn">Delete</button>
                            </td>
                        </tr>
                    ))}
                </tbody>
            </table>
        </div>
    );
};


// --- Main App Component ---

export default function WorkspaceDashboard() {
    const [workspaces, setWorkspaces] = useState([]);
    const [isLoading, setIsLoading] = useState(true);

    const fetchWorkspaces = useCallback(async () => {
        try {
            const { data } = await axios.get('/api/workspaces');
            setWorkspaces(data);
        } catch (error) {
            console.error("Failed to fetch workspaces", error);
        } finally {
            setIsLoading(false);
        }
    }, []);

    useEffect(() => {
        fetchWorkspaces();
        // A pitfall is hammering the API. In a production system, this should be
        // a WebSocket or Server-Sent Events connection for real-time updates.
        // For simplicity, we use polling here.
        const intervalId = setInterval(fetchWorkspaces, 5000); // Poll every 5 seconds
        return () => clearInterval(intervalId);
    }, [fetchWorkspaces]);
    
    if (isLoading) return <div>Loading workspaces...</div>;

    return (
        <div className="dashboard-container">
            <h1>GPU Workspace Platform</h1>
            <WorkspaceForm onWorkspaceCreated={fetchWorkspaces} />
            <WorkspaceList workspaces={workspaces} onDelete={fetchWorkspaces} />
        </div>
    );
}

The frontend code is deliberately straightforward. It polls the BFF every few seconds to refresh the state of the workspaces. This is a pragmatic choice for a first version. The status is displayed with a colored “pill,” giving the user immediate visual feedback on the provisioning process (Pending, Provisioning, Ready). Once ready, the SSH connection command appears directly in the UI.

Architectural Flow

The entire system works in a clean, decoupled loop.

sequenceDiagram
    participant FE as Frontend (React)
    participant BFF
    participant K8s API as Kubernetes API Server
    participant Crossplane
    participant Cloud as AWS API

    FE->>+BFF: POST /api/workspaces (owner, type)
    BFF->>+K8s API: CREATE GpuDevWorkspace CR
    K8s API-->>-BFF: Acknowledged
    BFF-->>-FE: 202 Accepted
    
    loop Status Polling
        FE->>+BFF: GET /api/workspaces
        BFF->>+K8s API: GET GpuDevWorkspace CRs
        K8s API-->>-BFF: List of CRs with status
        BFF-->>-FE: Formatted list
    end
    
    K8s API->>Crossplane: Notifies of new GpuDevWorkspace CR
    Crossplane->>Crossplane: Reconciliation loop starts
    Crossplane->>Cloud: CREATE EC2 Instance
    Crossplane->>Cloud: CREATE Security Group, etc.
    Cloud-->>Crossplane: Returns Instance ID, Public IP
    Crossplane->>+K8s API: UPDATE GpuDevWorkspace CR status
    K8s API-->>-Crossplane: Status updated

This architecture successfully decouples the user experience from the infrastructure implementation. The CV team interacts with a simple web form, while the platform team manages the underlying complexity through declarative Kubernetes resources.

The current implementation, while functional, has clear limitations and areas for improvement. The frontend’s polling mechanism is inefficient and should be replaced with a WebSocket or SSE stream from the BFF for real-time updates. Cost management is a significant omission; there is no automated cleanup or leasing mechanism. A future iteration must include a ttl (time-to-live) parameter in the XRD and a dedicated controller to automatically de-provision expired workspaces, preventing runaway cloud costs. The userData script, while effective, increases instance boot time. A more mature system would use a tool like Packer to pre-bake AMIs for common configurations, reducing provisioning time from minutes to seconds. Finally, security could be hardened by integrating the BFF with an OIDC provider to map authenticated users to specific Kubernetes service accounts, enforcing finer-grained permissions on who can create or delete workspaces.

Kubernetes Frontend Crossplane DevOps OpenCV IaC

Engineering a Resilient Oracle Change Data Stream Processor Using Dask and WebSockets

2023-10-27 Data Engineering

CDC Dask Python WebSockets Oracle Distributed Computing

Implementing a Temporal Knowledge Graph via an NLP and Event Sourcing Pipeline on ArangoDB and Terraform

2023-10-27 Data Engineering

Python Event Sourcing NLP Terraform ArangoDB Architecture