The computer vision team’s development loop was broken. Onboarding a new engineer involved a multi-day, documentation-heavy process of manually configuring a GPU-enabled cloud instance. Engineers were spending more time wrestling with NVIDIA drivers, CUDA versions, and Python dependencies than they were with OpenCV algorithms. Every developer’s environment was a unique snowflake, leading to CI failures and the classic “it works on my machine” standoff. The core pain point was a lack of standardized, reproducible, on-demand development environments. Our initial attempts with shell scripts and wikis only exacerbated the problem, creating more maintenance overhead for the platform team.
Our goal became to provide a fully automated, self-service platform. A developer should be able to request a pre-configured, isolated workspace with a specific GPU type and software stack via a simple web UI and have it provisioned within minutes. The entire lifecycle—creation, status monitoring, and deletion—had to be declarative and managed through a central control plane. This wasn’t just about automation; it was about treating developer environments as ephemeral, cattle-not-pets resources.
We evaluated several technology stacks. A simple Terraform module wrapped in a Jenkins job was an early contender. However, this approach is fundamentally imperative. It’s a “fire-and-forget” script. We needed a system with a continuous reconciliation loop that could actively manage the state of the workspace, ensuring it always matched the desired configuration. This naturally led us to the Kubernetes ecosystem.
Crossplane emerged as the ideal foundation. Instead of using Kubernetes to orchestrate containers, Crossplane allows us to use the Kubernetes API to orchestrate anything—in our case, cloud infrastructure like EC2 instances, security groups, and IAM roles. We could define our own custom resource, kind: GpuDevWorkspace
, and let Crossplane translate that abstract request into concrete AWS resources. This provides the declarative, continuously reconciling control plane we were missing. For the user interface, a simple React frontend would serve as the self-service portal, abstracting away all the underlying Kubernetes and cloud complexity from our target users: the CV engineers. OpenCV isn’t a direct part of the platform’s stack, but its demanding requirements (specific drivers, libraries, GPU access) are the entire reason this platform needs to exist.
Defining the Platform’s API with Crossplane
The first step was to define what a GpuDevWorkspace
is. This is the contract between the platform and its users. We used Crossplane’s CompositeResourceDefinition
(XRD) to create a new Kubernetes API. A common mistake here is to expose every possible cloud provider option. In a real-world project, the goal is to provide a curated, opinionated abstraction. We decided to expose only what was necessary for the CV team.
# filename: 01-xrd-gpudevworkspace.yaml
apiVersion: apiextensions.crossplane.io/v1
kind: CompositeResourceDefinition
metadata:
name: gpudevworkspaces.platform.acme.io
spec:
group: platform.acme.io
names:
kind: GpuDevWorkspace
plural: gpudevworkspaces
claimNames:
kind: GpuDevWorkspaceClaim
plural: gpudevworkspaceclaims
versions:
- name: v1alpha1
served: true
referenceable: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
parameters:
type: object
properties:
# User-configurable parameters for the workspace
instanceType:
type: string
description: "The AWS EC2 instance type (e.g., g4dn.xlarge)."
default: "g4dn.xlarge"
region:
type: string
description: "AWS region for the workspace."
default: "us-east-1"
owner:
type: string
description: "Identifier for the user who owns this workspace."
opencvVersion:
type: string
description: "The version of OpenCV to pre-install."
default: "4.8.0"
required:
- owner
required:
- parameters
status:
type: object
properties:
# Fields to be populated by the backend/controllers
instanceId:
type: string
publicIp:
type: string
connectionDetails:
type: string
description: "Instructions on how to connect (e.g., SSH command)."
phase:
type: string
description: "The current state of the workspace (Provisioning, Ready, Deleting, Failed)."
This XRD establishes our GpuDevWorkspace
resource. Notice the status
block; this is crucial for communicating the state of the provisioning process back to the user via our frontend.
With the API defined, we needed to teach Crossplane how to satisfy it. This is done with a Composition
. The Composition
maps our abstract GpuDevWorkspace
to a set of concrete managed resources from a Crossplane provider, in this case, provider-aws
.
# filename: 02-composition-aws-ec2.yaml
apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
name: gpudevworkspace-aws-ec2
labels:
provider: aws
spec:
compositeTypeRef:
apiVersion: platform.acme.io/v1alpha1
kind: GpuDevWorkspace
resources:
- name: ec2Instance
base:
apiVersion: ec2.aws.upbound.io/v1beta1
kind: Instance
spec:
forProvider:
# We use a pre-baked AMI for faster startup times.
# This AMI should have Docker, NVIDIA drivers, and CUDA pre-installed.
ami: "ami-0a123b456c789d0e1"
tags:
ManagedBy: "Crossplane"
WorkspaceOwner: "unknown" # Will be patched
patches:
- fromFieldPath: "spec.parameters.instanceType"
toFieldPath: "spec.forProvider.instanceType"
- fromFieldPath: "spec.parameters.region"
toFieldPath: "spec.forProvider.region"
- fromFieldPath: "spec.parameters.owner"
toFieldPath: "spec.forProvider.tags.WorkspaceOwner"
- fromFieldPath: "metadata.uid" # Use a unique ID for the name
toFieldPath: "metadata.annotations[crossplane.io/external-name]"
transforms:
- type: string
string:
fmt: "gpudev-%s"
# Patch the startup script into userData
- type: CombineFromComposite
combine:
variables:
- fromFieldPath: spec.parameters.opencvVersion
strategy: string
string:
fmt: |
#!/bin/bash
# cloud-init script to configure the instance on first boot.
# The base AMI already has NVIDIA drivers, CUDA, and Docker.
set -e
export OPENCV_VERSION="%s"
# Simple logging for debug purposes
exec > >(tee /var/log/user-data.log|logger -t user-data -s 2>/dev/console) 2>&1
echo "Starting user-data script for OpenCV v${OPENCV_VERSION}"
# Update and install dependencies
apt-get update
apt-get install -y python3-pip python3-dev
# Install OpenCV with contributions
pip3 install opencv-python-headless==${OPENCV_VERSION}
pip3 install opencv-contrib-python-headless==${OPENCV_VERSION}
echo "OpenCV installation complete."
# Create a user for the developer
useradd -m -s /bin/bash developer
mkdir -p /home/developer/.ssh
# NOTE: In a production setup, the public key should be fetched
# from a secure store, not hardcoded.
echo "ssh-rsa AAAA..." >> /home/developer/.ssh/authorized_keys
chown -R developer:developer /home/developer/.ssh
chmod 700 /home/developer/.ssh
chmod 600 /home/developer/.ssh/authorized_keys
echo "Developer user configured. Workspace is ready."
toFieldPath: spec.forProvider.userData
# These patches surface outputs from the managed resource back to the CR status
- fromFieldPath: "status.atProvider.id"
toFieldPath: "status.instanceId"
- fromFieldPath: "status.atProvider.publicIp"
toFieldPath: "status.publicIp"
- fromFieldPath: status.atProvider.publicIp
toFieldPath: status.connectionDetails
transforms:
- type: string
string:
fmt: "ssh developer@%s"
The Composition
is where the real work happens. The patches
array is the most powerful feature, allowing us to map fields from our abstract resource (GpuDevWorkspace
) to the concrete ec2.Instance
resource. The userData
script is critical; it performs the final configuration for the specific OpenCV version requested by the user. A significant pitfall in early designs is making this script too complex. Our strategy is to pre-bake as much as possible into a custom Amazon Machine Image (AMI) to minimize boot time. The script should only handle lightweight, dynamic configuration.
The Bridge: A Backend-for-Frontend Service
Exposing the Kubernetes API directly to a web browser is a security nightmare. We need an intermediary service—a Backend-for-Frontend (BFF)—that acts as a secure proxy. This Node.js/Express application will expose a simple REST API to our React frontend and use a service account to interact with the Kubernetes cluster to manage GpuDevWorkspace
resources.
// filename: bff/server.js
const express = require('express');
const { KubeConfig, CustomObjectsApi } = require('@kubernetes/client-node');
const app = express();
app.use(express.json());
const kc = new KubeConfig();
// This loads the configuration from the environment where the BFF is running.
// In Kubernetes, it will automatically use the pod's service account.
kc.loadFromDefault();
const k8sApi = kc.makeApiClient(CustomObjectsApi);
const GROUP = 'platform.acme.io';
const VERSION = 'v1alpha1';
const PLURAL = 'gpudevworkspaces';
const NAMESPACE = 'default'; // Or a dedicated namespace for workspaces
// --- API Endpoints ---
// Create a new workspace
app.post('/api/workspaces', async (req, res) => {
const { owner, instanceType, opencvVersion } = req.body;
if (!owner) {
return res.status(400).json({ error: 'Owner field is required.' });
}
// A real-world project would have more robust validation.
const workspaceName = `ws-${owner.toLowerCase().replace(/[^a-z0-9]/g, '')}-${Date.now()}`;
const workspaceSpec = {
apiVersion: `${GROUP}/${VERSION}`,
kind: 'GpuDevWorkspace',
metadata: {
name: workspaceName,
namespace: NAMESPACE,
},
spec: {
parameters: {
owner,
instanceType: instanceType || 'g4dn.xlarge',
opencvVersion: opencvVersion || '4.8.0',
region: 'us-east-1',
},
// This ensures the underlying cloud resources are deleted when this CR is deleted.
compositionUpdatePolicy: 'Automatic',
writeConnectionSecretToRef: {
name: `${workspaceName}-conn`,
namespace: NAMESPACE,
},
},
};
try {
console.log(`Creating GpuDevWorkspace: ${workspaceName}`);
const { body } = await k8sApi.createNamespacedCustomObject(GROUP, VERSION, NAMESPACE, PLURAL, workspaceSpec);
res.status(202).json(body);
} catch (err) {
console.error('Failed to create workspace:', err.body ? JSON.stringify(err.body) : err.message);
res.status(500).json({ error: 'Failed to communicate with the Kubernetes API.' });
}
});
// List all workspaces
app.get('/api/workspaces', async (req, res) => {
try {
const { body } = await k8sApi.listNamespacedCustomObject(GROUP, VERSION, NAMESPACE, PLURAL);
// We only send a curated list of fields to the frontend.
const workspaces = body.items.map(item => ({
name: item.metadata.name,
owner: item.spec.parameters.owner,
instanceType: item.spec.parameters.instanceType,
status: {
phase: item.status?.phase || 'Pending',
instanceId: item.status?.instanceId,
publicIp: item.status?.publicIp,
connection: item.status?.connectionDetails,
},
creationTimestamp: item.metadata.creationTimestamp,
}));
res.json(workspaces);
} catch (err) {
console.error('Failed to list workspaces:', err.body ? JSON.stringify(err.body) : err.message);
res.status(500).json({ error: 'Failed to list workspaces.' });
}
});
// Delete a workspace
app.delete('/api/workspaces/:name', async (req, res) => {
const { name } = req.params;
try {
console.log(`Deleting GpuDevWorkspace: ${name}`);
await k8sApi.deleteNamespacedCustomObject(GROUP, VERSION, NAMESPACE, PLURAL, name);
res.status(204).send();
} catch (err) {
// Handle 'not found' gracefully
if (err.statusCode === 404) {
return res.status(404).json({ error: 'Workspace not found.' });
}
console.error(`Failed to delete workspace ${name}:`, err.body ? JSON.stringify(err.body) : err.message);
res.status(500).json({ error: `Failed to delete workspace ${name}.` });
}
});
const PORT = process.env.PORT || 3001;
app.listen(PORT, () => {
console.log(`BFF server listening on port ${PORT}`);
});
This BFF is lean by design. Its only job is to translate authenticated HTTP requests into Kubernetes API calls. Error handling and logging are paramount. The code handles missing fields, Kubernetes API errors, and formats the output specifically for the frontend’s needs, preventing the leakage of internal Kubernetes object structures. For production, this BFF would need proper authentication and authorization middleware to identify the user making the request.
The Self-Service Portal: A React Frontend
The final piece is the user-facing portal. We used React with a simple state management solution to keep it lightweight. The UI has two main functions: a form to create new workspaces and a dashboard to view and delete existing ones.
// filename: frontend/src/WorkspaceDashboard.js
import React, { useState, useEffect, useCallback } from 'react';
import axios from 'axios';
// --- Components ---
const WorkspaceForm = ({ onWorkspaceCreated }) => {
const [owner, setOwner] = useState('');
const [instanceType, setInstanceType] = useState('g4dn.xlarge');
const [isSubmitting, setIsSubmitting] = useState(false);
const [error, setError] = useState(null);
const handleSubmit = async (e) => {
e.preventDefault();
setIsSubmitting(true);
setError(null);
try {
await axios.post('/api/workspaces', { owner, instanceType });
setOwner(''); // Reset form
onWorkspaceCreated();
} catch (err) {
setError(err.response?.data?.error || 'Failed to create workspace.');
} finally {
setIsSubmitting(false);
}
};
return (
<form onSubmit={handleSubmit} className="workspace-form">
<h2>Create New Workspace</h2>
{error && <div className="error-banner">{error}</div>}
<div>
<label>Owner Email:</label>
<input type="email" value={owner} onChange={(e) => setOwner(e.target.value)} required />
</div>
<div>
<label>Instance Type:</label>
<select value={instanceType} onChange={(e) => setInstanceType(e.target.value)}>
<option value="g4dn.xlarge">g4dn.xlarge (Standard)</option>
<option value="g4dn.2xlarge">g4dn.2xlarge (Large)</option>
<option value="p3.2xlarge">p3.2xlarge (High Perf)</option>
</select>
</div>
<button type="submit" disabled={isSubmitting}>
{isSubmitting ? 'Provisioning...' : 'Create Workspace'}
</button>
</form>
);
};
const WorkspaceList = ({ workspaces, onDelete }) => {
const handleDelete = async (name) => {
if (window.confirm(`Are you sure you want to delete workspace ${name}?`)) {
try {
await axios.delete(`/api/workspaces/${name}`);
onDelete();
} catch (err) {
alert('Failed to delete workspace.');
}
}
};
return (
<div className="workspace-list">
<h2>Active Workspaces</h2>
<table>
<thead>
<tr>
<th>Name</th>
<th>Owner</th>
<th>Status</th>
<th>Connection Info</th>
<th>Actions</th>
</tr>
</thead>
<tbody>
{workspaces.map(ws => (
<tr key={ws.name}>
<td>{ws.name}</td>
<td>{ws.owner}</td>
<td><span className={`status-pill ${ws.status.phase.toLowerCase()}`}>{ws.status.phase}</span></td>
<td>{ws.status.phase === 'Ready' ? <code>{ws.status.connection}</code> : '...'}</td>
<td>
<button onClick={() => handleDelete(ws.name)} className="delete-btn">Delete</button>
</td>
</tr>
))}
</tbody>
</table>
</div>
);
};
// --- Main App Component ---
export default function WorkspaceDashboard() {
const [workspaces, setWorkspaces] = useState([]);
const [isLoading, setIsLoading] = useState(true);
const fetchWorkspaces = useCallback(async () => {
try {
const { data } = await axios.get('/api/workspaces');
setWorkspaces(data);
} catch (error) {
console.error("Failed to fetch workspaces", error);
} finally {
setIsLoading(false);
}
}, []);
useEffect(() => {
fetchWorkspaces();
// A pitfall is hammering the API. In a production system, this should be
// a WebSocket or Server-Sent Events connection for real-time updates.
// For simplicity, we use polling here.
const intervalId = setInterval(fetchWorkspaces, 5000); // Poll every 5 seconds
return () => clearInterval(intervalId);
}, [fetchWorkspaces]);
if (isLoading) return <div>Loading workspaces...</div>;
return (
<div className="dashboard-container">
<h1>GPU Workspace Platform</h1>
<WorkspaceForm onWorkspaceCreated={fetchWorkspaces} />
<WorkspaceList workspaces={workspaces} onDelete={fetchWorkspaces} />
</div>
);
}
The frontend code is deliberately straightforward. It polls the BFF every few seconds to refresh the state of the workspaces. This is a pragmatic choice for a first version. The status is displayed with a colored “pill,” giving the user immediate visual feedback on the provisioning process (Pending
, Provisioning
, Ready
). Once ready, the SSH connection command appears directly in the UI.
Architectural Flow
The entire system works in a clean, decoupled loop.
sequenceDiagram participant FE as Frontend (React) participant BFF participant K8s API as Kubernetes API Server participant Crossplane participant Cloud as AWS API FE->>+BFF: POST /api/workspaces (owner, type) BFF->>+K8s API: CREATE GpuDevWorkspace CR K8s API-->>-BFF: Acknowledged BFF-->>-FE: 202 Accepted loop Status Polling FE->>+BFF: GET /api/workspaces BFF->>+K8s API: GET GpuDevWorkspace CRs K8s API-->>-BFF: List of CRs with status BFF-->>-FE: Formatted list end K8s API->>Crossplane: Notifies of new GpuDevWorkspace CR Crossplane->>Crossplane: Reconciliation loop starts Crossplane->>Cloud: CREATE EC2 Instance Crossplane->>Cloud: CREATE Security Group, etc. Cloud-->>Crossplane: Returns Instance ID, Public IP Crossplane->>+K8s API: UPDATE GpuDevWorkspace CR status K8s API-->>-Crossplane: Status updated
This architecture successfully decouples the user experience from the infrastructure implementation. The CV team interacts with a simple web form, while the platform team manages the underlying complexity through declarative Kubernetes resources.
The current implementation, while functional, has clear limitations and areas for improvement. The frontend’s polling mechanism is inefficient and should be replaced with a WebSocket or SSE stream from the BFF for real-time updates. Cost management is a significant omission; there is no automated cleanup or leasing mechanism. A future iteration must include a ttl
(time-to-live) parameter in the XRD and a dedicated controller to automatically de-provision expired workspaces, preventing runaway cloud costs. The userData
script, while effective, increases instance boot time. A more mature system would use a tool like Packer to pre-bake AMIs for common configurations, reducing provisioning time from minutes to seconds. Finally, security could be hardened by integrating the BFF with an OIDC provider to map authenticated users to specific Kubernetes service accounts, enforcing finer-grained permissions on who can create or delete workspaces.