The operational friction in our machine learning lifecycle was becoming untenable. Our data science teams were either piling into a single, monolithic MLflow tracking server—creating a noisy and unmanageable environment—or they were forced to submit a multi-day ticket cycle to provision a new MongoDB instance and a VM for a dedicated server. This bottleneck was a direct impediment to rapid experimentation. The core problem wasn’t MLflow or MongoDB; it was the imperative, ticket-driven process for provisioning the coupled infrastructure required to support an ML experiment.
Our initial mandate was to build an internal, self-service platform for MLOps infrastructure. The goal was to empower data scientists to provision a complete, isolated MLflow environment (a tracking server, a dedicated MongoDB database for metadata, and an S3 bucket for artifacts) simply by describing their desired state in a single configuration file and committing it to Git. We settled on our existing Kubernetes ecosystem as the foundation, but a simple Helm chart was insufficient. Helm can deploy the MLflow application, but it has no awareness or control over external, managed resources like a MongoDB Atlas cluster or an AWS S3 bucket. We needed a true control plane, one that could extend the Kubernetes API itself to manage both on-cluster and off-cluster resources cohesively. This led us directly to Crossplane.
The architecture we designed treats an “MLflow Environment” as a new, native Kubernetes resource type. A platform user interacts with a high-level Custom Resource (CR) named MLflowEnvironment
. Crossplane’s machinery, running in the cluster, observes this CR and orchestrates the complex, multi-provider provisioning workflow in the background: creating a MongoDB Atlas project and cluster, configuring a database user, provisioning an S3 bucket, and finally, deploying the MLflow server pod configured to use these newly created resources.
This is the breakdown of that build process, including the dead ends and the non-obvious configurations required to make it work in a production setting.
The Foundational Layer: Crossplane Providers and Configuration
Before defining our custom API, the control plane needs the capability to communicate with the target infrastructure providers. In this case, that’s MongoDB Atlas for the backend store and AWS for the artifact store. This is handled by installing Crossplane Providers and configuring them with the necessary credentials.
First, the Provider
objects are installed in the cluster. These are straightforward Kubernetes resources that point to the provider packages.
# 01-providers.yaml
apiVersion: pkg.crossplane.io/v1
kind: Provider
metadata:
name: provider-jet-aws
spec:
package: xpkg.upbound.io/upbound/provider-jet-aws:v0.42.1
---
apiVersion: pkg.crossplane.io/v1
kind: Provider
metadata:
name: provider-mongodb-atlas
spec:
package: xpkg.upbound.io/upbound/provider-mongodb-atlas:v0.2.1
The real challenge is securely providing credentials. A common mistake is to hardcode secrets. The production-grade approach is to store provider credentials in a Kubernetes Secret
and reference them in a ProviderConfig
object.
For AWS, the provider expects a secret containing the access key ID and secret access key.
# 02-aws-provider-secret.yaml
# This secret should be created via a secure mechanism, not stored in git.
# Example: kubectl create secret generic aws-creds -n crossplane-system --from-literal=credentials='[default]\naws_access_key_id=...\naws_secret_access_key=...'
The ProviderConfig
for AWS then references this secret.
# 03-aws-provider-config.yaml
apiVersion: aws.upbound.io/v1beta1
kind: ProviderConfig
metadata:
name: default-aws-provider
spec:
credentials:
source: Secret
secretRef:
namespace: crossplane-system
name: aws-creds
key: credentials
For MongoDB Atlas, the provider needs an organization ID, a public key, and a private key. The process is similar.
# 04-atlas-provider-secret.yaml
# kubectl create secret generic atlas-creds -n crossplane-system \
# --from-literal=orgId='...' \
# --from-literal=publicApiKey='...' \
# --from-literal=privateApiKey='...'
The ProviderConfig
for Atlas ties it all together.
# 05-atlas-provider-config.yaml
apiVersion: mongodb.upbound.io/v1beta1
kind: ProviderConfig
metadata:
name: default-atlas-provider
spec:
orgID: # ORG_ID - can be hardcoded if single org
credentials:
source: Secret
secretRef:
namespace: crossplane-system
name: atlas-creds
key: privateApiKey
publicAPIKeySecretRef:
namespace: crossplane-system
name: atlas-creds
key: publicApiKey
With these configurations applied, Crossplane is now authorized to manage resources in both AWS and MongoDB Atlas. This setup is the bedrock of our declarative API.
Defining the Abstraction: The CompositeResourceDefinition
(XRD)
The next step is to define the public-facing API for our data scientists. This is the MLflowEnvironment
resource itself. We use a CompositeResourceDefinition
(XRD) to specify its schema, defining what parameters a user can and must provide. This is a critical design step; the fields exposed here form the contract between the platform team and its users. We decided to keep it minimal to start.
# 06-xrd-mlflow-environment.yaml
apiVersion: apiextensions.crossplane.io/v1
kind: CompositeResourceDefinition
metadata:
name: mlflowenvironments.mlops.example.com
spec:
group: mlops.example.com
names:
kind: MLflowEnvironment
plural: mlflowenvironments
claimNames:
kind: MLflowEnvironmentClaim
plural: mlflowenvironmentclaims
versions:
- name: v1alpha1
served: true
referenceable: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
parameters:
type: object
properties:
# Unique identifier for the environment, used for naming resources.
# A validation pattern ensures it's DNS-friendly.
environmentID:
type: string
description: "A unique ID for the MLflow environment. Used to name resources."
pattern: "^[a-z0-9]([-a-z0-9]*[a-z0-9])?$"
# Region for deploying resources like S3 and potentially the Atlas cluster.
region:
type: string
description: "The cloud provider region for resource deployment."
default: "us-east-1"
# Deletion policy to protect against accidental data loss.
# A real-world project must handle this carefully.
deletionPolicy:
description: "Controls what happens to resources when the claim is deleted. 'Delete' or 'Orphan'."
type: string
enum:
- Delete
- Orphan
default: Orphan
required:
- environmentID
required:
- parameters
Key decisions in this XRD:
-
environmentID
: This is the primary user input. We enforce a DNS-friendly pattern because it will be used to construct names for other resources (S3 buckets, Atlas projects, etc.). -
region
: This provides flexibility for geo-locating resources. -
deletionPolicy
: This is a crucial safety mechanism. By defaulting toOrphan
, we prevent a user from accidentally deleting their claim and wiping out the underlying database and artifacts. A production system might have more sophisticated logic here, perhaps based on environment type (e.g.,dev
can be deleted,prod
must be retained).
Applying this XRD to the Kubernetes cluster creates a new CRD, making MLflowEnvironment
a usable, albeit non-functional, resource type.
The Implementation Logic: The Composition
The Composition
is the heart of the system. It maps the abstract API defined in the XRD to a concrete set of managed resources. This is where we define how an MLflowEnvironment
is actually constructed from pieces of MongoDB Atlas, AWS, and Kubernetes resources.
A major implementation detail is how to pass information between the created resources. For instance, the MLflow Deployment
needs the connection string for the MongoDB database, which is only known after the database and its user have been created. Crossplane solves this through a system of patching and transforms.
Here is the complete Composition
, broken down into its constituent parts.
# 07-composition-mlflow-environment.yaml
apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
name: mlflow-atlas-aws.mlops.example.com
labels:
provider: multi-cloud
spec:
# This Composition is for the XRD we defined earlier.
compositeTypeRef:
apiVersion: mlops.example.com/v1alpha1
kind: MLflowEnvironment
# This is the list of managed resources that will be created.
resources:
# 1. MongoDB Atlas Project
- name: atlas-project
base:
apiVersion: project.atlas.mongodb.com/v1alpha1
kind: Project
spec:
forProvider:
orgID: # YOUR_ORG_ID
reclaimPolicy: Delete # Projects can be transient
patches:
- fromFieldPath: "spec.parameters.environmentID"
toFieldPath: "spec.forProvider.name"
transforms:
- type: string
string:
fmt: "mlflow-%s"
- fromFieldPath: "spec.reclaimPolicy"
toFieldPath: "spec.reclaimPolicy"
# 2. MongoDB Atlas Cluster (M10 tier for this example)
- name: atlas-cluster
base:
apiVersion: cluster.atlas.mongodb.com/v1alpha1
kind: Cluster
spec:
forProvider:
# Minimal config for a basic cluster
providerName: "TENANT"
backingProviderName: "AWS"
providerInstanceSizeName: "M10"
providerSettings:
- regionName: "US_EAST_1"
reclaimPolicy: Delete
patches:
- fromFieldPath: "spec.parameters.environmentID"
toFieldPath: "spec.forProvider.name"
transforms:
- type: string
string:
fmt: "mlflow-cluster-%s"
- fromFieldPath: "metadata.labels['crossplane.io/claim-name']" # Use the claim name for project reference
toFieldPath: "spec.forProvider.projectName"
transforms:
- type: string
string:
fmt: "mlflow-%s"
- fromFieldPath: "spec.reclaimPolicy"
toFieldPath: "spec.reclaimPolicy"
connectionDetails:
- fromConnectionSecretKey: "srvAddress"
name: "srvAddress"
- fromConnectionSecretKey: "username"
name: "username"
- fromConnectionSecretKey: "password"
name: "password"
# 3. MongoDB Atlas Database User
- name: atlas-db-user
base:
apiVersion: databaseuser.atlas.mongodb.com/v1alpha1
kind: DatabaseUser
spec:
forProvider:
authDatabaseName: "admin"
roles:
- roleName: "readWriteAnyDatabase"
databaseName: "admin"
scopes:
- name: "mlflow-cluster-placeholder" # Will be patched
type: "CLUSTER"
reclaimPolicy: Delete
writeConnectionSecretToRef:
namespace: crossplane-system
name: "placeholder" # Will be patched
patches:
- fromFieldPath: "spec.parameters.environmentID"
toFieldPath: "spec.forProvider.username"
transforms:
- type: string
string:
fmt: "mlflow-user-%s"
- fromFieldPath: "spec.parameters.environmentID"
toFieldPath: "spec.writeConnectionSecretToRef.name"
transforms:
- type: string
string:
fmt: "%s-db-connection"
- fromFieldPath: "spec.parameters.environmentID"
toFieldPath: "spec.forProvider.scopes[0].name"
transforms:
- type: string
string:
fmt: "mlflow-cluster-%s"
- fromFieldPath: "metadata.labels['crossplane.io/claim-name']"
toFieldPath: "spec.forProvider.projectName"
transforms:
- type: string
string:
fmt: "mlflow-%s"
- fromFieldPath: "spec.reclaimPolicy"
toFieldPath: "spec.reclaimPolicy"
# 4. AWS S3 Bucket for MLflow Artifacts
- name: s3-bucket
base:
apiVersion: s3.aws.jet.upbound.io/v1beta1
kind: Bucket
spec:
forProvider:
# Region will be patched
reclaimPolicy: Delete
patches:
- fromFieldPath: "spec.parameters.environmentID"
toFieldPath: "metadata.name"
transforms:
- type: string
string:
fmt: "mlflow-artifacts-%s"
- fromFieldPath: "spec.parameters.region"
toFieldPath: "spec.forProvider.region"
- fromFieldPath: "spec.reclaimPolicy"
toFieldPath: "spec.reclaimPolicy"
# 5. Consolidated Kubernetes Secret for MLflow Deployment
# This is a critical piece of glue code.
- name: mlflow-config-secret
base:
apiVersion: v1
kind: Secret
metadata:
namespace: default # Should be patched to the claim's namespace
type: Opaque
patches:
- fromFieldPath: "metadata.labels['crossplane.io/claim-name']"
toFieldPath: "metadata.name"
transforms:
- type: string
string:
fmt: "%s-mlflow-config"
- fromFieldPath: "metadata.labels['crossplane.io/claim-namespace']"
toFieldPath: "metadata.namespace"
# Combine details from various resources to build the MongoDB URI
- fromConnectionSecretKey: "password"
toFieldPath: "data.MONGO_PASSWORD"
transform:
type: "string"
string:
transform:
type: "base64"
input: "PASSWORD"
- type: CombineFromComposite
combine:
variables:
- fromFieldPath: "spec.parameters.environmentID"
- fromConnectionSecretKey: "srvAddress"
strategy: string
string:
fmt: "mongodb+srv://mlflow-user-%s:%s@%s/?retryWrites=true&w=majority"
toFieldPath: data.MONGO_URI
# Base64 encode the final string for the Secret
transforms:
- type: string
string:
transform:
type: "base64"
input: "STRING"
# Pass the S3 bucket name
- fromFieldPath: "status.atProvider.id" # The bucket name is its ID
toFieldPath: data.ARTIFACT_ROOT
policy:
fromFieldPath: Required # This is a dependency
transforms:
- type: string
string:
fmt: "s3://%s"
- type: string
string:
transform:
type: "base64"
input: "STRING"
source: s3-bucket
# 6. Kubernetes Deployment for MLflow Server
- name: mlflow-server-deployment
base:
apiVersion: apps/v1
kind: Deployment
spec:
replicas: 1
selector:
matchLabels:
app: mlflow-server # Patched with a unique ID
template:
metadata:
labels:
app: mlflow-server # Patched
spec:
containers:
- name: mlflow-server
image: "ghcr.io/mlflow/mlflow:v2.8.0"
args: [
"server",
"--host", "0.0.0.0",
"--port", "5000"
]
ports:
- containerPort: 5000
env:
- name: MLFLOW_BACKEND_STORE_URI
valueFrom:
secretKeyRef:
name: "placeholder-secret" # Patched
key: MONGO_URI
- name: MLFLOW_ARTIFACTS_DESTINATION
valueFrom:
secretKeyRef:
name: "placeholder-secret" # Patched
key: ARTIFACT_ROOT
# AWS credentials for S3 access
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: aws-creds # Assuming in the same namespace or using a refactor
key: aws_access_key_id
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: aws-creds
key: aws_secret_access_key
patches:
- fromFieldPath: "metadata.labels['crossplane.io/claim-name']"
toFieldPath: "metadata.name"
- fromFieldPath: "metadata.labels['crossplane.io/claim-namespace']"
toFieldPath: "metadata.namespace"
- fromFieldPath: "metadata.labels['crossplane.io/claim-name']"
toFieldPath: "spec.selector.matchLabels.app"
- fromFieldPath: "metadata.labels['crossplane.io/claim-name']"
toFieldPath: "spec.template.metadata.labels.app"
- fromFieldPath: "metadata.labels['crossplane.io/claim-name']"
toFieldPath: "spec.template.spec.containers[0].env[0].valueFrom.secretKeyRef.name"
transforms:
- type: string
string:
fmt: "%s-mlflow-config"
- fromFieldPath: "metadata.labels['crossplane.io/claim-name']"
toFieldPath: "spec.template.spec.containers[0].env[1].valueFrom.secretKeyRef.name"
transforms:
- type: string
string:
fmt: "%s-mlflow-config"
# 7. Kubernetes Service to expose the MLflow UI
- name: mlflow-server-service
base:
apiVersion: v1
kind: Service
spec:
type: ClusterIP
ports:
- port: 80
targetPort: 5000
protocol: TCP
selector:
app: mlflow-server # Patched
patches:
- fromFieldPath: "metadata.labels['crossplane.io/claim-name']"
toFieldPath: "metadata.name"
- fromFieldPath: "metadata.labels['crossplane.io/claim-namespace']"
toFieldPath: "metadata.namespace"
- fromFieldPath: "metadata.labels['crossplane.io/claim-name']"
toFieldPath: "spec.selector.app"
The most complex part here is the mlflow-config-secret
. It doesn’t map to a single cloud provider resource. Instead, it acts as an intermediary, assembling connection details from multiple sources (DatabaseUser
, Cluster
, Bucket
) into a format that the final Deployment
can consume. The CombineFromComposite
transform is particularly powerful, allowing us to construct the full MongoDB connection string by templating values from different places. This is a common pattern in Crossplane for bridging the gap between what a provider API offers and what an application expects.
The User Experience: Creating an MLflowEnvironment
Claim
With the XRD and Composition
in place, the platform is ready. A data scientist wanting a new environment now only needs to create one simple YAML file. In a GitOps flow, they would commit this file to a repository, and a tool like ArgoCD would apply it to the cluster.
# 08-claim-mlflow-dev-team-a.yaml
apiVersion: mlops.example.com/v1alpha1
kind: MLflowEnvironment
metadata:
name: dev-team-a-project
namespace: data-science-projects
spec:
parameters:
environmentID: "dev-team-a-project"
region: "us-west-2"
deletionPolicy: Orphan # Explicitly set for safety
Once this manifest is applied, the following happens automatically:
- Crossplane sees the new
MLflowEnvironment
resource. - It finds the
Composition
that matches itscompositeTypeRef
. - It begins creating all resources defined in the
Composition
in order, resolving patches and dependencies as it goes. - Within minutes, a new Atlas project, cluster, and user exist; a new S3 bucket is ready; and a
Deployment
andService
for MLflow are running in thedata-science-projects
namespace, fully configured and ready for use.
The user can check the status with kubectl get mlflowenvironment dev-team-a-project -n data-science-projects -o yaml
. They will see conditions indicating whether the underlying resources are synced and ready.
graph TD subgraph Git Repository A[User commits Claim YAML] end subgraph Kubernetes Cluster B[GitOps Controller applies Claim] --> C{MLflowEnvironment Claim}; C --> D[Crossplane Controller]; D -- Selects --> E[Composition]; E -- Orchestrates --> F[Atlas Project]; E -- Orchestrates --> G[Atlas Cluster]; E -- Orchestrates --> H[Atlas DB User]; E -- Orchestrates --> I[AWS S3 Bucket]; subgraph Managed Resources F & G & H & I end subgraph Application Resources K[Kubernetes Secret] L[Kubernetes Deployment] M[Kubernetes Service] end H -- Connection Details --> K; G -- Connection Details --> K; I -- Bucket Name --> K; K -- Mounts Env Vars --> L; M -- Selects Pods from --> L; end subgraph Cloud Providers N[MongoDB Atlas] O[AWS S3] end F & G & H --> N I --> O
Lingering Issues and Future Iterations
This solution provides immense value by automating a complex workflow, but it’s not without its own set of challenges. The deletionPolicy
is a double-edged sword; setting it to Orphan
prevents data loss but can lead to orphaned, costly resources if not managed properly. A more robust solution might involve a custom cleanup controller or manual review process for orphaned infrastructure.
Furthermore, updates to the Composition
itself are a significant concern. If we decide to change the MongoDB cluster size or add a new sidecar to the MLflow deployment, how do we roll that out to dozens of existing environments? Crossplane will attempt to reconcile the changes, but this requires careful testing to avoid breaking existing setups. This is the inherent complexity of managing a stateful control plane.
Finally, while this automates provisioning, it doesn’t solve for day-2 operations like monitoring, logging, and cost attribution. The next iteration of this platform would need to inject standardized monitoring agents, configure log shipping, and ensure all created resources are tagged with the owning team’s cost center for FinOps tracking. The current design provides the hooks for this (e.g., patching labels and annotations), but the implementation is a non-trivial extension.