Declarative Provisioning of MLflow Environments Using a Custom Crossplane API for MongoDB Backends


The operational friction in our machine learning lifecycle was becoming untenable. Our data science teams were either piling into a single, monolithic MLflow tracking server—creating a noisy and unmanageable environment—or they were forced to submit a multi-day ticket cycle to provision a new MongoDB instance and a VM for a dedicated server. This bottleneck was a direct impediment to rapid experimentation. The core problem wasn’t MLflow or MongoDB; it was the imperative, ticket-driven process for provisioning the coupled infrastructure required to support an ML experiment.

Our initial mandate was to build an internal, self-service platform for MLOps infrastructure. The goal was to empower data scientists to provision a complete, isolated MLflow environment (a tracking server, a dedicated MongoDB database for metadata, and an S3 bucket for artifacts) simply by describing their desired state in a single configuration file and committing it to Git. We settled on our existing Kubernetes ecosystem as the foundation, but a simple Helm chart was insufficient. Helm can deploy the MLflow application, but it has no awareness or control over external, managed resources like a MongoDB Atlas cluster or an AWS S3 bucket. We needed a true control plane, one that could extend the Kubernetes API itself to manage both on-cluster and off-cluster resources cohesively. This led us directly to Crossplane.

The architecture we designed treats an “MLflow Environment” as a new, native Kubernetes resource type. A platform user interacts with a high-level Custom Resource (CR) named MLflowEnvironment. Crossplane’s machinery, running in the cluster, observes this CR and orchestrates the complex, multi-provider provisioning workflow in the background: creating a MongoDB Atlas project and cluster, configuring a database user, provisioning an S3 bucket, and finally, deploying the MLflow server pod configured to use these newly created resources.

This is the breakdown of that build process, including the dead ends and the non-obvious configurations required to make it work in a production setting.

The Foundational Layer: Crossplane Providers and Configuration

Before defining our custom API, the control plane needs the capability to communicate with the target infrastructure providers. In this case, that’s MongoDB Atlas for the backend store and AWS for the artifact store. This is handled by installing Crossplane Providers and configuring them with the necessary credentials.

First, the Provider objects are installed in the cluster. These are straightforward Kubernetes resources that point to the provider packages.

# 01-providers.yaml
apiVersion: pkg.crossplane.io/v1
kind: Provider
metadata:
  name: provider-jet-aws
spec:
  package: xpkg.upbound.io/upbound/provider-jet-aws:v0.42.1
---
apiVersion: pkg.crossplane.io/v1
kind: Provider
metadata:
  name: provider-mongodb-atlas
spec:
  package: xpkg.upbound.io/upbound/provider-mongodb-atlas:v0.2.1

The real challenge is securely providing credentials. A common mistake is to hardcode secrets. The production-grade approach is to store provider credentials in a Kubernetes Secret and reference them in a ProviderConfig object.

For AWS, the provider expects a secret containing the access key ID and secret access key.

# 02-aws-provider-secret.yaml
# This secret should be created via a secure mechanism, not stored in git.
# Example: kubectl create secret generic aws-creds -n crossplane-system --from-literal=credentials='[default]\naws_access_key_id=...\naws_secret_access_key=...'

The ProviderConfig for AWS then references this secret.

# 03-aws-provider-config.yaml
apiVersion: aws.upbound.io/v1beta1
kind: ProviderConfig
metadata:
  name: default-aws-provider
spec:
  credentials:
    source: Secret
    secretRef:
      namespace: crossplane-system
      name: aws-creds
      key: credentials

For MongoDB Atlas, the provider needs an organization ID, a public key, and a private key. The process is similar.

# 04-atlas-provider-secret.yaml
# kubectl create secret generic atlas-creds -n crossplane-system \
#   --from-literal=orgId='...' \
#   --from-literal=publicApiKey='...' \
#   --from-literal=privateApiKey='...'

The ProviderConfig for Atlas ties it all together.

# 05-atlas-provider-config.yaml
apiVersion: mongodb.upbound.io/v1beta1
kind: ProviderConfig
metadata:
  name: default-atlas-provider
spec:
  orgID: # ORG_ID - can be hardcoded if single org
  credentials:
    source: Secret
    secretRef:
      namespace: crossplane-system
      name: atlas-creds
      key: privateApiKey
  publicAPIKeySecretRef:
    namespace: crossplane-system
    name: atlas-creds
    key: publicApiKey

With these configurations applied, Crossplane is now authorized to manage resources in both AWS and MongoDB Atlas. This setup is the bedrock of our declarative API.

Defining the Abstraction: The CompositeResourceDefinition (XRD)

The next step is to define the public-facing API for our data scientists. This is the MLflowEnvironment resource itself. We use a CompositeResourceDefinition (XRD) to specify its schema, defining what parameters a user can and must provide. This is a critical design step; the fields exposed here form the contract between the platform team and its users. We decided to keep it minimal to start.

# 06-xrd-mlflow-environment.yaml
apiVersion: apiextensions.crossplane.io/v1
kind: CompositeResourceDefinition
metadata:
  name: mlflowenvironments.mlops.example.com
spec:
  group: mlops.example.com
  names:
    kind: MLflowEnvironment
    plural: mlflowenvironments
  claimNames:
    kind: MLflowEnvironmentClaim
    plural: mlflowenvironmentclaims
  versions:
  - name: v1alpha1
    served: true
    referenceable: true
    schema:
      openAPIV3Schema:
        type: object
        properties:
          spec:
            type: object
            properties:
              parameters:
                type: object
                properties:
                  # Unique identifier for the environment, used for naming resources.
                  # A validation pattern ensures it's DNS-friendly.
                  environmentID:
                    type: string
                    description: "A unique ID for the MLflow environment. Used to name resources."
                    pattern: "^[a-z0-9]([-a-z0-9]*[a-z0-9])?$"
                  # Region for deploying resources like S3 and potentially the Atlas cluster.
                  region:
                    type: string
                    description: "The cloud provider region for resource deployment."
                    default: "us-east-1"
                  # Deletion policy to protect against accidental data loss.
                  # A real-world project must handle this carefully.
                  deletionPolicy:
                    description: "Controls what happens to resources when the claim is deleted. 'Delete' or 'Orphan'."
                    type: string
                    enum:
                    - Delete
                    - Orphan
                    default: Orphan
                required:
                - environmentID
            required:
            - parameters

Key decisions in this XRD:

  1. environmentID: This is the primary user input. We enforce a DNS-friendly pattern because it will be used to construct names for other resources (S3 buckets, Atlas projects, etc.).
  2. region: This provides flexibility for geo-locating resources.
  3. deletionPolicy: This is a crucial safety mechanism. By defaulting to Orphan, we prevent a user from accidentally deleting their claim and wiping out the underlying database and artifacts. A production system might have more sophisticated logic here, perhaps based on environment type (e.g., dev can be deleted, prod must be retained).

Applying this XRD to the Kubernetes cluster creates a new CRD, making MLflowEnvironment a usable, albeit non-functional, resource type.

The Implementation Logic: The Composition

The Composition is the heart of the system. It maps the abstract API defined in the XRD to a concrete set of managed resources. This is where we define how an MLflowEnvironment is actually constructed from pieces of MongoDB Atlas, AWS, and Kubernetes resources.

A major implementation detail is how to pass information between the created resources. For instance, the MLflow Deployment needs the connection string for the MongoDB database, which is only known after the database and its user have been created. Crossplane solves this through a system of patching and transforms.

Here is the complete Composition, broken down into its constituent parts.

# 07-composition-mlflow-environment.yaml
apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
  name: mlflow-atlas-aws.mlops.example.com
  labels:
    provider: multi-cloud
spec:
  # This Composition is for the XRD we defined earlier.
  compositeTypeRef:
    apiVersion: mlops.example.com/v1alpha1
    kind: MLflowEnvironment
  # This is the list of managed resources that will be created.
  resources:
    # 1. MongoDB Atlas Project
    - name: atlas-project
      base:
        apiVersion: project.atlas.mongodb.com/v1alpha1
        kind: Project
        spec:
          forProvider:
            orgID: # YOUR_ORG_ID
          reclaimPolicy: Delete # Projects can be transient
      patches:
        - fromFieldPath: "spec.parameters.environmentID"
          toFieldPath: "spec.forProvider.name"
          transforms:
            - type: string
              string:
                fmt: "mlflow-%s"
        - fromFieldPath: "spec.reclaimPolicy"
          toFieldPath: "spec.reclaimPolicy"

    # 2. MongoDB Atlas Cluster (M10 tier for this example)
    - name: atlas-cluster
      base:
        apiVersion: cluster.atlas.mongodb.com/v1alpha1
        kind: Cluster
        spec:
          forProvider:
            # Minimal config for a basic cluster
            providerName: "TENANT"
            backingProviderName: "AWS"
            providerInstanceSizeName: "M10"
            providerSettings:
              - regionName: "US_EAST_1"
          reclaimPolicy: Delete
      patches:
        - fromFieldPath: "spec.parameters.environmentID"
          toFieldPath: "spec.forProvider.name"
          transforms:
            - type: string
              string:
                fmt: "mlflow-cluster-%s"
        - fromFieldPath: "metadata.labels['crossplane.io/claim-name']" # Use the claim name for project reference
          toFieldPath: "spec.forProvider.projectName"
          transforms:
            - type: string
              string:
                fmt: "mlflow-%s"
        - fromFieldPath: "spec.reclaimPolicy"
          toFieldPath: "spec.reclaimPolicy"
      connectionDetails:
        - fromConnectionSecretKey: "srvAddress"
          name: "srvAddress"
        - fromConnectionSecretKey: "username"
          name: "username"
        - fromConnectionSecretKey: "password"
          name: "password"
        
    # 3. MongoDB Atlas Database User
    - name: atlas-db-user
      base:
        apiVersion: databaseuser.atlas.mongodb.com/v1alpha1
        kind: DatabaseUser
        spec:
          forProvider:
            authDatabaseName: "admin"
            roles:
              - roleName: "readWriteAnyDatabase"
                databaseName: "admin"
            scopes:
              - name: "mlflow-cluster-placeholder" # Will be patched
                type: "CLUSTER"
          reclaimPolicy: Delete
          writeConnectionSecretToRef:
            namespace: crossplane-system
            name: "placeholder" # Will be patched
      patches:
        - fromFieldPath: "spec.parameters.environmentID"
          toFieldPath: "spec.forProvider.username"
          transforms:
            - type: string
              string:
                fmt: "mlflow-user-%s"
        - fromFieldPath: "spec.parameters.environmentID"
          toFieldPath: "spec.writeConnectionSecretToRef.name"
          transforms:
            - type: string
              string:
                fmt: "%s-db-connection"
        - fromFieldPath: "spec.parameters.environmentID"
          toFieldPath: "spec.forProvider.scopes[0].name"
          transforms:
            - type: string
              string:
                fmt: "mlflow-cluster-%s"
        - fromFieldPath: "metadata.labels['crossplane.io/claim-name']"
          toFieldPath: "spec.forProvider.projectName"
          transforms:
            - type: string
              string:
                fmt: "mlflow-%s"
        - fromFieldPath: "spec.reclaimPolicy"
          toFieldPath: "spec.reclaimPolicy"

    # 4. AWS S3 Bucket for MLflow Artifacts
    - name: s3-bucket
      base:
        apiVersion: s3.aws.jet.upbound.io/v1beta1
        kind: Bucket
        spec:
          forProvider:
            # Region will be patched
          reclaimPolicy: Delete
      patches:
        - fromFieldPath: "spec.parameters.environmentID"
          toFieldPath: "metadata.name"
          transforms:
            - type: string
              string:
                fmt: "mlflow-artifacts-%s"
        - fromFieldPath: "spec.parameters.region"
          toFieldPath: "spec.forProvider.region"
        - fromFieldPath: "spec.reclaimPolicy"
          toFieldPath: "spec.reclaimPolicy"

    # 5. Consolidated Kubernetes Secret for MLflow Deployment
    # This is a critical piece of glue code.
    - name: mlflow-config-secret
      base:
        apiVersion: v1
        kind: Secret
        metadata:
          namespace: default # Should be patched to the claim's namespace
        type: Opaque
      patches:
        - fromFieldPath: "metadata.labels['crossplane.io/claim-name']"
          toFieldPath: "metadata.name"
          transforms:
            - type: string
              string:
                fmt: "%s-mlflow-config"
        - fromFieldPath: "metadata.labels['crossplane.io/claim-namespace']"
          toFieldPath: "metadata.namespace"
        # Combine details from various resources to build the MongoDB URI
        - fromConnectionSecretKey: "password"
          toFieldPath: "data.MONGO_PASSWORD"
          transform:
            type: "string"
            string:
              transform:
                type: "base64"
                input: "PASSWORD"
        - type: CombineFromComposite
          combine:
            variables:
              - fromFieldPath: "spec.parameters.environmentID"
              - fromConnectionSecretKey: "srvAddress"
            strategy: string
            string:
              fmt: "mongodb+srv://mlflow-user-%s:%s@%s/?retryWrites=true&w=majority"
          toFieldPath: data.MONGO_URI
          # Base64 encode the final string for the Secret
          transforms:
            - type: string
              string:
                transform:
                  type: "base64"
                  input: "STRING"
        # Pass the S3 bucket name
        - fromFieldPath: "status.atProvider.id" # The bucket name is its ID
          toFieldPath: data.ARTIFACT_ROOT
          policy:
            fromFieldPath: Required # This is a dependency
          transforms:
            - type: string
              string:
                fmt: "s3://%s"
            - type: string
              string:
                transform:
                  type: "base64"
                  input: "STRING"
          source: s3-bucket

    # 6. Kubernetes Deployment for MLflow Server
    - name: mlflow-server-deployment
      base:
        apiVersion: apps/v1
        kind: Deployment
        spec:
          replicas: 1
          selector:
            matchLabels:
              app: mlflow-server # Patched with a unique ID
          template:
            metadata:
              labels:
                app: mlflow-server # Patched
            spec:
              containers:
              - name: mlflow-server
                image: "ghcr.io/mlflow/mlflow:v2.8.0"
                args: [
                  "server",
                  "--host", "0.0.0.0",
                  "--port", "5000"
                ]
                ports:
                - containerPort: 5000
                env:
                - name: MLFLOW_BACKEND_STORE_URI
                  valueFrom:
                    secretKeyRef:
                      name: "placeholder-secret" # Patched
                      key: MONGO_URI
                - name: MLFLOW_ARTIFACTS_DESTINATION
                  valueFrom:
                    secretKeyRef:
                      name: "placeholder-secret" # Patched
                      key: ARTIFACT_ROOT
                # AWS credentials for S3 access
                - name: AWS_ACCESS_KEY_ID
                  valueFrom:
                    secretKeyRef:
                      name: aws-creds # Assuming in the same namespace or using a refactor
                      key: aws_access_key_id
                - name: AWS_SECRET_ACCESS_KEY
                  valueFrom:
                    secretKeyRef:
                      name: aws-creds
                      key: aws_secret_access_key
      patches:
        - fromFieldPath: "metadata.labels['crossplane.io/claim-name']"
          toFieldPath: "metadata.name"
        - fromFieldPath: "metadata.labels['crossplane.io/claim-namespace']"
          toFieldPath: "metadata.namespace"
        - fromFieldPath: "metadata.labels['crossplane.io/claim-name']"
          toFieldPath: "spec.selector.matchLabels.app"
        - fromFieldPath: "metadata.labels['crossplane.io/claim-name']"
          toFieldPath: "spec.template.metadata.labels.app"
        - fromFieldPath: "metadata.labels['crossplane.io/claim-name']"
          toFieldPath: "spec.template.spec.containers[0].env[0].valueFrom.secretKeyRef.name"
          transforms:
            - type: string
              string:
                fmt: "%s-mlflow-config"
        - fromFieldPath: "metadata.labels['crossplane.io/claim-name']"
          toFieldPath: "spec.template.spec.containers[0].env[1].valueFrom.secretKeyRef.name"
          transforms:
            - type: string
              string:
                fmt: "%s-mlflow-config"

    # 7. Kubernetes Service to expose the MLflow UI
    - name: mlflow-server-service
      base:
        apiVersion: v1
        kind: Service
        spec:
          type: ClusterIP
          ports:
          - port: 80
            targetPort: 5000
            protocol: TCP
          selector:
            app: mlflow-server # Patched
      patches:
        - fromFieldPath: "metadata.labels['crossplane.io/claim-name']"
          toFieldPath: "metadata.name"
        - fromFieldPath: "metadata.labels['crossplane.io/claim-namespace']"
          toFieldPath: "metadata.namespace"
        - fromFieldPath: "metadata.labels['crossplane.io/claim-name']"
          toFieldPath: "spec.selector.app"

The most complex part here is the mlflow-config-secret. It doesn’t map to a single cloud provider resource. Instead, it acts as an intermediary, assembling connection details from multiple sources (DatabaseUser, Cluster, Bucket) into a format that the final Deployment can consume. The CombineFromComposite transform is particularly powerful, allowing us to construct the full MongoDB connection string by templating values from different places. This is a common pattern in Crossplane for bridging the gap between what a provider API offers and what an application expects.

The User Experience: Creating an MLflowEnvironment Claim

With the XRD and Composition in place, the platform is ready. A data scientist wanting a new environment now only needs to create one simple YAML file. In a GitOps flow, they would commit this file to a repository, and a tool like ArgoCD would apply it to the cluster.

# 08-claim-mlflow-dev-team-a.yaml
apiVersion: mlops.example.com/v1alpha1
kind: MLflowEnvironment
metadata:
  name: dev-team-a-project
  namespace: data-science-projects
spec:
  parameters:
    environmentID: "dev-team-a-project"
    region: "us-west-2"
    deletionPolicy: Orphan # Explicitly set for safety

Once this manifest is applied, the following happens automatically:

  1. Crossplane sees the new MLflowEnvironment resource.
  2. It finds the Composition that matches its compositeTypeRef.
  3. It begins creating all resources defined in the Composition in order, resolving patches and dependencies as it goes.
  4. Within minutes, a new Atlas project, cluster, and user exist; a new S3 bucket is ready; and a Deployment and Service for MLflow are running in the data-science-projects namespace, fully configured and ready for use.

The user can check the status with kubectl get mlflowenvironment dev-team-a-project -n data-science-projects -o yaml. They will see conditions indicating whether the underlying resources are synced and ready.

graph TD
    subgraph Git Repository
        A[User commits Claim YAML]
    end

    subgraph Kubernetes Cluster
        B[GitOps Controller applies Claim] --> C{MLflowEnvironment Claim};
        C --> D[Crossplane Controller];
        D -- Selects --> E[Composition];
        E -- Orchestrates --> F[Atlas Project];
        E -- Orchestrates --> G[Atlas Cluster];
        E -- Orchestrates --> H[Atlas DB User];
        E -- Orchestrates --> I[AWS S3 Bucket];
        
        subgraph Managed Resources
            F & G & H & I
        end

        subgraph Application Resources
            K[Kubernetes Secret]
            L[Kubernetes Deployment]
            M[Kubernetes Service]
        end
        
        H -- Connection Details --> K;
        G -- Connection Details --> K;
        I -- Bucket Name --> K;
        K -- Mounts Env Vars --> L;
        M -- Selects Pods from --> L;
    end
    
    subgraph Cloud Providers
        N[MongoDB Atlas]
        O[AWS S3]
    end
    
    F & G & H --> N
    I --> O

Lingering Issues and Future Iterations

This solution provides immense value by automating a complex workflow, but it’s not without its own set of challenges. The deletionPolicy is a double-edged sword; setting it to Orphan prevents data loss but can lead to orphaned, costly resources if not managed properly. A more robust solution might involve a custom cleanup controller or manual review process for orphaned infrastructure.

Furthermore, updates to the Composition itself are a significant concern. If we decide to change the MongoDB cluster size or add a new sidecar to the MLflow deployment, how do we roll that out to dozens of existing environments? Crossplane will attempt to reconcile the changes, but this requires careful testing to avoid breaking existing setups. This is the inherent complexity of managing a stateful control plane.

Finally, while this automates provisioning, it doesn’t solve for day-2 operations like monitoring, logging, and cost attribution. The next iteration of this platform would need to inject standardized monitoring agents, configure log shipping, and ensure all created resources are tagged with the owning team’s cost center for FinOps tracking. The current design provides the hooks for this (e.g., patching labels and annotations), but the implementation is a non-trivial extension.


  TOC