Managing Real-Time Feature Store Infrastructure Declaratively with Crossplane, GCP Pub/Sub, and Trino


The operational burden of managing infrastructure for a real-time feature store is non-trivial. Each new feature set requires a message queue topic for ingestion, a storage bucket for historical data, specific IAM policies for access control, and configurations in the query engine. This process, when performed manually or with imperative scripts, is error-prone, slow, and creates a significant bottleneck for data science teams. The core problem is not just provisioning but providing a scalable, self-service platform for creating and managing the lifecycle of these feature sets.

An imperative approach using a tool like Terraform is the conventional solution. A data scientist would submit a request, and a DevOps engineer would write or modify HCL files, run terraform plan, and terraform apply.

Solution A: The Imperative Scripting Approach (e.g., Terraform)

  • Pros:

    • Maturity: Terraform is a well-established, industry-standard tool for Infrastructure as Code.
    • Ecosystem: A vast provider ecosystem exists, including a well-maintained Google Cloud Platform provider.
    • Predictability: The plan command provides a clear preview of changes before they are applied.
  • Cons:

    • State Drift: The Terraform state file is the source of truth, but manual changes made in the cloud console can cause it to drift from reality. Reconciling this drift is a manual and often complex task.
    • Execution Model: Execution is push-based and requires a CI/CD pipeline or manual intervention to run. It does not continuously ensure the desired state is maintained. If a resource is accidentally deleted, it remains deleted until the next Terraform run.
    • Operational Silo: It reinforces the separation between the infrastructure team (writing HCL) and the application/data team (consuming the infrastructure). It’s not a true self-service model. The “API” is a pull request, not a real-time, programmable interface.

In a real-world project, this leads to a “ticket-ops” culture where data scientists wait for infrastructure, stifling experimentation. The goal is to create a platform where consumers can provision the composite infrastructure they need without understanding the low-level details of each component.

This brings us to a declarative, Kubernetes-native approach.

Solution B: The Declarative, Kubernetes-native Approach (Crossplane)

  • Pros:

    • Continuous Reconciliation: Crossplane installs controllers into a Kubernetes cluster that constantly watch both the desired state (defined in YAML manifests) and the actual state in the cloud provider. Any deviation is automatically corrected. This eliminates state drift.
    • Unified API: It extends the Kubernetes API itself. Creating a Pub/Sub topic becomes as simple as kubectl apply -f topic.yaml. This turns your Kubernetes cluster into a universal control plane for all infrastructure, not just containers.
    • Abstraction and Self-Service: Crossplane’s Composition model allows platform engineers to define high-level, abstract resources (e.g., a FeatureSet) that encapsulate all the necessary underlying components (Topic, Bucket, IAM Bindings). Consumers interact only with this simple abstraction.
  • Cons:

    • Learning Curve: The concepts of Providers, CompositeResourceDefinitions (XRDs), and Compositions require an initial investment to understand. Debugging reconciliation loops can be more complex than debugging a failed script.
    • Maturity: While rapidly maturing, the Crossplane ecosystem and its providers are younger than Terraform’s. Some edge-case resource attributes might have less coverage.
    • Dependency on Kubernetes: This approach inherently ties your infrastructure management to a Kubernetes cluster, which itself is a complex system to manage.

The final decision was to adopt Solution B. The long-term value of a self-healing, continuously reconciled, and API-driven platform outweighs the initial complexity. It aligns with a broader cloud-native strategy and empowers data teams to operate with greater autonomy, directly contributing to a faster iteration cycle.

Core Implementation Overview

The architecture consists of a Kubernetes management cluster running Crossplane, which provisions and manages Google Cloud resources. A lightweight frontend interacts with the Kubernetes API to allow users to declaratively request new FeatureSet resources. Trino, running in a separate Kubernetes cluster, is configured to query data from the provisioned resources.

graph TD
    subgraph User Interaction
        Frontend -- "1. Defines 'FeatureSet'" --> K8s_API_Server
    end

    subgraph Management Cluster
        K8s_API_Server -- "2. Creates FeatureSet custom resource" --> Crossplane_Controller
        Crossplane_Controller -- "3. Reads Composition" --> Crossplane_Controller
        Crossplane_Controller -- "4. Provisions resources via GCP API" --> GCP
    end

    subgraph GCP
        GCP_PubSub(Pub/Sub Topic)
        GCP_GCS(GCS Bucket)
        GCP_IAM(IAM Bindings)
        GCP --> GCP_PubSub
        GCP --> GCP_GCS
        GCP --> GCP_IAM
    end

    subgraph Data Flow
        Data_Source(External Data Source) -- "5. Pushes feature events" --> GCP_PubSub
        Data_Sink(Cloud Function/Dataflow) -- "6. Archives events" --> GCP_GCS
        Trino_Cluster -- "7. Queries historical data" --> GCP_GCS
        ML_Model(ML Model/Analytics UI) -- "8. Consumes features" --> Trino_Cluster
    end

    K8s_API_Server -- "Notifies" --> Frontend

1. Setting Up Crossplane and the GCP Provider

First, the Crossplane controller and the GCP provider must be installed on the management cluster. This is done via Helm.

# provider-gcp.yaml
apiVersion: pkg.crossplane.io/v1
kind: Provider
metadata:
  name: provider-gcp
spec:
  package: xpkg.upbound.io/upbound/provider-gcp:v0.39.0

Apply this with kubectl apply -f provider-gcp.yaml. Next, we need to configure credentials. A common mistake is to use static keys. In a production environment, Workload Identity is the standard.

# provider-config.yaml
# Assumes Workload Identity is configured between the K8s service account
# and the GCP service account.
apiVersion: gcp.upbound.io/v1beta1
kind: ProviderConfig
metadata:
  name: default
spec:
  projectID: your-gcp-project-id
  credentials:
    source: InjectedIdentity

Apply with kubectl apply -f provider-config.yaml. Crossplane is now ready to manage resources in the specified GCP project.

2. Defining a Custom Infrastructure Abstraction

This is the most critical part: defining our own custom API for a FeatureSet. This is a two-step process. First, the CompositeResourceDefinition (XRD) defines the schema of our new API.

# featureset.xrd.yaml
apiVersion: apiextensions.crossplane.io/v1
kind: CompositeResourceDefinition
metadata:
  name: featuresets.platform.acme.com
spec:
  group: platform.acme.com
  names:
    kind: FeatureSet
    plural: featuresets
  claimNames:
    kind: FeatureSetClaim
    plural: featuresetclaims
  versions:
  - name: v1alpha1
    served: true
    referenceable: true
    schema:
      openAPIV3Schema:
        type: object
        properties:
          spec:
            type: object
            properties:
              # teamName is used for labeling and IAM policies.
              teamName:
                type: string
                description: "The name of the team owning this feature set."
              # retentionDays determines how long data is kept in GCS.
              retentionDays:
                type: integer
                description: "Data retention period in days for the GCS bucket."
                default: 365
            required:
              - teamName

This XRD creates a new, cluster-scoped resource FeatureSet and a namespace-scoped FeatureSetClaim. Application teams will typically use the namespaced Claim, which offers better isolation.

3. Implementing the Abstraction with a Composition

The Composition translates the high-level FeatureSet into concrete, low-level managed resources. This is where the platform logic lives.

# featureset.composition.yaml
apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
  name: featureset-gcp
  labels:
    provider: gcp
spec:
  compositeTypeRef:
    apiVersion: platform.acme.com/v1alpha1
    kind: FeatureSet
  resources:
    # 1. Google Cloud Pub/Sub Topic for real-time ingestion
    - name: pubsub-topic
      base:
        apiVersion: pubsub.gcp.upbound.io/v1beta1
        kind: Topic
        spec:
          forProvider:
            # Labels are critical for cost allocation and resource tracking
            labels:
              composition-name: "featureset-gcp"
      patches:
        - fromFieldPath: "metadata.name"
          toFieldPath: "metadata.name"
        - fromFieldPath: "spec.teamName"
          toFieldPath: "spec.forProvider.labels.team"

    # 2. Google Cloud Storage Bucket for historical data (data lake)
    - name: gcs-bucket
      base:
        apiVersion: storage.gcp.upbound.io/v1beta1
        kind: Bucket
        spec:
          forProvider:
            location: "US-CENTRAL1"
            forceDestroy: false # A safety measure for production buckets
            uniformBucketLevelAccess: true
            lifecycleRule:
              - action:
                  type: "Delete"
                condition:
                  withState: "ANY" # Applies to both live and archived versions
      patches:
        # Patch the bucket name to be unique and predictable
        - fromFieldPath: "metadata.name"
          toFieldPath: "metadata.name"
          transforms:
            - type: string
              string:
                fmt: "%s-feature-archive"
        # Patch the lifecycle rule from the FeatureSet spec
        - fromFieldPath: "spec.retentionDays"
          toFieldPath: "spec.forProvider.lifecycleRule[0].condition.age"

    # 3. IAM Policy to allow a specific service account to publish to the topic
    # In a real-world project, this would be more granular.
    - name: pubsub-iam-binding
      base:
        apiVersion: pubsub.gcp.upbound.io/v1beta1
        kind: TopicIAMMember
        spec:
          forProvider:
            role: "roles/pubsub.publisher"
            # This should be parameterized or discovered, hardcoded here for simplicity
            member: "serviceAccount:[email protected]"
      patches:
        - fromFieldPath: "metadata.name"
          toFieldPath: "spec.forProvider.topic"

    # 4. Expose connection details back to the FeatureSet resource status
    - name: connection-details
      patches:
        - fromFieldPath: "status.atProvider.id"
          toFieldPath: "status.topicName"
          policy:
            fromFieldPath: Required
          source: pubsub-topic
        - fromFieldPath: "status.atProvider.url"
          toFieldPath: "status.bucketUrl"
          policy:
            fromFieldPath: Required
          source: gcs-bucket

Applying these two files (xrd and composition) arms our Kubernetes cluster with the ability to provision our custom FeatureSet platform primitive.

4. Consuming the Abstraction

A data scientist can now provision all the required infrastructure with a single, simple YAML file. They don’t need to know about GCP IAM, bucket lifecycle rules, or Pub/Sub internals.

# my-fraud-detection-features.yaml
apiVersion: platform.acme.com/v1alpha1
kind: FeatureSetClaim
metadata:
  name: fraud-detection-v1
  namespace: data-science-team
spec:
  compositionSelector:
    matchLabels:
      provider: gcp
  parameters:
    teamName: "fraud-detection"
    retentionDays: 90 # Shorter retention for this experimental feature set

Applying this (kubectl apply -f my-fraud-detection-features.yaml -n data-science-team) will trigger Crossplane. Within minutes, a new Pub/Sub topic and GCS bucket will be created and configured, with their statuses reported back to the FeatureSetClaim object in Kubernetes.

5. Frontend Integration as a Control Plane

A frontend provides a user-friendly interface on top of this powerful API. The key is that the frontend doesn’t talk to GCP; it talks to the Kubernetes API server. This simplifies authentication (using Kubernetes RBAC) and logic.

Here is a conceptual React snippet demonstrating how to create a FeatureSetClaim. In a real application, you would use the official Kubernetes JavaScript client or a more robust API layer.

// A conceptual example using fetch against the Kubernetes API proxy.
// Assumes `kubectl proxy` is running or similar auth is in place.

import React, { useState } from 'react';

const FeatureSetCreator = ({ namespace }) => {
    const [name, setName] = useState('');
    const [team, setTeam] = useState('');
    const [retention, setRetention] = useState(365);
    const [status, setStatus] = useState('');

    const handleSubmit = async (e) => {
        e.preventDefault();
        setStatus('Creating...');

        const claimManifest = {
            apiVersion: 'platform.acme.com/v1alpha1',
            kind: 'FeatureSetClaim',
            metadata: {
                name: name,
                namespace: namespace,
            },
            spec: {
                compositionSelector: {
                    matchLabels: { provider: 'gcp' },
                },
                parameters: {
                    teamName: team,
                    retentionDays: parseInt(retention, 10),
                },
            },
        };

        try {
            const response = await fetch(
                `/api/v1/namespaces/${namespace}/featureSetclaims`,
                {
                    method: 'POST',
                    headers: { 'Content-Type': 'application/json' },
                    body: JSON.stringify(claimManifest),
                }
            );

            if (!response.ok) {
                const error = await response.json();
                throw new Error(`Failed to create FeatureSet: ${error.message}`);
            }

            setStatus(`FeatureSet '${name}' created successfully. Provisioning in progress.`);
            // In a real app, we would start watching the resource status for updates.
        } catch (error) {
            console.error(error);
            setStatus(`Error: ${error.message}`);
        }
    };

    return (
        <form onSubmit={handleSubmit}>
            {/* Form inputs for name, team, retention */}
            <button type="submit">Create Feature Set</button>
            {status && <p>Status: {status}</p>}
        </form>
    );
};

This demonstrates the power of the architecture: the frontend’s logic is purely about composing a declarative manifest and submitting it to a single, consistent API endpoint.

6. Trino Configuration for Querying

With the infrastructure provisioned, Trino needs to be configured to read from the GCS buckets. This is done by adding a catalog configuration file to the Trino coordinator. We use the Hive connector, which is capable of reading various file formats (Parquet, ORC, Avro) from object storage.

# etc/catalog/gcs_features.properties

# Connector to use
connector.name=hive

# Metastore configuration - can be Glue, an external HMS, or even a file-based one for simple cases
hive.metastore=file
hive.metastore.catalog.dir=file:///etc/trino/metastore

# S3-compatible API settings for GCS
hive.s3.aws-access-key=GOOG...
hive.s3.aws-secret-key=...
hive.s3.endpoint=https://storage.googleapis.com
hive.s3.path-style-access=true

# Allow creating tables on external locations
hive.non-managed-table-creation-enabled=true

A common pitfall is managing credentials here. In a production Kubernetes deployment, these secrets should be mounted from a Kubernetes Secret, which could itself be managed by a tool like Vault or External Secrets Operator.

Once the catalog is loaded, a data analyst can create an external table pointing to the GCS bucket provisioned by Crossplane.

CREATE SCHEMA gcs_features.fraud_detection
WITH (location = 'gs://fraud-detection-v1-feature-archive/');

-- Assuming data is stored as partitioned Parquet files
CREATE TABLE gcs_features.fraud_detection.transactions (
    transaction_id VARCHAR,
    user_id BIGINT,
    amount DOUBLE,
    timestamp TIMESTAMP(6) WITH TIME ZONE,
    event_date DATE
)
WITH (
    format = 'PARQUET',
    external_location = 'gs://fraud-detection-v1-feature-archive/transactions/',
    partitioned_by = ARRAY['event_date']
);

-- Now, query the features
SELECT
    user_id,
    COUNT(transaction_id) as daily_transaction_count
FROM gcs_features.fraud_detection.transactions
WHERE event_date = CURRENT_DATE - INTERVAL '1' DAY
GROUP BY 1;

This completes the loop: infrastructure is declaratively provisioned, data flows into it, and the analytical engine can query it.

Extensibility and Architectural Limitations

This architecture provides a solid foundation but has clear boundaries. The current data path is asynchronous; features are available for query only after they are written to GCS. For true real-time, online feature serving (e.g., for an inference service that needs sub-millisecond latency), this is insufficient. The FeatureSet Composition could be extended to also provision an online key-value store like Redis or Firestore, along with a streaming processor (like a Cloud Function or Flink job) to populate it. The beauty of the model is that this added complexity remains hidden from the consumer, who might only see a new boolean flag onlineServing: true in the FeatureSetClaim spec.

Furthermore, the Trino cluster itself is a stateful application whose lifecycle management is not covered here. In a truly cloud-native setup, a Trino Kubernetes Operator could be used to manage its deployment declaratively as well.

The primary limitation of this Crossplane-centric model is its reliance on the reconciliation loop. For infrastructure that requires complex, multi-step orchestration workflows (e.g., a database migration), a simple declarative state might not be expressive enough. However, for the vast majority of cloud resources that map well to a create-read-update-delete lifecycle, this approach provides a robust and scalable management paradigm that significantly reduces operational toil.


  TOC