The operational burden of managing infrastructure for a real-time feature store is non-trivial. Each new feature set requires a message queue topic for ingestion, a storage bucket for historical data, specific IAM policies for access control, and configurations in the query engine. This process, when performed manually or with imperative scripts, is error-prone, slow, and creates a significant bottleneck for data science teams. The core problem is not just provisioning but providing a scalable, self-service platform for creating and managing the lifecycle of these feature sets.
An imperative approach using a tool like Terraform is the conventional solution. A data scientist would submit a request, and a DevOps engineer would write or modify HCL files, run terraform plan
, and terraform apply
.
Solution A: The Imperative Scripting Approach (e.g., Terraform)
Pros:
- Maturity: Terraform is a well-established, industry-standard tool for Infrastructure as Code.
- Ecosystem: A vast provider ecosystem exists, including a well-maintained Google Cloud Platform provider.
- Predictability: The
plan
command provides a clear preview of changes before they are applied.
Cons:
- State Drift: The Terraform state file is the source of truth, but manual changes made in the cloud console can cause it to drift from reality. Reconciling this drift is a manual and often complex task.
- Execution Model: Execution is push-based and requires a CI/CD pipeline or manual intervention to run. It does not continuously ensure the desired state is maintained. If a resource is accidentally deleted, it remains deleted until the next Terraform run.
- Operational Silo: It reinforces the separation between the infrastructure team (writing HCL) and the application/data team (consuming the infrastructure). It’s not a true self-service model. The “API” is a pull request, not a real-time, programmable interface.
In a real-world project, this leads to a “ticket-ops” culture where data scientists wait for infrastructure, stifling experimentation. The goal is to create a platform where consumers can provision the composite infrastructure they need without understanding the low-level details of each component.
This brings us to a declarative, Kubernetes-native approach.
Solution B: The Declarative, Kubernetes-native Approach (Crossplane)
Pros:
- Continuous Reconciliation: Crossplane installs controllers into a Kubernetes cluster that constantly watch both the desired state (defined in YAML manifests) and the actual state in the cloud provider. Any deviation is automatically corrected. This eliminates state drift.
- Unified API: It extends the Kubernetes API itself. Creating a Pub/Sub topic becomes as simple as
kubectl apply -f topic.yaml
. This turns your Kubernetes cluster into a universal control plane for all infrastructure, not just containers. - Abstraction and Self-Service: Crossplane’s
Composition
model allows platform engineers to define high-level, abstract resources (e.g., aFeatureSet
) that encapsulate all the necessary underlying components (Topic, Bucket, IAM Bindings). Consumers interact only with this simple abstraction.
Cons:
- Learning Curve: The concepts of
Providers
,CompositeResourceDefinitions
(XRDs), andCompositions
require an initial investment to understand. Debugging reconciliation loops can be more complex than debugging a failed script. - Maturity: While rapidly maturing, the Crossplane ecosystem and its providers are younger than Terraform’s. Some edge-case resource attributes might have less coverage.
- Dependency on Kubernetes: This approach inherently ties your infrastructure management to a Kubernetes cluster, which itself is a complex system to manage.
- Learning Curve: The concepts of
The final decision was to adopt Solution B. The long-term value of a self-healing, continuously reconciled, and API-driven platform outweighs the initial complexity. It aligns with a broader cloud-native strategy and empowers data teams to operate with greater autonomy, directly contributing to a faster iteration cycle.
Core Implementation Overview
The architecture consists of a Kubernetes management cluster running Crossplane, which provisions and manages Google Cloud resources. A lightweight frontend interacts with the Kubernetes API to allow users to declaratively request new FeatureSet
resources. Trino, running in a separate Kubernetes cluster, is configured to query data from the provisioned resources.
graph TD subgraph User Interaction Frontend -- "1. Defines 'FeatureSet'" --> K8s_API_Server end subgraph Management Cluster K8s_API_Server -- "2. Creates FeatureSet custom resource" --> Crossplane_Controller Crossplane_Controller -- "3. Reads Composition" --> Crossplane_Controller Crossplane_Controller -- "4. Provisions resources via GCP API" --> GCP end subgraph GCP GCP_PubSub(Pub/Sub Topic) GCP_GCS(GCS Bucket) GCP_IAM(IAM Bindings) GCP --> GCP_PubSub GCP --> GCP_GCS GCP --> GCP_IAM end subgraph Data Flow Data_Source(External Data Source) -- "5. Pushes feature events" --> GCP_PubSub Data_Sink(Cloud Function/Dataflow) -- "6. Archives events" --> GCP_GCS Trino_Cluster -- "7. Queries historical data" --> GCP_GCS ML_Model(ML Model/Analytics UI) -- "8. Consumes features" --> Trino_Cluster end K8s_API_Server -- "Notifies" --> Frontend
1. Setting Up Crossplane and the GCP Provider
First, the Crossplane controller and the GCP provider must be installed on the management cluster. This is done via Helm.
# provider-gcp.yaml
apiVersion: pkg.crossplane.io/v1
kind: Provider
metadata:
name: provider-gcp
spec:
package: xpkg.upbound.io/upbound/provider-gcp:v0.39.0
Apply this with kubectl apply -f provider-gcp.yaml
. Next, we need to configure credentials. A common mistake is to use static keys. In a production environment, Workload Identity is the standard.
# provider-config.yaml
# Assumes Workload Identity is configured between the K8s service account
# and the GCP service account.
apiVersion: gcp.upbound.io/v1beta1
kind: ProviderConfig
metadata:
name: default
spec:
projectID: your-gcp-project-id
credentials:
source: InjectedIdentity
Apply with kubectl apply -f provider-config.yaml
. Crossplane is now ready to manage resources in the specified GCP project.
2. Defining a Custom Infrastructure Abstraction
This is the most critical part: defining our own custom API for a FeatureSet
. This is a two-step process. First, the CompositeResourceDefinition
(XRD) defines the schema of our new API.
# featureset.xrd.yaml
apiVersion: apiextensions.crossplane.io/v1
kind: CompositeResourceDefinition
metadata:
name: featuresets.platform.acme.com
spec:
group: platform.acme.com
names:
kind: FeatureSet
plural: featuresets
claimNames:
kind: FeatureSetClaim
plural: featuresetclaims
versions:
- name: v1alpha1
served: true
referenceable: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
# teamName is used for labeling and IAM policies.
teamName:
type: string
description: "The name of the team owning this feature set."
# retentionDays determines how long data is kept in GCS.
retentionDays:
type: integer
description: "Data retention period in days for the GCS bucket."
default: 365
required:
- teamName
This XRD creates a new, cluster-scoped resource FeatureSet
and a namespace-scoped FeatureSetClaim
. Application teams will typically use the namespaced Claim
, which offers better isolation.
3. Implementing the Abstraction with a Composition
The Composition
translates the high-level FeatureSet
into concrete, low-level managed resources. This is where the platform logic lives.
# featureset.composition.yaml
apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
name: featureset-gcp
labels:
provider: gcp
spec:
compositeTypeRef:
apiVersion: platform.acme.com/v1alpha1
kind: FeatureSet
resources:
# 1. Google Cloud Pub/Sub Topic for real-time ingestion
- name: pubsub-topic
base:
apiVersion: pubsub.gcp.upbound.io/v1beta1
kind: Topic
spec:
forProvider:
# Labels are critical for cost allocation and resource tracking
labels:
composition-name: "featureset-gcp"
patches:
- fromFieldPath: "metadata.name"
toFieldPath: "metadata.name"
- fromFieldPath: "spec.teamName"
toFieldPath: "spec.forProvider.labels.team"
# 2. Google Cloud Storage Bucket for historical data (data lake)
- name: gcs-bucket
base:
apiVersion: storage.gcp.upbound.io/v1beta1
kind: Bucket
spec:
forProvider:
location: "US-CENTRAL1"
forceDestroy: false # A safety measure for production buckets
uniformBucketLevelAccess: true
lifecycleRule:
- action:
type: "Delete"
condition:
withState: "ANY" # Applies to both live and archived versions
patches:
# Patch the bucket name to be unique and predictable
- fromFieldPath: "metadata.name"
toFieldPath: "metadata.name"
transforms:
- type: string
string:
fmt: "%s-feature-archive"
# Patch the lifecycle rule from the FeatureSet spec
- fromFieldPath: "spec.retentionDays"
toFieldPath: "spec.forProvider.lifecycleRule[0].condition.age"
# 3. IAM Policy to allow a specific service account to publish to the topic
# In a real-world project, this would be more granular.
- name: pubsub-iam-binding
base:
apiVersion: pubsub.gcp.upbound.io/v1beta1
kind: TopicIAMMember
spec:
forProvider:
role: "roles/pubsub.publisher"
# This should be parameterized or discovered, hardcoded here for simplicity
member: "serviceAccount:[email protected]"
patches:
- fromFieldPath: "metadata.name"
toFieldPath: "spec.forProvider.topic"
# 4. Expose connection details back to the FeatureSet resource status
- name: connection-details
patches:
- fromFieldPath: "status.atProvider.id"
toFieldPath: "status.topicName"
policy:
fromFieldPath: Required
source: pubsub-topic
- fromFieldPath: "status.atProvider.url"
toFieldPath: "status.bucketUrl"
policy:
fromFieldPath: Required
source: gcs-bucket
Applying these two files (xrd
and composition
) arms our Kubernetes cluster with the ability to provision our custom FeatureSet
platform primitive.
4. Consuming the Abstraction
A data scientist can now provision all the required infrastructure with a single, simple YAML file. They don’t need to know about GCP IAM, bucket lifecycle rules, or Pub/Sub internals.
# my-fraud-detection-features.yaml
apiVersion: platform.acme.com/v1alpha1
kind: FeatureSetClaim
metadata:
name: fraud-detection-v1
namespace: data-science-team
spec:
compositionSelector:
matchLabels:
provider: gcp
parameters:
teamName: "fraud-detection"
retentionDays: 90 # Shorter retention for this experimental feature set
Applying this (kubectl apply -f my-fraud-detection-features.yaml -n data-science-team
) will trigger Crossplane. Within minutes, a new Pub/Sub topic and GCS bucket will be created and configured, with their statuses reported back to the FeatureSetClaim
object in Kubernetes.
5. Frontend Integration as a Control Plane
A frontend provides a user-friendly interface on top of this powerful API. The key is that the frontend doesn’t talk to GCP; it talks to the Kubernetes API server. This simplifies authentication (using Kubernetes RBAC) and logic.
Here is a conceptual React snippet demonstrating how to create a FeatureSetClaim
. In a real application, you would use the official Kubernetes JavaScript client or a more robust API layer.
// A conceptual example using fetch against the Kubernetes API proxy.
// Assumes `kubectl proxy` is running or similar auth is in place.
import React, { useState } from 'react';
const FeatureSetCreator = ({ namespace }) => {
const [name, setName] = useState('');
const [team, setTeam] = useState('');
const [retention, setRetention] = useState(365);
const [status, setStatus] = useState('');
const handleSubmit = async (e) => {
e.preventDefault();
setStatus('Creating...');
const claimManifest = {
apiVersion: 'platform.acme.com/v1alpha1',
kind: 'FeatureSetClaim',
metadata: {
name: name,
namespace: namespace,
},
spec: {
compositionSelector: {
matchLabels: { provider: 'gcp' },
},
parameters: {
teamName: team,
retentionDays: parseInt(retention, 10),
},
},
};
try {
const response = await fetch(
`/api/v1/namespaces/${namespace}/featureSetclaims`,
{
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(claimManifest),
}
);
if (!response.ok) {
const error = await response.json();
throw new Error(`Failed to create FeatureSet: ${error.message}`);
}
setStatus(`FeatureSet '${name}' created successfully. Provisioning in progress.`);
// In a real app, we would start watching the resource status for updates.
} catch (error) {
console.error(error);
setStatus(`Error: ${error.message}`);
}
};
return (
<form onSubmit={handleSubmit}>
{/* Form inputs for name, team, retention */}
<button type="submit">Create Feature Set</button>
{status && <p>Status: {status}</p>}
</form>
);
};
This demonstrates the power of the architecture: the frontend’s logic is purely about composing a declarative manifest and submitting it to a single, consistent API endpoint.
6. Trino Configuration for Querying
With the infrastructure provisioned, Trino needs to be configured to read from the GCS buckets. This is done by adding a catalog configuration file to the Trino coordinator. We use the Hive connector, which is capable of reading various file formats (Parquet, ORC, Avro) from object storage.
# etc/catalog/gcs_features.properties
# Connector to use
connector.name=hive
# Metastore configuration - can be Glue, an external HMS, or even a file-based one for simple cases
hive.metastore=file
hive.metastore.catalog.dir=file:///etc/trino/metastore
# S3-compatible API settings for GCS
hive.s3.aws-access-key=GOOG...
hive.s3.aws-secret-key=...
hive.s3.endpoint=https://storage.googleapis.com
hive.s3.path-style-access=true
# Allow creating tables on external locations
hive.non-managed-table-creation-enabled=true
A common pitfall is managing credentials here. In a production Kubernetes deployment, these secrets should be mounted from a Kubernetes Secret, which could itself be managed by a tool like Vault or External Secrets Operator.
Once the catalog is loaded, a data analyst can create an external table pointing to the GCS bucket provisioned by Crossplane.
CREATE SCHEMA gcs_features.fraud_detection
WITH (location = 'gs://fraud-detection-v1-feature-archive/');
-- Assuming data is stored as partitioned Parquet files
CREATE TABLE gcs_features.fraud_detection.transactions (
transaction_id VARCHAR,
user_id BIGINT,
amount DOUBLE,
timestamp TIMESTAMP(6) WITH TIME ZONE,
event_date DATE
)
WITH (
format = 'PARQUET',
external_location = 'gs://fraud-detection-v1-feature-archive/transactions/',
partitioned_by = ARRAY['event_date']
);
-- Now, query the features
SELECT
user_id,
COUNT(transaction_id) as daily_transaction_count
FROM gcs_features.fraud_detection.transactions
WHERE event_date = CURRENT_DATE - INTERVAL '1' DAY
GROUP BY 1;
This completes the loop: infrastructure is declaratively provisioned, data flows into it, and the analytical engine can query it.
Extensibility and Architectural Limitations
This architecture provides a solid foundation but has clear boundaries. The current data path is asynchronous; features are available for query only after they are written to GCS. For true real-time, online feature serving (e.g., for an inference service that needs sub-millisecond latency), this is insufficient. The FeatureSet
Composition could be extended to also provision an online key-value store like Redis or Firestore, along with a streaming processor (like a Cloud Function or Flink job) to populate it. The beauty of the model is that this added complexity remains hidden from the consumer, who might only see a new boolean flag onlineServing: true
in the FeatureSetClaim
spec.
Furthermore, the Trino cluster itself is a stateful application whose lifecycle management is not covered here. In a truly cloud-native setup, a Trino Kubernetes Operator could be used to manage its deployment declaratively as well.
The primary limitation of this Crossplane-centric model is its reliance on the reconciliation loop. For infrastructure that requires complex, multi-step orchestration workflows (e.g., a database migration), a simple declarative state might not be expressive enough. However, for the vast majority of cloud resources that map well to a create-read-update-delete lifecycle, this approach provides a robust and scalable management paradigm that significantly reduces operational toil.