Constructing an Immutable ML Serving Image with Packer, BentoML, and an Embedded LevelDB Metadata Cache


The production similarity search service was a consistent source of on-call alerts. Not because of failures, but because of unpredictable, creeping latency. Our architecture was standard: a user query would hit a BentoML service, which would generate an embedding, query a Pinecone index for the top N most similar item IDs, and then perform a second network hop to a centralized Redis cluster to fetch the metadata for those N items. The root cause was clear: two network round trips for every request, with the second one subject to the noisy neighbor problem in our shared Redis instance. Compounding this was deployment drift. Configuration tweaks made directly on EC2 instances to mitigate performance issues were rarely back-ported to our deployment scripts, leading to “works on my machine” style failures during rollouts.

The proposed solution was radical: eliminate the second network hop entirely and enforce deployment consistency by moving to a fully immutable infrastructure model. We would build a self-contained, “golden” Amazon Machine Image (AMI) for each new model version. This AMI would bundle not just the serving code and the model, but also a complete, local copy of the required metadata. At runtime, the service would query Pinecone for IDs and then read the corresponding metadata directly from its local disk. This approach promised to slash latency and guarantee that every deployed instance was an identical, tested artifact.

The technology selection process was mostly straightforward, with one critical decision point. Packer was the obvious choice for building AMIs programmatically. BentoML and Scikit-learn were our established standards for model packaging and training. Pinecone remained our vector database of choice. The real debate was about the embedded metadata store. A simple flat file like JSON or CSV was dismissed due to inefficient lookups; loading a multi-gigabyte file into memory was not an option, and on-disk searching would be slow. SQLite was a contender, but felt like overkill with its relational features and potential for write contention issues, even though our use case was read-only at runtime. We settled on LevelDB. Its simple key-value model, high-performance reads, and battle-tested C++ core (via the plyvel Python binding) made it a perfect fit for a pre-populated, read-heavy local cache. It was just a directory of files, making it trivial to create offline and copy into the image during the Packer build.

Our build process was designed as a three-phase pipeline. First, an offline data preparation script would train the model, populate the Pinecone index, and build the LevelDB database. Second, the BentoML service code would be written to orchestrate the hybrid query flow. Third, a Packer template would define the assembly of these components into a bootable AMI.

Phase 1: Offline Data and Model Preparation

This initial step is the foundation of the entire immutable artifact. It’s a script that runs in our CI environment whenever the source data is updated or the model needs to be retrained. Its outputs are the artifacts that will be baked into the final image: a serialized Scikit-learn model, and a directory containing the LevelDB database files.

# scripts/prepare_artifacts.py

import os
import json
import logging
import shutil
from contextlib import contextmanager

import numpy as np
import plyvel
import pinecone
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
import joblib

# --- Configuration ---
# In a real project, these would come from environment variables or a config file.
PINECONE_API_KEY = os.getenv("PINECONE_API_KEY")
PINECONE_ENVIRONMENT = os.getenv("PINECONE_ENVIRONMENT", "gcp-starter")
PINECONE_INDEX_NAME = "immutable-metadata-service"
ARTIFACTS_DIR = "./build_artifacts"
MODEL_DIR = os.path.join(ARTIFACTS_DIR, "model")
LEVELDB_PATH = os.path.join(ARTIFACTS_DIR, "item_metadata.db")
DIMENSIONS = 100 # Reduced for this example

# --- Logging Setup ---
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s",
    handlers=[logging.StreamHandler()]
)

# --- Sample Data ---
# In a real-world project, this would be loaded from a database or data warehouse.
SAMPLE_DOCUMENTS = [
    {"id": "doc_001", "text": "The quick brown fox jumps over the lazy dog.", "category": "animals"},
    {"id": "doc_002", "text": "Packer simplifies the creation of any type of machine image.", "category": "devops"},
    {"id": "doc_003", "text": "BentoML is an open-source platform for high-performance ML model serving.", "category": "mlops"},
    {"id": "doc_004", "text": "LevelDB is a fast key-value storage library written at Google.", "category": "database"},
    {"id": "doc_005", "text": "A lazy cat naps in the warm afternoon sun.", "category": "animals"},
    {"id": "doc_006", "text": "Pinecone provides a managed vector database for similarity search.", "category": "database"},
]


@contextmanager
def get_leveldb_client(db_path, create_if_missing=False):
    """Context manager for safe LevelDB connection handling."""
    db = None
    try:
        db = plyvel.DB(db_path, create_if_missing=create_if_missing)
        yield db
    except Exception as e:
        logging.error(f"LevelDB operation failed: {e}")
        raise
    finally:
        if db:
            db.close()
            logging.info(f"LevelDB connection to {db_path} closed.")

def setup_directories():
    """Cleans and creates directories for build artifacts."""
    if os.path.exists(ARTIFACTS_DIR):
        logging.warning(f"Removing existing artifacts directory: {ARTIFACTS_DIR}")
        shutil.rmtree(ARTIFACTS_DIR)
    os.makedirs(MODEL_DIR, exist_ok=True)
    logging.info(f"Created artifact directories at {ARTIFACTS_DIR}")

def train_and_save_model():
    """Trains a simple TF-IDF model and saves it."""
    logging.info("Starting model training...")
    
    # We use a simple TF-IDF vectorizer. In a real scenario, this could be a more
    # complex model like SBERT or a fine-tuned transformer.
    vectorizer = TfidfVectorizer(max_features=DIMENSIONS, stop_words='english')
    pipeline = Pipeline([('tfidf', vectorizer)])
    
    texts = [doc['text'] for doc in SAMPLE_DOCUMENTS]
    pipeline.fit(texts)
    
    model_path = os.path.join(MODEL_DIR, "vectorizer.joblib")
    joblib.dump(pipeline, model_path)
    logging.info(f"Model trained and saved to {model_path}")
    return pipeline

def create_embeddings(model, documents):
    """Generates embeddings for the documents."""
    logging.info("Generating embeddings...")
    texts = [doc['text'] for doc in documents]
    embeddings = model.transform(texts).toarray()
    logging.info(f"Generated {len(embeddings)} embeddings of dimension {embeddings.shape[1]}")
    return embeddings

def populate_pinecone(embeddings):
    """Initializes Pinecone index and upserts data."""
    if not PINECONE_API_KEY:
        raise ValueError("PINECONE_API_KEY environment variable not set.")

    logging.info(f"Initializing Pinecone with environment: {PINECONE_ENVIRONMENT}")
    pinecone.init(api_key=PINECONE_API_KEY, environment=PINECONE_ENVIRONMENT)

    if PINECONE_INDEX_NAME in pinecone.list_indexes():
        logging.warning(f"Pinecone index '{PINECONE_INDEX_NAME}' already exists. Deleting.")
        pinecone.delete_index(PINECONE_INDEX_NAME)

    logging.info(f"Creating new Pinecone index '{PINECONE_INDEX_NAME}' with dimension {DIMENSIONS}")
    pinecone.create_index(PINECONE_INDEX_NAME, dimension=DIMENSIONS, metric="cosine")
    
    index = pinecone.Index(PINECONE_INDEX_NAME)
    
    ids = [doc['id'] for doc in SAMPLE_DOCUMENTS]
    vectors_to_upsert = list(zip(ids, embeddings.tolist()))

    # Upsert in batches for better performance
    batch_size = 100
    for i in range(0, len(vectors_to_upsert), batch_size):
        batch = vectors_to_upsert[i:i + batch_size]
        index.upsert(vectors=batch)
        logging.info(f"Upserted batch {i//batch_size + 1} to Pinecone.")

    logging.info(f"Successfully populated Pinecone index. Index stats: {index.describe_index_stats()}")

def populate_leveldb():
    """Populates the LevelDB instance with document metadata."""
    logging.info(f"Populating LevelDB at {LEVELDB_PATH}...")
    try:
        with get_leveldb_client(LEVELDB_PATH, create_if_missing=True) as db:
            with db.write_batch() as wb:
                for doc in SAMPLE_DOCUMENTS:
                    # Key: document ID (bytes)
                    # Value: JSON string of metadata (bytes)
                    key = doc['id'].encode('utf-8')
                    # We store the full document as metadata for this example
                    value = json.dumps(doc).encode('utf-8')
                    wb.put(key, value)
            logging.info(f"Wrote {len(SAMPLE_DOCUMENTS)} records to LevelDB in a single batch.")
    except Exception as e:
        logging.error(f"Failed to populate LevelDB: {e}")
        # Clean up a partially created DB on failure
        if os.path.exists(LEVELDB_PATH):
            shutil.rmtree(LEVELDB_PATH)
        raise

if __name__ == "__main__":
    logging.info("--- Starting Artifact Preparation Pipeline ---")
    setup_directories()
    model_pipeline = train_and_save_model()
    doc_embeddings = create_embeddings(model_pipeline, SAMPLE_DOCUMENTS)
    populate_pinecone(doc_embeddings)
    populate_leveldb()
    logging.info("--- Artifact Preparation Complete ---")
    logging.info(f"Model saved in: {MODEL_DIR}")
    logging.info(f"LevelDB database created at: {LEVELDB_PATH}")

Running this script leaves us with a build_artifacts directory containing the model and the LevelDB files. This is the payload for our image.

Phase 2: The BentoML Service with Hybrid Lookups

The service code is where the magic happens at runtime. It needs to be aware of the co-located LevelDB database. We accomplish this by initializing the LevelDB client within the BentoML service’s lifecycle. A critical consideration in a production environment is resource management. We must ensure the database connection is properly opened and closed. While LevelDB is thread-safe, managing a single client instance for the service’s lifetime is the most efficient approach.

Here’s the service.py that will be packaged by BentoML.

# service/service.py

import os
import json
import logging
from typing import List

import bentoml
import numpy as np
import plyvel
import pinecone
from bentoml.io import JSON
from pydantic import BaseModel

# --- Configuration ---
PINECONE_API_KEY = os.getenv("PINECONE_API_KEY")
PINECONE_ENVIRONMENT = os.getenv("PINECONE_ENVIRONMENT", "gcp-starter")
PINECONE_INDEX_NAME = "immutable-metadata-service"
# This path is where Packer will place the DB inside the final image.
LEVELDB_PATH = "/opt/ml_service/data/item_metadata.db"

# --- Logging ---
# BentoML has its own logging, but we can add our own for clarity.
logger = logging.getLogger(__name__)

# --- Runner and Service Definition ---
# Load the Scikit-learn model we trained as a BentoML Runner
vectorizer_runner = bentoml.sklearn.get("vectorizer:latest").to_runner()

svc = bentoml.Service(
    "similarity_search_with_local_meta",
    runners=[vectorizer_runner]
)

# --- Service Context and Lifecycle Hooks ---
# We use the service context to manage the lifecycle of our database connections.
# This ensures we don't open/close connections on every single request.
@svc.on_startup
def initialize_connections(context: bentoml.Context):
    """
    Initialize Pinecone and LevelDB clients on service startup.
    These clients are stored in the service context dictionary.
    """
    logger.info("Initializing external connections on service startup...")
    # Pinecone Client
    if not PINECONE_API_KEY:
        raise RuntimeError("PINECONE_API_KEY is not set. Service cannot start.")
    pinecone.init(api_key=PINECONE_API_KEY, environment=PINECONE_ENVIRONMENT)
    context.state["pinecone_index"] = pinecone.Index(PINECONE_INDEX_NAME)
    logger.info("Pinecone connection initialized.")

    # LevelDB Client
    try:
        # We expect the DB to exist. If not, it's a critical packaging error.
        context.state["leveldb_client"] = plyvel.DB(LEVELDB_PATH, create_if_missing=False)
        logger.info(f"Successfully connected to local LevelDB at {LEVELDB_PATH}")
    except Exception as e:
        logger.error(f"FATAL: Could not open LevelDB at {LEVELDB_PATH}. Is the image built correctly? Error: {e}")
        # This will prevent the service from starting up correctly.
        raise

@svc.on_shutdown
def cleanup_connections(context: bentoml.Context):
    """
    Clean up resources on service shutdown.
    """
    logger.info("Closing connections on service shutdown...")
    if "leveldb_client" in context.state and not context.state["leveldb_client"].closed:
        context.state["leveldb_client"].close()
        logger.info("LevelDB connection closed.")
    # Pinecone client does not require explicit closing.

# --- API Definition ---

class QueryRequest(BaseModel):
    query_text: str
    top_k: int = 5

class MetadataResponse(BaseModel):
    id: str
    score: float
    metadata: dict

class SearchResponse(BaseModel):
    results: List[MetadataResponse]

@svc.api(input=JSON(pydantic_model=QueryRequest), output=JSON(pydantic_model=SearchResponse))
async def search(request: QueryRequest) -> SearchResponse:
    """
    The main API endpoint. It orchestrates the vector search and local metadata retrieval.
    """
    logger.info(f"Received search request for query: '{request.query_text}' with top_k={request.top_k}")
    
    # Step 1: Generate query embedding using the Scikit-learn runner
    query_embedding_sparse = await vectorizer_runner.async_run([request.query_text])
    query_embedding = query_embedding_sparse.toarray().tolist()

    # Step 2: Query Pinecone to get top_k similar item IDs
    pinecone_index = bentoml.get_context().state["pinecone_index"]
    try:
        query_response = pinecone_index.query(
            vector=query_embedding,
            top_k=request.top_k,
            include_values=False # We only need IDs
        )
        matches = query_response['matches']
        logger.info(f"Pinecone returned {len(matches)} matches.")
    except Exception as e:
        logger.error(f"Error querying Pinecone: {e}")
        # In a production system, you'd return a proper HTTP error code.
        return SearchResponse(results=[])

    # Step 3: Fetch metadata for each ID from the local LevelDB instance
    leveldb_client = bentoml.get_context().state["leveldb_client"]
    results = []
    for match in matches:
        item_id = match['id']
        try:
            # LevelDB keys/values are bytes.
            metadata_bytes = leveldb_client.get(item_id.encode('utf-8'))
            if metadata_bytes:
                metadata = json.loads(metadata_bytes.decode('utf-8'))
                results.append(
                    MetadataResponse(id=item_id, score=match['score'], metadata=metadata)
                )
            else:
                # This indicates a data consistency issue between Pinecone and LevelDB.
                # A critical metric to monitor.
                logger.warning(f"ID '{item_id}' found in Pinecone but not in local LevelDB cache. Skipping.")
        except Exception as e:
            logger.error(f"Error fetching metadata for ID '{item_id}' from LevelDB: {e}")
            # Decide whether to skip or fail. Skipping is more resilient.
            continue
    
    return SearchResponse(results=results)

We also need a bentofile.yaml to define the service package.

# service/bentofile.yaml
service: "service:svc"
labels:
  owner: ml-platform-team
  project: immutable-search
include:
  - "*.py"
python:
  packages:
    - scikit-learn
    - pandas
    - pydantic
    - pinecone-client
    - plyvel
models:
  - vectorizer:latest

Before proceeding, we package the model with bentoml models import ./build_artifacts/model/vectorizer.joblib vectorizer:latest. Then, bentoml build in the service directory creates a bento.

Phase 3: The Packer Image Build

This is the final assembly line. Packer reads a HashiCorp Configuration Language (HCL) file that declaratively defines the machine image. It will start a temporary EC2 instance, run a series of provisioners (shell scripts, file uploads), and then capture a snapshot of the instance’s disk as a new AMI.

# packer/image.pkr.hcl

packer {
  required_plugins {
    amazon = {
      version = ">= 1.0.0"
      source  = "github.com/hashicorp/amazon"
    }
  }
}

variable "aws_region" {
  type    = string
  default = "us-east-1"
}

variable "instance_type" {
  type    = string
  default = "t3.medium"
}

// Use the latest Amazon Linux 2 AMI as our base
source "amazon-ebs" "ml_service_base" {
  ami_name      = "immutable-ml-service-{{timestamp}}"
  instance_type = var.instance_type
  region        = var.aws_region
  source_ami_filter {
    filters = {
      name                = "amzn2-ami-hvm-*-x86_64-gp2"
      root-device-type    = "ebs"
      virtualization-type = "hvm"
    }
    most_recent = true
    owners      = ["amazon"]
  }
  ssh_username = "ec2-user"
  tags = {
    Name        = "Immutable ML Service Image"
    Service     = "SimilaritySearch"
    Version     = "{{timestamp}}"
  }
}

build {
  name    = "build-immutable-ml-service"
  sources = ["source.amazon-ebs.ml_service_base"]

  // Provisioner 1: Install system dependencies and Python
  provisioner "shell" {
    inline = [
      "sudo yum update -y",
      "sudo yum install -y gcc-c++ python3-devel", // python3-devel for plyvel
      "sudo pip3 install --upgrade pip",
      "sudo pip3 install bentoml", // Install bentoctl globally for convenience
    ]
  }

  // Provisioner 2: Create directories and upload all our artifacts
  provisioner "shell" {
    inline = [
      "sudo mkdir -p /opt/ml_service/data",
      "sudo chown -R ec2-user:ec2-user /opt/ml_service",
    ]
  }

  // Upload the entire BentoML service directory
  provisioner "file" {
    source      = "../service/"
    destination = "/tmp/service_build"
  }

  // Upload the LevelDB data directory
  provisioner "file" {
    source      = "../build_artifacts/item_metadata.db"
    destination = "/opt/ml_service/data/item_metadata.db"
  }

  // Provisioner 3: Install the Bento and dependencies
  provisioner "shell" {
    inline = [
      "cd /tmp/service_build",
      // We build the bento inside the machine to ensure architecture compatibility
      "bentoml models import ../build_artifacts/model/vectorizer.joblib vectorizer:latest",
      "BENTO_PATH=$(bentoml build -f bentofile.yaml --output-dir /tmp | tail -n 1)",
      // Now install this bento into the global store
      "bentoml import $BENTO_PATH",
      "rm -rf /tmp/*" // Clean up build artifacts
    ]
  }

  // Provisioner 4: Set up the systemd service to run BentoML on boot
  provisioner "shell" {
    inline = [
      // Create the systemd service file
      "sudo bash -c 'cat > /etc/systemd/system/bentoml.service <<EOF\n[Unit]\nDescription=BentoML Service for Similarity Search\nAfter=network.target\n\n[Service]\nUser=ec2-user\nWorkingDirectory=/home/ec2-user\n# WARNING: In a true production environment, secrets like API keys MUST be injected securely,\n# e.g., via EC2 Instance Profile, AWS Secrets Manager, or HashiCorp Vault, not as env vars here.\nEnvironment=\"PINECONE_API_KEY=${env "PINECONE_API_KEY"}\"\nEnvironment=\"PINECONE_ENVIRONMENT=${env "PINECONE_ENVIRONMENT"}\"\nExecStart=/usr/local/bin/bentoml serve similarity_search_with_local_meta:latest --production\nRestart=always\n\n[Install]\nWantedBy=multi-user.target\nEOF'",

      // Enable the service
      "sudo systemctl enable bentoml.service"
    ]
  }
}

The final result is an AMI, identified by its ID (e.g., ami-0123456789abcdef0). Deploying a new version is now a trivial infrastructure-as-code operation: update the AMI ID in our Terraform or CloudFormation template for the Auto Scaling Group’s Launch Template and apply the change. Rollbacks are equally simple, just a matter of pointing back to the previous AMI ID. The latency problem was solved; p99 latency for the metadata lookup step dropped from ~30ms to under 1ms, as it was now a local disk read. Deployment failures due to configuration drift were entirely eliminated.

The primary trade-off of this architecture is data freshness. The metadata cache is only as recent as the last AMI build. For our use case, where metadata changes infrequently (daily at most), this is an acceptable price to pay for the immense gains in performance and reliability. If sub-hour data freshness were a requirement, this pattern would be unsuitable. The AMI build process also takes longer than a simple code push, typically 5-10 minutes, which slightly increases the lead time for changes. Future iterations could explore a hybrid model where the AMI is baked with a base version of the LevelDB cache, and a startup script fetches a delta patch from S3 to apply on boot, offering a compromise between true immutability and the need for more frequent data updates.


  TOC