Implementing a Persistent Milvus Connection Manager for Node.js AWS Lambda to Mitigate Cold Start Latency


The initial performance metrics for our semantic search endpoint were unacceptable. Deployed as a Node.js function on AWS Lambda, serving requests that required vector similarity searches against a Milvus collection, the P99 latency was hitting several seconds on initial load. The root cause was immediately obvious to anyone who has worked with serverless functions and stateful backends: establishing a new database connection on every cold start. Each invocation that landed on a new, uninitialized container was paying the full price of TCP/IP handshakes, authentication, and session setup with the Milvus cluster. This is a classic anti-pattern in serverless design.

Our first, naive implementation looked something like this, a direct and straightforward piece of code that would work fine in a long-running process but is fundamentally flawed for a Lambda environment.

// src/handler_v1.ts
// DO NOT USE THIS IN PRODUCTION. THIS IS THE FLAWED INITIAL APPROACH.

import { MilvusClient } from "@zilliz/milvus2-sdk-node";
import { APIGatewayProxyEvent, APIGatewayProxyResult } from "aws-lambda";

const MILVUS_ADDRESS = process.env.MILVUS_ADDRESS!;
const COLLECTION_NAME = "document_embeddings";

export const handler = async (event: APIGatewayProxyEvent): Promise<APIGatewayProxyResult> => {
  if (!event.body) {
    return { statusCode: 400, body: JSON.stringify({ error: "Request body is missing" }) };
  }

  try {
    const { queryVector } = JSON.parse(event.body);
    if (!queryVector || !Array.isArray(queryVector)) {
      return { statusCode: 400, body: JSON.stringify({ error: "Invalid query vector" }) };
    }

    // Anti-pattern: Creating a new client on every single invocation.
    // This adds hundreds of milliseconds, or even seconds, to every cold start.
    console.log("Creating new Milvus client...");
    const milvusClient = new MilvusClient(MILVUS_ADDRESS);
    console.log("Milvus client created. Connecting...");

    // This is the primary source of latency.
    await milvusClient.loadCollection({ collection_name: COLLECTION_NAME });
    console.log("Collection loaded.");

    const searchParams = {
      collection_name: COLLECTION_NAME,
      expr: "doc_type == 'public'",
      vectors: [queryVector],
      search_params: {
        anns_field: "embedding",
        topk: "5",
        metric_type: "L2",
        params: JSON.stringify({ nprobe: 10 }),
      },
      output_fields: ["doc_id", "source"],
      vector_type: 101, // Binary vector search
    };

    const searchResults = await milvusClient.search(searchParams);

    // Equally problematic: No explicit close/release.
    // Relies on the Lambda environment cleanup, which can lead to connection leaks or exhaustion on the Milvus side.
    
    return {
      statusCode: 200,
      body: JSON.stringify(searchResults),
    };

  } catch (error) {
    console.error("Failed to execute Milvus search:", error);
    return {
      statusCode: 500,
      body: JSON.stringify({ error: "Internal server error during search." }),
    };
  }
};

The problem is clear. The Lambda execution model freezes the container after an invocation and can reuse it for subsequent requests (a “warm start”). Any objects or state declared in the global scope, outside the handler function, persist across these warm invocations. By placing the MilvusClient instantiation inside the handler, we were forfeiting this crucial optimization, forcing a new connection for every cold start and sometimes even for warm starts if the previous invocation had an error.

The first step to fixing this is hoisting the client instantiation out of the handler. This is a common and necessary practice for any database client in a serverless function.

// A slightly better, but still incomplete, approach.
import { MilvusClient } from "@zilliz/milvus2-sdk-node";

const MILVUS_ADDRESS = process.env.MILVUS_ADDRESS!;
const milvusClient = new MilvusClient(MILVUS_ADDRESS); // Hoisted to global scope

// ... handler implementation ...

This improves the situation for warm starts but introduces new complexities. When is the connection actually established? What if the first invocation fails mid-connection? What if concurrent invocations hit a new container simultaneously? A simple global variable isn’t robust enough. It doesn’t manage the connection state. We need a dedicated connection manager that ensures only one connection attempt occurs and that subsequent requests can reliably await and reuse the established connection.

This led to the development of a singleton connection manager. Its responsibilities are:

  1. Encapsulate a single MilvusClient instance.
  2. Track the connection state (DISCONNECTED, CONNECTING, CONNECTED).
  3. Ensure that if multiple concurrent invocations trigger a connection, only the first one performs the connection logic, while the others wait for the result of that attempt.
  4. Provide a simple getClient() method that abstracts this logic away from the handler.

Here is the production-ready implementation of milvus-connection-manager.ts.

// src/milvus-connection-manager.ts

import { MilvusClient } from "@zilliz/milvus2-sdk-node";
import { promisify } from "util";

// Using a timer to add a timeout to our connection logic.
const sleep = promisify(setTimeout);

enum ConnectionStatus {
  DISCONNECTED,
  CONNECTING,
  CONNECTED,
  ERROR,
}

class MilvusConnectionManager {
  private static instance: MilvusConnectionManager;
  private client: MilvusClient;
  private status: ConnectionStatus = ConnectionStatus.DISCONNECTED;
  private connectionPromise: Promise<MilvusClient> | null = null;

  private constructor() {
    const milvusAddress = process.env.MILVUS_ADDRESS;
    if (!milvusAddress) {
      console.error("FATAL: MILVUS_ADDRESS environment variable is not set.");
      this.status = ConnectionStatus.ERROR;
      // In a real project, you might throw here to fail the container initialization entirely.
      this.client = null as any; // Fail fast
    } else {
      console.log(`Initializing Milvus client for address: ${milvusAddress}`);
      this.client = new MilvusClient(milvusAddress);
    }
  }

  public static getInstance(): MilvusConnectionManager {
    if (!MilvusConnectionManager.instance) {
      MilvusConnectionManager.instance = new MilvusConnectionManager();
    }
    return MilvusConnectionManager.instance;
  }

  private async connect(): Promise<MilvusClient> {
    if (this.status === ConnectionStatus.ERROR) {
      throw new Error("Milvus client is in an unrecoverable error state.");
    }
    
    this.status = ConnectionStatus.CONNECTING;
    console.log("Attempting to connect to Milvus...");

    try {
      // The Milvus Node SDK doesn't have an explicit connect() method.
      // Operations like checkHealth() or loadCollection() implicitly connect.
      // We'll use checkHealth as a lightweight connection verifier.
      const healthCheck = await Promise.race([
        this.client.checkHealth(),
        sleep(5000).then(() => { throw new Error("Milvus connection timed out after 5 seconds"); })
      ]);

      if (!healthCheck.isHealthy) {
        throw new Error("Milvus reported as unhealthy.");
      }

      console.log("Successfully connected to Milvus and it is healthy.");
      this.status = ConnectionStatus.CONNECTED;
      this.connectionPromise = null; // Clear the promise, subsequent calls will get the client directly.
      return this.client;
    } catch (error) {
      console.error("Failed to connect to Milvus:", error);
      this.status = ConnectionStatus.DISCONNECTED;
      this.connectionPromise = null; // Clear the promise to allow for retry on next invocation.
      throw error; // Propagate the error to the caller (the Lambda handler).
    }
  }

  public async getClient(): Promise<MilvusClient> {
    switch (this.status) {
      case ConnectionStatus.CONNECTED:
        // If we are already connected, return the client immediately.
        return this.client;

      case ConnectionStatus.CONNECTING:
        // If a connection attempt is already in progress (e.g., from a concurrent invocation),
        // wait for it to complete instead of starting a new one.
        console.log("Connection in progress, awaiting result...");
        // The non-null assertion is safe here because connectionPromise is always set when status is CONNECTING.
        return this.connectionPromise!;

      case ConnectionStatus.DISCONNECTED:
        // If disconnected, initiate the connection.
        console.log("Client is disconnected. Initiating new connection...");
        // Store the promise so concurrent callers can await the same connection attempt.
        this.connectionPromise = this.connect();
        return this.connectionPromise;
        
      case ConnectionStatus.ERROR:
        // If in an error state (e.g., missing config), fail immediately.
        throw new Error("Milvus client is in an error state.");
    }
  }
}

export const milvusManager = MilvusConnectionManager.getInstance();

The key piece of logic is the handling of the CONNECTING state. By storing the Promise returned by the connect() method, any concurrent calls to getClient() while the first is in-flight will simply await that existing promise. This prevents a thundering herd of connection attempts against the Milvus cluster from a single, newly-spawned Lambda container handling its first burst of traffic.

The refactored handler now becomes much cleaner and is decoupled from the connection management logic.

// src/handler_v2.ts

import { milvusManager } from "./milvus-connection-manager";
import { APIGatewayProxyEvent, APIGatewayProxyResult } from "aws-lambda";

const COLLECTION_NAME = process.env.COLLECTION_NAME || "document_embeddings";
let isCollectionLoaded = false; // State to avoid reloading the collection on every warm invocation

export const handler = async (event: APIGatewayProxyEvent): Promise<APIGatewayProxyResult> => {
  if (!event.body) {
    return { statusCode: 400, body: JSON.stringify({ error: "Request body is missing" }) };
  }

  try {
    const { queryVector } = JSON.parse(event.body);
    if (!queryVector || !Array.isArray(queryVector)) {
      return { statusCode: 400, body: JSON.stringify({ error: "Invalid query vector" }) };
    }

    // The handler now only cares about getting a ready-to-use client.
    const milvusClient = await milvusManager.getClient();

    // Optimization: Only load the collection into memory if it hasn't been done
    // in this execution environment yet.
    if (!isCollectionLoaded) {
      console.log(`Loading collection: ${COLLECTION_NAME}`);
      await milvusClient.loadCollection({ collection_name: COLLECTION_NAME });
      isCollectionLoaded = true;
      console.log("Collection loaded successfully.");
    }

    const searchParams = {
        collection_name: COLLECTION_NAME,
        expr: "doc_type == 'public'",
        vectors: [queryVector],
        search_params: {
          anns_field: "embedding",
          topk: "5",
          metric_type: "L2",
          params: JSON.stringify({ nprobe: 10 }),
        },
        output_fields: ["doc_id", "source"],
        vector_type: 101,
    };

    const searchResults = await milvusClient.search(searchParams);
    
    return {
      statusCode: 200,
      body: JSON.stringify(searchResults.results),
    };

  } catch (error: any) {
    console.error("Error in handler:", error.message, error.stack);
    // If the error was a connection failure, we might want to reset the manager's state
    // to allow a fresh connection attempt on the next invocation. This is an advanced
    // resiliency pattern, but for now, we'll let it fail and rely on a new container.
    return {
      statusCode: 500,
      body: JSON.stringify({ error: "Internal server error." }),
    };
  }
};

This architecture solves the connection latency problem. However, another performance bottleneck in serverless environments is code package size. A typical Node.js project involves zipping the entire node_modules directory, which can easily be tens or hundreds of megabytes. During a cold start, Lambda must download and unzip this package before it can even start the Node.js runtime. This file I/O is a significant, and often overlooked, contributor to latency.

This is where the Rome toolchain became critical for our project. Rome is an all-in-one frontend toolchain that includes a compiler, linter, formatter, and, most importantly for this use case, a bundler. We can use Rome to transpile our TypeScript and bundle it with all its dependencies into a single, minified JavaScript file.

Our project structure:

.
├── rome.json
├── package.json
├── tsconfig.json
└── src
    ├── handler_v2.ts
    └── milvus-connection-manager.ts

The rome.json configuration is minimal. We just need to tell it to enable the bundler.

// rome.json
{
  "$schema": "./node_modules/rome/configuration_schema.json",
  "organizeImports": {
    "enabled": true
  },
  "linter": {
    "enabled": true,
    "rules": {
      "recommended": true
    }
  },
  "bundler": {
    "enabled": true
  }
}

And the build script in package.json:

// package.json
{
  "name": "lambda-milvus-rome",
  "version": "1.0.0",
  "scripts": {
    "build": "rome bundle src/handler_v2.ts --out-dir dist --sourcemap-option hidden",
    "package": "npm run build && cd dist && zip -r ../lambda-package.zip . && cd .."
  },
  "dependencies": {
    "@zilliz/milvus2-sdk-node": "^2.2.16"
  },
  "devDependencies": {
    "@types/aws-lambda": "^8.10.119",
    "rome": "^12.1.3",
    "typescript": "^5.1.6"
  }
}

Running npm run package now produces a lambda-package.zip containing a single handler_v2.js file and its sourcemap. The size of this zip file is typically less than a megabyte, compared to the 50MB+ of the unbundled node_modules directory. This drastically reduces the cold start I/O overhead.

The difference in the invocation flow can be visualized.

The naive approach:

sequenceDiagram
    participant Client
    participant API Gateway
    participant Lambda (Cold)
    participant Milvus

    Client->>API Gateway: POST /search
    API Gateway->>Lambda (Cold): Invoke
    Note over Lambda (Cold): Download & Unzip large package
    Note over Lambda (Cold): Start Node.js runtime
    Lambda (Cold)->>Lambda (Cold): **new MilvusClient()**
    Lambda (Cold)->>Milvus: **Establish TCP Connection**
    Milvus-->>Lambda (Cold): Connection OK
    Lambda (Cold)->>Milvus: search()
    Milvus-->>Lambda (Cold): Search Results
    Lambda (Cold)-->>API Gateway: Response
    API Gateway-->>Client: Response (High Latency)

The optimized approach with connection management and bundling:

sequenceDiagram
    participant Client
    participant API Gateway
    participant Lambda (Cold)
    participant Lambda (Warm)
    participant Milvus

    %% Cold Start Invocation
    Client->>API Gateway: POST /search (Request 1)
    API Gateway->>Lambda (Cold): Invoke
    Note over Lambda (Cold): Download & Unzip small bundle
    Note over Lambda (Cold): Start Node.js runtime
    Note over Lambda (Cold): `milvusManager` initializes
    Lambda (Cold)->>Milvus: **Establish TCP Connection (once)**
    Milvus-->>Lambda (Cold): Connection OK
    Lambda (Cold)->>Milvus: search()
    Milvus-->>Lambda (Cold): Search Results
    Lambda (Cold)-->>API Gateway: Response
    API Gateway-->>Client: Response (Moderate Latency)

    %% Warm Start Invocation
    Client->>API Gateway: POST /search (Request 2)
    API Gateway->>Lambda (Warm): Invoke (reuses container)
    Note over Lambda (Warm): Execution context is reused
    Note over Lambda (Warm): `milvusManager` has a CONNECTED client
    Lambda (Warm)->>Milvus: search()
    Milvus-->>Lambda (Warm): Search Results
    Lambda (Warm)-->>API Gateway: Response
    API Gateway-->>Client: Response (Low Latency)

This final architecture—combining a stateful connection manager that leverages the Lambda execution context with a build process optimized for minimal package size using Rome—is robust and performant. It balances the operational benefits of serverless with the performance requirements of stateful, low-latency database interactions.

The solution is not without its limitations. The connection manager, while handling concurrent invocations within a single container, does not solve for broader connection pooling across the entire fleet of Lambda containers. If traffic scales to hundreds of concurrent containers, this could still result in hundreds of connections to Milvus. In such extreme-scale scenarios, a dedicated connection proxy service (like PgBouncer for Postgres) deployed as a separate, long-running service might be necessary. Furthermore, the current implementation lacks a proactive health check; a stale connection would only be discovered upon a failed query. A future iteration could involve a lightweight background timer within the Lambda (if using provisioned concurrency) or a more sophisticated check-on-use policy within the getClient method to validate the connection’s health before returning it.


  TOC