The initial performance metrics for our semantic search endpoint were unacceptable. Deployed as a Node.js function on AWS Lambda, serving requests that required vector similarity searches against a Milvus collection, the P99 latency was hitting several seconds on initial load. The root cause was immediately obvious to anyone who has worked with serverless functions and stateful backends: establishing a new database connection on every cold start. Each invocation that landed on a new, uninitialized container was paying the full price of TCP/IP handshakes, authentication, and session setup with the Milvus cluster. This is a classic anti-pattern in serverless design.
Our first, naive implementation looked something like this, a direct and straightforward piece of code that would work fine in a long-running process but is fundamentally flawed for a Lambda environment.
// src/handler_v1.ts
// DO NOT USE THIS IN PRODUCTION. THIS IS THE FLAWED INITIAL APPROACH.
import { MilvusClient } from "@zilliz/milvus2-sdk-node";
import { APIGatewayProxyEvent, APIGatewayProxyResult } from "aws-lambda";
const MILVUS_ADDRESS = process.env.MILVUS_ADDRESS!;
const COLLECTION_NAME = "document_embeddings";
export const handler = async (event: APIGatewayProxyEvent): Promise<APIGatewayProxyResult> => {
if (!event.body) {
return { statusCode: 400, body: JSON.stringify({ error: "Request body is missing" }) };
}
try {
const { queryVector } = JSON.parse(event.body);
if (!queryVector || !Array.isArray(queryVector)) {
return { statusCode: 400, body: JSON.stringify({ error: "Invalid query vector" }) };
}
// Anti-pattern: Creating a new client on every single invocation.
// This adds hundreds of milliseconds, or even seconds, to every cold start.
console.log("Creating new Milvus client...");
const milvusClient = new MilvusClient(MILVUS_ADDRESS);
console.log("Milvus client created. Connecting...");
// This is the primary source of latency.
await milvusClient.loadCollection({ collection_name: COLLECTION_NAME });
console.log("Collection loaded.");
const searchParams = {
collection_name: COLLECTION_NAME,
expr: "doc_type == 'public'",
vectors: [queryVector],
search_params: {
anns_field: "embedding",
topk: "5",
metric_type: "L2",
params: JSON.stringify({ nprobe: 10 }),
},
output_fields: ["doc_id", "source"],
vector_type: 101, // Binary vector search
};
const searchResults = await milvusClient.search(searchParams);
// Equally problematic: No explicit close/release.
// Relies on the Lambda environment cleanup, which can lead to connection leaks or exhaustion on the Milvus side.
return {
statusCode: 200,
body: JSON.stringify(searchResults),
};
} catch (error) {
console.error("Failed to execute Milvus search:", error);
return {
statusCode: 500,
body: JSON.stringify({ error: "Internal server error during search." }),
};
}
};
The problem is clear. The Lambda execution model freezes the container after an invocation and can reuse it for subsequent requests (a “warm start”). Any objects or state declared in the global scope, outside the handler function, persist across these warm invocations. By placing the MilvusClient
instantiation inside the handler, we were forfeiting this crucial optimization, forcing a new connection for every cold start and sometimes even for warm starts if the previous invocation had an error.
The first step to fixing this is hoisting the client instantiation out of the handler. This is a common and necessary practice for any database client in a serverless function.
// A slightly better, but still incomplete, approach.
import { MilvusClient } from "@zilliz/milvus2-sdk-node";
const MILVUS_ADDRESS = process.env.MILVUS_ADDRESS!;
const milvusClient = new MilvusClient(MILVUS_ADDRESS); // Hoisted to global scope
// ... handler implementation ...
This improves the situation for warm starts but introduces new complexities. When is the connection actually established? What if the first invocation fails mid-connection? What if concurrent invocations hit a new container simultaneously? A simple global variable isn’t robust enough. It doesn’t manage the connection state. We need a dedicated connection manager that ensures only one connection attempt occurs and that subsequent requests can reliably await and reuse the established connection.
This led to the development of a singleton connection manager. Its responsibilities are:
- Encapsulate a single
MilvusClient
instance. - Track the connection state (
DISCONNECTED
,CONNECTING
,CONNECTED
). - Ensure that if multiple concurrent invocations trigger a connection, only the first one performs the connection logic, while the others wait for the result of that attempt.
- Provide a simple
getClient()
method that abstracts this logic away from the handler.
Here is the production-ready implementation of milvus-connection-manager.ts
.
// src/milvus-connection-manager.ts
import { MilvusClient } from "@zilliz/milvus2-sdk-node";
import { promisify } from "util";
// Using a timer to add a timeout to our connection logic.
const sleep = promisify(setTimeout);
enum ConnectionStatus {
DISCONNECTED,
CONNECTING,
CONNECTED,
ERROR,
}
class MilvusConnectionManager {
private static instance: MilvusConnectionManager;
private client: MilvusClient;
private status: ConnectionStatus = ConnectionStatus.DISCONNECTED;
private connectionPromise: Promise<MilvusClient> | null = null;
private constructor() {
const milvusAddress = process.env.MILVUS_ADDRESS;
if (!milvusAddress) {
console.error("FATAL: MILVUS_ADDRESS environment variable is not set.");
this.status = ConnectionStatus.ERROR;
// In a real project, you might throw here to fail the container initialization entirely.
this.client = null as any; // Fail fast
} else {
console.log(`Initializing Milvus client for address: ${milvusAddress}`);
this.client = new MilvusClient(milvusAddress);
}
}
public static getInstance(): MilvusConnectionManager {
if (!MilvusConnectionManager.instance) {
MilvusConnectionManager.instance = new MilvusConnectionManager();
}
return MilvusConnectionManager.instance;
}
private async connect(): Promise<MilvusClient> {
if (this.status === ConnectionStatus.ERROR) {
throw new Error("Milvus client is in an unrecoverable error state.");
}
this.status = ConnectionStatus.CONNECTING;
console.log("Attempting to connect to Milvus...");
try {
// The Milvus Node SDK doesn't have an explicit connect() method.
// Operations like checkHealth() or loadCollection() implicitly connect.
// We'll use checkHealth as a lightweight connection verifier.
const healthCheck = await Promise.race([
this.client.checkHealth(),
sleep(5000).then(() => { throw new Error("Milvus connection timed out after 5 seconds"); })
]);
if (!healthCheck.isHealthy) {
throw new Error("Milvus reported as unhealthy.");
}
console.log("Successfully connected to Milvus and it is healthy.");
this.status = ConnectionStatus.CONNECTED;
this.connectionPromise = null; // Clear the promise, subsequent calls will get the client directly.
return this.client;
} catch (error) {
console.error("Failed to connect to Milvus:", error);
this.status = ConnectionStatus.DISCONNECTED;
this.connectionPromise = null; // Clear the promise to allow for retry on next invocation.
throw error; // Propagate the error to the caller (the Lambda handler).
}
}
public async getClient(): Promise<MilvusClient> {
switch (this.status) {
case ConnectionStatus.CONNECTED:
// If we are already connected, return the client immediately.
return this.client;
case ConnectionStatus.CONNECTING:
// If a connection attempt is already in progress (e.g., from a concurrent invocation),
// wait for it to complete instead of starting a new one.
console.log("Connection in progress, awaiting result...");
// The non-null assertion is safe here because connectionPromise is always set when status is CONNECTING.
return this.connectionPromise!;
case ConnectionStatus.DISCONNECTED:
// If disconnected, initiate the connection.
console.log("Client is disconnected. Initiating new connection...");
// Store the promise so concurrent callers can await the same connection attempt.
this.connectionPromise = this.connect();
return this.connectionPromise;
case ConnectionStatus.ERROR:
// If in an error state (e.g., missing config), fail immediately.
throw new Error("Milvus client is in an error state.");
}
}
}
export const milvusManager = MilvusConnectionManager.getInstance();
The key piece of logic is the handling of the CONNECTING
state. By storing the Promise
returned by the connect()
method, any concurrent calls to getClient()
while the first is in-flight will simply await that existing promise. This prevents a thundering herd of connection attempts against the Milvus cluster from a single, newly-spawned Lambda container handling its first burst of traffic.
The refactored handler now becomes much cleaner and is decoupled from the connection management logic.
// src/handler_v2.ts
import { milvusManager } from "./milvus-connection-manager";
import { APIGatewayProxyEvent, APIGatewayProxyResult } from "aws-lambda";
const COLLECTION_NAME = process.env.COLLECTION_NAME || "document_embeddings";
let isCollectionLoaded = false; // State to avoid reloading the collection on every warm invocation
export const handler = async (event: APIGatewayProxyEvent): Promise<APIGatewayProxyResult> => {
if (!event.body) {
return { statusCode: 400, body: JSON.stringify({ error: "Request body is missing" }) };
}
try {
const { queryVector } = JSON.parse(event.body);
if (!queryVector || !Array.isArray(queryVector)) {
return { statusCode: 400, body: JSON.stringify({ error: "Invalid query vector" }) };
}
// The handler now only cares about getting a ready-to-use client.
const milvusClient = await milvusManager.getClient();
// Optimization: Only load the collection into memory if it hasn't been done
// in this execution environment yet.
if (!isCollectionLoaded) {
console.log(`Loading collection: ${COLLECTION_NAME}`);
await milvusClient.loadCollection({ collection_name: COLLECTION_NAME });
isCollectionLoaded = true;
console.log("Collection loaded successfully.");
}
const searchParams = {
collection_name: COLLECTION_NAME,
expr: "doc_type == 'public'",
vectors: [queryVector],
search_params: {
anns_field: "embedding",
topk: "5",
metric_type: "L2",
params: JSON.stringify({ nprobe: 10 }),
},
output_fields: ["doc_id", "source"],
vector_type: 101,
};
const searchResults = await milvusClient.search(searchParams);
return {
statusCode: 200,
body: JSON.stringify(searchResults.results),
};
} catch (error: any) {
console.error("Error in handler:", error.message, error.stack);
// If the error was a connection failure, we might want to reset the manager's state
// to allow a fresh connection attempt on the next invocation. This is an advanced
// resiliency pattern, but for now, we'll let it fail and rely on a new container.
return {
statusCode: 500,
body: JSON.stringify({ error: "Internal server error." }),
};
}
};
This architecture solves the connection latency problem. However, another performance bottleneck in serverless environments is code package size. A typical Node.js project involves zipping the entire node_modules
directory, which can easily be tens or hundreds of megabytes. During a cold start, Lambda must download and unzip this package before it can even start the Node.js runtime. This file I/O is a significant, and often overlooked, contributor to latency.
This is where the Rome toolchain became critical for our project. Rome is an all-in-one frontend toolchain that includes a compiler, linter, formatter, and, most importantly for this use case, a bundler. We can use Rome to transpile our TypeScript and bundle it with all its dependencies into a single, minified JavaScript file.
Our project structure:
.
├── rome.json
├── package.json
├── tsconfig.json
└── src
├── handler_v2.ts
└── milvus-connection-manager.ts
The rome.json
configuration is minimal. We just need to tell it to enable the bundler.
// rome.json
{
"$schema": "./node_modules/rome/configuration_schema.json",
"organizeImports": {
"enabled": true
},
"linter": {
"enabled": true,
"rules": {
"recommended": true
}
},
"bundler": {
"enabled": true
}
}
And the build script in package.json
:
// package.json
{
"name": "lambda-milvus-rome",
"version": "1.0.0",
"scripts": {
"build": "rome bundle src/handler_v2.ts --out-dir dist --sourcemap-option hidden",
"package": "npm run build && cd dist && zip -r ../lambda-package.zip . && cd .."
},
"dependencies": {
"@zilliz/milvus2-sdk-node": "^2.2.16"
},
"devDependencies": {
"@types/aws-lambda": "^8.10.119",
"rome": "^12.1.3",
"typescript": "^5.1.6"
}
}
Running npm run package
now produces a lambda-package.zip
containing a single handler_v2.js
file and its sourcemap. The size of this zip file is typically less than a megabyte, compared to the 50MB+ of the unbundled node_modules
directory. This drastically reduces the cold start I/O overhead.
The difference in the invocation flow can be visualized.
The naive approach:
sequenceDiagram participant Client participant API Gateway participant Lambda (Cold) participant Milvus Client->>API Gateway: POST /search API Gateway->>Lambda (Cold): Invoke Note over Lambda (Cold): Download & Unzip large package Note over Lambda (Cold): Start Node.js runtime Lambda (Cold)->>Lambda (Cold): **new MilvusClient()** Lambda (Cold)->>Milvus: **Establish TCP Connection** Milvus-->>Lambda (Cold): Connection OK Lambda (Cold)->>Milvus: search() Milvus-->>Lambda (Cold): Search Results Lambda (Cold)-->>API Gateway: Response API Gateway-->>Client: Response (High Latency)
The optimized approach with connection management and bundling:
sequenceDiagram participant Client participant API Gateway participant Lambda (Cold) participant Lambda (Warm) participant Milvus %% Cold Start Invocation Client->>API Gateway: POST /search (Request 1) API Gateway->>Lambda (Cold): Invoke Note over Lambda (Cold): Download & Unzip small bundle Note over Lambda (Cold): Start Node.js runtime Note over Lambda (Cold): `milvusManager` initializes Lambda (Cold)->>Milvus: **Establish TCP Connection (once)** Milvus-->>Lambda (Cold): Connection OK Lambda (Cold)->>Milvus: search() Milvus-->>Lambda (Cold): Search Results Lambda (Cold)-->>API Gateway: Response API Gateway-->>Client: Response (Moderate Latency) %% Warm Start Invocation Client->>API Gateway: POST /search (Request 2) API Gateway->>Lambda (Warm): Invoke (reuses container) Note over Lambda (Warm): Execution context is reused Note over Lambda (Warm): `milvusManager` has a CONNECTED client Lambda (Warm)->>Milvus: search() Milvus-->>Lambda (Warm): Search Results Lambda (Warm)-->>API Gateway: Response API Gateway-->>Client: Response (Low Latency)
This final architecture—combining a stateful connection manager that leverages the Lambda execution context with a build process optimized for minimal package size using Rome—is robust and performant. It balances the operational benefits of serverless with the performance requirements of stateful, low-latency database interactions.
The solution is not without its limitations. The connection manager, while handling concurrent invocations within a single container, does not solve for broader connection pooling across the entire fleet of Lambda containers. If traffic scales to hundreds of concurrent containers, this could still result in hundreds of connections to Milvus. In such extreme-scale scenarios, a dedicated connection proxy service (like PgBouncer for Postgres) deployed as a separate, long-running service might be necessary. Furthermore, the current implementation lacks a proactive health check; a stale connection would only be discovered upon a failed query. A future iteration could involve a lightweight background timer within the Lambda (if using provisioned concurrency) or a more sophisticated check-on-use policy within the getClient
method to validate the connection’s health before returning it.