Engineering a Cost-Effective Serverless NLP Inference Endpoint with spaCy Kong and AWS Lambda


The mandate was straightforward: expose a Named Entity Recognition (NER) capability as a new internal microservice. The usage pattern, however, was anything but. We anticipated intense, sporadic bursts of traffic from batch processing jobs, followed by long periods of complete inactivity. A fleet of always-on EC2 instances running a spaCy model felt like burning money. The obvious architectural path led to serverless with AWS Lambda, leveraging its pay-per-invocation model. This decision, while correct on paper, led us down a rabbit hole of non-trivial engineering challenges revolving around deployment size, cold start latency, and cost management. This is the log of how we built a production-ready, serverless NLP endpoint, fronting it with our existing Kong API Gateway.

Our initial technology selection was firm. AWS Lambda for compute, as its event-driven nature was a perfect fit for the unpredictable workload. spaCy, specifically the en_core_web_lg model, for its balance of performance and accuracy. And Kong, because it was already the gatekeeper for our entire microservices ecosystem, providing unified authentication, routing, and rate-limiting. The problem was never what to use, but how to make these components work together under the severe constraints of a serverless environment.

The First Wall: The 250MB Deployment Package Limit

The first and most immediate obstacle was Lambda’s hard limit on deployment package size. A standard Python project setup for this task would look something like this:

# Initial project structure
.
├── lambda_function
│   ├── app.py
│   └── requirements.txt
└── scripts
    └── build.sh

With a simple requirements.txt:

# requirements.txt
spacy==3.5.0
https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.5.0/en_core_web_lg-3.5.0.tar.gz

And a build script to package it:

# scripts/build.sh
#!/bin/bash
set -e

# Clean up previous build
rm -rf package deployment.zip

# Create packaging directory
mkdir -p package

# Install dependencies
pip install --target ./package -r lambda_function/requirements.txt

# Copy application code
cp lambda_function/app.py ./package/

# Create the deployment zip
cd package
zip -r9 ../deployment.zip .
cd ..

# Report size
echo "Deployment package size: $(ls -lh deployment.zip | awk '{print $5}')"

Running this script yielded a deployment.zip file well over 500MB. This is more than double the 250MB (unzipped) limit imposed by AWS Lambda. Our first attempt at a solution was naive: trying to manually prune the installed spacy and model directories. We tried removing documentation, tests, and unused components. This was a dead end. It made the build process incredibly brittle and any library update would require a painful, manual re-evaluation of what could be safely deleted. A common mistake is to invest too much time in these fragile micro-optimizations.

The correct, production-grade solution was to leverage Lambda Layers for the dependencies and a more robust storage solution for the model itself. However, even moving spacy to a layer, the en_core_web_lg model is over 500MB. The real bottleneck wasn’t the library; it was the model artifact.

This forced a fundamental rethink. The model data doesn’t belong in the deployment package. It’s a static asset. The proper tool for this job in the Lambda ecosystem is Amazon EFS (Elastic File System). By mounting an EFS volume to our Lambda function, we could treat the model as if it were on a local disk, completely bypassing all package size limitations.

This required a more sophisticated infrastructure setup, which we managed with Terraform.

# infra/main.tf

provider "aws" {
  region = "us-east-1"
}

# Networking
resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"
}

resource "aws_subnet" "private" {
  count             = 2
  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet(aws_vpc.main.cidr_block, 8, count.index)
  availability_zone = data.aws_availability_zones.available.names[count.index]
}

resource "aws_security_group" "lambda_sg" {
  name   = "lambda-efs-sg"
  vpc_id = aws_vpc.main.id
}

resource "aws_security_group" "efs_sg" {
  name   = "efs-sg"
  vpc_id = aws_vpc.main.id

  ingress {
    from_port       = 2049
    to_port         = 2049
    protocol        = "tcp"
    security_groups = [aws_security_group.lambda_sg.id]
  }
}

# EFS for spaCy model storage
resource "aws_efs_file_system" "spacy_models" {
  creation_token = "spacy-models-fs"
  tags = {
    Name = "SpacyModels"
  }
}

resource "aws_efs_mount_target" "mount" {
  count           = 2
  file_system_id  = aws_efs_file_system.spacy_models.id
  subnet_id       = aws_subnet.private[count.index].id
  security_groups = [aws_security_group.efs_sg.id]
}

resource "aws_efs_access_point" "spacy_ap" {
  file_system_id = aws_efs_file_system.spacy_models.id
  posix_user {
    gid = 1001
    uid = 1001
  }
  root_directory {
    path = "/models"
    creation_info {
      owner_gid   = 1001
      owner_uid   = 1001
      permissions = "750"
    }
  }
}

# IAM Role for Lambda
resource "aws_iam_role" "lambda_exec_role" {
  name = "lambda_spacy_ner_role"
  assume_role_policy = jsonencode({
    Version = "2012-10-17",
    Statement = [{
      Action = "sts:AssumeRole",
      Effect = "Allow",
      Principal = {
        Service = "lambda.amazonaws.com"
      }
    }]
  })
}

resource "aws_iam_role_policy_attachment" "lambda_policy" {
  role       = aws_iam_role.lambda_exec_role.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaVPCAccessExecutionRole"
}

resource "aws_iam_role_policy_attachment" "efs_policy" {
  role       = aws_iam_role.lambda_exec_role.name
  policy_arn = "arn:aws:iam::aws:policy/AmazonElasticFileSystemClientFullAccess"
}

# Lambda Function
resource "aws_lambda_function" "spacy_ner" {
  function_name = "spacy-ner-service"
  role          = aws_iam_role.lambda_exec_role.arn
  handler       = "app.handler"
  runtime       = "python3.9"
  memory_size   = 2048 # spaCy models need significant memory
  timeout       = 30

  # Code is zipped and uploaded separately, e.g. via S3
  s3_bucket = "my-lambda-code-bucket"
  s3_key    = "spacy_ner/deployment.zip"

  vpc_config {
    subnet_ids         = aws_subnet.private[*].id
    security_group_ids = [aws_security_group.lambda_sg.id]
  }

  file_system_config {
    arn              = aws_efs_access_point.spacy_ap.arn
    local_mount_path = "/mnt/models"
  }

  environment {
    variables = {
      SPACY_MODEL_PATH = "/mnt/models/en_core_web_lg-3.5.0"
    }
  }

  depends_on = [aws_efs_mount_target.mount]
}

This infrastructure code defines a VPC, security groups for the Lambda and EFS to communicate, the EFS volume itself, an access point for proper permissions, and finally the Lambda function configured to mount the EFS volume at /mnt/models.

The model now needed to be placed onto the EFS volume. This is a one-time setup task. We accomplished this with a small EC2 instance temporarily mounted to the same EFS volume, running a simple Python script.

# scripts/provision_efs_model.py
import spacy
import os

# This path corresponds to the mount point on the EC2 instance
EFS_MOUNT_PATH = "/efs/models"

MODEL_NAME = "en_core_web_lg"
MODEL_PATH = os.path.join(EFS_MOUNT_PATH, f"{MODEL_NAME}-3.5.0")

def main():
    if os.path.exists(MODEL_PATH):
        print(f"Model already exists at {MODEL_PATH}. Skipping download.")
        return

    print(f"Downloading model '{MODEL_NAME}' to EFS...")
    # This downloads and saves the model to the specified path
    spacy.cli.download(MODEL_NAME, False, False, "--path", EFS_MOUNT_PATH)
    print("Model provisioning complete.")
    
    # spacy.cli.download unfortunately downloads into a subdir.
    # We rename it for consistency.
    downloaded_dir = os.path.join(EFS_MOUNT_PATH, MODEL_NAME)
    os.rename(downloaded_dir, MODEL_PATH)
    print(f"Renamed {downloaded_dir} to {MODEL_PATH}")


if __name__ == "__main__":
    main()

With the model on EFS, the Lambda deployment package was now trivial, containing only our application code and a requirements.txt with just spacy. This easily fit within the limits.

The Second Wall: The Cold Start Catastrophe

We solved the size problem, but immediately hit the next one: performance. A “cold start” refers to the first invocation of a Lambda function after a period of inactivity. For our service, a cold start took between 10 to 15 seconds. This was unacceptable.

The latency comes from several sources: AWS provisioning the container, initializing the Python runtime, and—the main culprit—our code loading the large spaCy model from EFS into memory. The line nlp = spacy.load(MODEL_PATH) was responsible for the vast majority of this delay.

A common mistake here is to place the model loading logic inside the Lambda handler.

# ANTI-PATTERN: DO NOT DO THIS
import spacy
import os

def handler(event, context):
    # Model is loaded on every single invocation. Terrible for performance.
    model_path = os.environ["SPACY_MODEL_PATH"]
    nlp = spacy.load(model_path)
    # ... process text
    return result

The correct approach is to leverage the Lambda execution context. Objects defined in the global scope are initialized once when the container starts and are reused across subsequent invocations for that “warm” instance.

Our refined app.py looked like this:

# lambda_function/app.py
import os
import json
import time
import logging

# Set up logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)

# --- CRITICAL SECTION: Global Scope Initialization ---
# This code runs ONLY during a cold start.
# It initializes the model and stores it in the global scope.
MODEL_PATH = os.environ.get("SPACY_MODEL_PATH")
if not MODEL_PATH:
    raise ValueError("SPACY_MODEL_PATH environment variable not set.")

logger.info(f"Starting model load from path: {MODEL_PATH}...")
model_load_start_time = time.time()

try:
    import spacy
    # The heavy lifting happens here. This is the source of cold start latency.
    nlp_model = spacy.load(MODEL_PATH)
    model_load_end_time = time.time()
    logger.info(f"Model loaded successfully in {model_load_end_time - model_load_start_time:.2f} seconds.")
except Exception as e:
    logger.error(f"Failed to load spaCy model: {e}")
    nlp_model = None
# --- END CRITICAL SECTION ---


def handler(event, context):
    """
    Main Lambda handler function.
    Processes text from the event body to perform Named Entity Recognition.
    """
    handler_start_time = time.time()
    
    # Check if the model failed to load during initialization
    if nlp_model is None:
        return {
            "statusCode": 500,
            "body": json.dumps({"error": "Model is not available. Check logs for initialization errors."})
        }

    try:
        # A common mistake is to not validate the input format rigorously.
        # In a real-world project, this is where you'd have robust input validation.
        body = json.loads(event.get("body", "{}"))
        text = body.get("text")

        if not text or not isinstance(text, str):
            return {
                "statusCode": 400,
                "body": json.dumps({"error": "Missing or invalid 'text' field in request body."})
            }

        logger.info("Processing text with spaCy model...")
        doc = nlp_model(text)
        
        entities = [{
            "text": ent.text,
            "label": ent.label_,
            "start_char": ent.start_char,
            "end_char": ent.end_char
        } for ent in doc.ents]
        
        handler_end_time = time.time()
        logger.info(f"Handler execution time: {(handler_end_time - handler_start_time)*1000:.2f} ms")

        return {
            "statusCode": 200,
            "headers": {"Content-Type": "application/json"},
            "body": json.dumps({"entities": entities})
        }

    except json.JSONDecodeError:
        return {
            "statusCode": 400,
            "body": json.dumps({"error": "Invalid JSON in request body."})
        }
    except Exception as e:
        logger.error(f"An unexpected error occurred: {e}", exc_info=True)
        return {
            "statusCode": 500,
            "body": json.dumps({"error": "Internal server error."})
        }

This change ensured the expensive spacy.load() operation happens only once per container lifecycle. A warm invocation was now consistently under 200ms. However, this didn’t eliminate the initial 10-15 second penalty.

To solve that, we had to make a direct trade-off between cost and latency by using Provisioned Concurrency. This AWS feature keeps a specified number of Lambda instances initialized and ready to serve requests, effectively eliminating cold starts for traffic that fits within that concurrency limit.

The trade-off is financial. You are no longer purely “pay-per-invocation”; you are paying an hourly fee to keep those instances warm, whether they receive traffic or not. We configured a provisioned concurrency of 5, which was enough to handle the initial burst from our batch jobs. This was a business decision: the cost was justified by the improved user experience.

The Third Wall: Secure and Managed Exposure via Kong

With a functional, performant Lambda, the final piece was exposing it through our Kong API Gateway. We needed to ensure it was discoverable, secured with an API key, and protected from abuse with rate limiting. The pitfall here is connecting Kong directly to the raw Lambda Function URL without proper controls.

We manage Kong declaratively. The configuration for our new service was added to our central kong.yaml.

# kong/kong.yaml

_format_version: "2.1"

services:
- name: nlp-ner-service
  # This URL comes from the AWS Lambda Function URL configuration
  url: https://<lambda-function-url-id>.lambda-url.us-east-1.on.aws/
  connect_timeout: 15000 # Increased timeout to handle potential cold starts
  read_timeout: 30000
  retries: 2
  routes:
  - name: nlp-ner-route
    paths:
    - /nlp/ner
    strip_path: true
    methods:
    - POST

  plugins:
  # Plugin 1: Authentication. Only clients with a valid key can access.
  - name: key-auth
    config:
      key_names:
      - apikey
      run_on_preflight: false

  # Plugin 2: Rate Limiting. A crucial cost control mechanism.
  # Prevents a runaway client from causing a massive Lambda bill.
  - name: rate-limiting
    config:
      minute: 100
      policy: local

# Create a consumer and API key for a client application
consumers:
- username: batch-processor
  keyauth_credentials:
  - key: <generated-secure-api-key>

This configuration does three critical things:

  1. Service & Route: It maps the public path /nlp/ner on our Kong gateway to the backend AWS Lambda Function URL.
  2. Authentication (key-auth): It mandates that every request must contain a valid apikey header. Requests without it are rejected by Kong with a 401 Unauthorized before they ever reach our Lambda, saving us an invocation cost.
  3. Rate Limiting (rate-limiting): It prevents any single client from hammering the service with more than 100 requests per minute. This is not just a performance safeguard; for serverless architectures, it is a primary cost control mechanism.

The final architecture can be visualized as follows:

sequenceDiagram
    participant Client
    participant KongGateway as Kong API Gateway
    participant AWSLambda as AWS Lambda (NER Service)
    participant AWSEFS as Amazon EFS

    Client->>+KongGateway: POST /nlp/ner (with API Key)
    KongGateway->>KongGateway: Verify API Key & Check Rate Limit
    Note over KongGateway: Authentication & Rate Limiting Plugins
    KongGateway->>+AWSLambda: Forward Request
    Note over AWSLambda: Cold Start? (Load model from EFS)
    AWSLambda-->>AWSEFS: Read spaCy model
    AWSEFS-->>AWSLambda: Model data
    AWSLambda->>AWSLambda: Perform NER
    AWSLambda-->>-KongGateway: 200 OK (JSON with entities)
    KongGateway-->>-Client: 200 OK (JSON with entities)

The result is a robust service endpoint. It scales automatically, is cost-effective for its sporadic usage pattern, and is securely integrated into our existing infrastructure.

Lingering Issues and Applicability Boundaries

This architecture, while effective for our specific use case, is not a silver bullet. The use of Provisioned Concurrency introduces a fixed monthly cost, eroding some of the pure “pay-per-use” benefit of serverless. At a certain threshold of sustained traffic, the cost model would flip, making a container-based solution on Fargate or even a small, auto-scaling EC2 group more economical.

Furthermore, EFS introduces its own complexities. We are using the default “Bursting” throughput mode, which is sufficient for our needs. A high-traffic service might require “Provisioned” throughput, adding another layer of cost and configuration. There is also a minor latency overhead for the initial model read from the network-attached storage compared to a model packaged directly with the function.

A potential future optimization path is to investigate model quantization or distillation. Creating a smaller, custom-trained spaCy model could potentially shrink it enough to fit within a Lambda Layer, removing the need for EFS entirely. This would simplify the architecture and could further reduce cold start times. The trade-off, of course, is a potential drop in model accuracy and the significant engineering effort required for model optimization. The current solution remains the pragmatic choice, balancing performance, cost, and operational complexity for our bursty workload.


  TOC