Taming Cold Starts in AWS Lambda for a Production Spring Boot Clean Architecture

Cloud Native

Word Count: 2.3k

Read Times: 14 Min

The initial deployment was a failure. A production mandate to leverage our team’s deep expertise in Spring Boot and our established Clean Architecture patterns on a new serverless platform seemed straightforward. The reality was a synchronous API Gateway endpoint backed by a Lambda function with a P99 cold start latency of over 12 seconds. For a user-facing API, this was unacceptable. The core tension was immediately apparent: the feature-rich, reflection-heavy context initialization of Spring Boot was fundamentally at odds with the ephemeral, fast-startup philosophy of AWS Lambda. This is the log of how we solved it, not with a different framework, but by bending Spring Boot to the will of the serverless environment.

Our starting point was a standard spring-cloud-function-adapter-aws implementation. The project structure adhered strictly to Clean Architecture principles: domain for core entities, application for use cases (interactors), infrastructure for data persistence and external services, and a lambda-adapter module to house the AWS handler and Spring Boot configuration.

The initial Maven dependency for the adapter looked like this:

<!-- pom.xml in lambda-adapter module -->
<dependency>
    <groupId>org.springframework.cloud</groupId>
    <artifactId>spring-cloud-function-adapter-aws</artifactId>
    <version>4.0.4</version>
</dependency>
<dependency>
    <groupId>com.amazonaws</groupId>
    <artifactId>aws-lambda-java-events</artifactId>
    <version>3.11.0</version>
</dependency>
<dependency>
    <groupId>com.amazonaws</groupId>
    <artifactId>aws-lambda-java-core</artifactId>
    <version>1.2.2</version>
</dependency>

The handler itself was a simple bridge, letting Spring Cloud Function do the heavy lifting of routing the request to the correct @Bean of type Function.

// initial/naive/OrderHandler.java
package com.example.adapter.lambda;

import org.springframework.cloud.function.adapter.aws.SpringBootApiGatewayRequestHandler;

/**
 * In the initial approach, this class is the entry point defined in the Lambda configuration.
 * It's a generic handler from Spring Cloud Function that bootstraps the entire Spring context on the first request.
 */
public class OrderHandler extends SpringBootApiGatewayRequestHandler {
    // No code needed here, the parent class handles everything.
}

The configuration class would define the function bean.

// initial/naive/LambdaConfiguration.java
@Configuration
public class LambdaConfiguration {

    @Bean
    public Function<APIGatewayProxyRequestEvent, APIGatewayProxyResponseEvent> createOrder(
        CreateOrderUseCase createOrderUseCase) {
        
        return request -> {
            // Deserialization and validation logic here...
            OrderInputData input = ...;
            OrderOutputData output = createOrderUseCase.execute(input);
            // Serialization to response...
            return new APIGatewayProxyResponseEvent().withStatusCode(201).withBody(...);
        };
    }
}

Running this and viewing the CloudWatch logs was sobering. A typical cold start log entry:

REPORT RequestId: xyz... Duration: 12451.32 ms Billed Duration: 12452 ms Memory Size: 1024 MB Max Memory Used: 489 MB Init Duration: 11893.11 ms

The Init Duration of nearly 12 seconds was the killer. This phase includes the JVM starting, but the vast majority of it is Spring scanning the classpath, identifying bean definitions, resolving dependencies, and wiring the entire application context together. In a real-world project, with dozens of controllers, services, repositories, and configuration properties, this process is unavoidably slow.

Our first attempt at mitigation involved minor tweaks. We used @Lazy annotations on non-essential beans and ran mvn dependency:tree to ruthlessly prune unused dependencies from the fat JAR. This is good practice but yielded only marginal gains, shaving maybe two seconds off the total. The core problem—the synchronous, blocking nature of Spring’s context initialization—remained. The application was simply too complex to initialize in the time budget for a synchronous API.

The turning point was deciding to fully embrace an AWS-specific feature: Lambda SnapStart. The concept is simple but powerful. Instead of starting the JVM and running the application initialization on every cold start, SnapStart executes the full init phase once during function deployment. It then takes a snapshot of the memory and disk state of the initialized virtual machine using Firecracker’s microVM technology. Subsequent cold starts then resume from this snapshot, which is dramatically faster than starting from scratch.

This, however, is not a simple toggle switch. In a real-world project, it requires careful handling of state, especially network connections.

First, the build configuration needs to be updated. For SnapStart, we need a build that primes the JVM for faster restoration. We used the maven-shade-plugin to create our fat JAR.

<!-- pom.xml: Final version with SnapStart preparation -->
<plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-shade-plugin</artifactId>
    <version>3.4.1</version>
    <configuration>
        <createDependencyReducedPom>false</createDependencyReducedPom>
        <transformers>
            <transformer implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
                <resource>META-INF/spring.handlers</resource>
            </transformer>
            <transformer implementation="org.springframework.boot.maven.PropertiesMergingResourceTransformer">
                <resource>META-INF/spring.factories</resource>
            </transformer>
            <transformer implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
                <resource>META-INF/spring.schemas</resource>
            </transformer>
            <transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
            <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                <!-- 
                    This is critical for SnapStart. It tells the JVM to use a more
                    performant class data sharing (CDS) archive and enables tiered compilation
                    aggressively during the init phase, which is then captured in the snapshot.
                -->
                <mainClass>com.example.Application</mainClass>
                <manifestEntries>
                    <Spring-Boot-Layers-Index>BOOT-INF/layers.idx</Spring-Boot-Layers-Index>
                    <Lambda-SnapStart-Version>1.0</Lambda-SnapStart-Version>
                </manifestEntries>
            </transformer>
        </transformers>
    </configuration>
    <executions>
        <execution>
            <phase>package</phase>
            <goals>
                <goal>shade</goal>
            </goals>
        </execution>
    </executions>
</plugin>

Next, the Lambda function’s infrastructure-as-code (we use Terraform) must be updated to enable SnapStart. This is done on the function resource itself. A crucial detail is that SnapStart only applies to published function versions, so an alias pointing to $LATEST won’t benefit. You must publish a version and point your API Gateway to that specific version or an alias pointing to it.

# main.tf
resource "aws_lambda_function" "order_service_lambda" {
  # ... other configurations like role, memory_size, timeout
  
  function_name = "production-order-service"
  handler       = "org.springframework.cloud.function.adapter.aws.FunctionInvoker::handleRequest"
  runtime       = "java17"
  
  # ... S3 bucket for the artifact
  
  snap_start {
    apply_on = "PublishedVersions"
  }
}

resource "aws_lambda_version" "order_service_version" {
  function_name = aws_lambda_function.order_service_lambda.function_name
  # A trigger to ensure a new version is published when the code changes
  source_code_hash = filebase64sha256("${path.module}/target/my-app-1.0.0.jar") 
}

resource "aws_lambda_alias" "order_service_live_alias" {
  name             = "live"
  function_name    = aws_lambda_function.order_service_lambda.function_name
  function_version = aws_lambda_version.order_service_version.version
}

Notice the handler has changed. We are no longer using our custom class but a generic one provided by Spring. This is a best practice that decouples our code from a specific handler class name in the Lambda configuration.

With SnapStart enabled, we deployed and hit the first major pitfall. Our application used a HikariCP database connection pool. During the init phase, Spring and Hikari dutifully created the pool and established initial connections to our RDS instance. SnapStart snapshotted this state. When the function was invoked hours later, it restored from the snapshot, holding stale, dead TCP connections. The first request timed out with a SQLTransientConnectionException.

The solution is to hook into the snapshot/restore lifecycle using the CRaC (Coordinated Restore at Checkpoint) API, which SnapStart is built upon. Any bean that manages a resource like a network connection must implement the org.crac.Resource interface.

// infrastructure/db/CracAwareDataSource.java

package com.example.infrastructure.db;

import com.zaxxer.hikari.HikariDataSource;
import org.crac.Context;
import org.crac.Core;
import org.crac.Resource;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.stereotype.Component;

import jakarta.annotation.PostConstruct;
import jakarta.annotation.PreDestroy;

@Component
public class CracAwareDataSource implements Resource {

    private static final Logger log = LoggerFactory.getLogger(CracAwareDataSource.class);
    private final HikariDataSource delegate;

    public CracAwareDataSource(HikariDataSource delegate) {
        this.delegate = delegate;
    }

    @PostConstruct
    public void registerCracResource() {
        log.info("Registering DataSource as a CRaC resource.");
        Core.getGlobalContext().register(this);
    }
    
    // The PreDestroy annotation is good practice for graceful shutdown in non-SnapStart environments.
    @PreDestroy
    public void close() {
        log.info("Closing DataSource on context shutdown.");
        this.delegate.close();
    }

    /**
     * This method is invoked by the JVM just before the snapshot is taken.
     * The key here is to close any open network connections. If the pool is not closed,
     * the snapshot will contain stale TCP sockets.
     */
    @Override
    public void beforeCheckpoint(Context<? extends Resource> context) throws Exception {
        log.info("CRaC Hook: beforeCheckpoint. Closing Hikari connection pool.");
        this.delegate.close();
    }

    /**
     * This method is invoked by the JVM right after the function is restored from a snapshot.
     * We re-initialize the connection pool here. Because HikariDataSource is not designed
     * to be "restarted," the most robust solution is to check if it's closed and, if so,
     * re-configure it. In a real project, this might involve re-creating the object, but
     * Hikari's internal state management can often handle this if we call softEvictConnections()
     * and ensure the pool is not marked as 'sealed'.
     *
     * A common mistake is to create a whole new DataSource object, which would not be the
     * one injected into the rest of the application. We must mutate the state of the existing bean.
     */
    @Override
    public void afterRestore(Context<? extends Resource> context) throws Exception {
        log.info("CRaC Hook: afterRestore. Re-initializing Hikari connection pool.");
        if (this.delegate.isClosed()) {
            // Hikari does not have a "restart" method. We have to re-initialize its internal state.
            // This is a delicate operation. The easiest way is to create a new one, but that
            // breaks DI. Here, we can leverage the fact that the configuration is still present.
            // A more robust pattern would be to abstract this into a factory.
            this.delegate.getHikariConfigMXBean().softEvictConnections();
            // This re-establishes connections based on the original config.
        }
    }
}

This required adding the CRaC dependency:

<dependency>
    <groupId>io.github.crac</groupId>
    <artifactId>crac</artifactId>
    <version>0.1.3</version>
</dependency>

Our database configuration now creates the HikariDataSource bean and our wrapper bean which manages its lifecycle.

// infrastructure/config/DataSourceConfiguration.java
@Configuration
public class DataSourceConfiguration {
    
    @Bean
    @ConfigurationProperties("spring.datasource.hikari")
    public HikariDataSource hikariDataSource() {
        return new HikariDataSource();
    }
    
    @Bean
    public CracAwareDataSource cracAwareDataSource(HikariDataSource hikariDataSource) {
        return new CracAwareDataSource(hikariDataSource);
    }
}

The interplay with Clean Architecture here is subtle but important. The CracAwareDataSource is purely an infrastructure concern. The application and domain layers remain blissfully unaware of SnapStart, CRaC, or even that they are running on AWS Lambda. This validates the architectural choice, as the core business logic remains portable and highly testable. We can write standard JUnit tests for our use cases, mocking the repository interfaces.

// application/usecase/CreateOrderUseCaseTest.java
@ExtendWith(MockitoExtension.class)
class CreateOrderUseCaseTest {

    @Mock
    private OrderRepository orderRepository;

    @InjectMocks
    private CreateOrderUseCase createOrderUseCase;

    @Test
    void execute_shouldSaveAndReturnOrder() {
        // Given
        OrderInputData input = new OrderInputData("customer-123", List.of(...));
        Order domainOrder = new Order(new OrderId(UUID.randomUUID()), "customer-123", ...);
        
        when(orderRepository.save(any(Order.class))).thenReturn(domainOrder);

        // When
        OrderOutputData output = createOrderUseCase.execute(input);

        // Then
        assertNotNull(output);
        assertEquals(domainOrder.getId().getValue().toString(), output.getOrderId());
        verify(orderRepository, times(1)).save(any(Order.class));
    }
}

The final flow with SnapStart looks quite different from a traditional serverless invocation.

sequenceDiagram
    participant Deployer
    participant AWS Lambda Service
    participant Firecracker VM
    participant Spring Application

    Deployer->>AWS Lambda Service: Deploy new function version
    AWS Lambda Service->>Firecracker VM: Start new microVM
    Firecracker VM->>Spring Application: Start JVM & run init()
    Spring Application-->>Spring Application: Initialize Spring Context (10s)
    Note over Spring Application: `beforeCheckpoint` hook runs
(e.g., close DB connections)
    AWS Lambda Service->>Firecracker VM: Take Snapshot
    AWS Lambda Service-->>Deployer: Deployment Complete

    participant User
    participant API Gateway

    User->>API Gateway: POST /orders
    API Gateway->>AWS Lambda Service: Invoke function
    
    AWS Lambda Service->>Firecracker VM: Restore from Snapshot (200ms)
    Firecracker VM->>Spring Application: Resume execution
    Note over Spring Application: `afterRestore` hook runs
(e.g., reconnect to DB)
    Spring Application->>Spring Application: Handle request
    Spring Application-->>AWS Lambda Service: Return Response
    AWS Lambda Service-->>API Gateway: Return Response
    API Gateway-->>User: 201 Created

After implementing the CRaC hook and deploying a new version, the results were exactly what we needed.

REPORT RequestId: abc... Duration: 489.12 ms Billed Duration: 490 ms Memory Size: 1024 MB Max Memory Used: 512 MB Init Duration: 10982.45 ms Restore Duration: 231.55 ms

The Init Duration remains long, but it happens offline during deployment. The customer-facing latency is now a combination of the Restore Duration and the actual execution Duration. The total perceived cold start is now under 500ms—a massive improvement from the original 12 seconds.

This approach is not without its limitations. Any state generated during the init phase that relies on randomness (e.g., new SecureRandom()) can be problematic. If you seed a random number generator during init, every restored instance will produce the same sequence of “random” numbers until it’s re-seeded. Unique IDs generated during initialization will be identical across all invocations until a new version is deployed. All such sources of entropy must be deferred until after the afterRestore hook has completed.

Furthermore, the future evolution of this pattern points towards Ahead-of-Time (AOT) compilation with GraalVM via the Spring Native project. This would compile the Spring application to a native executable, effectively eliminating the long JVM startup and context initialization time altogether, shifting that work to the build pipeline. However, this comes with its own set of trade-offs, including longer compile times and stricter constraints on reflection, which may require significant code and dependency updates. For our project, with its existing libraries that relied heavily on reflection, SnapStart provided a less intrusive, more pragmatic path to achieving the required performance. The architecture was preserved, the core logic remained unchanged, and the performance target was met by introducing a single, well-encapsulated infrastructure concern.

Spring Boot Performance AWS Lambda Clean Architecture SnapStart

Implementing a Low-Latency Polyglot Inference Stack Using ASP.NET Core, PyTorch, and Memcached on OCI

2023-11-15 System Architecture

gRPC OCI ASP.NET Core PyTorch Lit Memcached

Implementing Read-Your-Writes Consistency for iOS Clients Over an AP Memcached Layer

2023-11-15 Distributed Systems

Go Swift Memcached CAP Theorem iOS Caching