Implementing BASE-Compliant Sagas for Micro-Frontend Backends Using AWS Lambda and a Centralized MySQL Instance


The architectural decision to decouple the front end into micro-frontends was straightforward. The subsequent choice to back each micro-frontend with its own AWS Lambda function, following the Backend-for-Frontend (BFF) pattern, seemed a natural extension for scalability and team autonomy. The complication arose from a legacy constraint: all core business data resides in a single, monolithic MySQL database. This intersection of a distributed, serverless compute layer with a centralized, ACID-compliant relational database created an immediate and significant challenge.

A single user action on the UI, such as completing a multi-step onboarding process, now needed to orchestrate writes across three distinct domains, managed by three separate Lambda functions: auth-service, profile-service, and notifications-service. The initial, naive implementation involved a chain of synchronous Lambda invocations. The auth-service would create the user, then synchronously invoke the profile-service, wait for its response, and finally invoke the notifications-service.

// WARNING: This is an anti-pattern for serverless architectures.
// auth-service/handler.js

const { LambdaClient, InvokeCommand } = require("@aws-sdk/client-lambda");
const mysql = require('mysql2/promise');

const lambdaClient = new LambdaClient({});
// Connection details should be managed securely, e.g., via Secrets Manager
const dbConfig = { /* ... */ }; 

exports.completeOnboarding = async (event) => {
    const { email, password, fullName, marketingOptIn } = JSON.parse(event.body);
    let connection;
    
    try {
        connection = await mysql.createConnection(dbConfig);
        await connection.beginTransaction();

        // Step 1: Create user in auth table
        const [userResult] = await connection.execute(
            'INSERT INTO users (email, password_hash, status) VALUES (?, ?, ?)',
            [email, hash(password), 'PENDING'] // Hashing function omitted for brevity
        );
        const userId = userResult.insertId;

        // Step 2: Synchronously invoke Profile service
        const profilePayload = { userId, fullName };
        const invokeProfileCommand = new InvokeCommand({
            FunctionName: 'profile-service-dev-createProfile',
            InvocationType: 'RequestResponse',
            Payload: JSON.stringify(profilePayload),
        });
        const profileResponse = await lambdaClient.send(invokeProfileCommand);
        const profileResult = JSON.parse(Buffer.from(profileResponse.Payload).toString());

        if (profileResponse.StatusCode !== 200 || profileResult.status !== 'success') {
            throw new Error('Profile creation failed. Rolling back user.');
        }

        // Step 3: Synchronously invoke Notifications service
        const notificationsPayload = { userId, email, marketingOptIn };
        // ... similar invocation for notifications-service ...

        // If all successful, update user status
        await connection.execute('UPDATE users SET status = ? WHERE id = ?', ['ACTIVE', userId]);
        await connection.commit();

        return { statusCode: 201, body: JSON.stringify({ userId }) };

    } catch (error) {
        if (connection) await connection.rollback();
        console.error('Onboarding failed:', error);
        // This rollback is only local. The other services are unaware.
        return { statusCode: 500, body: 'Internal Server Error' };
    } finally {
        if (connection) await connection.end();
    }
};

This approach is fundamentally flawed in a serverless environment. It introduces temporal coupling; a failure or slowdown in the profile-service directly impacts the auth-service. It creates long-running Lambda functions, which increases cost and risks hitting the 15-minute execution limit. Most critically, it fails to handle partial failures gracefully. If the notifications-service fails after the profile-service has already committed its transaction, there is no clean, automated mechanism to roll back the profile creation. This is a direct path to data inconsistency. We were attempting to force distributed transactions onto an architecture ill-suited for it, violating the core principles of microservice independence.

The solution required a paradigm shift away from ACID guarantees at the transaction level towards the BASE model: Basically Available, Soft state, Eventually consistent. The Saga pattern emerged as the logical implementation. We opted for a choreography-based Saga, using AWS SNS and SQS to orchestrate communication asynchronously, ensuring each service remains decoupled and resilient.

The Event-Driven Saga Architecture

The design is centered around a single SNS topic, user-events, which acts as a central event bus. Each service’s Lambda function is triggered by its own dedicated SQS queue, which subscribes to the user-events topic. This SQS layer provides durability and allows for independent processing, retries, and dead-letter queue (DLQ) configuration for each service.

sequenceDiagram
    participant Client
    participant AuthLambda as Auth Service (Lambda)
    participant UserEvents as User Events (SNS)
    participant ProfileQueue as Profile SQS Queue
    participant ProfileLambda as Profile Service (Lambda)
    participant NotificationsQueue as Notifications SQS Queue
    participant NotificationsLambda as Notifications Service (Lambda)
    participant DB as MySQL Database

    Client->>+AuthLambda: POST /onboard (user data)
    Note over AuthLambda, DB: Starts DB Transaction
    AuthLambda->>DB: INSERT INTO users (status='PENDING')
    Note over AuthLambda, DB: Commits DB Transaction
    AuthLambda->>-UserEvents: Publish `USER_CREATED` event
    UserEvents-->>ProfileQueue: Fan-out message
    UserEvents-->>NotificationsQueue: Fan-out message (filtered out)

    ProfileQueue-->>+ProfileLambda: Consume `USER_CREATED` event
    Note over ProfileLambda, DB: Starts DB Transaction
    ProfileLambda->>DB: INSERT INTO user_profiles
    Note over ProfileLambda, DB: Commits DB Transaction
    ProfileLambda->>-UserEvents: Publish `PROFILE_CREATED` event

    UserEvents-->>NotificationsQueue: Fan-out message
    NotificationsQueue-->>+NotificationsLambda: Consume `PROFILE_CREATED` event
    Note over NotificationsLambda, DB: Starts DB Transaction
    NotificationsLambda->>DB: INSERT INTO user_preferences
    Note over NotificationsLambda, DB: Commits DB Transaction
    NotificationsLambda->>-UserEvents: Publish `ONBOARDING_COMPLETED` event
    
    UserEvents-->>AuthLambda: (via another queue, not shown)
    Note over AuthLambda: Consumes `ONBOARDING_COMPLETED`
    AuthLambda->>DB: UPDATE users SET status='ACTIVE'

The state of the Saga is tracked implicitly within our database. The users table gains a status column (PENDING, ACTIVE, FAILED_PROFILE, FAILED_NOTIFICATIONS) that serves as the source of truth for the overall process.

Core Implementation: The Initiating Service

The auth-service now has a much simpler responsibility. It validates the input, creates the user record in a PENDING state, and publishes a USER_CREATED event. That’s it. Its execution is fast, and its transaction is localized. The crucial part is ensuring the database write and the event publish are as atomic as possible. In a real-world project, if the event publish fails after the database commit, you need a reconciliation process (e.g., an outbox pattern or a periodic scanner), but for this implementation, we handle the immediate failure.

Database Schema:

-- users table
CREATE TABLE `users` (
  `id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
  `email` VARCHAR(255) NOT NULL,
  `status` ENUM('PENDING', 'ACTIVE', 'PROFILE_CREATION_FAILED', 'ONBOARDING_FAILED') NOT NULL DEFAULT 'PENDING',
  `created_at` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`),
  UNIQUE KEY `email_idx` (`email`)
) ENGINE=InnoDB;

-- user_profiles table
CREATE TABLE `user_profiles` (
  `user_id` INT UNSIGNED NOT NULL,
  `full_name` VARCHAR(255) NOT NULL,
  `updated_at` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`user_id`),
  CONSTRAINT `fk_user_profiles_user_id` FOREIGN KEY (`user_id`) REFERENCES `users` (`id`) ON DELETE CASCADE
) ENGINE=InnoDB;

-- user_preferences table
CREATE TABLE `user_preferences` (
  `user_id` INT UNSIGNED NOT NULL,
  `marketing_opt_in` TINYINT(1) NOT NULL DEFAULT 0,
  `updated_at` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`user_id`),
  CONSTRAINT `fk_user_preferences_user_id` FOREIGN KEY (`user_id`) REFERENCES `users` (`id`) ON DELETE CASCADE
) ENGINE=InnoDB;

auth-service/handler.js:

// auth-service/handler.js

const { SNSClient, PublishCommand } = require("@aws-sdk/client-sns");
const mysql = require('mysql2/promise');
const { v4: uuidv4 } = require('uuid');

const snsClient = new SNSClient({});
// In production, use RDS Proxy for connection pooling
const dbConfig = {
    host: process.env.DB_HOST,
    user: process.env.DB_USER,
    password: process.env.DB_PASSWORD,
    database: process.env.DB_NAME,
    waitForConnections: true,
    connectionLimit: 10,
    queueLimit: 0
};
const pool = mysql.createPool(dbConfig);

const USER_EVENTS_TOPIC_ARN = process.env.USER_EVENTS_TOPIC_ARN;

// Structured logger
const log = (level, message, context) => {
    console.log(JSON.stringify({ level, message, ...context }));
};

exports.startOnboarding = async (event) => {
    const { email, fullName, marketingOptIn } = JSON.parse(event.body);
    const requestId = uuidv4();
    log('info', 'Onboarding process started', { requestId, email });

    let connection;
    try {
        connection = await pool.getConnection();
        await connection.beginTransaction();

        // 1. Create user record in PENDING state
        const [userResult] = await connection.execute(
            'INSERT INTO users (email, status) VALUES (?, ?)',
            [email, 'PENDING']
        );
        const userId = userResult.insertId;
        log('info', 'User record created', { requestId, userId });

        // 2. Prepare event payload
        const eventPayload = {
            eventType: 'USER_CREATED',
            data: {
                userId,
                email,
                fullName,
                marketingOptIn,
                // Include a correlation ID for tracing across services
                correlationId: requestId
            }
        };

        // 3. Publish event to SNS
        const publishCommand = new PublishCommand({
            TopicArn: USER_EVENTS_TOPIC_ARN,
            Message: JSON.stringify(eventPayload),
            MessageAttributes: {
                eventType: {
                    DataType: 'String',
                    StringValue: 'USER_CREATED'
                }
            }
        });
        await snsClient.send(publishCommand);
        log('info', 'USER_CREATED event published', { requestId, userId });

        // 4. Commit local transaction
        await connection.commit();

        return {
            statusCode: 202, // Accepted for processing
            body: JSON.stringify({
                status: 'Onboarding process initiated',
                userId
            })
        };

    } catch (error) {
        if (connection) {
            await connection.rollback();
        }
        log('error', 'Onboarding initiation failed', { requestId, error: error.message });
        // Specific error handling for duplicate email
        if (error.code === 'ER_DUP_ENTRY') {
            return { statusCode: 409, body: 'Email already exists.' };
        }
        return { statusCode: 500, body: 'Internal server error.' };
    } finally {
        if (connection) {
            connection.release();
        }
    }
};

Note the use of MessageAttributes in the SNS publish command. This allows SQS subscriptions to filter messages. The profile-service queue can be configured with a filter policy to only accept messages where eventType is USER_CREATED.

Participating Services and Compensating Logic

Each subsequent service in the chain follows a similar pattern: consume, process, publish. The critical addition is handling failures. If the profile-service fails to create a profile, it must publish a PROFILE_CREATION_FAILED event. This is the compensating action trigger.

profile-service/handler.js:

// profile-service/handler.js

const { SNSClient, PublishCommand } = require("@aws-sdk/client-sns");
const mysql = require('mysql2/promise');

const snsClient = new SNSClient({});
const pool = mysql.createPool({ /* ... db config ... */ });
const USER_EVENTS_TOPIC_ARN = process.env.USER_EVENTS_TOPIC_ARN;

const log = (level, message, context) => { /* ... structured logger ... */ };

exports.handleUserCreatedEvent = async (event) => {
    // SQS messages are in event.Records
    for (const record of event.Records) {
        const messageBody = JSON.parse(record.body);
        const eventPayload = JSON.parse(messageBody.Message); // SNS message is nested
        
        const { userId, fullName, correlationId } = eventPayload.data;
        log('info', 'Received USER_CREATED event', { correlationId, userId });

        try {
            // Idempotency Check: A common mistake is to re-process messages.
            // Check if a profile already exists for this user ID.
            const [rows] = await pool.execute('SELECT user_id FROM user_profiles WHERE user_id = ?', [userId]);
            if (rows.length > 0) {
                log('warn', 'Profile already exists for user. Skipping.', { correlationId, userId });
                // Acknowledge the message without processing to remove from queue
                return;
            }

            // Create the profile
            await pool.execute(
                'INSERT INTO user_profiles (user_id, full_name) VALUES (?, ?)',
                [userId, fullName]
            );
            log('info', 'User profile created successfully', { correlationId, userId });

            // Publish the next event in the saga
            await publishEvent('PROFILE_CREATED', { userId, correlationId });

        } catch (error) {
            log('error', 'Profile creation failed', { correlationId, userId, error: error.message });
            
            // CRITICAL: Publish a failure event for compensation
            await publishEvent('PROFILE_CREATION_FAILED', { userId, correlationId, reason: error.message });
        }
    }
};

async function publishEvent(eventType, data) {
    const command = new PublishCommand({
        TopicArn: USER_EVENTS_TOPIC_ARN,
        Message: JSON.stringify({ eventType, data }),
        MessageAttributes: {
            eventType: { DataType: 'String', StringValue: eventType }
        }
    });
    await snsClient.send(command);
    log('info', `Published ${eventType} event`, { correlationId: data.correlationId, userId: data.userId });
}

The auth-service would then have another Lambda function subscribed to the user-events topic (via its own SQS queue) with a filter for PROFILE_CREATION_FAILED. This function’s sole job is to execute the compensating transaction: updating the user’s status.

auth-service/compensationHandler.js:

// auth-service/compensationHandler.js
const mysql = require('mysql2/promise');
const pool = mysql.createPool({ /* ... db config ... */ });

exports.handleProfileCreationFailed = async (event) => {
    for (const record of event.Records) {
        // ... parse message ...
        const { userId, correlationId, reason } = eventPayload.data;
        
        try {
            await pool.execute(
                "UPDATE users SET status = 'PROFILE_CREATION_FAILED' WHERE id = ? AND status = 'PENDING'",
                [userId]
            );
            // Log success
        } catch (error) {
            // Log error. This message will be retried by SQS and eventually go to a DLQ.
            // This requires manual intervention.
            throw error;
        }
    }
};

This completes the loop, ensuring the system state reflects the partial failure and can be handled appropriately (e.g., by notifying the user or a support team).

The Unseen Enemy: Database Connection Management

A significant operational pitfall when using Lambda with a traditional RDBMS like MySQL is connection management. Each Lambda invocation runs in a separate execution environment. Naively creating a new database connection on every invocation will quickly exhaust the max_connections limit on your MySQL server, leading to cascading failures.

The initial code snippets used a basic pool, which helps within a single warm container, but doesn’t solve the problem across thousands of concurrent executions. The production-grade solution is to use AWS RDS Proxy. The proxy sits between your Lambda functions and the database, pooling and sharing database connections efficiently. It also improves security by allowing functions to connect via IAM roles instead of storing database credentials in environment variables.

To use it, you create an RDS Proxy in the AWS console, configure its target group to point to your MySQL instance, and set up IAM authentication. The Lambda function’s execution role needs permissions to connect to the proxy (rds-db:connect).

The connection code in the Lambda functions then changes slightly:

// Example of connecting via RDS Proxy with IAM Auth
const { RDS, Signer } = require("@aws-sdk/client-rds");

const signer = new Signer({
    hostname: process.env.DB_PROXY_ENDPOINT,
    port: 3306,
    username: process.env.DB_IAM_USER, // An IAM database user
});

// The password is a temporary auth token generated by the signer
const connection = await mysql.createConnection({
    host: process.env.DB_PROXY_ENDPOINT,
    user: process.env.DB_IAM_USER,
    password: signer.getAuthToken(),
    database: process.env.DB_NAME,
    ssl: 'Amazon RDS' // Required for IAM authentication
});

This is a non-negotiable step for any production serverless application backed by an RDBMS. Without it, the system is brittle and will not scale.

The implemented Saga pattern, while introducing eventual consistency, successfully decouples our micro-frontend backends. Failures are now isolated, retries are managed by SQS, and the system as a whole is far more resilient. The business logic dictates that it’s acceptable for a user’s profile to be created a few hundred milliseconds after their auth record; this trade-off is a small price for the scalability and resilience gained.

This choreography-based Saga model, however, is not without its own challenges. As the number of services involved in a transaction grows, tracking the state of a Saga becomes difficult. The logic is distributed across multiple services and event handlers, making it hard to answer “where is onboarding for user X currently stuck?”. Debugging a failed multi-step process requires correlating logs across several Lambda functions, emphasizing the need for a shared correlationId.

A potential next step for a more complex workflow would be to adopt an orchestration-based Saga using a tool like AWS Step Functions. This would centralize the Saga’s state machine definition, providing explicit control over the flow, error handling, and retries. This approach trades the simplicity of choreographed events for explicit visibility and control, a trade-off that becomes more appealing as business logic complexity increases. Furthermore, the architecture still hinges on a single MySQL database. True service autonomy would eventually require a database-per-service model, which would necessitate more complex data replication or API-based composition strategies, introducing an entirely new class of consistency challenges.


  TOC