Implementing a Cancellable Saga Pattern Across NestJS Microservices with Envoy and Recoil for State Synchronization


Our monolithic order management system was a constant source of data integrity fires. A single transaction to create an order involved a database insert, a third-party payment gateway call, an inventory update, and a shipping notification. When the inventory update failed due to a stock discrepancy, the payment was already processed. The result was a successful charge for an unfulfillable order, requiring manual refunds and customer service intervention. This scenario, repeated at scale, became an operational bottleneck. The move to microservices was inevitable, but it amplified the core problem: how do you maintain transactional consistency across distributed, independent services without resorting to brittle, high-latency two-phase commits?

The initial concept was to break down the process into Order, Payment, and Inventory microservices. The Order service would act as the orchestrator. To solve the consistency problem, we settled on the Saga pattern. A saga is a sequence of local transactions. Each transaction updates data within a single service and publishes a message or event to trigger the next transaction. If a local transaction fails, the saga executes a series of compensating transactions that revert the preceding transactions. This ensures atomicity at the business process level without distributed locking.

Our technology selection was driven by production needs: stability, performance, and observability.

  1. Backend Services (NestJS): NestJS provides a structured, opinionated framework for building microservices. Its dependency injection system and first-class support for various transporters (we chose RabbitMQ for its reliability) make it ideal for creating maintainable, testable services.
  2. Saga Orchestration (Saga Pattern): We implemented an orchestrator-based saga within the Order service. This centralizes the workflow logic, making it easier to understand and debug compared to a choreographed approach where services react to each other’s events.
  3. Service Mesh (Envoy Proxy): Direct service-to-service communication introduces complexities in discovery, load balancing, retries, and circuit breaking. Envoy, as a sidecar proxy, handles these cross-cutting concerns at the infrastructure layer, keeping the application code clean. Critically, it also injects and propagates tracing headers (x-request-id), which is the foundation of our observability strategy.
  4. API Gateway (NestJS with Fastify): The entry point to our system needed to be fast. While NestJS’s default Express adapter is robust, we chose the FastifyAdapter for its superior performance, particularly for handling long-lived connections like the Server-Sent Events (SSE) stream we needed for front-end updates.
  5. Frontend State (Recoil): The front-end needs to reflect the multi-step, asynchronous nature of the saga. A user can’t be left staring at a spinner for 30 seconds. Recoil’s atomic state model is a perfect fit. We can create an atom family to track the state of each saga instance, allowing for granular, reactive updates to the UI as events stream in from the backend.

The Saga Orchestrator: Core of the Backend Logic

The Order service houses the saga orchestrator. It’s a state machine that drives the entire process. When a new order request arrives, it initiates the saga, persists its initial state, and begins executing the steps.

Here is the core structure of our OrderSaga. It defines the transaction steps and their corresponding compensation steps.

// src/order/order.saga.ts (in the Order Service)
import { Injectable, Logger } from '@nestjs/common';
import { ClientProxy } from '@nestjs/microservices';
import { v4 as uuidv4 } from 'uuid';

// In a real application, this would be persisted to a DB (e.g., Redis, PostgreSQL)
// For this demonstration, we use an in-memory store.
const sagaStateStore = new Map<string, OrderSagaState>();

export interface OrderSagaState {
  transactionId: string;
  orderId: string;
  userId: string;
  productId: string;
  quantity: number;
  totalPrice: number;
  status: 'PENDING' | 'PAYMENT_COMPLETE' | 'INVENTORY_RESERVED' | 'COMPLETED' | 'FAILED';
  history: { step: string; status: 'SUCCESS' | 'FAILURE' | 'COMPENSATING' | 'COMPENSATED'; timestamp: Date }[];
  failureReason?: string;
}

type SagaStep = {
  name: string;
  execute: (state: OrderSagaState) => Promise<any>;
  compensate: (state: OrderSagaState) => Promise<any>;
};

@Injectable()
export class OrderSaga {
  private readonly logger = new Logger(OrderSaga.name);
  private steps: SagaStep[] = [];

  constructor(
    // These would be injected clients for our microservices.
    // In NestJS, this is typically done via `@Inject('SERVICE_NAME')`
    private readonly paymentClient: ClientProxy,
    private readonly inventoryClient: ClientProxy,
  ) {
    this.initializeSteps();
  }

  private initializeSteps() {
    this.steps = [
      {
        name: 'ProcessPayment',
        execute: this.processPayment.bind(this),
        compensate: this.refundPayment.bind(this),
      },
      {
        name: 'ReserveInventory',
        execute: this.reserveInventory.bind(this),
        compensate: this.releaseInventory.bind(this),
      },
    ];
  }

  public async startSaga(orderData: {
    orderId: string;
    userId: string;
    productId: string;
    quantity: number;
    totalPrice: number;
  }): Promise<string> {
    const transactionId = uuidv4();
    const initialState: OrderSagaState = {
      ...orderData,
      transactionId,
      status: 'PENDING',
      history: [{ step: 'SagaInitiated', status: 'SUCCESS', timestamp: new Date() }],
    };
    sagaStateStore.set(transactionId, initialState);
    this.logger.log(`[${transactionId}] Saga started for order ${orderData.orderId}`);

    // Asynchronously execute the saga
    this.executeSaga(transactionId);

    return transactionId;
  }

  private async executeSaga(transactionId: string) {
    const state = sagaStateStore.get(transactionId);
    if (!state) return;

    for (let i = 0; i < this.steps.length; i++) {
      const step = this.steps[i];
      try {
        this.logger.log(`[${transactionId}] Executing step: ${step.name}`);
        await step.execute(state);
        state.history.push({ step: step.name, status: 'SUCCESS', timestamp: new Date() });
        
        // This is a naive state update. In a real system, you'd have more granular states.
        if (step.name === 'ProcessPayment') state.status = 'PAYMENT_COMPLETE';
        if (step.name === 'ReserveInventory') state.status = 'INVENTORY_RESERVED';
        
        sagaStateStore.set(transactionId, state);
      } catch (error) {
        this.logger.error(`[${transactionId}] Step ${step.name} failed: ${error.message}`);
        state.status = 'FAILED';
        state.failureReason = error.message;
        state.history.push({ step: step.name, status: 'FAILURE', timestamp: new Date() });
        sagaStateStore.set(transactionId, state);
        
        // Start compensation flow
        await this.compensate(transactionId, i);
        return; // Stop execution
      }
    }

    state.status = 'COMPLETED';
    sagaStateStore.set(transactionId, state);
    this.logger.log(`[${transactionId}] Saga completed successfully for order ${state.orderId}`);
  }

  private async compensate(transactionId: string, failedStepIndex: number) {
    const state = sagaStateStore.get(transactionId);
    this.logger.warn(`[${transactionId}] Starting compensation flow from step index ${failedStepIndex}`);

    for (let i = failedStepIndex; i >= 0; i--) {
      const step = this.steps[i];
      // Only compensate steps that have successfully completed before the failure.
      const hasSucceeded = state.history.some(h => h.step === step.name && h.status === 'SUCCESS');
      if (hasSucceeded) {
          try {
            this.logger.log(`[${transactionId}] Compensating step: ${step.name}`);
            state.history.push({ step: step.name, status: 'COMPENSATING', timestamp: new Date() });
            await step.compensate(state);
            state.history.push({ step: step.name, status: 'COMPENSATED', timestamp: new Date() });
            sagaStateStore.set(transactionId, state);
          } catch (error) {
            this.logger.error(`[${transactionId}] CRITICAL: Compensation for step ${step.name} failed: ${error.message}`);
            // This is a critical failure. It requires manual intervention.
            // You might push this to a dead-letter queue or trigger an alert.
          }
      }
    }
  }

  // --- Step Implementations ---

  private async processPayment(state: OrderSagaState): Promise<void> {
    // In a real system, you'd send a message to the Payment service.
    // The `send` method returns an Observable, which we convert to a Promise.
    return new Promise((resolve, reject) => {
      this.paymentClient.send('process_payment', {
        transactionId: state.transactionId,
        userId: state.userId,
        amount: state.totalPrice,
      }).subscribe({
        next: (response) => response.status === 'APPROVED' ? resolve() : reject(new Error(response.reason)),
        error: (err) => reject(err),
      });
    });
  }

  private async refundPayment(state: OrderSagaState): Promise<void> {
    this.paymentClient.emit('refund_payment', {
      transactionId: state.transactionId,
      userId: state.userId,
      amount: state.totalPrice,
    });
    // emit is fire-and-forget, fine for compensations that don't block.
  }

  private async reserveInventory(state: OrderSagaState): Promise<void> {
    return new Promise((resolve, reject) => {
      this.inventoryClient.send('reserve_inventory', {
        transactionId: state.transactionId,
        productId: state.productId,
        quantity: state.quantity,
      }).subscribe({
        next: (response) => response.status === 'RESERVED' ? resolve() : reject(new Error(response.reason)),
        error: (err) => reject(err),
      });
    });
  }
  
  private async releaseInventory(state: OrderSagaState): Promise<void> {
    this.inventoryClient.emit('release_inventory', {
      transactionId: state.transactionId,
      productId: state.productId,
      quantity: state.quantity,
    });
  }
}

The key pitfall here is saga state management. Using an in-memory map is fine for a demo, but in production, this state must be persisted transactionally after each step. If the orchestrator crashes mid-saga, it must be able to recover and resume from where it left off. A common pattern is to use a database table with a row for each saga instance, updating its state and history atomically.

Infrastructure as Code: Envoy Proxy Configuration

Envoy acts as the smart network layer. All inter-service communication flows through it. This configuration sets up listeners for incoming traffic, routes requests to the correct service clusters, and, crucially, implements resilience patterns.

# envoy.yaml
static_resources:
  listeners:
    - name: listener_0
      address:
        socket_address: { address: 0.0.0.0, port_value: 10000 }
      filter_chains:
        - filters:
            - name: envoy.filters.network.http_connection_manager
              typed_config:
                "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
                stat_prefix: ingress_http
                # Generate x-request-id for tracing if not present
                generate_request_id: true
                route_config:
                  name: local_route
                  virtual_hosts:
                    - name: local_service
                      domains: ["*"]
                      routes:
                        # Route to Order Service API Gateway
                        - match: { prefix: "/orders" }
                          route: { cluster: order_service_cluster }
                http_filters:
                  - name: envoy.filters.http.router
                    typed_config:
                      "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router

  clusters:
    # Cluster for the main API Gateway (Order Service)
    - name: order_service_cluster
      connect_timeout: 5s
      type: STRICT_DNS
      lb_policy: ROUND_ROBIN
      load_assignment:
        cluster_name: order_service_cluster
        endpoints:
          - lb_endpoints:
              - endpoint:
                  address:
                    socket_address:
                      # This should resolve to the Order service container
                      address: host.docker.internal 
                      port_value: 3001
    
    # We are using RabbitMQ for internal comms, so Envoy isn't proxying
    # service-to-service calls directly in this setup. However, if we were using
    # HTTP/gRPC, clusters for payment and inventory would be defined here with
    # circuit breakers and retry policies like this:
    #
    # - name: payment_service_cluster
    #   connect_timeout: 1s
    #   type: STRICT_DNS
    #   lb_policy: ROUND_ROBIN
    #   circuit_breakers:
    #     thresholds:
    #       - priority: DEFAULT
    #         max_connections: 100
    #         max_pending_requests: 10
    #         max_requests: 50
    #         max_retries: 3
    #   load_assignment:
    #     ...

A common mistake is to put too much business logic into Envoy. Envoy should handle network-level concerns. Retrying a failed payment transaction, for example, is a business decision that belongs in the saga orchestrator. Retrying a 503 Service Unavailable error from a transient network issue is an infrastructure concern that belongs in Envoy. Here, we’ve focused Envoy on its primary role as an edge proxy and trace initiator. In a fully HTTP-based microservice architecture, it would also manage internal traffic, where circuit breakers would be critical to prevent cascading failures.

The Gateway: High-Performance Communication with Fastify and SSE

The gateway is the bridge between the client and the internal microservices. It initiates the saga and provides a real-time status stream. We use Server-Sent Events (SSE) for this, a simple and efficient protocol for pushing data from server to client over a standard HTTP connection.

// src/app.controller.ts (in the API Gateway/Order Service)
import { Controller, Post, Body, Sse, Param, MessageEvent } from '@nestjs/common';
import { Observable, interval, map, fromEvent } from 'rxjs';
import { EventEmitter2 } from '@nestjs/event-emitter';
import { OrderSaga, OrderSagaState } from './order/order.saga';

@Controller('orders')
export class AppController {
  constructor(
    private readonly orderSaga: OrderSaga,
    private readonly eventEmitter: EventEmitter2,
  ) {}

  @Post()
  async createOrder(@Body() createOrderDto: any): Promise<{ transactionId: string }> {
    // Basic validation should be here
    const transactionId = await this.orderSaga.startSaga({
      orderId: createOrderDto.orderId,
      userId: 'user-123',
      // ... more data
    });

    // We also need a way to push updates for this transactionId
    // The saga orchestrator will emit events like 'saga.updated.TRANSACTION_ID'
    // This is a simplified version of that concept.
    
    return { transactionId };
  }

  @Sse(':transactionId/status')
  sse(@Param('transactionId') transactionId: string): Observable<MessageEvent> {
    // This is the key part for real-time updates.
    // We listen for events specific to this transactionId.
    // The Saga Orchestrator needs to be modified to emit these events.
    // For example, after each step: this.eventEmitter.emit(`saga.update.${transactionId}`, newState);
    return fromEvent(this.eventEmitter, `saga.update.${transactionId}`).pipe(
      map((state: OrderSagaState) => {
        return {
          data: { 
            status: state.status,
            history: state.history,
            failureReason: state.failureReason
          },
        } as MessageEvent;
      }),
    );
  }
}

The main.ts file must be configured to use FastifyAdapter.

// src/main.ts
import { NestFactory } from '@nestjs/core';
import {
  FastifyAdapter,
  NestFastifyApplication,
} from '@nestjs/platform-fastify';
import { AppModule } from './app.module';

async function bootstrap() {
  const app = await NestFactory.create<NestFastifyApplication>(
    AppModule,
    new FastifyAdapter(),
  );
  await app.listen(3001);
}
bootstrap();

Using Fastify provides a noticeable performance improvement for I/O-bound operations like managing hundreds or thousands of concurrent SSE connections, compared to the default Express adapter. This choice is a direct trade-off for performance over familiarity.

Frontend State Synchronization with Recoil

The final piece is the user interface. It must react to the saga’s state changes without complex manual state management. Recoil’s model shines here.

First, we define an atom family to hold the state for any given order transaction.

// src/state/orderState.js (in a React app)
import { atomFamily } from 'recoil';

export const orderSagaStateFamily = atomFamily({
  key: 'orderSagaState',
  default: {
    status: 'IDLE',
    history: [],
    failureReason: null,
  },
});

Next, a component uses a useEffect hook to subscribe to the SSE stream when a transaction starts. As data arrives, it updates the Recoil atom, and the UI re-renders automatically.

// src/components/OrderTracker.js
import React, { useEffect, useState } from 'react';
import { useRecoilState } from 'recoil';
import { orderSagaStateFamily } from '../state/orderState';

function OrderTracker({ transactionId }) {
  const [sagaState, setSagaState] = useRecoilState(orderSagaStateFamily(transactionId));

  useEffect(() => {
    if (!transactionId) return;

    // Set initial state to pending when we start tracking
    setSagaState({ status: 'PENDING', history: [{ step: 'RequestSent', status: 'SUCCESS' }] });

    const eventSource = new EventSource(`/api/orders/${transactionId}/status`);
    
    eventSource.onmessage = (event) => {
      const data = JSON.parse(event.data);
      // The state update is trivial. Recoil handles the rest.
      setSagaState(data);
    };

    eventSource.onerror = (error) => {
      console.error('SSE Error:', error);
      setSagaState((prev) => ({ ...prev, status: 'ERROR', failureReason: 'Connection lost' }));
      eventSource.close();
    };

    // Cleanup on component unmount
    return () => {
      eventSource.close();
    };
  }, [transactionId, setSagaState]);

  return (
    <div>
      <h3>Order Status (Tx: {transactionId})</h3>
      <p>Overall Status: <strong>{sagaState.status}</strong></p>
      {sagaState.failureReason && <p style={{ color: 'red' }}>Reason: {sagaState.failureReason}</p>}
      <ul>
        {sagaState.history.map((item, index) => (
          <li key={index}>
            {item.timestamp}: [{item.step}] - {item.status}
          </li>
        ))}
      </ul>
    </div>
  );
}

// Parent component that initiates the order
export function OrderPlacement() {
    const [transactionId, setTransactionId] = useState(null);

    const handlePlaceOrder = async () => {
        // This would call the POST /orders endpoint
        const response = await fetch('/api/orders', { method: 'POST', /*...*/ });
        const { transactionId: newTxId } = await response.json();
        setTransactionId(newTxId);
    };

    return (
        <div>
            <button onClick={handlePlaceOrder} disabled={!!transactionId}>
                Place a Test Order (Inventory will fail)
            </button>
            {transactionId && <OrderTracker transactionId={transactionId} />}
        </div>
    );
}

This architecture provides a clean separation of concerns. The React component knows nothing about Sagas or microservices. It only knows how to subscribe to a data stream and update a piece of state. Recoil’s declarative nature makes this flow far simpler to manage than traditional state management libraries that might require complex reducer logic for handling streaming updates.

sequenceDiagram
    participant Client as React UI (Recoil)
    participant Envoy as Envoy Proxy
    participant Gateway as NestJS Gateway (Fastify)
    participant Orchestrator as Order Saga
    participant PaymentSvc as Payment Service
    participant InventorySvc as Inventory Service

    Client->>+Envoy: POST /orders
    Envoy->>+Gateway: POST /orders
    Gateway->>+Orchestrator: startSaga()
    Orchestrator-->>-Gateway: { transactionId }
    Gateway-->>-Envoy: { transactionId }
    Envoy-->>-Client: { transactionId }
    
    Client->>+Envoy: GET /orders/{id}/status (SSE)
    Envoy->>+Gateway: GET /orders/{id}/status (SSE)
    Gateway-->>-Client: SSE Connection Opened
    
    Note over Orchestrator: Saga Execution Starts
    Orchestrator->>+PaymentSvc: process_payment
    PaymentSvc-->>-Orchestrator: payment_approved
    Orchestrator-->>Gateway: emit('saga.update', state)
    Gateway-->>Client: SSE: { status: 'PAYMENT_COMPLETE' }
    
    Orchestrator->>+InventorySvc: reserve_inventory
    InventorySvc-->>-Orchestrator: insufficient_stock (FAIL)
    Note over Orchestrator: Step Failed. Start Compensation.
    Orchestrator-->>Gateway: emit('saga.update', state)
    Gateway-->>Client: SSE: { status: 'FAILED', reason: '...' }

    Orchestrator->>+PaymentSvc: refund_payment
    Orchestrator-->>Gateway: emit('saga.update', state)
    Gateway-->>Client: SSE: { history: [..., {step: 'ProcessPayment', status:'COMPENSATING'}] }
    
    Note right of Client: UI re-renders reactively at each SSE event

The presented solution works, but it’s not without its limitations. The saga orchestrator’s state is held in-memory, making it a single point of failure and unsuitable for production without a persistent store like Redis or a relational database. Recovering from an orchestrator crash mid-saga is a non-trivial problem that requires careful state management and idempotency in all service calls. Furthermore, our SSE implementation is basic; a production system would need robust heartbeating, reconnection logic, and potentially a more scalable solution like WebSockets managed via a dedicated service if client numbers grow large. Finally, while logging with a correlation ID is a huge step forward for observability, true insight comes from integrating this with a distributed tracing system like Jaeger or OpenTelemetry, allowing visualization of the entire call graph and latency analysis for each step of the saga.


  TOC