Our monolithic order management system was a constant source of data integrity fires. A single transaction to create an order involved a database insert, a third-party payment gateway call, an inventory update, and a shipping notification. When the inventory update failed due to a stock discrepancy, the payment was already processed. The result was a successful charge for an unfulfillable order, requiring manual refunds and customer service intervention. This scenario, repeated at scale, became an operational bottleneck. The move to microservices was inevitable, but it amplified the core problem: how do you maintain transactional consistency across distributed, independent services without resorting to brittle, high-latency two-phase commits?
The initial concept was to break down the process into Order
, Payment
, and Inventory
microservices. The Order
service would act as the orchestrator. To solve the consistency problem, we settled on the Saga pattern. A saga is a sequence of local transactions. Each transaction updates data within a single service and publishes a message or event to trigger the next transaction. If a local transaction fails, the saga executes a series of compensating transactions that revert the preceding transactions. This ensures atomicity at the business process level without distributed locking.
Our technology selection was driven by production needs: stability, performance, and observability.
- Backend Services (
NestJS
): NestJS provides a structured, opinionated framework for building microservices. Its dependency injection system and first-class support for various transporters (we chose RabbitMQ for its reliability) make it ideal for creating maintainable, testable services. - Saga Orchestration (
Saga Pattern
): We implemented an orchestrator-based saga within theOrder
service. This centralizes the workflow logic, making it easier to understand and debug compared to a choreographed approach where services react to each other’s events. - Service Mesh (
Envoy Proxy
): Direct service-to-service communication introduces complexities in discovery, load balancing, retries, and circuit breaking. Envoy, as a sidecar proxy, handles these cross-cutting concerns at the infrastructure layer, keeping the application code clean. Critically, it also injects and propagates tracing headers (x-request-id
), which is the foundation of our observability strategy. - API Gateway (
NestJS
withFastify
): The entry point to our system needed to be fast. While NestJS’s default Express adapter is robust, we chose theFastifyAdapter
for its superior performance, particularly for handling long-lived connections like the Server-Sent Events (SSE) stream we needed for front-end updates. - Frontend State (
Recoil
): The front-end needs to reflect the multi-step, asynchronous nature of the saga. A user can’t be left staring at a spinner for 30 seconds. Recoil’s atomic state model is a perfect fit. We can create an atom family to track the state of each saga instance, allowing for granular, reactive updates to the UI as events stream in from the backend.
The Saga Orchestrator: Core of the Backend Logic
The Order
service houses the saga orchestrator. It’s a state machine that drives the entire process. When a new order request arrives, it initiates the saga, persists its initial state, and begins executing the steps.
Here is the core structure of our OrderSaga
. It defines the transaction steps and their corresponding compensation steps.
// src/order/order.saga.ts (in the Order Service)
import { Injectable, Logger } from '@nestjs/common';
import { ClientProxy } from '@nestjs/microservices';
import { v4 as uuidv4 } from 'uuid';
// In a real application, this would be persisted to a DB (e.g., Redis, PostgreSQL)
// For this demonstration, we use an in-memory store.
const sagaStateStore = new Map<string, OrderSagaState>();
export interface OrderSagaState {
transactionId: string;
orderId: string;
userId: string;
productId: string;
quantity: number;
totalPrice: number;
status: 'PENDING' | 'PAYMENT_COMPLETE' | 'INVENTORY_RESERVED' | 'COMPLETED' | 'FAILED';
history: { step: string; status: 'SUCCESS' | 'FAILURE' | 'COMPENSATING' | 'COMPENSATED'; timestamp: Date }[];
failureReason?: string;
}
type SagaStep = {
name: string;
execute: (state: OrderSagaState) => Promise<any>;
compensate: (state: OrderSagaState) => Promise<any>;
};
@Injectable()
export class OrderSaga {
private readonly logger = new Logger(OrderSaga.name);
private steps: SagaStep[] = [];
constructor(
// These would be injected clients for our microservices.
// In NestJS, this is typically done via `@Inject('SERVICE_NAME')`
private readonly paymentClient: ClientProxy,
private readonly inventoryClient: ClientProxy,
) {
this.initializeSteps();
}
private initializeSteps() {
this.steps = [
{
name: 'ProcessPayment',
execute: this.processPayment.bind(this),
compensate: this.refundPayment.bind(this),
},
{
name: 'ReserveInventory',
execute: this.reserveInventory.bind(this),
compensate: this.releaseInventory.bind(this),
},
];
}
public async startSaga(orderData: {
orderId: string;
userId: string;
productId: string;
quantity: number;
totalPrice: number;
}): Promise<string> {
const transactionId = uuidv4();
const initialState: OrderSagaState = {
...orderData,
transactionId,
status: 'PENDING',
history: [{ step: 'SagaInitiated', status: 'SUCCESS', timestamp: new Date() }],
};
sagaStateStore.set(transactionId, initialState);
this.logger.log(`[${transactionId}] Saga started for order ${orderData.orderId}`);
// Asynchronously execute the saga
this.executeSaga(transactionId);
return transactionId;
}
private async executeSaga(transactionId: string) {
const state = sagaStateStore.get(transactionId);
if (!state) return;
for (let i = 0; i < this.steps.length; i++) {
const step = this.steps[i];
try {
this.logger.log(`[${transactionId}] Executing step: ${step.name}`);
await step.execute(state);
state.history.push({ step: step.name, status: 'SUCCESS', timestamp: new Date() });
// This is a naive state update. In a real system, you'd have more granular states.
if (step.name === 'ProcessPayment') state.status = 'PAYMENT_COMPLETE';
if (step.name === 'ReserveInventory') state.status = 'INVENTORY_RESERVED';
sagaStateStore.set(transactionId, state);
} catch (error) {
this.logger.error(`[${transactionId}] Step ${step.name} failed: ${error.message}`);
state.status = 'FAILED';
state.failureReason = error.message;
state.history.push({ step: step.name, status: 'FAILURE', timestamp: new Date() });
sagaStateStore.set(transactionId, state);
// Start compensation flow
await this.compensate(transactionId, i);
return; // Stop execution
}
}
state.status = 'COMPLETED';
sagaStateStore.set(transactionId, state);
this.logger.log(`[${transactionId}] Saga completed successfully for order ${state.orderId}`);
}
private async compensate(transactionId: string, failedStepIndex: number) {
const state = sagaStateStore.get(transactionId);
this.logger.warn(`[${transactionId}] Starting compensation flow from step index ${failedStepIndex}`);
for (let i = failedStepIndex; i >= 0; i--) {
const step = this.steps[i];
// Only compensate steps that have successfully completed before the failure.
const hasSucceeded = state.history.some(h => h.step === step.name && h.status === 'SUCCESS');
if (hasSucceeded) {
try {
this.logger.log(`[${transactionId}] Compensating step: ${step.name}`);
state.history.push({ step: step.name, status: 'COMPENSATING', timestamp: new Date() });
await step.compensate(state);
state.history.push({ step: step.name, status: 'COMPENSATED', timestamp: new Date() });
sagaStateStore.set(transactionId, state);
} catch (error) {
this.logger.error(`[${transactionId}] CRITICAL: Compensation for step ${step.name} failed: ${error.message}`);
// This is a critical failure. It requires manual intervention.
// You might push this to a dead-letter queue or trigger an alert.
}
}
}
}
// --- Step Implementations ---
private async processPayment(state: OrderSagaState): Promise<void> {
// In a real system, you'd send a message to the Payment service.
// The `send` method returns an Observable, which we convert to a Promise.
return new Promise((resolve, reject) => {
this.paymentClient.send('process_payment', {
transactionId: state.transactionId,
userId: state.userId,
amount: state.totalPrice,
}).subscribe({
next: (response) => response.status === 'APPROVED' ? resolve() : reject(new Error(response.reason)),
error: (err) => reject(err),
});
});
}
private async refundPayment(state: OrderSagaState): Promise<void> {
this.paymentClient.emit('refund_payment', {
transactionId: state.transactionId,
userId: state.userId,
amount: state.totalPrice,
});
// emit is fire-and-forget, fine for compensations that don't block.
}
private async reserveInventory(state: OrderSagaState): Promise<void> {
return new Promise((resolve, reject) => {
this.inventoryClient.send('reserve_inventory', {
transactionId: state.transactionId,
productId: state.productId,
quantity: state.quantity,
}).subscribe({
next: (response) => response.status === 'RESERVED' ? resolve() : reject(new Error(response.reason)),
error: (err) => reject(err),
});
});
}
private async releaseInventory(state: OrderSagaState): Promise<void> {
this.inventoryClient.emit('release_inventory', {
transactionId: state.transactionId,
productId: state.productId,
quantity: state.quantity,
});
}
}
The key pitfall here is saga state management. Using an in-memory map is fine for a demo, but in production, this state must be persisted transactionally after each step. If the orchestrator crashes mid-saga, it must be able to recover and resume from where it left off. A common pattern is to use a database table with a row for each saga instance, updating its state and history atomically.
Infrastructure as Code: Envoy Proxy Configuration
Envoy acts as the smart network layer. All inter-service communication flows through it. This configuration sets up listeners for incoming traffic, routes requests to the correct service clusters, and, crucially, implements resilience patterns.
# envoy.yaml
static_resources:
listeners:
- name: listener_0
address:
socket_address: { address: 0.0.0.0, port_value: 10000 }
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
stat_prefix: ingress_http
# Generate x-request-id for tracing if not present
generate_request_id: true
route_config:
name: local_route
virtual_hosts:
- name: local_service
domains: ["*"]
routes:
# Route to Order Service API Gateway
- match: { prefix: "/orders" }
route: { cluster: order_service_cluster }
http_filters:
- name: envoy.filters.http.router
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
clusters:
# Cluster for the main API Gateway (Order Service)
- name: order_service_cluster
connect_timeout: 5s
type: STRICT_DNS
lb_policy: ROUND_ROBIN
load_assignment:
cluster_name: order_service_cluster
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
# This should resolve to the Order service container
address: host.docker.internal
port_value: 3001
# We are using RabbitMQ for internal comms, so Envoy isn't proxying
# service-to-service calls directly in this setup. However, if we were using
# HTTP/gRPC, clusters for payment and inventory would be defined here with
# circuit breakers and retry policies like this:
#
# - name: payment_service_cluster
# connect_timeout: 1s
# type: STRICT_DNS
# lb_policy: ROUND_ROBIN
# circuit_breakers:
# thresholds:
# - priority: DEFAULT
# max_connections: 100
# max_pending_requests: 10
# max_requests: 50
# max_retries: 3
# load_assignment:
# ...
A common mistake is to put too much business logic into Envoy. Envoy should handle network-level concerns. Retrying a failed payment transaction, for example, is a business decision that belongs in the saga orchestrator. Retrying a 503 Service Unavailable
error from a transient network issue is an infrastructure concern that belongs in Envoy. Here, we’ve focused Envoy on its primary role as an edge proxy and trace initiator. In a fully HTTP-based microservice architecture, it would also manage internal traffic, where circuit breakers would be critical to prevent cascading failures.
The Gateway: High-Performance Communication with Fastify and SSE
The gateway is the bridge between the client and the internal microservices. It initiates the saga and provides a real-time status stream. We use Server-Sent Events (SSE) for this, a simple and efficient protocol for pushing data from server to client over a standard HTTP connection.
// src/app.controller.ts (in the API Gateway/Order Service)
import { Controller, Post, Body, Sse, Param, MessageEvent } from '@nestjs/common';
import { Observable, interval, map, fromEvent } from 'rxjs';
import { EventEmitter2 } from '@nestjs/event-emitter';
import { OrderSaga, OrderSagaState } from './order/order.saga';
@Controller('orders')
export class AppController {
constructor(
private readonly orderSaga: OrderSaga,
private readonly eventEmitter: EventEmitter2,
) {}
@Post()
async createOrder(@Body() createOrderDto: any): Promise<{ transactionId: string }> {
// Basic validation should be here
const transactionId = await this.orderSaga.startSaga({
orderId: createOrderDto.orderId,
userId: 'user-123',
// ... more data
});
// We also need a way to push updates for this transactionId
// The saga orchestrator will emit events like 'saga.updated.TRANSACTION_ID'
// This is a simplified version of that concept.
return { transactionId };
}
@Sse(':transactionId/status')
sse(@Param('transactionId') transactionId: string): Observable<MessageEvent> {
// This is the key part for real-time updates.
// We listen for events specific to this transactionId.
// The Saga Orchestrator needs to be modified to emit these events.
// For example, after each step: this.eventEmitter.emit(`saga.update.${transactionId}`, newState);
return fromEvent(this.eventEmitter, `saga.update.${transactionId}`).pipe(
map((state: OrderSagaState) => {
return {
data: {
status: state.status,
history: state.history,
failureReason: state.failureReason
},
} as MessageEvent;
}),
);
}
}
The main.ts
file must be configured to use FastifyAdapter
.
// src/main.ts
import { NestFactory } from '@nestjs/core';
import {
FastifyAdapter,
NestFastifyApplication,
} from '@nestjs/platform-fastify';
import { AppModule } from './app.module';
async function bootstrap() {
const app = await NestFactory.create<NestFastifyApplication>(
AppModule,
new FastifyAdapter(),
);
await app.listen(3001);
}
bootstrap();
Using Fastify provides a noticeable performance improvement for I/O-bound operations like managing hundreds or thousands of concurrent SSE connections, compared to the default Express adapter. This choice is a direct trade-off for performance over familiarity.
Frontend State Synchronization with Recoil
The final piece is the user interface. It must react to the saga’s state changes without complex manual state management. Recoil’s model shines here.
First, we define an atom family to hold the state for any given order transaction.
// src/state/orderState.js (in a React app)
import { atomFamily } from 'recoil';
export const orderSagaStateFamily = atomFamily({
key: 'orderSagaState',
default: {
status: 'IDLE',
history: [],
failureReason: null,
},
});
Next, a component uses a useEffect
hook to subscribe to the SSE stream when a transaction starts. As data arrives, it updates the Recoil atom, and the UI re-renders automatically.
// src/components/OrderTracker.js
import React, { useEffect, useState } from 'react';
import { useRecoilState } from 'recoil';
import { orderSagaStateFamily } from '../state/orderState';
function OrderTracker({ transactionId }) {
const [sagaState, setSagaState] = useRecoilState(orderSagaStateFamily(transactionId));
useEffect(() => {
if (!transactionId) return;
// Set initial state to pending when we start tracking
setSagaState({ status: 'PENDING', history: [{ step: 'RequestSent', status: 'SUCCESS' }] });
const eventSource = new EventSource(`/api/orders/${transactionId}/status`);
eventSource.onmessage = (event) => {
const data = JSON.parse(event.data);
// The state update is trivial. Recoil handles the rest.
setSagaState(data);
};
eventSource.onerror = (error) => {
console.error('SSE Error:', error);
setSagaState((prev) => ({ ...prev, status: 'ERROR', failureReason: 'Connection lost' }));
eventSource.close();
};
// Cleanup on component unmount
return () => {
eventSource.close();
};
}, [transactionId, setSagaState]);
return (
<div>
<h3>Order Status (Tx: {transactionId})</h3>
<p>Overall Status: <strong>{sagaState.status}</strong></p>
{sagaState.failureReason && <p style={{ color: 'red' }}>Reason: {sagaState.failureReason}</p>}
<ul>
{sagaState.history.map((item, index) => (
<li key={index}>
{item.timestamp}: [{item.step}] - {item.status}
</li>
))}
</ul>
</div>
);
}
// Parent component that initiates the order
export function OrderPlacement() {
const [transactionId, setTransactionId] = useState(null);
const handlePlaceOrder = async () => {
// This would call the POST /orders endpoint
const response = await fetch('/api/orders', { method: 'POST', /*...*/ });
const { transactionId: newTxId } = await response.json();
setTransactionId(newTxId);
};
return (
<div>
<button onClick={handlePlaceOrder} disabled={!!transactionId}>
Place a Test Order (Inventory will fail)
</button>
{transactionId && <OrderTracker transactionId={transactionId} />}
</div>
);
}
This architecture provides a clean separation of concerns. The React component knows nothing about Sagas or microservices. It only knows how to subscribe to a data stream and update a piece of state. Recoil’s declarative nature makes this flow far simpler to manage than traditional state management libraries that might require complex reducer logic for handling streaming updates.
sequenceDiagram participant Client as React UI (Recoil) participant Envoy as Envoy Proxy participant Gateway as NestJS Gateway (Fastify) participant Orchestrator as Order Saga participant PaymentSvc as Payment Service participant InventorySvc as Inventory Service Client->>+Envoy: POST /orders Envoy->>+Gateway: POST /orders Gateway->>+Orchestrator: startSaga() Orchestrator-->>-Gateway: { transactionId } Gateway-->>-Envoy: { transactionId } Envoy-->>-Client: { transactionId } Client->>+Envoy: GET /orders/{id}/status (SSE) Envoy->>+Gateway: GET /orders/{id}/status (SSE) Gateway-->>-Client: SSE Connection Opened Note over Orchestrator: Saga Execution Starts Orchestrator->>+PaymentSvc: process_payment PaymentSvc-->>-Orchestrator: payment_approved Orchestrator-->>Gateway: emit('saga.update', state) Gateway-->>Client: SSE: { status: 'PAYMENT_COMPLETE' } Orchestrator->>+InventorySvc: reserve_inventory InventorySvc-->>-Orchestrator: insufficient_stock (FAIL) Note over Orchestrator: Step Failed. Start Compensation. Orchestrator-->>Gateway: emit('saga.update', state) Gateway-->>Client: SSE: { status: 'FAILED', reason: '...' } Orchestrator->>+PaymentSvc: refund_payment Orchestrator-->>Gateway: emit('saga.update', state) Gateway-->>Client: SSE: { history: [..., {step: 'ProcessPayment', status:'COMPENSATING'}] } Note right of Client: UI re-renders reactively at each SSE event
The presented solution works, but it’s not without its limitations. The saga orchestrator’s state is held in-memory, making it a single point of failure and unsuitable for production without a persistent store like Redis or a relational database. Recovering from an orchestrator crash mid-saga is a non-trivial problem that requires careful state management and idempotency in all service calls. Furthermore, our SSE implementation is basic; a production system would need robust heartbeating, reconnection logic, and potentially a more scalable solution like WebSockets managed via a dedicated service if client numbers grow large. Finally, while logging with a correlation ID is a huge step forward for observability, true insight comes from integrating this with a distributed tracing system like Jaeger or OpenTelemetry, allowing visualization of the entire call graph and latency analysis for each step of the saga.