Implementing a Durable SQL-Backed Saga Relay for Zero-Downtime Deployments with Spring and Spinnaker


The transition from a monolith to microservices introduced a critical data consistency problem. A single business transaction, like processing an order, was now fragmented across order-service, payment-service, and inventory-service. The initial approach using synchronous, chained REST calls was a predictable failure. A timeout in the inventory-service would cascade, leaving a completed payment for an order that was never recorded as fulfilled. Standard two-phase commit (2PC) was considered and immediately dismissed; the operational overhead and performance degradation from distributed locks were unacceptable for our availability requirements.

Our first attempt to solve this involved implementing the Saga pattern. The concept was sound: a sequence of local transactions, where each step has a corresponding compensating action to undo it. The initial implementation was an in-memory orchestrator within the order-service. It was a simple state machine held in a ConcurrentHashMap, mapping a correlation ID to the current state of the saga.

// DO NOT USE THIS IN PRODUCTION - FLAWED INITIAL ATTEMPT
@Service
public class InMemorySagaCoordinator {

    // A map to hold the state of all in-flight sagas.
    // This is the core flaw: state is volatile.
    private final Map<UUID, SagaState> activeSagas = new ConcurrentHashMap<>();

    private final RestTemplate restTemplate;

    // ... constructor ...

    public void beginOrderSaga(OrderRequest order) {
        UUID sagaId = UUID.randomUUID();
        SagaState initialState = new SagaState(sagaId, order);
        activeSagas.put(sagaId, initialState);
        
        // Asynchronously execute the first step
        CompletableFuture.runAsync(() -> executePaymentStep(sagaId));
    }

    private void executePaymentStep(UUID sagaId) {
        SagaState state = activeSagas.get(sagaId);
        if (state == null) return; // Saga was completed or aborted

        try {
            // Step 1: Call Payment Service
            ResponseEntity<String> response = restTemplate.postForEntity(
                "http://payment-service/charge", 
                state.getPaymentDetails(), 
                String.class
            );

            if (response.getStatusCode().is2xxSuccessful()) {
                state.setCurrentStep(SagaStep.DEBIT_INVENTORY);
                CompletableFuture.runAsync(() -> executeInventoryStep(sagaId));
            } else {
                // Begin rollback - no compensation needed for the first step
                activeSagas.remove(sagaId);
            }
        } catch (Exception e) {
            // Network error, etc.
            // A real-world project needs a retry mechanism here, which adds even more complexity
            // to this flawed in-memory model.
            activeSagas.remove(sagaId);
        }
    }
    
    // ... executeInventoryStep and compensation methods omitted for brevity ...
}

This design worked perfectly in controlled tests. The first production deployment, however, revealed the fatal flaw. A routine pod restart in our Kubernetes cluster, part of a minor configuration update, wiped the ConcurrentHashMap clean. Dozens of in-flight sagas vanished. We were left with charged payments for orders that the inventory service knew nothing about, forcing a painful manual reconciliation process. The core takeaway was absolute: saga state must be as durable as the application data it coordinates.

This led to the design of a durable, SQL-backed “Saga Relay.” Instead of relying on volatile application memory, we decided to persist the entire state of every saga in a PostgreSQL database. A common mistake is to over-engineer this persistence layer. We explicitly chose standard SQL over event sourcing systems like Kafka or specialized databases. In a real-world project, the familiarity, transactional integrity (for single-row updates), and powerful querying capabilities of SQL provide a stable and auditable foundation. The database becomes our durable transaction log.

The schema is the bedrock of this new design. It needs to track not only the overall saga instance but also the state of each individual step.

-- Schema for the Saga Relay persistence layer in PostgreSQL

CREATE TABLE saga_instance (
    id UUID PRIMARY KEY,
    saga_type VARCHAR(100) NOT NULL,
    status VARCHAR(20) NOT NULL CHECK (status IN ('RUNNING', 'COMPENSATING', 'COMPLETED', 'FAILED')),
    current_step INT NOT NULL DEFAULT 0,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE TABLE saga_step (
    id SERIAL PRIMARY KEY,
    saga_id UUID NOT NULL REFERENCES saga_instance(id) ON DELETE CASCADE,
    step_name VARCHAR(100) NOT NULL,
    status VARCHAR(20) NOT NULL CHECK (status IN ('PENDING', 'SUCCESS', 'FAILED', 'COMPENSATED')),
    execution_payload JSONB,
    compensation_payload JSONB,
    -- The endpoints for the forward and backward actions.
    -- Storing them here makes the relay more generic.
    forward_endpoint VARCHAR(255) NOT NULL,
    compensation_endpoint VARCHAR(255) NOT NULL,
    -- For optimistic locking to prevent concurrent modification.
    version INT NOT NULL DEFAULT 0,
    -- Keep track of retry attempts.
    attempts INT NOT NULL DEFAULT 0,
    last_attempt_at TIMESTAMPTZ
);

CREATE INDEX idx_saga_instance_status ON saga_instance(status);
CREATE INDEX idx_saga_step_saga_id_status ON saga_step(saga_id, status);

With the durable storage defined, we implemented the orchestrator logic in Spring Boot, using Spring Data JPA for persistence. The core of the system is no longer an in-memory map but a polling-based job that queries the database for work to be done. This is a deliberate trade-off: polling introduces some latency compared to a push model, but it is incredibly resilient and simple to implement correctly.

Here are the JPA entities mapping to our schema:

// SagaInstance.java
@Entity
@Table(name = "saga_instance")
public class SagaInstance {
    @Id
    private UUID id;

    @Column(name = "saga_type", nullable = false)
    private String sagaType;
    
    @Enumerated(EnumType.STRING)
    @Column(nullable = false)
    private SagaStatus status;
    
    @Column(name = "current_step", nullable = false)
    private int currentStep;
    
    // Using a one-to-many relationship with an order column
    // to ensure steps are processed in the correct sequence.
    @OneToMany(mappedBy = "sagaInstance", cascade = CascadeType.ALL, fetch = FetchType.EAGER)
    @OrderBy("stepOrder ASC")
    private List<SagaStep> steps = new ArrayList<>();
    
    // Getters, Setters, Timestamps...
}

// SagaStep.java
@Entity
@Table(name = "saga_step")
public class SagaStep {
    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    private Long id;
    
    @ManyToOne(fetch = FetchType.LAZY)
    @JoinColumn(name = "saga_id", nullable = false)
    private SagaInstance sagaInstance;
    
    @Column(name = "step_name", nullable = false)
    private String stepName;
    
    @Column(name = "step_order", nullable = false)
    private int stepOrder;

    @Enumerated(EnumType.STRING)
    @Column(nullable = false)
    private SagaStepStatus status;
    
    @Type(JsonBinaryType.class)
    @Column(name = "execution_payload", columnDefinition = "jsonb")
    private String executionPayload;
    
    @Column(name = "forward_endpoint", nullable = false)
    private String forwardEndpoint;
    
    @Column(name = "compensation_endpoint", nullable = false)
    private String compensationEndpoint;

    @Version
    private int version;
    
    @Column(nullable = false)
    private int attempts = 0;

    // Getters, Setters...
}

The orchestrator logic now centers around a scheduled job that drives the state machine forward. It finds RUNNING sagas, identifies the next PENDING step, and executes it.

// SagaProcessor.java
@Service
@EnableScheduling
public class SagaProcessor {

    private static final Logger log = LoggerFactory.getLogger(SagaProcessor.class);
    private final SagaInstanceRepository sagaInstanceRepository;
    private final SagaStepExecutor sagaStepExecutor;
    private volatile boolean isPaused = false; // Control flag for deployment safety

    // ... constructor ...

    @Scheduled(fixedDelayString = "${saga.processor.polling.delay:5000}")
    @Transactional
    public void processPendingSagas() {
        if (isPaused) {
            log.info("Saga processor is paused, skipping execution cycle.");
            return;
        }

        // Find sagas that are running and have steps to be processed.
        // The query is crucial for performance.
        List<SagaInstance> activeSagas = sagaInstanceRepository.findRunnableSagas(PageRequest.of(0, 10));

        for (SagaInstance saga : activeSagas) {
            saga.getSteps().stream()
                .filter(step -> step.getStatus() == SagaStepStatus.PENDING)
                .findFirst()
                .ifPresent(sagaStepExecutor::executeStep);
        }
    }
    
    // Method to pause/resume the processor via an actuator endpoint.
    public void pause() { this.isPaused = true; }
    public void resume() { this.isPaused = false; }
    public boolean isPaused() { return this.isPaused; }
}

// SagaStepExecutor.java
@Service
public class SagaStepExecutor {

    private final SagaStepRepository stepRepository;
    private final WebClient webClient;

    // ... constructor ...
    
    @Transactional(propagation = Propagation.REQUIRES_NEW) // Each step execution is its own transaction.
    public void executeStep(SagaStep step) {
        log.info("Executing step {} for saga {}", step.getStepName(), step.getSagaInstance().getId());
        step.setAttempts(step.getAttempts() + 1);
        step.setLastAttemptAt(Instant.now());

        try {
            // Using Spring's WebClient for non-blocking I/O.
            webClient.post()
                .uri(step.getForwardEndpoint())
                .contentType(MediaType.APPLICATION_JSON)
                .bodyValue(step.getExecutionPayload())
                .retrieve()
                .toBodilessEntity()
                .subscribe(
                    response -> {
                        if (response.getStatusCode().is2xxSuccessful()) {
                            handleSuccess(step.getId());
                        } else {
                            handleFailure(step.getId(), "HTTP " + response.getStatusCodeValue());
                        }
                    },
                    error -> handleFailure(step.getId(), error.getMessage())
                );

        } catch (Exception e) {
            // This outer catch is for synchronous errors like URI syntax issues.
            log.error("Critical error during step execution setup for saga {}", step.getSagaInstance().getId(), e);
            handleFailure(step.getId(), "Pre-flight execution error");
        }
    }

    @Transactional(propagation = Propagation.REQUIRES_NEW)
    public void handleSuccess(Long stepId) {
        SagaStep step = stepRepository.findById(stepId).orElseThrow();
        step.setStatus(SagaStepStatus.SUCCESS);
        stepRepository.save(step);
        // Logic to advance the parent saga's current_step would go here.
    }

    @Transactional(propagation = Propagation.REQUIRES_NEW)
    public void handleFailure(Long stepId, String reason) {
        SagaStep step = stepRepository.findById(stepId).orElseThrow();
        log.warn("Step {} for saga {} failed. Reason: {}", step.getStepName(), step.getSagaInstance().getId(), reason);
        
        // A real implementation would have more sophisticated retry logic based on 'attempts'.
        // For simplicity, we fail immediately and trigger compensation.
        step.setStatus(SagaStepStatus.FAILED);
        stepRepository.save(step);
        
        SagaInstance saga = step.getSagaInstance();
        saga.setStatus(SagaStatus.COMPENSATING);
        // The compensation logic would be handled by another scheduled job looking for 'COMPENSATING' sagas.
    }
}

This durable implementation was a massive improvement. Service restarts were no longer a threat. However, our next production deployment surfaced a more subtle and dangerous problem. We used a standard blue-green deployment strategy managed by Spinnaker. The pipeline looked like this: bake a new machine image, deploy a new server group (the “green” one), add it to the load balancer, and then disable the old server group (the “blue” one).

The pitfall here is the period during which both blue and green server groups are active. Both sets of pods were running their own SagaProcessor scheduled job. Both were polling the same saga_instance table. This led to a classic race condition: two different application instances would pick up the same pending step and execute it twice. For a payment charge, this is catastrophic. For an inventory decrement, it’s a data corruption nightmare.

The deployment itself was creating data inconsistency.

We needed a deployment strategy that was aware of this stateful, singleton-like background process. The solution was to customize our Spinnaker pipeline to orchestrate the lifecycle of the SagaProcessor job across the deployment boundary. The key was to ensure only one processor was active at any given time system-wide.

We achieved this by adding a simple control mechanism to our Spring Boot application via custom Actuator endpoints.

// SagaControlEndpoint.java
@Component
@Endpoint(id = "sagacontrol")
public class SagaControlEndpoint {

    private final SagaProcessor sagaProcessor;

    public SagaControlEndpoint(SagaProcessor sagaProcessor) {
        this.sagaProcessor = sagaProcessor;
    }

    @WriteOperation
    public Map<String, String> pause() {
        sagaProcessor.pause();
        return Map.of("status", "PAUSED");
    }

    @WriteOperation
    public Map<String, String> resume() {
        sagaProcessor.resume();
        return Map.of("status", "RESUMED");
    }
    
    @ReadOperation
    public Map<String, Object> status() {
        return Map.of("paused", sagaProcessor.isPaused());
    }
}

These endpoints allow an external system—in our case, Spinnaker—to safely pause and resume the background job. Our Spinnaker pipeline was re-architected to use these controls.

graph TD
    subgraph Spinnaker Pipeline
        A(Bake AMI) --> B(Deploy New Version - Green);
        B --> C{Manual Judgment: Proceed?};
        C --> D[Run Job: Pause Old Version];
        D --> E{Wait for Drain};
        E --> F[Disable Old Version - Blue];
        F --> G[Run Job: Resume New Version];
    end

    subgraph "Old Server Group (Blue)"
        B1(App v1 - Running)
        D --> B2(App v1 - Paused)
        F --> B3(App v1 - Terminated)
    end
    
    subgraph "New Server Group (Green)"
        B --> G1(App v2 - Deployed but Paused by default)
        G --> G2(App v2 - Resumed & Running)
    end

    style D fill:#f9f,stroke:#333,stroke-width:2px
    style G fill:#f9f,stroke:#333,stroke-width:2px

The new Spinnaker pipeline flow is as follows:

  1. Bake & Deploy Green: A new server group with the updated application code is deployed. Crucially, the application starts with the SagaProcessor in a paused state by default, which can be controlled via a property: saga.processor.start.paused=true.
  2. Run Job: Pause Old Version: Before routing traffic, a “Run Job (CloudFoundry)” or “Run Job (Kubernetes)” stage is added. This stage executes a script that iterates through all instances in the old server group and sends a POST request to the /actuator/sagacontrol/pause endpoint. This gracefully stops the existing processors from picking up new work.
  3. Wait for Drain: A “Wait” stage is added for a configurable period (e.g., 30 seconds) to allow any steps that were already in-flight on the old instances to complete their current transaction.
  4. Disable Blue: The old server group is disabled and removed from the load balancer.
  5. Run Job: Resume New Version: A final “Run Job” stage targets the new server group and sends a POST request to /actuator/sagacontrol/resume. This activates the processor on the new instances, which then safely take over processing from the database.

Here is a conceptual snippet of what a “Run Job” stage’s manifest might look like in a Kubernetes context within Spinnaker’s pipeline JSON:

{
  "type": "runJob",
  "name": "Pause Saga Processor on Old Version",
  "refId": "pauseOldVersion",
  "requisiteStageRefIds": ["deployNewVersion"],
  "account": "my-k8s-account",
  "application": "my-app",
  "cloudProvider": "kubernetes",
  "manifest": {
    "apiVersion": "batch/v1",
    "kind": "Job",
    "metadata": {
      "namespace": "my-namespace",
      "name": "pause-saga-job"
    },
    "spec": {
      "template": {
        "spec": {
          "containers": [
            {
              "name": "control-job",
              "image": "curlimages/curl:7.72.0",
              "command": [
                "/bin/sh",
                "-c",
                "for pod_ip in $(kubectl get pods -l app=my-app,version=v1 -o jsonpath='{.items[*].status.podIP}'); do curl -X POST http://$pod_ip:8080/actuator/sagacontrol/pause; done"
              ]
            }
          ],
          "restartPolicy": "Never"
        }
      }
    }
  },
  "propertyFile": "output.properties",
  "predecessors": [
    {
      "refId": "deployNewVersion",
      "failure": "fail"
    }
  ]
}

This combined architecture of a durable SQL-backed relay and a deployment-aware lifecycle managed by Spinnaker finally gave us the resilience we needed. In-flight transactions are no longer lost on restarts, and deployments no longer cause data corruption through race conditions. The solution isn’t glamorous; it relies on established technologies like PostgreSQL and a simple polling mechanism. Its strength lies in its robustness and operational simplicity.

The current design, however, is not without its limitations. The database is a single point of contention and could become a performance bottleneck if the saga transaction volume becomes extremely high. The polling mechanism, while safe, inherently introduces a minimum latency to step execution equal to the polling interval. For future iterations, we are exploring a hybrid model. One potential path is to introduce a lightweight message queue; the processor would write a message to the queue to trigger the next step immediately, while still using the SQL database as the ultimate source of truth for recovery. This would reduce latency but requires careful implementation to avoid message duplication issues. Another avenue is to investigate database-level locking, such as using SELECT ... FOR UPDATE SKIP LOCKED, to allow multiple processor instances to run concurrently, each safely grabbing a distinct set of sagas to work on. This would eliminate the need for the pause/resume mechanism in the deployment pipeline but transfers complexity from the deployment process to the data access logic.


  TOC