A business operation that spans several services raises a question SQL has been answering for fifty years inside a single database: what happens when one step succeeds and the next one fails? As long as everything lives in the same database, BEGIN ... ROLLBACK is enough. The moment you call an external service, a third-party API or another database, that safety net disappears.

The Saga pattern answers that question. Rather than attempting an impossible ACID transaction, it breaks the operation into local steps, each paired with a compensating transaction that knows how to undo its effect. If step 4 fails, the compensations for steps 1, 2 and 3 are replayed in reverse order.

This article opens a series on distributed architecture patterns. We start with the most foundational one, the one that changes the way you think about a business operation the moment it crosses a service boundary. It belongs to the same family as the anti-corruption layer: isolate side effects and keep the business code in charge of its boundaries.

The problem: no rollback beyond the database

Imagine an online store checkout. The operation chains four steps:

  1. Create the order in the database
  2. Reserve the items in the inventory service
  3. Charge the customer’s card through Stripe
  4. Schedule shipping with the carrier

With a single PostgreSQL database, you would wrap the whole thing in a transaction and any failure would roll it back. But step 3 calls the Stripe API. Once the charge has been accepted, no SQL ROLLBACK can undo what happened outside. The customer has been charged, the Stripe webhook will come back, the event has left the boundary.

If step 4 fails, you end up with a created order, reserved inventory, a charged customer, and no shipment scheduled. Without an explicit plan, that order will exist halfway forever.

The principle: compensate rather than rollback

The Saga starts from a simple observation: any step that modifies an external system must know how to undo its own effect. Undoing is not the same as rolling back. Refunding a Stripe charge means calling a refund API, which leaves a trace on the Stripe side, emits an accounting event, and shows up on the customer’s statement. It is an explicit business operation, not an invisible deletion.

A Saga is therefore a sequence of steps, where each step S_i is paired with a compensation C_i. On failure at step n, you run C_{n-1}, C_{n-2}, ..., C_1 in reverse. The invariant to maintain: every compensation must be idempotent, and the system must be able to resume where it left off after a crash.

Choreography or orchestration

Two implementation styles coexist.

Choreography relies on events. Each service listens to what the others publish and reacts. OrderCreated triggers the inventory module, which publishes StockReserved, which triggers the payment module, and so on. No conductor. It is elegant as long as the workflow stays small. Beyond three or four steps, following what happens becomes hard: the Saga logic is scattered across as many services as there are steps, and debugging a failure requires reconstructing the thread from logs.

Orchestration centralizes the logic in a single component, a SagaOrchestrator, which knows every step, triggers them in order, and drives compensations on failure. More explicit, more testable, better suited to complex workflows. For a checkout that touches inventory, payment and shipping, it is almost always the right choice.

A minimal orchestration in Python

Here is a stripped-down Saga orchestrator. Each step implements execute and compensate.

from abc import ABC, abstractmethod
from dataclasses import dataclass


class SagaStep(ABC):
    @abstractmethod
    def execute(self, context: dict) -> None: ...

    @abstractmethod
    def compensate(self, context: dict) -> None: ...


@dataclass
class SagaFailed(Exception):
    step: str
    cause: Exception


class SagaOrchestrator:
    def __init__(self, steps: list[SagaStep]) -> None:
        self.steps = steps

    def run(self, context: dict) -> None:
        executed: list[SagaStep] = []
        current: SagaStep | None = None
        try:
            for step in self.steps:
                current = step
                step.execute(context)
                executed.append(step)
        except Exception as exc:
            for done in reversed(executed):
                try:
                    done.compensate(context)
                except Exception:
                    # log and continue: a failing compensation
                    # should not block the others
                    continue
            raise SagaFailed(step=type(current).__name__, cause=exc) from exc

And the e-commerce order steps:

class CreateOrderStep(SagaStep):
    def execute(self, context):
        order = Order.objects.create(**context["data"])
        context["order_id"] = order.id

    def compensate(self, context):
        Order.objects.filter(id=context["order_id"]).update(
            status="cancelled"
        )


class ChargePaymentStep(SagaStep):
    def execute(self, context):
        charge = stripe.Charge.create(
            amount=context["amount"],
            currency="eur",
            customer=context["customer_id"],
            idempotency_key=f"order-{context['order_id']}",
        )
        context["charge_id"] = charge.id

    def compensate(self, context):
        stripe.Refund.create(charge=context["charge_id"])


class ScheduleShipmentStep(SagaStep):
    def execute(self, context):
        shipment = shipping_api.create(context["order_id"])
        context["shipment_id"] = shipment.id

    def compensate(self, context):
        shipping_api.cancel(context["shipment_id"])


saga = SagaOrchestrator([
    CreateOrderStep(),
    ChargePaymentStep(),
    ScheduleShipmentStep(),
])
saga.run({"data": {...}, "amount": 4990, "customer_id": "cus_..."})

If ScheduleShipmentStep.execute raises, the orchestrator calls ChargePaymentStep.compensate and then CreateOrderStep.compensate. Stripe refunds the customer, the order moves to cancelled. The system stays consistent.

The pitfalls you discover in production

The code above works in the ideal case. Three problems show up the moment you deploy it for real.

Idempotency is mandatory. A compensation can be replayed. If the process crashes after compensating step 2 but before marking the Saga finished, the restart will attempt the compensation again. stripe.Refund.create must be safe to call twice without refunding twice. That is exactly why Stripe exposes an idempotency_key on its endpoints: use it systematically, on charges and on refunds alike.

A failed compensation is a real case. The Stripe API or the shipping service may be down when you want to compensate. The sensible strategy: retry with backoff, and if after several attempts the compensation still fails, move the Saga into a compensation_failed status that alerts a human. A silently failing compensation leaves the system in a corrupted state.

The Saga state must survive a process crash. The example above keeps state in memory. If the worker dies between two steps, you lose track of what has been done. In practice, you persist state after every step: a SagaInstance model in the database, with each step’s status and the serialized context. On restart, you resume where you left off, or launch compensation.

Wiring it into Django and Celery

On a Django plus Celery stack, the pattern maps naturally. Each step becomes a Celery task, chained with the next one through an execute_step task that maintains state in the database.

from celery import shared_task
from django.db import transaction


@shared_task(bind=True, max_retries=3)
def execute_step(self, saga_id: int, step_index: int):
    with transaction.atomic():
        saga = SagaInstance.objects.select_for_update().get(id=saga_id)
        step = STEPS[step_index]
        try:
            step.execute(saga.context)
            saga.mark_step_done(step_index)
        except Exception:
            compensate_saga.delay(saga_id, up_to=step_index)
            raise

    if step_index + 1 < len(STEPS):
        execute_step.delay(saga_id, step_index + 1)
    else:
        saga.mark_completed()

select_for_update prevents a concurrent retry from running the same step twice, but it requires an open transaction or Django raises TransactionManagementError. Scheduling the next step (execute_step.delay) is intentionally outside the atomic block, so a Celery task is never emitted if the transaction ends up rolled back. The compensation itself is a Celery task, which gives it retry and worker-crash resilience for free.

A Saga step often needs to publish an event (OrderCreated, PaymentCharged) so other parts of the system react. Publishing directly to a broker from inside the step raises a classic problem: if the publication succeeds but the local transaction fails, you emit an event that matches nothing. The reverse is also true.

The Outbox pattern solves that misalignment. The step writes the event to an outbox table within the same SQL transaction as its own business write. An external publisher reads that table and emits to the broker. Both states stay consistent whatever happens. A later article in this series will dig into this mechanism.

When not to reach for a Saga

The pattern has a cost: more code, state to persist, compensation logic to write and test for every step. For a workflow that fits inside a single database and a single transaction, it is overengineering. A Django transaction.atomic does the job, more simply, with actual ACID guarantees.

A Saga becomes relevant once:

  • a business operation touches several systems (different databases, external APIs, message queues)
  • steps cannot be made atomic together
  • the operation must be resumable after a crash
  • each step has a business notion of “undo” that makes sense

Conversely, for a purely local workflow, for an idempotent operation that can simply be replayed on failure, or for a case where “doing nothing” is an acceptable compensation, the Saga pattern is too heavy.

Conclusion

The Saga does not replace transactions, it accepts that they are no longer available and offers a frame to live with that. Every step carries its own undo logic, the orchestrator guarantees order and state persistence, and the system stays consistent even when an external service drops in the middle of a workflow.

The pattern reveals itself the moment you start tracing what actually happens in a distributed business operation. Compensation is never free, but it is explicit, traceable and auditable, where a silently corrupted state never is.