The Rollback Problem: What Happens When Step 4 Fails?

The failure scenario nobody designs for

Your automated process has five steps. Step 1 validates the input. Step 2 checks eligibility. Step 3 reserves a resource. Step 4 processes the payment. Step 5 sends the confirmation.

Step 4 fails.

Now what? Step 3 already reserved a resource that should be released. Step 2's eligibility check created a record. Step 1 logged the attempt. The system is in an inconsistent state — half-done work scattered across multiple services and databases.

In a monolithic application with a single database, you'd rely on a database transaction: if anything fails, everything rolls back. But in distributed systems — and especially in AI-agent-driven workflows that interact with multiple external services — there's no global transaction to rely on.

This is the rollback problem. It's not new. But with the rise of AI agents that autonomously execute multi-step workflows, it's become significantly more dangerous — because agents can initiate actions faster and across more systems than any manual process, and they don't naturally pause to ask "can I undo this?"

A well-documented engineering problem

The Saga pattern, first described by Hector Garcia-Molina and Kenneth Salem in a 1987 Princeton University technical report ("SAGAS", TR-070-87), addresses exactly this scenario. Microsoft's Azure Architecture Center, AWS Prescriptive Guidance, and Chris Richardson's widely referenced microservices.io all document the same core concept: when a multi-step process can't rely on a single ACID transaction, each step must define a compensating transaction — a reverse operation that semantically undoes the work.

The key word is "semantically." As Microsoft's Compensating Transaction pattern explains, you can't always roll back data changes with a simple database rollback; compensation is an application-specific process that applies business logic to undo previously completed work. You can't un-send an email. You can't un-call an API. But you can cancel a reservation, reverse a charge, or mark a record as void.

AWS documentation distinguishes platform-level failures (forward recovery via retry and continue) from application-level failures (backward recovery via compensating transactions). The choice between them is a design decision that must be made per step, not per system.

Microsoft's Compensating Transaction pattern documentation adds an important nuance: "A compensating transaction might not have to undo the work in the exact reverse order of the original operation. It might be possible to perform some of the undo steps in parallel." This matters for performance — if your process has 8 steps and step 6 fails, you may be able to compensate steps 3, 4, and 5 simultaneously rather than sequentially.

Why AI agents make this worse

Traditional workflows — built with tools like Camunda, Temporal, or AWS Step Functions — at least force developers to think about the execution path. When you draw a BPMN diagram or write a state machine, you see the steps, you think about failure modes, and (sometimes) you define compensation logic.

AI agents invert this. In most agent frameworks, the LLM decides which tools to call and in what order, dynamically, at runtime. There's no pre-defined execution graph. There's no explicit declaration of "if this step fails, undo that step." The agent reasons about what to do next, and if something fails, it reasons about what to do about the failure — introducing another layer of non-determinism.

An Edstellar article offers an illustrative cascade-failure scenario: a mislabeled supplier risk rating triggers a contract termination; a mishandled email kicks off automated reactions across procurement, legal, and finance. The scenario is hypothetical, but the pattern is recognizable: unlike rule-based automation that halts at failure, agents can push forward and compound bad decisions without oversight.

In an IBM Think interview, Maryam Ashoori cited figures suggesting only about 19% of organizations focus on observability and monitoring of agents in production — implying many teams still lack mature tracing for agent workflows and may not detect when a failure leaves downstream systems in an inconsistent state.

The design principles that work

The Saga pattern literature converges on a set of principles that apply directly to any automated workflow — whether human-coded or AI-driven:

1. Every step declares its compensator. Before a step executes, the system knows how to undo it. This isn't an afterthought — it's a precondition for execution. Microsoft's documentation puts it plainly: "Use this pattern only for operations that must be undone if they fail. If possible, design solutions to avoid the complexity of requiring compensating transactions." When that's not possible, the compensating logic should be defined upfront.

2. Compensation runs in reverse order. When step 4 fails, you compensate step 3, then step 2, then step 1 — or in parallel where dependencies allow. The Saga Execution Coordinator (described in the Baeldung architecture guide) inspects the saga log to identify impacted components and the correct compensation sequence.

3. Compensating actions must be idempotent and retryable. A compensation can itself fail. The system must be able to retry it without causing additional inconsistency. DevX's practical analysis puts it well: "Compensating actions should be retryable, observable, and honest about what cannot be undone. Some actions are irreversible, and your system must handle that with follow-up workflows or human intervention."

4. State is preserved between steps. If the process pauses (waiting for an external response, waiting for human approval), it must preserve full state so it can resume without losing context. This is the "durable execution" concept that Temporal has popularized — but the principle is universal.

5. The entire execution is traced. A consulting write-up on saga implementation describes an organization where manual investigation and rollback of failed multi-service transactions took hours per incident. After introducing orchestration with a dedicated saga log and automated compensating transactions, recovery was reduced to minutes. The specific figures are anecdotal, but the pattern is consistent with the broader Saga literature: automated compensation with a trace log dramatically reduces mean time to recovery.

What this means for production AI systems

If you're building or deploying AI agents that take actions across multiple systems, ask three questions:

For every action the agent can take — what's the undo? If the agent can create a record, what cancels it? If it can call an external API, what reverses that call? If there's no undo, that action needs human approval gates, not autonomous execution.

When something fails — does the system know what already happened? If step 4 fails, can the system enumerate steps 1-3 and their compensating actions? Or does failure mean "call an engineer and figure out what state things are in"?

Is the failure recovery deterministic or does it depend on the AI's judgment? If the LLM decides how to recover from a failure, you've compounded one source of unpredictability with another. If compensating transactions are predefined and execute mechanically, recovery is reliable regardless of what caused the failure.

The Saga pattern exists precisely because distributed systems can't pretend they have global transactions. AI agents exist in an even more distributed, less predictable environment. The rollback problem isn't going away — it's scaling with every new agent you deploy.