Compensation Patterns: When Rollback Isn't Possible

In a database transaction, rollback is simple. ROLLBACK and everything disappears. The data was never committed.

In a distributed system, there is no rollback. Once you've called an external API, that call happened. The email was sent. The payment was charged. The order was placed.

You can't undo these actions. But you can compensate for them.

The Saga Pattern

The saga pattern (originally described by Hector Garcia-Molina in 1987) replaces a single distributed transaction with a sequence of local transactions, each paired with a compensating action.

Forward flow:
  reserve_inventory → charge_payment → ship_order → send_confirmation

If ship_order fails:
  refund_payment → release_inventory
  (reverse order, skip what doesn't need compensation)

TIATON implements this automatically through skill compensators.

Declaring Compensators

Each skill can declare a compensator — a function that reverses its effects:

{
  "key": "charge_payment",
  "handler": "charge_customer",
  "compensator": "refund_customer",
  "requires": ["inventory_reserved"],
  "ensures": ["payment_charged"]
}

The compensator is a regular handler with the same signature:

def charge_customer(ctx, state):
    payload = new("payments.v1.ChargeRequest", {
        "customer_id": state["customer_id"],
        "amount": state["order_total"],
        "currency": "USD",
        "idempotency_key": state["order_id"] + "_charge",
    })
    return RUNNING, submit_job(
        "payments.v1.PaymentService/Charge", payload
    )

def on_charge_complete(ctx, state):
    state["payment_id"] = ctx.event.result.payment_id
    state["charged_amount"] = ctx.event.result.amount
    return SUCCESS

def refund_customer(ctx, state):
    """Compensator: reverse the charge."""
    if not state.get("payment_id"):
        return SUCCESS  # Nothing to refund

    payload = new("payments.v1.RefundRequest", {
        "payment_id": state["payment_id"],
        "amount": state["charged_amount"],
        "reason": "Order processing failure - automatic compensation",
    })
    return RUNNING, submit_job(
        "payments.v1.PaymentService/Refund", payload
    )

Automatic Compensation Cascade

When a skill fails, TIATON triggers compensation for all previously completed skills in reverse order:

Execution order:
  1. validate_order    → SUCCESS
  2. reserve_inventory → SUCCESS (compensator: release_inventory)
  3. charge_payment    → SUCCESS (compensator: refund_payment)
  4. ship_order        → FAILURE ← failure here

Compensation (automatic, reverse order):
  3. refund_payment    → SUCCESS ← charge reversed
  2. release_inventory → SUCCESS ← inventory freed
  1. (no compensator)  ← validation is stateless

You don't write this orchestration logic. The agent tracks CompletedSkills in execution order and reverses through them on failure.

Compensation Strategies

Different scenarios need different approaches:

Full Reversal

The simplest case — undo everything:

def release_inventory(ctx, state):
    """Release all reserved items."""
    payload = new("inventory.v1.ReleaseRequest", {
        "reservation_id": state["reservation_id"],
    })
    return RUNNING, submit_job(
        "inventory.v1.InventoryService/Release", payload
    )

Partial Compensation

Sometimes you compensate with a different action:

def void_credit_check(ctx, state):
    """Mark credit inquiry as voided (can't delete it, but can flag it)."""
    payload = new("credit.v1.VoidInquiryRequest", {
        "inquiry_id": state["credit_inquiry_id"],
        "reason": "Application cancelled due to processing error",
    })
    return RUNNING, submit_job(
        "credit.v1.CreditService/VoidInquiry", payload
    )

Notification-Based Compensation

When you can't reverse an action, notify the affected parties:

def notify_cancellation(ctx, state):
    """Can't un-send the approval email, but can send cancellation."""
    payload = new("notifications.v1.SendRequest", {
        "to": state["applicant_email"],
        "template": "application_cancelled",
        "data": {
            "reason": "Processing error — your application will be re-evaluated",
            "reference": state["application_id"],
        },
    })
    return RUNNING, submit_job(
        "notifications.v1.NotificationService/Send", payload
    )

The Audit Trail

Every compensation is recorded in the session trace:

{
  "tick": 5,
  "type": "compensation",
  "trigger": "ship_order FAILURE",
  "compensations": [
    {
      "skill": "charge_payment",
      "compensator": "refund_payment",
      "status": "success",
      "duration_ms": 1203,
      "result": {
        "refund_id": "ref_8841",
        "amount": 299.99
      }
    },
    {
      "skill": "reserve_inventory",
      "compensator": "release_inventory",
      "status": "success",
      "duration_ms": 45,
      "result": {
        "items_released": 3
      }
    }
  ]
}

No detective work. No manual investigation. The trace shows exactly what failed, what was compensated, and whether compensation succeeded.

When Compensation Fails

What if the compensator itself fails? TIATON records the failure and marks the session as requiring manual intervention:

{
  "status": "compensation_failed",
  "failed_compensations": [
    {
      "skill": "charge_payment",
      "compensator": "refund_payment",
      "error": "Payment provider timeout after 30s",
      "payment_id": "pay_7721",
      "amount": 299.99
    }
  ]
}

This creates an actionable alert: "Refund of $299.99 for payment pay_7721 failed — manual refund required." The system doesn't silently swallow the failure.

Design Principles

Every side effect needs a compensator — If a skill calls an external system, declare how to reverse it
Compensators must be idempotent — They might be called more than once (retries)
Use idempotency keys — External APIs should handle duplicate compensation requests gracefully
Log everything — The compensation trail is as important as the execution trail
Accept imperfection — Some compensations are "best effort" (notifications). That's okay. The audit trail captures it.

The goal isn't perfect rollback — that's impossible in distributed systems. The goal is predictable recovery with complete visibility.