Pause, Wait, Resume: Async Workflows That Survive Restarts
Your workflow calls an external payment provider. The provider takes 3 seconds to respond. Easy — you wait for the HTTP response.
Now your workflow submits a document for manual approval. The approver is on vacation. The response comes in 5 days. Not so easy anymore.
The Traditional Problem
Most workflow engines handle async in one of three painful ways:
1. Polling loops
# Please don't do this
while not is_approved(document_id):
time.sleep(60) # Check every minute
if timeout_reached():
raise TimeoutError("Approval timed out")
Wastes resources. Doesn't survive restarts. Timeout logic becomes a second workflow.
2. Callback spaghetti
// Please definitely don't do this
submitForApproval(doc, (approvalResult) => {
if (approvalResult.approved) {
processPayment(doc, (paymentResult) => {
if (paymentResult.success) {
sendConfirmation(doc, (confirmResult) => {
// 6 levels deep, who knows what state we're in
})
}
})
}
})
Unreadable. Error handling is scattered. State is implicit.
3. Database-backed state machines
UPDATE workflows SET status = 'waiting_for_approval'
WHERE id = 'wf_123';
-- ... separate cron job checks for approved workflows ...
-- ... another service picks up and continues ...
-- ... state scattered across 4 tables ...
Works, but you reinvent half a workflow engine every time.
The TIATON Approach
In TIATON, async is a first-class concept. A skill returns RUNNING with a job submission. The agent serializes its complete state and pauses. When the external event arrives, the agent resumes exactly where it stopped.
def request_approval(ctx, state):
payload = new("approvals.v1.ApprovalRequest", {
"document_id": state["document_id"],
"approver_group": "compliance",
"deadline_hours": 72,
})
return RUNNING, submit_job(
"approvals.v1.ApprovalService/RequestApproval",
payload
)
def on_approval_received(ctx, state):
result = ctx.event.result
state["approval_status"] = result.status
state["approved_by"] = result.approver_id
state["approval_timestamp"] = result.timestamp
if result.status == "rejected":
state["rejection_reason"] = result.reason
return FAILURE
return SUCCESS
That's it. No polling. No callbacks. No state machine tables. The runtime handles serialization, deserialization, event matching, and state restoration.
What Gets Serialized
When an agent pauses, the entire session is serialized:
{
"session_id": "sess_8f2a3c",
"agent_key": "document_review",
"state": {
"data": {
"document_id": "doc_771",
"document_type": "loan_agreement",
"applicant_id": "app_392",
"approval_status": null
},
"completed_skills": ["validate_document", "extract_terms"],
"predicates": {
"document_validated": true,
"terms_extracted": true,
"approval_received": false
}
},
"pending_jobs": [
{
"job_id": "job_44f1",
"job_type": "approvals.v1.ApprovalService/RequestApproval",
"node_id": "n7",
"submitted_at": "2025-02-10T14:32:00Z"
}
]
}
This JSON can be stored in PostgreSQL, Redis, S3 — anywhere. The session has no dependency on the process that created it.
Resuming
When the approval arrives (hours or days later), you resume with an event:
curl -X POST /api/sessions/sess_8f2a3c/resume \
-d '{
"events": [{
"type": "job_succeeded",
"job_id": "job_44f1",
"result": {
"status": "approved",
"approver_id": "user_88",
"timestamp": "2025-02-12T09:15:00Z"
}
}]
}'
The runtime:
- Loads the serialized session
- Rebuilds the behavior tree (deterministic node IDs ensure correct mapping)
- Injects the event
- Resumes ticking from the exact point where it paused
- Calls the
on_approval_receivedcompletion handler - Continues to the next skill
Timers
Scheduled delays use the same mechanism:
def schedule_reminder(ctx, state):
"""Send a reminder if no response in 48 hours."""
delay = new("tiaton.v1.DelayRequest", {
"timer_key": "approval_reminder",
"deadline_ms": 48 * 60 * 60 * 1000, # 48 hours
})
return RUNNING, submit_job("tiaton.v1.System/Delay", delay)
def on_reminder_fired(ctx, state):
"""Timer expired — send reminder notification."""
state["reminder_sent"] = True
# Submit another async job to send the reminder
payload = new("notifications.v1.ReminderRequest", {
"approver_group": "compliance",
"document_id": state["document_id"],
})
return RUNNING, submit_job(
"notifications.v1.NotificationService/SendReminder",
payload
)
Timers are just jobs. The runtime intercepts tiaton.v1.System/Delay, schedules a timer through the infrastructure, and fires a timer_fired event when the deadline passes.
Surviving Restarts
Because sessions are fully serialized:
- Server restart — Sessions resume from persistent storage
- Horizontal scaling — Any server instance can pick up any session
- Version upgrades — Deploy new code, existing sessions continue with their original behavior
- Disaster recovery — Restore sessions from backup, they continue where they left off
The session doesn't know or care which process started it. It only needs its serialized state and the tree definition (which is deterministic from the workflow JSON).
The Mental Model
Think of each session as a saved game:
- Player (agent) makes progress through levels (skills)
- Game saves automatically at checkpoints (async boundaries)
- Player can quit (process stops) and resume later
- Save file (serialized session) is portable across consoles (servers)
- Progress is never lost
No polling. No callbacks. No distributed state machines. Just declare what you need, pause when waiting, resume when ready.