Observability for Decisions: Beyond Logs

Your system logs say: "loan_application_123: rejected". Great. Why was it rejected?

You grep through more logs. You find "risk_score: 0.73, threshold: 0.70". Okay, the risk score was too high. But why was the risk score 0.73? Which rules contributed to it? Was this the correct threshold for this customer segment? When was this threshold last changed?

This is the observability gap. Traditional application monitoring tells you what happened. Decision observability tells you why it happened.

The Anatomy of a Decision Trace

In TIATON, every decision session produces a trace — a complete record of every step, every rule evaluation, and every state change.

A session trace contains:

  • Input facts — the data that entered the system
  • Tick-by-tick execution — what the agent did at each step
  • DMN evaluations — which rules matched, which didn't, and why
  • State transitions — how the state changed after each skill
  • Timing data — how long each step took
  • Final decision — the outcome with full reasoning chain

Reading a Trace

Here's a simplified trace for a loan decision:

{
  "session_id": "sess_a7f3c9e2",
  "agent_key": "loan_processing",
  "status": "done",
  "ticks": 4,
  "duration_ms": 1847,
  "input_facts": {
    "applicant_name": "John Doe",
    "credit_score": 620,
    "annual_income": 45000,
    "loan_amount": 75000
  },
  "phases": [
    {
      "tick": 1,
      "skill": "validate_application",
      "status": "success",
      "duration_ms": 12,
      "state_after": {
        "application_validated": true
      }
    },
    {
      "tick": 2,
      "skill": "check_credit",
      "status": "success",
      "duration_ms": 834,
      "state_after": {
        "credit_checked": true,
        "credit_report_id": "cr_8821"
      }
    },
    {
      "tick": 3,
      "skill": "evaluate_risk",
      "status": "success",
      "duration_ms": 45,
      "dmn_evaluation": {
        "domain": "lending",
        "tables_evaluated": ["loan_eligibility", "risk_scoring"],
        "results": {
          "loan_eligibility": {
            "matched_rule": 3,
            "decision": "manual_review",
            "note": "Borderline — needs review"
          },
          "risk_scoring": {
            "matched_rule": 5,
            "risk_level": "medium",
            "risk_score": 0.62
          }
        }
      },
      "state_after": {
        "risk_evaluated": true,
        "decision": "manual_review"
      }
    },
    {
      "tick": 4,
      "skill": "notify_applicant",
      "status": "success",
      "duration_ms": 956,
      "state_after": {
        "applicant_notified": true,
        "notification_id": "ntf_3391"
      }
    }
  ]
}

An auditor opens this trace and sees: the applicant had a credit score of 620 and income of 45,000. Rule 3 in the loan_eligibility table matched — "Borderline — needs review." The risk scoring table gave a medium risk level. The final decision was manual_review.

No guessing. No log correlation. No "let me check with the dev team."

The 30-Second Audit

The TIATON management UI makes this visual. For any session:

  1. Click the session in the list
  2. See the execution graph with phases
  3. Click any node — see the DMN evaluation, matched rules, state changes
  4. Export the trace as JSON for compliance records

What used to take a compliance investigation (days) now takes 30 seconds.

Error Traces

Traces are even more valuable when things go wrong. When a skill fails and compensation triggers:

{
  "tick": 3,
  "skill": "open_position",
  "status": "failure",
  "error": "Market closed: NASDAQ after-hours rejected",
  "compensation": [
    {
      "skill": "reserve_margin",
      "compensator": "release_margin",
      "status": "success",
      "duration_ms": 23
    }
  ]
}

You see exactly what failed, what error occurred, and how the system cleaned up. The compensation trail is as visible as the happy path.

Connecting Rules to Outcomes

The real power is connecting the dots between rule versions and production outcomes.

"After we changed rule 5 in the risk_scoring table last Tuesday, what percentage of applications shifted from 'approve' to 'manual_review'?"

Because every session records which rule version was used, you can answer this directly. You can compare outcomes across rule versions and spot unintended consequences before they become problems.

What Good Decision Observability Looks Like

MetricTypical SystemWith Decision Traces
Time to explain a decisionHours to days30 seconds
Time to find a rule-related bugDaysMinutes
Compliance audit preparationWeeksAutomated export
Impact analysis of rule changesManual + guessworkQuery across sessions
Root cause of rejected applicationsDeveloper investigationClick the session

This isn't just about compliance — though compliance teams love it. It's about building trust in automated decisions. When anyone in the organization can understand why a decision was made, they trust the system to make those decisions.