ADR-054: DLQ + retry semantics — exponential backoff with jitter, in-place replay with `failure_history` audit
Status: Accepted (2026-04-25)
Context
ADR-044 (TA-5) locked the substrate: outbox table, four-state status taxonomy (pending → in_flight → delivered | failed), idempotency via core.event_handled + idempotency_key, polling fallback past LISTEN/NOTIFY cap. ADR-017 + the envelope module docstring already commit DLQ to status='failed' filter on core.outbox rather than a separate DLQ table. This ADR adds the retry curve, replay protocol, and operator-surface contract that complete the failure-handling story.
Decision
D1 — DLQ as status='failed' filter on core.outbox (ADR-044 D4 ratification)
No separate DLQ table. Failed-out rows stay in core.outbox with status='failed'. Producer-owned recovery surface per ADR-017. Query pattern: SELECT ... FROM core.outbox WHERE status='failed' AND deleted_at IS NULL.
D2 — Retry curve: exponential with jitter
Default policy: base delay 1 s, multiplier 2, max delay 5 min, full jitter applied per attempt. Schedule: ~1 s, ~2 s, ~4 s, ~8 s, ~16 s (capped at 300 s). Jitter spreads retries to avoid thundering-herd on transient failures.
D3 — Max retries: 5 (default; per-handler override allowed)
Five attempts before transition to status='failed'. Per-handler override via RetryPolicy injection. Reasoning: 5 covers most transient-failure recovery windows (cumulative ~30 s of attempts) without burning hours of latency on terminally-broken events.
D4 — Transient vs terminal classifier
Default classifier (in spectral.core.events.retry):
- Terminal (immediate
status='failed', no retry):IntegrityError,ValueErrorfrom payload validation,pydantic.ValidationError, explicitTerminalHandlerErrorsubclass - Transient (retry per D2/D3):
ConnectionError,TimeoutError,OperationalError, default for any other exception - Per-handler override: handlers may register additional terminal exception types via
RetryPolicy.terminal_errors
D5 — Replay protocol: in-place reset to PENDING + failure_history audit column
Schema extension on core.outbox (migration 20260425014302_core_outbox_failure_history.sql):
failure_history jsonb default '[]'— accumulates terminal-failure-cycle snapshotsattempts integer not null default 0— current-cycle retry countlast_error text null— current-cycle last error messagefirst_failed_at timestamptz null— current-cycle first-failure timestamp
Replay function core.outbox_replay(event_id uuid, p_new_generation bigint, p_replayed_by text) performs atomically:
- Snapshot current cycle into
failure_historyarray entry. - Clear current-cycle fields (
attempts=0,last_error=null,first_failed_at=null). - Update
generationto the caller-suppliedp_new_generation(operator explicitly opts in to “process with current code”; the gen-N-stays-on-gen-N invariant is a rolling-deploy window guarantee, not an after-the-fact-replay invariant). - Reset
status='pending'.
p_new_generation is a caller-supplied argument (TA-26 D7/D8 placement principle — SQL functions take values explicitly rather than embedding runtime-context assumptions). p_replayed_by is text to accommodate both human-operator emails and SQL-direct-caller identifiers (e.g., "sql-direct").
failure_history entries are typed via spectral.core.events.failure.FailureCycle. Additive-only schema per ADR-044 D11.
D6 — Operator surface: substrate-only at alpha; tool surface deferred to ADR-060
At alpha: operator inspects FAILED rows via Supabase Studio (or direct psql). Calls core.outbox_replay(event_id, ...) via SQL. Sentry alert (D7) fires on → FAILED transition with row context. Runbook documents the workflow.
Ops Agent tool surface (DLQ inspection + replay) is delivered by ADR-060 D7 (list_dlq_events, get_dlq_event_detail, replay_dlq_event). The TA-6 contract surface is the substrate; the operator UX layer is TA-15’s domain.
Forward triggers to upgrade the operator surface:
- ADR-060 D7 lands the tool surface
- Volume of FAILED rows exceeds what direct SQL can comfortably triage (signal: > ~10 FAILED rows per week)
D7 — Sentry alert on → FAILED transition
Implementation: structlog + Sentry breadcrumb fire when handler classifies failure as terminal OR retry budget exhausts. Includes event_id, event_type, source, target, handler_name, last_error, attempts. ADR-036 observability machinery covers this; TA-6 specifies the obligation.
D8 — Idempotency requirements ratified from ADR-044 D10
Consumer handlers MUST check core.event_handled before performing non-idempotent work; replayed events carry the same idempotency_key, so dedup is structural. The replay protocol (D5) preserves event_id and idempotency_key; the consumer’s event_handled row from a prior successful processing — if any — would dedup the replay.
Alternatives considered
Separate core.outbox_dlq table. Rejected; ADR-017 commits to producer-owned recovery via status filter; a separate table fragments the row history.
Per-attempt failure history (every retry logged on the row). Rejected; structlog + Sentry already capture per-attempt; the row only needs cycle-level audit.
No failure_history column at alpha (audit lives only in logs). Rejected; on-row durable audit is high-leverage at trivial schema cost.
Replay = new outbox row. Rejected; idempotency_key collision concerns; loses provenance.
Eager Ops Agent tool surface in TA-6. Rejected; pre-anchors TA-15; SQL-via-Studio is sufficient operator path at alpha.
Consequences
- Failure handling is structural, not procedural.
- Replay preserves audit trail on the row itself.
- Retry curve covers transient failures without burning latency on terminal ones.
- Operator path is unblocked at alpha (no waiting for ADR-060).
failure_historyis a jsonb append column — bounded but unbounded in pathological cases (clusters of repeat-replay). Storage growth is tracked; no enforcement at alpha.- Operator surface is SQL-driven at alpha; not as ergonomic as a UI.
- Default classifier is a heuristic; per-handler overrides may be needed in practice.
- ADR-060 D7 tool surface, when consumer epic lands, may want different operator workflow primitives than the SQL-at-alpha shape; runbook updates accordingly.
References
- ADR-017 — producer-owned DLQ recovery
- ADR-065 —
spectral.coreadmission discipline - ADR-036 — Sentry alert substrate
- ADR-044 — outbox + status taxonomy + idempotency
- ADR-048 — generation column on outbox
- ADR-053 — D7/D8 placement principle (
p_new_generationas explicit arg) - ADR-060 — D7 DLQ inspection tools
- TA-6 disposition — SPEC-309 comment
fe6167b9 - TA-6 verification — SPEC-309 comment
3c324876 src/spectral/core/events/retry.py—RetryPolicy,DEFAULT_RETRY_POLICY, classifiersrc/spectral/core/events/failure.py—FailureCyclesupabase/migrations/20260425014302_core_outbox_failure_history.sqldocs/runbooks/event-substrate.md— operator workflow- Codex
system-design/foundations/contract-surfaces/event-substrate.mdx— close-pass updates