Skip to content
GitHub
Decisions

ADR-054: DLQ + retry semantics — exponential backoff with jitter, in-place replay with `failure_history` audit

Status: Accepted (2026-04-25)

Context

ADR-044 (TA-5) locked the substrate: outbox table, four-state status taxonomy (pending → in_flight → delivered | failed), idempotency via core.event_handled + idempotency_key, polling fallback past LISTEN/NOTIFY cap. ADR-017 + the envelope module docstring already commit DLQ to status='failed' filter on core.outbox rather than a separate DLQ table. This ADR adds the retry curve, replay protocol, and operator-surface contract that complete the failure-handling story.

Decision

D1 — DLQ as status='failed' filter on core.outbox (ADR-044 D4 ratification)

No separate DLQ table. Failed-out rows stay in core.outbox with status='failed'. Producer-owned recovery surface per ADR-017. Query pattern: SELECT ... FROM core.outbox WHERE status='failed' AND deleted_at IS NULL.

D2 — Retry curve: exponential with jitter

Default policy: base delay 1 s, multiplier 2, max delay 5 min, full jitter applied per attempt. Schedule: ~1 s, ~2 s, ~4 s, ~8 s, ~16 s (capped at 300 s). Jitter spreads retries to avoid thundering-herd on transient failures.

D3 — Max retries: 5 (default; per-handler override allowed)

Five attempts before transition to status='failed'. Per-handler override via RetryPolicy injection. Reasoning: 5 covers most transient-failure recovery windows (cumulative ~30 s of attempts) without burning hours of latency on terminally-broken events.

D4 — Transient vs terminal classifier

Default classifier (in spectral.core.events.retry):

  • Terminal (immediate status='failed', no retry): IntegrityError, ValueError from payload validation, pydantic.ValidationError, explicit TerminalHandlerError subclass
  • Transient (retry per D2/D3): ConnectionError, TimeoutError, OperationalError, default for any other exception
  • Per-handler override: handlers may register additional terminal exception types via RetryPolicy.terminal_errors

D5 — Replay protocol: in-place reset to PENDING + failure_history audit column

Schema extension on core.outbox (migration 20260425014302_core_outbox_failure_history.sql):

  • failure_history jsonb default '[]' — accumulates terminal-failure-cycle snapshots
  • attempts integer not null default 0 — current-cycle retry count
  • last_error text null — current-cycle last error message
  • first_failed_at timestamptz null — current-cycle first-failure timestamp

Replay function core.outbox_replay(event_id uuid, p_new_generation bigint, p_replayed_by text) performs atomically:

  1. Snapshot current cycle into failure_history array entry.
  2. Clear current-cycle fields (attempts=0, last_error=null, first_failed_at=null).
  3. Update generation to the caller-supplied p_new_generation (operator explicitly opts in to “process with current code”; the gen-N-stays-on-gen-N invariant is a rolling-deploy window guarantee, not an after-the-fact-replay invariant).
  4. Reset status='pending'.

p_new_generation is a caller-supplied argument (TA-26 D7/D8 placement principle — SQL functions take values explicitly rather than embedding runtime-context assumptions). p_replayed_by is text to accommodate both human-operator emails and SQL-direct-caller identifiers (e.g., "sql-direct").

failure_history entries are typed via spectral.core.events.failure.FailureCycle. Additive-only schema per ADR-044 D11.

D6 — Operator surface: substrate-only at alpha; tool surface deferred to ADR-060

At alpha: operator inspects FAILED rows via Supabase Studio (or direct psql). Calls core.outbox_replay(event_id, ...) via SQL. Sentry alert (D7) fires on → FAILED transition with row context. Runbook documents the workflow.

Ops Agent tool surface (DLQ inspection + replay) is delivered by ADR-060 D7 (list_dlq_events, get_dlq_event_detail, replay_dlq_event). The TA-6 contract surface is the substrate; the operator UX layer is TA-15’s domain.

Forward triggers to upgrade the operator surface:

  • ADR-060 D7 lands the tool surface
  • Volume of FAILED rows exceeds what direct SQL can comfortably triage (signal: > ~10 FAILED rows per week)

D7 — Sentry alert on → FAILED transition

Implementation: structlog + Sentry breadcrumb fire when handler classifies failure as terminal OR retry budget exhausts. Includes event_id, event_type, source, target, handler_name, last_error, attempts. ADR-036 observability machinery covers this; TA-6 specifies the obligation.

D8 — Idempotency requirements ratified from ADR-044 D10

Consumer handlers MUST check core.event_handled before performing non-idempotent work; replayed events carry the same idempotency_key, so dedup is structural. The replay protocol (D5) preserves event_id and idempotency_key; the consumer’s event_handled row from a prior successful processing — if any — would dedup the replay.

Alternatives considered

Separate core.outbox_dlq table. Rejected; ADR-017 commits to producer-owned recovery via status filter; a separate table fragments the row history.

Per-attempt failure history (every retry logged on the row). Rejected; structlog + Sentry already capture per-attempt; the row only needs cycle-level audit.

No failure_history column at alpha (audit lives only in logs). Rejected; on-row durable audit is high-leverage at trivial schema cost.

Replay = new outbox row. Rejected; idempotency_key collision concerns; loses provenance.

Eager Ops Agent tool surface in TA-6. Rejected; pre-anchors TA-15; SQL-via-Studio is sufficient operator path at alpha.

Consequences

  • Failure handling is structural, not procedural.
  • Replay preserves audit trail on the row itself.
  • Retry curve covers transient failures without burning latency on terminal ones.
  • Operator path is unblocked at alpha (no waiting for ADR-060).
  • failure_history is a jsonb append column — bounded but unbounded in pathological cases (clusters of repeat-replay). Storage growth is tracked; no enforcement at alpha.
  • Operator surface is SQL-driven at alpha; not as ergonomic as a UI.
  • Default classifier is a heuristic; per-handler overrides may be needed in practice.
  • ADR-060 D7 tool surface, when consumer epic lands, may want different operator workflow primitives than the SQL-at-alpha shape; runbook updates accordingly.

References

  • ADR-017 — producer-owned DLQ recovery
  • ADR-065spectral.core admission discipline
  • ADR-036 — Sentry alert substrate
  • ADR-044 — outbox + status taxonomy + idempotency
  • ADR-048 — generation column on outbox
  • ADR-053 — D7/D8 placement principle (p_new_generation as explicit arg)
  • ADR-060 — D7 DLQ inspection tools
  • TA-6 disposition — SPEC-309 comment fe6167b9
  • TA-6 verification — SPEC-309 comment 3c324876
  • src/spectral/core/events/retry.pyRetryPolicy, DEFAULT_RETRY_POLICY, classifier
  • src/spectral/core/events/failure.pyFailureCycle
  • supabase/migrations/20260425014302_core_outbox_failure_history.sql
  • docs/runbooks/event-substrate.md — operator workflow
  • Codex system-design/foundations/contract-surfaces/event-substrate.mdx — close-pass updates