Event substrate runbook
Operator runbook for Spectral’s event substrate — outbox inspection, DLQ triage, replay, listener restart, and the operator tools that surround them. See ADR-044 for the substrate doctrine, ADR-054 for the DLQ-as-status-filter + retry + replay model, and ADR-017 for the producer-owned recovery posture’s origin.
At alpha the operator surface is substrate-only: Supabase Studio
(or psql) for inspection, SQL function calls for replay. The Ops
Agent tool surface — list_failed_outbox_rows(), replay_outbox_row(),
etc. — is the planned upgrade per ADR-060.
This runbook documents the SQL-driven flow that operates day-zero
through that upgrade landing.
Substrate at a glance
- Outbox table:
core.outbox— every event lives here through its full lifecycle (pending → in_flight → delivered | failed). - Dedup table:
core.event_handledkeyed on(handler_name, idempotency_key). Consumer handlers check this before performing non-idempotent work. - Channel routing:
pg_notify(NEW.channel, NEW.id)fires on insert. Publishers compose channel asoutbox_gen_<N>for per- generation listener isolation. - Per-row generation stamp:
core.outbox.generation— workers filter claims by their ownSPECTRAL_GENERATIONenv var. Structurally guarantees gen-N events are processed by gen-N code. - Per-row retry state:
core.outbox.attempts,last_error,first_failed_at— current-cycle retry tracking. Cleared on replay. - Per-row failure audit:
core.outbox.failure_history jsonb— append-only list of terminal-failure-cycle snapshots, populated bycore.outbox_replay()at the moment the operator resets a row.
DLQ inspection
The DLQ is core.outbox rows in status='failed' (per ADR-054 D1 — producer-owned recovery; no separate DLQ table).
List recent failed events
select id, event_type, source, target, workspace_id, attempts, last_error, first_failed_at, claimed_at as failed_at, jsonb_array_length(failure_history) as prior_replay_countfrom core.outboxwhere status = 'failed' and deleted_at is nullorder by claimed_at desclimit 50;Inspect a specific failed event
select *from core.outboxwhere id = '<event-id>';The full envelope payload is in the payload jsonb column. The
audit trail of any prior replay cycles is in failure_history.
Triage by source / target / handler
-- Failures grouped by event_typeselect event_type, source, target, count(*)from core.outboxwhere status = 'failed' and deleted_at is nullgroup by event_type, source, targetorder by count(*) desc;
-- Failures grouped by error class (heuristic: first line of last_error)select split_part(last_error, ':', 1) as error_class, count(*)from core.outboxwhere status = 'failed' and deleted_at is nullgroup by error_classorder by count(*) desc;Replay procedure
When the operator decides a failed event should be retried (e.g.,
upstream dependency was misconfigured, fix landed, want to
reprocess), invoke core.outbox_replay():
-- Replay against the current production generation.-- Look up SPECTRAL_GENERATION first; for production, query-- core.deployments for the most recent live generation.select max(generation) from core.deployments;
-- Then replay the specific event:select core.outbox_replay( p_event_id => '<event-id>', p_new_generation => <current-generation>, p_replayed_by => 'operator-identity' -- nullable; caller's choice);The function performs atomically:
- Snapshots current cycle into
failure_history:{cycle, attempts, last_error, first_failed_at, failed_at, replayed_at, replayed_by}. - Clears current-cycle columns (
attempts=0,last_error=null,first_failed_at=null). - Updates
generationto the supplied value (operator explicitly opts in to “process with current code”). - Resets
status='pending'and clearsclaimed_at.
The listener at the new generation picks up the row on its next
SELECT ... FOR UPDATE SKIP LOCKED poll cycle.
Forever-discard a failed event
If the operator decides the event should NOT be replayed (e.g., business logic has moved on, the event is stale, the failure is data-corruption that no replay will fix), tombstone the row:
update core.outboxset deleted_at = now()where id = '<event-id>' and status = 'failed';The row stays in the table until the retention sweep (per the
outbox retention policy) hard-deletes it. The audit trail in
failure_history is preserved through the tombstone window.
Idempotency invariant
Replayed events carry the same idempotency_key as the original.
Consumer handlers MUST check core.event_handled before performing
non-idempotent work. If a replay matches an
already-processed event (consumer’s event_handled row exists), the
consumer dedups and the row transitions to delivered without
side-effects. This is correct behavior — the operator may inadvertently
replay an event that previously succeeded; the dedup makes the replay
safe.
Listener restart
Listeners hold a dedicated direct-to-Postgres connection (port 5432,
not Supavisor) per ADR-044. If a listener falls behind or
disconnects, the polling fallback drains via SELECT ... FOR UPDATE SKIP LOCKED. Manual restart procedure:
- Verify backlog: count of pending rows on the listener’s channel
select channel, count(*)from core.outboxwhere status = 'pending' and deleted_at is nullgroup by channel;
- Restart the worker service via Render API or
render deploy. - New listener picks up where the prior one left off — outbox rows
in
pendingare re-claimable; rows inin_flightpast the claim TTL (300s per ADR-046 D8) are re-PENDed by the reaper.
Diagnostic queries
-- Backlog by statusselect status, count(*) from core.outboxwhere deleted_at is null group by status;
-- Stale in-flight (potential reaper-attention candidates)select id, claimed_at, age(now(), claimed_at) as agefrom core.outboxwhere status = 'in_flight' and claimed_at < now() - interval '5 minutes'order by claimed_at;
-- Notify queue health (per ADR-044)select pg_notification_queue_usage();-- Result is a fraction in [0, 1]; > 0.5 = listener falling behind.
-- Generation distribution (legacy-drain candidates)select generation, status, count(*)from core.outboxwhere deleted_at is nullgroup by generation, statusorder by generation;Sentry alert pathway
Per ADR-054 D7 + ADR-036, listener implementations fire
a Sentry breadcrumb on the → FAILED transition. The breadcrumb
includes event_id, event_type, source, target, handler_name,
last_error, and attempts. Alert routing is configured at the
Sentry project level; runbook readers should subscribe to the
relevant Sentry alert channel for proactive notification rather than
polling core.outbox for FAILED rows.
Forward-trigger evolution
The substrate-evolution and Ops-Agent-tool triggers are owned by ADR-044 D11 and ADR-060 — see those ADRs for the authoritative trigger lists. When a trigger fires, this runbook updates to the new substrate’s inspection path or to the Ops Agent tool surface.
Related
- ADR-044 — event substrate doctrine.
- ADR-054 — DLQ + retry + replay.
- ADR-055 — Curation → Worlds interface contract.
- ADR-056 — T3 Memory → Worlds routing.
- ADR-057 — Failure cluster → rule-candidate signal.
- ADR-060 — Agent tool invocation (operator-surface upgrade).
- ADR-046 — D8 generation-stamping + drain parameters.
secrets-management.md— Render Env Group rotation that drives generation bumps.deployment.md— production cutover sequence; generation-stamp protocol step 5 (allocate generation).legacy-drain.md— drains outbox rows at a prior generation after rollback.