Skip to content
GitHub
Developer

Event substrate runbook

Operator runbook for Spectral’s event substrate — outbox inspection, DLQ triage, replay, listener restart, and the operator tools that surround them. See ADR-044 for the substrate doctrine, ADR-054 for the DLQ-as-status-filter + retry + replay model, and ADR-017 for the producer-owned recovery posture’s origin.

At alpha the operator surface is substrate-only: Supabase Studio (or psql) for inspection, SQL function calls for replay. The Ops Agent tool surface — list_failed_outbox_rows(), replay_outbox_row(), etc. — is the planned upgrade per ADR-060. This runbook documents the SQL-driven flow that operates day-zero through that upgrade landing.

Substrate at a glance

  • Outbox table: core.outbox — every event lives here through its full lifecycle (pending → in_flight → delivered | failed).
  • Dedup table: core.event_handled keyed on (handler_name, idempotency_key). Consumer handlers check this before performing non-idempotent work.
  • Channel routing: pg_notify(NEW.channel, NEW.id) fires on insert. Publishers compose channel as outbox_gen_<N> for per- generation listener isolation.
  • Per-row generation stamp: core.outbox.generation — workers filter claims by their own SPECTRAL_GENERATION env var. Structurally guarantees gen-N events are processed by gen-N code.
  • Per-row retry state: core.outbox.attempts, last_error, first_failed_at — current-cycle retry tracking. Cleared on replay.
  • Per-row failure audit: core.outbox.failure_history jsonb — append-only list of terminal-failure-cycle snapshots, populated by core.outbox_replay() at the moment the operator resets a row.

DLQ inspection

The DLQ is core.outbox rows in status='failed' (per ADR-054 D1 — producer-owned recovery; no separate DLQ table).

List recent failed events

select
id,
event_type,
source,
target,
workspace_id,
attempts,
last_error,
first_failed_at,
claimed_at as failed_at,
jsonb_array_length(failure_history) as prior_replay_count
from core.outbox
where status = 'failed'
and deleted_at is null
order by claimed_at desc
limit 50;

Inspect a specific failed event

select *
from core.outbox
where id = '<event-id>';

The full envelope payload is in the payload jsonb column. The audit trail of any prior replay cycles is in failure_history.

Triage by source / target / handler

-- Failures grouped by event_type
select event_type, source, target, count(*)
from core.outbox
where status = 'failed' and deleted_at is null
group by event_type, source, target
order by count(*) desc;
-- Failures grouped by error class (heuristic: first line of last_error)
select
split_part(last_error, ':', 1) as error_class,
count(*)
from core.outbox
where status = 'failed' and deleted_at is null
group by error_class
order by count(*) desc;

Replay procedure

When the operator decides a failed event should be retried (e.g., upstream dependency was misconfigured, fix landed, want to reprocess), invoke core.outbox_replay():

-- Replay against the current production generation.
-- Look up SPECTRAL_GENERATION first; for production, query
-- core.deployments for the most recent live generation.
select max(generation) from core.deployments;
-- Then replay the specific event:
select core.outbox_replay(
p_event_id => '<event-id>',
p_new_generation => <current-generation>,
p_replayed_by => 'operator-identity' -- nullable; caller's choice
);

The function performs atomically:

  1. Snapshots current cycle into failure_history: {cycle, attempts, last_error, first_failed_at, failed_at, replayed_at, replayed_by}.
  2. Clears current-cycle columns (attempts=0, last_error=null, first_failed_at=null).
  3. Updates generation to the supplied value (operator explicitly opts in to “process with current code”).
  4. Resets status='pending' and clears claimed_at.

The listener at the new generation picks up the row on its next SELECT ... FOR UPDATE SKIP LOCKED poll cycle.

Forever-discard a failed event

If the operator decides the event should NOT be replayed (e.g., business logic has moved on, the event is stale, the failure is data-corruption that no replay will fix), tombstone the row:

update core.outbox
set deleted_at = now()
where id = '<event-id>'
and status = 'failed';

The row stays in the table until the retention sweep (per the outbox retention policy) hard-deletes it. The audit trail in failure_history is preserved through the tombstone window.

Idempotency invariant

Replayed events carry the same idempotency_key as the original. Consumer handlers MUST check core.event_handled before performing non-idempotent work. If a replay matches an already-processed event (consumer’s event_handled row exists), the consumer dedups and the row transitions to delivered without side-effects. This is correct behavior — the operator may inadvertently replay an event that previously succeeded; the dedup makes the replay safe.

Listener restart

Listeners hold a dedicated direct-to-Postgres connection (port 5432, not Supavisor) per ADR-044. If a listener falls behind or disconnects, the polling fallback drains via SELECT ... FOR UPDATE SKIP LOCKED. Manual restart procedure:

  1. Verify backlog: count of pending rows on the listener’s channel
    select channel, count(*)
    from core.outbox
    where status = 'pending' and deleted_at is null
    group by channel;
  2. Restart the worker service via Render API or render deploy.
  3. New listener picks up where the prior one left off — outbox rows in pending are re-claimable; rows in in_flight past the claim TTL (300s per ADR-046 D8) are re-PENDed by the reaper.

Diagnostic queries

-- Backlog by status
select status, count(*) from core.outbox
where deleted_at is null group by status;
-- Stale in-flight (potential reaper-attention candidates)
select id, claimed_at, age(now(), claimed_at) as age
from core.outbox
where status = 'in_flight' and claimed_at < now() - interval '5 minutes'
order by claimed_at;
-- Notify queue health (per ADR-044)
select pg_notification_queue_usage();
-- Result is a fraction in [0, 1]; > 0.5 = listener falling behind.
-- Generation distribution (legacy-drain candidates)
select generation, status, count(*)
from core.outbox
where deleted_at is null
group by generation, status
order by generation;

Sentry alert pathway

Per ADR-054 D7 + ADR-036, listener implementations fire a Sentry breadcrumb on the → FAILED transition. The breadcrumb includes event_id, event_type, source, target, handler_name, last_error, and attempts. Alert routing is configured at the Sentry project level; runbook readers should subscribe to the relevant Sentry alert channel for proactive notification rather than polling core.outbox for FAILED rows.

Forward-trigger evolution

The substrate-evolution and Ops-Agent-tool triggers are owned by ADR-044 D11 and ADR-060 — see those ADRs for the authoritative trigger lists. When a trigger fires, this runbook updates to the new substrate’s inspection path or to the Ops Agent tool surface.

  • ADR-044 — event substrate doctrine.
  • ADR-054 — DLQ + retry + replay.
  • ADR-055 — Curation → Worlds interface contract.
  • ADR-056 — T3 Memory → Worlds routing.
  • ADR-057 — Failure cluster → rule-candidate signal.
  • ADR-060 — Agent tool invocation (operator-surface upgrade).
  • ADR-046 — D8 generation-stamping + drain parameters.
  • secrets-management.md — Render Env Group rotation that drives generation bumps.
  • deployment.md — production cutover sequence; generation-stamp protocol step 5 (allocate generation).
  • legacy-drain.md — drains outbox rows at a prior generation after rollback.