Skip to content
GitHub
Decisions

ADR-044: Event substrate — Postgres LISTEN/NOTIFY + transactional outbox; single `core.outbox`; deployment-generation routing

Status: Accepted (2026-04-21; D5 partially superseded by ADR-063 — notification-flow portion stands; SQL grant default does not; D3 superseded by ADR-048 D2 — single core.outbox table; D8 channel taxonomy refined by ADR-048 D5 — outbox_default plus per-generation outbox_gen_<N> channels)

Context

This ADR captures TA-5 plus the substantive supersessions and refinements that came in through TA-19 D2 / D5 and TA-15 D-inter-context. The original TA-5 disposition picked LISTEN/NOTIFY + transactional outbox and per-producer-context outbox tables. TA-19 collapsed the per-context outbox tables to a single core.outbox and added a deployment-generation routing layer; TA-15 reframed the inter-context default away from SQL grants. The result is one canonical event substrate ADR; ADR-048 (TA-19) carries the deployment-topology dimension that this substrate runs in, and ADR-063 carries the inter-context access pattern that the substrate is one half of.

The substrate covers two flows:

  • Signal path between contexts (per ADR-017: at-least-once + idempotent + producer-owned DLQ + best-effort ordering)
  • Within-context worker dispatch (API → worker, AgentTask surface rolled forward from ADR-043 / TA-14 D11)

Research convergence: landscape survey (six substrates — LISTEN/NOTIFY, polling-only outbox, Supabase Realtime, Redis Streams, NATS JetStream, Kafka/Redpanda) plus an adversarial steelman that argued for NATS JetStream from day 1. Both confirmed LISTEN/NOTIFY + transactional outbox as the right alpha pick. Supabase Realtime was disqualified because its delivery guarantee is not published in primary sources. ADR-041 D9 pre-committed one dedicated direct-to-Postgres connection per listener process, bypassing Supavisor transaction mode (LISTEN is incompatible with transaction-mode pooling).

Decision

D1 — Primary substrate: Postgres LISTEN/NOTIFY + transactional outbox

Producer writes the outbox row inside the business transaction; an AFTER INSERT trigger fires NOTIFY <channel>, '<row_id>' at commit; a dedicated listener pulls via SELECT ... FOR UPDATE SKIP LOCKED. Substrate piggybacks on Supabase Postgres; zero new infrastructure at alpha.

D2 — Automatic polling fallback

If the dedicated LISTEN connection cannot establish or drops for longer than the reconnect-backoff cap (30s), the consumer falls back to polling at 5s cadence. Polling is not a second substrate; it is a degradation mode of D1 that keeps correctness under listener failure. NOTIFY is a latency optimization, not a correctness primitive.

D3 — Single core.outbox table

Superseded the original per-producer-context outbox tables in TA-19 D2. One table in core with the row schema described in D4; producers tag their rows with the source column. Schema-layer DRY plus envelope-contract consistency plus simpler permission model outweighed the per-context aesthetic argument at alpha scale.

D4 — Outbox row schema

Columns:

  • id uuid pk
  • event_type text
  • event_version int
  • occurred_at timestamptz
  • source text (renamed from source_bc per TA-19 D2)
  • target text null (renamed from target_bc; null = broadcast)
  • content_class text (future-proofs per-class retention divergence; today uniform)
  • channel text default 'outbox_default'
  • generation bigint (TA-19 D5; stamped by publisher at write time)
  • workspace_id uuid null
  • payload jsonb
  • idempotency_key text
  • trace_context text null
  • status text check in ('pending','in_flight','delivered','failed')
  • attempts int default 0
  • last_error text null
  • delivered_at timestamptz null
  • deleted_at timestamptz null (per ADR-042 D3)

Plus the failure-history columns from TA-6 D5: failure_history jsonb default '[]', first_failed_at timestamptz null. DLQ = status='failed' subset (ADR-017 producer-owned recovery; no separate table).

D5 — Inter-context access

Inter-context read pattern superseded by ADR-063. ADR-063 establishes that no SQL grants between contexts ship at any layer; call flow between contexts uses callee-owned OHS Protocols + DI through the framework-layer composition seam (per ADR-065 D3 + D5). The notification-flow portion (events flow through this substrate) stands. The SECURITY DEFINER status-transition function for outbox row updates is preserved (substrate detail, not an inter-context application read).

D6 — EventEnvelope lives in spectral.core.events

Frozen pydantic model. Fields: event_id: UUID, event_type: str, event_version: int, occurred_at: datetime, source: str, target: str | None, workspace_id: UUID | None, payload: dict[str, Any], idempotency_key: str, trace_context: str | None. The payload field is dict[str, Any] at the substrate layer; typed payload location superseded by ADR-065 D2 — producer-owned typed payload models live in <producer>.contracts.events.*, not spectral.core.events.signals.*. The generation value lives on the outbox row only — the envelope stays substrate-agnostic.

D7 — Publisher / Listener protocols in spectral.core.events.protocols

EventPublisher (async def publish(envelope) -> None), EventListener (async def listen(*, channel, handler_name, handler) -> None), EventHandler = Callable[[EventEnvelope], Awaitable[None]]. Per-context implementations live in spectral.<context>.infrastructure.events.

D8 — NOTIFY channel taxonomy (refined by TA-19 D5)

Originally one channel per producer context (<producer_bc>_outbox_new). Refined: column-value-driven routing with outbox_default as the baseline channel plus per-generation channels outbox_gen_<N> (TA-19 D5). The core.outbox_notify() trigger fires pg_notify(NEW.channel, NEW.id::text) on INSERT; the trigger function stays frozen. Workers LISTEN only on their own generation’s channel — V2 worker is structurally incapable of receiving V1 NOTIFY. The split axis is processing-characteristic (hot/cold, latency class) plus deployment-generation, not context.

NOTIFY payload is the outbox row ID only (UUID string, ~36 bytes — trivially within the 8 KB Postgres cap). Full payload lives in the outbox row, pulled by the handler.

D9 — Listener lifecycle

Dedicated direct-to-Postgres connection (port 5432, autocommit mode) per listening context process (per ADR-041 D9). On startup: drain existing PENDING rows once via SKIP LOCKED, then LISTEN <channel>. On each NOTIFY: drain batch N=10 via SKIP LOCKED. Reconnect with exponential backoff (1s → 30s cap); polling fallback (D2) activates when backoff exceeds the cap. Listener health is a monitored metric.

D10 — At-least-once + idempotent semantics

Inherited from ADR-017. Idempotency key = event_id by default. Consumers persist handled event IDs in a single core.event_handled table with UNIQUE(handler_name, idempotency_key) (refined from per-context event_handled by TA-19 D2). handler_name must be scope-qualified (e.g., worlds.scan_completed_indexer) to guarantee global uniqueness; protocol docstring enforces. Duplicate redelivery is caught by the ON CONFLICT path. The event_handled retention policy (D13) must strictly exceed the outbox active+grace floor to guarantee dedup correctness across the full outbox window.

D11 — Schema versioning

event_version integer on envelope; bumped when payload shape changes. Consumers ignore unknown fields (forward-compatible). Producers never remove fields within a major version (additive-only). Breaking payload change = new event_type with suffix (e.g., scan.completed.v2) plus deprecation deadline on the old type. Contract test pins the envelope shape.

D12 — AgentTask dispatch (closes ADR-043 / TA-14 D11)

platform.agent_tasks remains a separate workspace-scoped table carrying business state (status, HITL approval linkage, conversation_id, result back-reference, retention). Dispatch flows through core.outbox with event_type='agent.task.dispatched' and payload={agent_task_id}. The worker listens on the appropriate channel, reads the payload, pulls the agent_tasks row, executes. State transitions on agent_tasks that require worker action emit their own outbox rows. Separates transactional business state from transport. Trigger event ID for proactive conversations (conversations.trigger_event_id) = outbox row ID when applicable.

D13 — Retention integration

Two POLICY_REGISTRY entries (per ADR-042 D4):

  • (outbox_row, *) → RetentionPolicy(active_ttl_days=45, tombstoned_grace_days=7, disposal=HARD_DELETE) — 45-day window gives real forensic runway without a separate audit archive. Content-class-agnostic.
  • (event_handled, *) → RetentionPolicy(active_ttl_days=60, tombstoned_grace_days=7, disposal=HARD_DELETE) — must strictly exceed the outbox active+grace floor (52 days) to guarantee dedup correctness. 8-day buffer above floor; reducible only with a matching outbox reduction.

A contract test pins the inequality: event_handled.active_ttl > outbox.active_ttl + outbox.grace.

D14 — Observability

Per-publish structlog entry with the seven ADR-036 D5 canonical fields plus {event_id, event_type, source, target, workspace_id}. Per-handle structlog entry adds {attempts, duration_ms, status_result, handler_name}. OTel span linkage: producer emits a span; envelope carries W3C traceparent in trace_context; consumer continues the span. pg_notification_queue_usage() is exported as a gauge via the OTel Collector; alert at 25% fill.

D15 — Forward triggers (two-step migration ladder)

  • Step 1 (in-Postgres, cheap). Outbox > 5M rows steady-state → partition by time-range on occurred_at (native Postgres range partitioning, monthly). Operationally routine; no event contract change.
  • Step 2 (substrate swap, substantive). Any of: outbox > 20M rows sustained after partitioning + retention sweep; consumer ack latency p99 > 500 ms sustained over a 1-week window; first production incident attributable to LISTEN/NOTIFY queue-fill, lock contention, or outbox bloat → swap relay to NATS JetStream or Redis Streams. Event schemas (owned by spectral.core.events) unchanged; consumer API swaps from LISTEN to broker subscription. ~2–3 engineer-weeks when triggered.

D16 — Queue-depth observability

pg_notification_queue_usage() polled on 30s cadence by a dedicated metrics emitter, emitted as an OTel gauge. Alert at 25% fill. Addresses the Recall.ai-style failure mode (queue fills under sustained load or stuck listener) before user-visible impact. PG 15+ already fixes the pre-PG13 AccessExclusiveLock serialization that caused the Recall.ai postmortem; the queue-fill failure mode persists and monitoring catches it.

Alternatives considered

NATS JetStream from day 1. Rejected. Adversarial’s own threshold-for-change-of-mind is met by D15 + D13 + the queued runbook. Adopting NATS adds a second stateful component to ops for alpha volumes that do not warrant it.

Redis Streams. Rejected for similar reasons, plus no native Kafka-mirror path.

Supabase Realtime on the outbox. Disqualified — delivery guarantee not published in any primary Supabase source.

Polling-only outbox (no LISTEN/NOTIFY). Rejected as primary; accepted as the D2 fallback.

CDC via Debezium + logical replication slot. Rejected. Logical slots are shared cluster resources; a stuck consumer can block WAL cleanup cluster-wide.

Separate DLQ table per producer context. Rejected. DLQ = outbox rows with status='failed'; producer observes the failed subset directly via query.

Per-producer-context outbox tables (the original TA-5 D3). Superseded by the single core.outbox table during TA-19 disposition.

Column-level UPDATE grants for inter-context writes. Rejected during TA-5 (D5 chose SECURITY DEFINER); the broader inter-context default was further superseded by ADR-063.

Channel-per-context NOTIFY (the original TA-5 D8). Superseded by outbox_default plus per-generation channels in TA-19 D5.

AgentTask via Supabase Realtime channel. Rejected (D12).

Typed generic payload (EventEnvelope[TPayload]). Deferred. Typed payload models land additively in <producer>.contracts.events.* per ADR-065 D2 (the original signals.* placement in spectral.core was superseded).

Consequences

  • Zero new infrastructure at alpha. Substrate is Supabase Postgres.
  • Each consumer context process consumes one direct-to-Postgres connection slot. At alpha (3 contexts × ~2 processes each = ~6 slots out of Supabase Pro’s 60-slot direct pool), well-margined.
  • Per-context outbox + event_handled migrations were collapsed by TA-19 D2 into a single core.outbox and single core.event_handled. Migrations landed: 20260422170200_core_outbox.sql, 20260422170300_core_event_handled.sql, plus 20260425014302_core_outbox_failure_history.sql (TA-6 D5).
  • spectral.core.events is the substrate transport surface for events between contexts in spectral.core per ADR-065 admission discipline (D1, core/events/ functional area). 13 contract tests landed at commit 5494ccf under the contract-requirement-test discipline in force at the time (now superseded by ADR-065’s structural admission); retention contract tests extended at the same commit.
  • POLICY_REGISTRY got its first real entries at the same commit (previously empty post-ADR-042).
  • TA-14 D11 resolved: conversations.trigger_event_id = outbox row ID; AgentTask persists as a separate table with dispatch via outbox.
  • Listener process is a new daemon per listening context. Lifespan-scoped; joins FastAPI/worker startup.
  • The 8 KB NOTIFY cap is structurally unreachable — only UUIDs ship.
  • Recall.ai-class incident mitigated by PG 15+ + D16 queue monitoring + D2 polling fallback.
  • TA-6 / TA-7 / TA-8 / TA-9 follow-on dispositions extended the substrate with failure-history (TA-6), typed signal payloads in <producer>.contracts.events.* (per ADR-065 D2; the original spectral.core.events.signals.<source>.* placement is superseded), and consumer invariants (TA-7 outcome write; TA-8 persistent consumer; TA-9 cluster materialization). See ADR-054 / ADR-055 / ADR-056 / ADR-057.
  • ADR-017 off-by-one noted in continue.md sweeps and corrected in Codex agent-architecture.mdx close-pass (prior pages mis-referenced ADR-018 for the signal path).

References

  • ADR-017 — at-least-once + idempotent + producer-owned DLQ semantics
  • ADR-065spectral.core admission discipline
  • ADR-031 — single-library structure
  • ADR-032core schema
  • ADR-036 — canonical fields; OTel substrate
  • ADR-041 — D9 dedicated listener connection
  • ADR-042POLICY_REGISTRY shape
  • ADR-043 — D11 closure (AgentTask via outbox)
  • ADR-048 — TA-19 single-outbox refactor + generation routing
  • ADR-054 — TA-6 failure-history extension
  • ADR-055 — TA-7 typed payloads
  • ADR-056 — TA-8 typed payloads
  • ADR-057 — TA-9 typed payloads
  • ADR-063 — D5 partial supersession; SQL grant default between contexts does not ship
  • TA-5 disposition — SPEC-308 comment 0da0b862
  • TA-5 verification — SPEC-308 comment c8ed3fec
  • TA-19 supersession bookkeeping — SPEC-308 comment b759620d
  • TA-6/7/8/9 batch bookkeeping — SPEC-308 comment 0bc9bc85
  • TA-15 D5 partial supersession — SPEC-308 comment d3dfe930
  • src/spectral/core/events/ (commit 5494ccf) — landed contract surface
  • Codex system-design/foundations/contract-surfaces/event-substrate.mdx — close-pass new page
  • docs/runbooks/event-substrate.md — operational runbook