ADR-044: Event substrate — Postgres LISTEN/NOTIFY + transactional outbox; single `core.outbox`; deployment-generation routing
Status: Accepted (2026-04-21; D5 partially superseded by ADR-063 — notification-flow portion stands; SQL grant default does not; D3 superseded by ADR-048 D2 — single core.outbox table; D8 channel taxonomy refined by ADR-048 D5 — outbox_default plus per-generation outbox_gen_<N> channels)
Context
This ADR captures TA-5 plus the substantive supersessions and refinements that came in through TA-19 D2 / D5 and TA-15 D-inter-context. The original TA-5 disposition picked LISTEN/NOTIFY + transactional outbox and per-producer-context outbox tables. TA-19 collapsed the per-context outbox tables to a single core.outbox and added a deployment-generation routing layer; TA-15 reframed the inter-context default away from SQL grants. The result is one canonical event substrate ADR; ADR-048 (TA-19) carries the deployment-topology dimension that this substrate runs in, and ADR-063 carries the inter-context access pattern that the substrate is one half of.
The substrate covers two flows:
- Signal path between contexts (per ADR-017: at-least-once + idempotent + producer-owned DLQ + best-effort ordering)
- Within-context worker dispatch (API → worker,
AgentTasksurface rolled forward from ADR-043 / TA-14 D11)
Research convergence: landscape survey (six substrates — LISTEN/NOTIFY, polling-only outbox, Supabase Realtime, Redis Streams, NATS JetStream, Kafka/Redpanda) plus an adversarial steelman that argued for NATS JetStream from day 1. Both confirmed LISTEN/NOTIFY + transactional outbox as the right alpha pick. Supabase Realtime was disqualified because its delivery guarantee is not published in primary sources. ADR-041 D9 pre-committed one dedicated direct-to-Postgres connection per listener process, bypassing Supavisor transaction mode (LISTEN is incompatible with transaction-mode pooling).
Decision
D1 — Primary substrate: Postgres LISTEN/NOTIFY + transactional outbox
Producer writes the outbox row inside the business transaction; an AFTER INSERT trigger fires NOTIFY <channel>, '<row_id>' at commit; a dedicated listener pulls via SELECT ... FOR UPDATE SKIP LOCKED. Substrate piggybacks on Supabase Postgres; zero new infrastructure at alpha.
D2 — Automatic polling fallback
If the dedicated LISTEN connection cannot establish or drops for longer than the reconnect-backoff cap (30s), the consumer falls back to polling at 5s cadence. Polling is not a second substrate; it is a degradation mode of D1 that keeps correctness under listener failure. NOTIFY is a latency optimization, not a correctness primitive.
D3 — Single core.outbox table
Superseded the original per-producer-context outbox tables in TA-19 D2. One table in core with the row schema described in D4; producers tag their rows with the source column. Schema-layer DRY plus envelope-contract consistency plus simpler permission model outweighed the per-context aesthetic argument at alpha scale.
D4 — Outbox row schema
Columns:
id uuid pkevent_type textevent_version intoccurred_at timestamptzsource text(renamed fromsource_bcper TA-19 D2)target text null(renamed fromtarget_bc; null = broadcast)content_class text(future-proofs per-class retention divergence; today uniform)channel text default 'outbox_default'generation bigint(TA-19 D5; stamped by publisher at write time)workspace_id uuid nullpayload jsonbidempotency_key texttrace_context text nullstatus text check in ('pending','in_flight','delivered','failed')attempts int default 0last_error text nulldelivered_at timestamptz nulldeleted_at timestamptz null(per ADR-042 D3)
Plus the failure-history columns from TA-6 D5: failure_history jsonb default '[]', first_failed_at timestamptz null. DLQ = status='failed' subset (ADR-017 producer-owned recovery; no separate table).
D5 — Inter-context access
Inter-context read pattern superseded by ADR-063. ADR-063 establishes that no SQL grants between contexts ship at any layer; call flow between contexts uses callee-owned OHS Protocols + DI through the framework-layer composition seam (per ADR-065 D3 + D5). The notification-flow portion (events flow through this substrate) stands. The SECURITY DEFINER status-transition function for outbox row updates is preserved (substrate detail, not an inter-context application read).
D6 — EventEnvelope lives in spectral.core.events
Frozen pydantic model. Fields: event_id: UUID, event_type: str, event_version: int, occurred_at: datetime, source: str, target: str | None, workspace_id: UUID | None, payload: dict[str, Any], idempotency_key: str, trace_context: str | None. The payload field is dict[str, Any] at the substrate layer; typed payload location superseded by ADR-065 D2 — producer-owned typed payload models live in <producer>.contracts.events.*, not spectral.core.events.signals.*. The generation value lives on the outbox row only — the envelope stays substrate-agnostic.
D7 — Publisher / Listener protocols in spectral.core.events.protocols
EventPublisher (async def publish(envelope) -> None), EventListener (async def listen(*, channel, handler_name, handler) -> None), EventHandler = Callable[[EventEnvelope], Awaitable[None]]. Per-context implementations live in spectral.<context>.infrastructure.events.
D8 — NOTIFY channel taxonomy (refined by TA-19 D5)
Originally one channel per producer context (<producer_bc>_outbox_new). Refined: column-value-driven routing with outbox_default as the baseline channel plus per-generation channels outbox_gen_<N> (TA-19 D5). The core.outbox_notify() trigger fires pg_notify(NEW.channel, NEW.id::text) on INSERT; the trigger function stays frozen. Workers LISTEN only on their own generation’s channel — V2 worker is structurally incapable of receiving V1 NOTIFY. The split axis is processing-characteristic (hot/cold, latency class) plus deployment-generation, not context.
NOTIFY payload is the outbox row ID only (UUID string, ~36 bytes — trivially within the 8 KB Postgres cap). Full payload lives in the outbox row, pulled by the handler.
D9 — Listener lifecycle
Dedicated direct-to-Postgres connection (port 5432, autocommit mode) per listening context process (per ADR-041 D9). On startup: drain existing PENDING rows once via SKIP LOCKED, then LISTEN <channel>. On each NOTIFY: drain batch N=10 via SKIP LOCKED. Reconnect with exponential backoff (1s → 30s cap); polling fallback (D2) activates when backoff exceeds the cap. Listener health is a monitored metric.
D10 — At-least-once + idempotent semantics
Inherited from ADR-017. Idempotency key = event_id by default. Consumers persist handled event IDs in a single core.event_handled table with UNIQUE(handler_name, idempotency_key) (refined from per-context event_handled by TA-19 D2). handler_name must be scope-qualified (e.g., worlds.scan_completed_indexer) to guarantee global uniqueness; protocol docstring enforces. Duplicate redelivery is caught by the ON CONFLICT path. The event_handled retention policy (D13) must strictly exceed the outbox active+grace floor to guarantee dedup correctness across the full outbox window.
D11 — Schema versioning
event_version integer on envelope; bumped when payload shape changes. Consumers ignore unknown fields (forward-compatible). Producers never remove fields within a major version (additive-only). Breaking payload change = new event_type with suffix (e.g., scan.completed.v2) plus deprecation deadline on the old type. Contract test pins the envelope shape.
D12 — AgentTask dispatch (closes ADR-043 / TA-14 D11)
platform.agent_tasks remains a separate workspace-scoped table carrying business state (status, HITL approval linkage, conversation_id, result back-reference, retention). Dispatch flows through core.outbox with event_type='agent.task.dispatched' and payload={agent_task_id}. The worker listens on the appropriate channel, reads the payload, pulls the agent_tasks row, executes. State transitions on agent_tasks that require worker action emit their own outbox rows. Separates transactional business state from transport. Trigger event ID for proactive conversations (conversations.trigger_event_id) = outbox row ID when applicable.
D13 — Retention integration
Two POLICY_REGISTRY entries (per ADR-042 D4):
(outbox_row, *) → RetentionPolicy(active_ttl_days=45, tombstoned_grace_days=7, disposal=HARD_DELETE)— 45-day window gives real forensic runway without a separate audit archive. Content-class-agnostic.(event_handled, *) → RetentionPolicy(active_ttl_days=60, tombstoned_grace_days=7, disposal=HARD_DELETE)— must strictly exceed the outbox active+grace floor (52 days) to guarantee dedup correctness. 8-day buffer above floor; reducible only with a matching outbox reduction.
A contract test pins the inequality: event_handled.active_ttl > outbox.active_ttl + outbox.grace.
D14 — Observability
Per-publish structlog entry with the seven ADR-036 D5 canonical fields plus {event_id, event_type, source, target, workspace_id}. Per-handle structlog entry adds {attempts, duration_ms, status_result, handler_name}. OTel span linkage: producer emits a span; envelope carries W3C traceparent in trace_context; consumer continues the span. pg_notification_queue_usage() is exported as a gauge via the OTel Collector; alert at 25% fill.
D15 — Forward triggers (two-step migration ladder)
- Step 1 (in-Postgres, cheap). Outbox > 5M rows steady-state → partition by time-range on
occurred_at(native Postgres range partitioning, monthly). Operationally routine; no event contract change. - Step 2 (substrate swap, substantive). Any of: outbox > 20M rows sustained after partitioning + retention sweep; consumer ack latency p99 > 500 ms sustained over a 1-week window; first production incident attributable to LISTEN/NOTIFY queue-fill, lock contention, or outbox bloat → swap relay to NATS JetStream or Redis Streams. Event schemas (owned by
spectral.core.events) unchanged; consumer API swaps fromLISTENto broker subscription. ~2–3 engineer-weeks when triggered.
D16 — Queue-depth observability
pg_notification_queue_usage() polled on 30s cadence by a dedicated metrics emitter, emitted as an OTel gauge. Alert at 25% fill. Addresses the Recall.ai-style failure mode (queue fills under sustained load or stuck listener) before user-visible impact. PG 15+ already fixes the pre-PG13 AccessExclusiveLock serialization that caused the Recall.ai postmortem; the queue-fill failure mode persists and monitoring catches it.
Alternatives considered
NATS JetStream from day 1. Rejected. Adversarial’s own threshold-for-change-of-mind is met by D15 + D13 + the queued runbook. Adopting NATS adds a second stateful component to ops for alpha volumes that do not warrant it.
Redis Streams. Rejected for similar reasons, plus no native Kafka-mirror path.
Supabase Realtime on the outbox. Disqualified — delivery guarantee not published in any primary Supabase source.
Polling-only outbox (no LISTEN/NOTIFY). Rejected as primary; accepted as the D2 fallback.
CDC via Debezium + logical replication slot. Rejected. Logical slots are shared cluster resources; a stuck consumer can block WAL cleanup cluster-wide.
Separate DLQ table per producer context. Rejected. DLQ = outbox rows with status='failed'; producer observes the failed subset directly via query.
Per-producer-context outbox tables (the original TA-5 D3). Superseded by the single core.outbox table during TA-19 disposition.
Column-level UPDATE grants for inter-context writes. Rejected during TA-5 (D5 chose SECURITY DEFINER); the broader inter-context default was further superseded by ADR-063.
Channel-per-context NOTIFY (the original TA-5 D8). Superseded by outbox_default plus per-generation channels in TA-19 D5.
AgentTask via Supabase Realtime channel. Rejected (D12).
Typed generic payload (EventEnvelope[TPayload]). Deferred. Typed payload models land additively in <producer>.contracts.events.* per ADR-065 D2 (the original signals.* placement in spectral.core was superseded).
Consequences
- Zero new infrastructure at alpha. Substrate is Supabase Postgres.
- Each consumer context process consumes one direct-to-Postgres connection slot. At alpha (3 contexts × ~2 processes each = ~6 slots out of Supabase Pro’s 60-slot direct pool), well-margined.
- Per-context outbox + event_handled migrations were collapsed by TA-19 D2 into a single
core.outboxand singlecore.event_handled. Migrations landed:20260422170200_core_outbox.sql,20260422170300_core_event_handled.sql, plus20260425014302_core_outbox_failure_history.sql(TA-6 D5). spectral.core.eventsis the substrate transport surface for events between contexts inspectral.coreper ADR-065 admission discipline (D1,core/events/functional area). 13 contract tests landed at commit5494ccfunder the contract-requirement-test discipline in force at the time (now superseded by ADR-065’s structural admission); retention contract tests extended at the same commit.POLICY_REGISTRYgot its first real entries at the same commit (previously empty post-ADR-042).- TA-14 D11 resolved:
conversations.trigger_event_id= outbox row ID;AgentTaskpersists as a separate table with dispatch via outbox. - Listener process is a new daemon per listening context. Lifespan-scoped; joins FastAPI/worker startup.
- The 8 KB NOTIFY cap is structurally unreachable — only UUIDs ship.
- Recall.ai-class incident mitigated by PG 15+ + D16 queue monitoring + D2 polling fallback.
- TA-6 / TA-7 / TA-8 / TA-9 follow-on dispositions extended the substrate with failure-history (TA-6), typed signal payloads in
<producer>.contracts.events.*(per ADR-065 D2; the originalspectral.core.events.signals.<source>.*placement is superseded), and consumer invariants (TA-7 outcome write; TA-8 persistent consumer; TA-9 cluster materialization). See ADR-054 / ADR-055 / ADR-056 / ADR-057. - ADR-017 off-by-one noted in continue.md sweeps and corrected in Codex
agent-architecture.mdxclose-pass (prior pages mis-referenced ADR-018 for the signal path).
References
- ADR-017 — at-least-once + idempotent + producer-owned DLQ semantics
- ADR-065 —
spectral.coreadmission discipline - ADR-031 — single-library structure
- ADR-032 —
coreschema - ADR-036 — canonical fields; OTel substrate
- ADR-041 — D9 dedicated listener connection
- ADR-042 —
POLICY_REGISTRYshape - ADR-043 — D11 closure (AgentTask via outbox)
- ADR-048 — TA-19 single-outbox refactor + generation routing
- ADR-054 — TA-6 failure-history extension
- ADR-055 — TA-7 typed payloads
- ADR-056 — TA-8 typed payloads
- ADR-057 — TA-9 typed payloads
- ADR-063 — D5 partial supersession; SQL grant default between contexts does not ship
- TA-5 disposition — SPEC-308 comment
0da0b862 - TA-5 verification — SPEC-308 comment
c8ed3fec - TA-19 supersession bookkeeping — SPEC-308 comment
b759620d - TA-6/7/8/9 batch bookkeeping — SPEC-308 comment
0bc9bc85 - TA-15 D5 partial supersession — SPEC-308 comment
d3dfe930 src/spectral/core/events/(commit5494ccf) — landed contract surface- Codex
system-design/foundations/contract-surfaces/event-substrate.mdx— close-pass new page docs/runbooks/event-substrate.md— operational runbook