Connection pooling runbook
Operational procedures for psycopg pools, Supavisor utilization, and session-var leak diagnosis.
System reference: Codex reference/connection-pooling.mdx · ADR-041.
Substrate
- Pooler endpoint: Supavisor transaction mode (port 6543). Three pools per process — one per role (
spectral_core_app,spectral_worlds_app,spectral_platform_app). - Driver: psycopg 3 async with
psycopg_pool.AsyncConnectionPool;prepare_threshold=None;autocommit=Falseon configure callback. - Listener (event substrate): dedicated direct-to-Postgres connection (port 5432; one per listener process).
- Tenancy:
SET LOCAL app.account_id+SET LOCAL app.workspace_id(andapp.user_idwhere ADR-059 D6 applies) inside an explicit transaction viadb.request_scope(...).
Pool sizing
Per-process, per role:
| Surface | min | max | timeout |
|---|---|---|---|
| API (FastAPI) | 2 | 5 | 10 s |
| Workers | 2 | 10 | 30 s |
| Listener | dedicated single connection (not in any pool) | — | — |
Single-instance API + single-instance worker budget: 3 × 5 + 3 × 10 = 45 client-side sockets on the Supavisor transaction-mode endpoint.
Scaling trigger
Total client load = (3 × 5 × API instances) + (3 × 10 × worker instances). When this approaches 80% of the 200-client Supavisor Pro Micro cap (160 sockets), scale the Supabase compute tier, not the pool. Pool sizing stays.
Health checks
Inspect Supavisor utilization
Supabase dashboard → Database → Pooler → see per-pool client/server connection counts. Spectral baseline at alpha is well under 60 sockets.
Inspect Postgres backends
SELECT application_name, usename, state, count(*)FROM pg_stat_activityWHERE datname = current_database()GROUP BY application_name, usename, stateORDER BY count DESC;spectral_core_app / spectral_worlds_app / spectral_platform_app should appear; the listener’s spectral_listener should appear with state='idle in transaction' between LISTEN/NOTIFY events.
Inspect pool stats from app
pool.get_stats() is exposed via a future /ops/db-stats endpoint, ops-role gated. At alpha, the same data is observable via the Render service logs (the configure callback logs pool size on init).
Session-var leak post-mortem
Symptom: a query returns rows for the wrong tenant (RLS bypass) or returns zero rows when it should return rows (RLS over-deny).
Root-cause classes:
- Code outside
db.request_scopeacquired a raw pool connection. The post-alpha lint catches this structurally; until then, code review. Look forawait pool.connection()outsidespectral_<app>.db.*. - A
SET(withoutLOCAL) leaked across the transaction boundary.SET LOCALinsideBEGIN…COMMITis the only sanctioned form. Look forcur.execute("SET app.workspace_id = ...")(withoutLOCAL). - A worker handler ran without going through the per-job
request_scopewrapper. Look for handler bodies that acquire connections directly. - Pool checkout hook regression. The configure callback should set
autocommit=False; a regression that re-enables autocommit means queries can run outside a transaction, which fails closed under RLS — but the failure mode is “empty results,” which can mask the root cause.
Forensic queries:
-- See what session vars are set on a backendSELECT current_setting('app.account_id', true), current_setting('app.workspace_id', true), current_setting('app.user_id', true);
-- See current roleSELECT current_user, session_user;Listener crash recovery
Listener processes hold a direct-to-Postgres connection. On crash:
- Render restarts the worker service; the listener re-establishes LISTEN on its channel.
- The listener startup sequence drains existing PENDING rows once via SKIP LOCKED, then
LISTEN <channel>. - The reaper re-PENDs stuck IN_FLIGHT rows within own generation (claim TTL = 300 s).
- If the LISTEN connection cannot establish for > 30 s, polling fallback activates at 5 s cadence.
Health metric: listener mode is one of listening, reconnecting, fallback-polling. Sentry alert on sustained fallback-polling.
Pool exhaustion triage
Symptom: psycopg_pool.PoolTimeout raised; requests time out at 10 s (API) or 30 s (workers).
- Inspect pool stats — if
requests_waitingis high, demand exceedsmax_size. - Inspect Postgres backends — if
state='active'count is high, queries are running long; check for missing indexes, slow LLM calls inside transactions, or transactions left open. - Inspect Supavisor — if total clients are near cap, scale Supabase compute tier.
Never paper over with max_size += 5. Sizing is sacred at alpha — the 200-client cap is not soft.
See also
- ADR-041 — Connection pooling architecture
- Codex connection pooling
- Codex storage topology
- Codex access control
docs/runbooks/event-substrate.md— listener-direct-Postgres pattern