Developer

Connection pooling runbook

Operational procedures for psycopg pools, Supavisor utilization, and session-var leak diagnosis.

System reference: Codex reference/connection-pooling.mdx · ADR-041.

Substrate

Pooler endpoint: Supavisor transaction mode (port 6543). Three pools per process — one per role (spectral_core_app, spectral_worlds_app, spectral_platform_app).
Driver: psycopg 3 async with psycopg_pool.AsyncConnectionPool; prepare_threshold=None; autocommit=False on configure callback.
Listener (event substrate): dedicated direct-to-Postgres connection (port 5432; one per listener process).
Tenancy: SET LOCAL app.account_id + SET LOCAL app.workspace_id (and app.user_id where ADR-059 D6 applies) inside an explicit transaction via db.request_scope(...).

Pool sizing

Per-process, per role:

Surface	min	max	timeout
API (FastAPI)	2	5	10 s
Workers	2	10	30 s
Listener	dedicated single connection (not in any pool)	—	—

Single-instance API + single-instance worker budget: 3 × 5 + 3 × 10 = 45 client-side sockets on the Supavisor transaction-mode endpoint.

Scaling trigger

Total client load = (3 × 5 × API instances) + (3 × 10 × worker instances). When this approaches 80% of the 200-client Supavisor Pro Micro cap (160 sockets), scale the Supabase compute tier, not the pool. Pool sizing stays.

Health checks

Inspect Supavisor utilization

Supabase dashboard → Database → Pooler → see per-pool client/server connection counts. Spectral baseline at alpha is well under 60 sockets.

Inspect Postgres backends

SELECT
  application_name,
  usename,
  state,
  count(*)
FROM pg_stat_activity
WHERE datname = current_database()
GROUP BY application_name, usename, state
ORDER BY count DESC;

spectral_core_app / spectral_worlds_app / spectral_platform_app should appear; the listener’s spectral_listener should appear with state='idle in transaction' between LISTEN/NOTIFY events.

Inspect pool stats from app

pool.get_stats() is exposed via a future /ops/db-stats endpoint, ops-role gated. At alpha, the same data is observable via the Render service logs (the configure callback logs pool size on init).

Session-var leak post-mortem

Symptom: a query returns rows for the wrong tenant (RLS bypass) or returns zero rows when it should return rows (RLS over-deny).

Root-cause classes:

Code outside db.request_scope acquired a raw pool connection. The post-alpha lint catches this structurally; until then, code review. Look for await pool.connection() outside spectral_<app>.db.*.
A SET (without LOCAL) leaked across the transaction boundary. SET LOCAL inside BEGIN…COMMIT is the only sanctioned form. Look for cur.execute("SET app.workspace_id = ...") (without LOCAL).
A worker handler ran without going through the per-job request_scope wrapper. Look for handler bodies that acquire connections directly.
Pool checkout hook regression. The configure callback should set autocommit=False; a regression that re-enables autocommit means queries can run outside a transaction, which fails closed under RLS — but the failure mode is “empty results,” which can mask the root cause.

Forensic queries:

-- See what session vars are set on a backend
SELECT current_setting('app.account_id', true), current_setting('app.workspace_id', true), current_setting('app.user_id', true);

-- See current role
SELECT current_user, session_user;

Listener crash recovery

Listener processes hold a direct-to-Postgres connection. On crash:

Render restarts the worker service; the listener re-establishes LISTEN on its channel.
The listener startup sequence drains existing PENDING rows once via SKIP LOCKED, then LISTEN <channel>.
The reaper re-PENDs stuck IN_FLIGHT rows within own generation (claim TTL = 300 s).
If the LISTEN connection cannot establish for > 30 s, polling fallback activates at 5 s cadence.

Health metric: listener mode is one of listening, reconnecting, fallback-polling. Sentry alert on sustained fallback-polling.

Pool exhaustion triage

Symptom: psycopg_pool.PoolTimeout raised; requests time out at 10 s (API) or 30 s (workers).

Inspect pool stats — if requests_waiting is high, demand exceeds max_size.
Inspect Postgres backends — if state='active' count is high, queries are running long; check for missing indexes, slow LLM calls inside transactions, or transactions left open.
Inspect Supavisor — if total clients are near cap, scale Supabase compute tier.

Never paper over with max_size += 5. Sizing is sacred at alpha — the 200-client cap is not soft.