Skip to content
GitHub
Developer

Connection pooling runbook

Operational procedures for psycopg pools, Supavisor utilization, and session-var leak diagnosis.

System reference: Codex reference/connection-pooling.mdx · ADR-041.


Substrate

  • Pooler endpoint: Supavisor transaction mode (port 6543). Three pools per process — one per role (spectral_core_app, spectral_worlds_app, spectral_platform_app).
  • Driver: psycopg 3 async with psycopg_pool.AsyncConnectionPool; prepare_threshold=None; autocommit=False on configure callback.
  • Listener (event substrate): dedicated direct-to-Postgres connection (port 5432; one per listener process).
  • Tenancy: SET LOCAL app.account_id + SET LOCAL app.workspace_id (and app.user_id where ADR-059 D6 applies) inside an explicit transaction via db.request_scope(...).

Pool sizing

Per-process, per role:

Surfaceminmaxtimeout
API (FastAPI)2510 s
Workers21030 s
Listenerdedicated single connection (not in any pool)

Single-instance API + single-instance worker budget: 3 × 5 + 3 × 10 = 45 client-side sockets on the Supavisor transaction-mode endpoint.

Scaling trigger

Total client load = (3 × 5 × API instances) + (3 × 10 × worker instances). When this approaches 80% of the 200-client Supavisor Pro Micro cap (160 sockets), scale the Supabase compute tier, not the pool. Pool sizing stays.


Health checks

Inspect Supavisor utilization

Supabase dashboard → Database → Pooler → see per-pool client/server connection counts. Spectral baseline at alpha is well under 60 sockets.

Inspect Postgres backends

SELECT
application_name,
usename,
state,
count(*)
FROM pg_stat_activity
WHERE datname = current_database()
GROUP BY application_name, usename, state
ORDER BY count DESC;

spectral_core_app / spectral_worlds_app / spectral_platform_app should appear; the listener’s spectral_listener should appear with state='idle in transaction' between LISTEN/NOTIFY events.

Inspect pool stats from app

pool.get_stats() is exposed via a future /ops/db-stats endpoint, ops-role gated. At alpha, the same data is observable via the Render service logs (the configure callback logs pool size on init).


Session-var leak post-mortem

Symptom: a query returns rows for the wrong tenant (RLS bypass) or returns zero rows when it should return rows (RLS over-deny).

Root-cause classes:

  1. Code outside db.request_scope acquired a raw pool connection. The post-alpha lint catches this structurally; until then, code review. Look for await pool.connection() outside spectral_<app>.db.*.
  2. A SET (without LOCAL) leaked across the transaction boundary. SET LOCAL inside BEGIN…COMMIT is the only sanctioned form. Look for cur.execute("SET app.workspace_id = ...") (without LOCAL).
  3. A worker handler ran without going through the per-job request_scope wrapper. Look for handler bodies that acquire connections directly.
  4. Pool checkout hook regression. The configure callback should set autocommit=False; a regression that re-enables autocommit means queries can run outside a transaction, which fails closed under RLS — but the failure mode is “empty results,” which can mask the root cause.

Forensic queries:

-- See what session vars are set on a backend
SELECT current_setting('app.account_id', true), current_setting('app.workspace_id', true), current_setting('app.user_id', true);
-- See current role
SELECT current_user, session_user;

Listener crash recovery

Listener processes hold a direct-to-Postgres connection. On crash:

  1. Render restarts the worker service; the listener re-establishes LISTEN on its channel.
  2. The listener startup sequence drains existing PENDING rows once via SKIP LOCKED, then LISTEN <channel>.
  3. The reaper re-PENDs stuck IN_FLIGHT rows within own generation (claim TTL = 300 s).
  4. If the LISTEN connection cannot establish for > 30 s, polling fallback activates at 5 s cadence.

Health metric: listener mode is one of listening, reconnecting, fallback-polling. Sentry alert on sustained fallback-polling.


Pool exhaustion triage

Symptom: psycopg_pool.PoolTimeout raised; requests time out at 10 s (API) or 30 s (workers).

  1. Inspect pool stats — if requests_waiting is high, demand exceeds max_size.
  2. Inspect Postgres backends — if state='active' count is high, queries are running long; check for missing indexes, slow LLM calls inside transactions, or transactions left open.
  3. Inspect Supavisor — if total clients are near cap, scale Supabase compute tier.

Never paper over with max_size += 5. Sizing is sacred at alpha — the 200-client cap is not soft.


See also