Testing runbook
Operational procedures for the local + CI test substrate, per-test isolation, the D13 first-integration validation pass, the D14 trigger ladder, and the backup-nightly bats harness.
System reference: Codex how-to/testing.mdx · ADR-045 · ADR-061.
Local dev DB
supabase start (Supabase CLI) brings up the full Supabase stack on Docker (Postgres + Auth + Storage + Realtime). Migrations apply via supabase db reset.
supabase start # boot the stacksupabase db reset # apply all migrations from supabase/migrations/supabase status # show connection strings; ports usedLocal-CI divergence is bounded because CI tests do not exercise Auth / Storage / Realtime services.
CI DB
testcontainers-python plus supabase/postgres:15 (Postgres-only). Session-scoped:
- Boot testcontainer.
- Apply migrations.
AsyncPostgresSaver.setup()forlanggraph.*.- Install SECURITY DEFINER functions.
- Apply inter-context grants.
Per-test isolation: transaction rollback on the async psycopg3 connection. @pytest.mark.no_rollback opt-out for DDL-only tests.
Marker enforcement
Root tests/conftest.py fails collection on any test item missing one of the primary markers (unit, contract, integration, e2e). A fifth primary marker live_drift is used only by the nightly LLM live-drift workflow per ADR-061.
@pytest.mark.unitasync def test_thing(): ...Run a tier in isolation:
uv run pytest -m "unit"uv run pytest -m "unit or contract"uv run pytest -m integrationRole and auth fixtures
| Fixture | Scope | Purpose |
|---|---|---|
postgres_test | session | testcontainer + migrations + langgraph + SECURITY DEFINER fns |
db | function | async psycopg3 conn with per-test auto-rollback txn |
as_workspace_member(account_id, workspace_id) | context mgr | SET LOCAL app.account_id / app.workspace_id |
as_context_role(context) | context mgr | SET LOCAL ROLE spectral_{context}_app (txn-scoped) |
jwt_for(user_id, workspace_id, scopes) | function | PyJWT-signed test JWT with controlled claims |
llm_replay_client(fixture_path) | function | recorded-response LLM client |
SET LOCAL ROLE is chosen over SET ROLE so the role switch is txn-scoped and rolls back with the test transaction.
D13 first-integration validation pass
Before the first real integration test merges, verify (gate items):
supabase/postgres:15ships with pgvector enabled (CREATE EXTENSION vectorsucceeds).auth.usersexists in the image (needed for thecore.usersmirror FK target). If absent, migrations synthesize a minimalauth.usersin a test-only migration.- Asymmetric JWT signing supported (per ADR-039 D4a).
SET LOCALinside nested transactions (savepoints) preserves the session-var across savepoint boundaries and resets on ROLLBACK — no leakage to subsequent tests.SET LOCAL ROLEinside per-test txn resets cleanly on ROLLBACK — no residual role on the connection returned to the pool.- PKCE cookie split-reassembly through Cloudflare proxy + Pages Function (per ADR-052 carry-forward).
- JWT header-size end-to-end through Cloudflare upstream buffer (per ADR-052 carry-forward).
- Branch lifecycle exercise — create branch, apply migrations, run integration tests against branch URL, delete branch — round-trips via the Supabase Management API + the supabase CLI invocation pattern in
tools/ops/premerge_dryrun.sh. - Smoke-test invocation contract — single CLI call that takes a branch connection string and exits 0 on green / non-zero on any failure.
Fallback
If supabase/postgres:15 disappoints: vanilla postgres:15 + CREATE EXTENSION vector in a test-setup migration + a synthesized minimal auth.users table.
D14 trigger ladder
| Trigger | Response |
|---|---|
| First-integration-test image-contents anomaly (extension, auth.users, asymmetric JWT) | Fall back to vanilla postgres:15 + manual extension install per D13 |
First SET LOCAL / SET LOCAL ROLE nested-txn leakage observed under per-test rollback | Re-open per-test isolation mechanism; alternatives include schema-reset-per-test or database-per-test |
| First xdist-attributable flake (deadlock, ordering, shared-state interference) | Move to schema-per-worker (not retry-in-place) |
| First DDL-testing parallelism bottleneck | Move to database-per-worker |
| First partner pilot requiring shared-staging functional-test gate | Reconsider D12 (staging is not a CI target at alpha); add staging CI target |
Coverage floors
Domain ≥ 90%, application ≥ 80%, infrastructure ≥ 60%. Floors land as targets in tools/quality/check_coverage.py scaffold; enforcement starts disabled for the first month after the first real test suite lands. Month 1: tracking only (PR comments). Month 2+: enforce.
LLM testing posture (three tiers)
Per ADR-061:
- Unit / contract —
FakeLLMProvider(inspectral.core.llm.testing); deterministic; zero external calls - Integration — pytest-recording per-test cassettes at
tests/<context>/_fixtures/llm/<test-id>.yaml; replay byte-perfect - Live drift detection — nightly workflow (
.github/workflows/nightly-live-drift.ymlper ADR-061; lands with the deploy substrate);LIVE_PROVIDER=1env bypasses VCR; compares to recorded cassettes via similarity threshold (0.85 default; per-test override)
Cassette recording sessions:
RECORD_NEW_FIXTURES=1 uv run pytest tests/platform/integration/test_scan.py -m integrationAlways review the fresh cassette diff for sensitive content before commit; the cassette redaction lint blocks Authorization: Bearer ... patterns.
See docs/runbooks/llm-testing.md for drift triage + threshold calibration.
Backup-nightly bats + fake-gcs harness
tools/ops/backup/backup-nightly.sh runs pg_dump → age → rclone rcat to GCS. The integration test harness uses bats (Bash Automated Testing System) plus fake-gcs-server to exercise the full pipe locally in CI.
The harness lives under tests/ops/backup/ (close-pass scaffold; lands when first integration test consumer needs it). Compose profile backup in infra/local/compose.yml brings up backup-nightly + fake-gcs-server for local exercise:
pnpm compose:up:backuppnpm compose:run backup-nightly bash tools/ops/backup/backup-nightly.sh# Verify the dump uploaded to fake-gcs:curl http://localhost:4443/storage/v1/b/spectral-backups/oCassette redaction lint
tools/quality/check_cassette_redaction.py blocks Authorization: Bearer ... patterns and similar in committed cassettes. Wired into pre-push gate. Lands with the first cassette commit (until then, a dead lint with no inputs).
See also
- ADR-045 — Test substrate
- ADR-061 — LLM testing strategy
- ADR-062 — CI secrets handling
- Codex testing
docs/runbooks/llm-testing.md— recording sessions + drift triagedocs/runbooks/ci-secrets.md— Environment scoping + rotation