Test Failures

Local qa-replay run before merge catches latent drift the unit suites cannot see

Problem

The SPEC-604 merge gate ran the full-stack qa replay suite locally and found four latent breaks already on main — none caused by the branch under review. The suite had simply not been executed since several earlier merges (FORCE-RLS, the masked-identifier pass, the version-history overhaul): under direct-merge-to-main, CI only runs when something is pushed, so the repo accrues qa drift invisibly while every unit suite stays green.

Root Cause(s) — the four drift classes

SET ROLE platform_role worlds reads without app.world_id (tools/dev/qa_customer_seed.py::_deploy). Under FORCE ROW LEVEL SECURITY (SPEC-564) the enshrined-rule SELECT returned zero rows → NoEnshrinedRulesError. This is the second confirmed instance of the class predicted when SPEC-564 landed (the first was integration-test teardown). Fix: SELECT set_config('app.world_id', <world>, false) after SET ROLE in any tooling that reads/writes worlds tables as platform_role.
qa helpers coupled to UI display conventions. _worlds.ts createWorld read the created world id from the success panel’s <code> text, which the masked-identifier pass (730218e1) reduced to a last-6 handle — so every world-scoped test drove /worlds/<6-chars>/… and 422’d. Fix on both sides of the convention: the masked <code> carries the full id in data-id/title (which the convention promises), and the helper reads data-id with a text fallback. Corollary: a test that asserts toContainText(worldId) against a masked display only “passed” while the helper was returning the short id — assert worldId.slice(-6) visible + [data-id="${worldId}"] present instead.
Copy assertions vs. overhauled surfaces. publish-deploy.spec.ts asserted “Version 1” / “Published:” — pre-version-history-overhaul row copy. Any surface rework must re-run the suite or the assertions rot.
Bare or() locators with co-visible alternatives. Playwright or() fails strict mode when more than one alternative is visible at once (empty-state text + section heading; alert + version line). Racy against parallel workers that change data mid-run. Fix: append .first() when alternatives can co-render.

Solution

Run the replay gate locally as part of every merge gate while merges go directly to main (sequence and gotchas):

bash tools/dev/start.sh --stop && supabase db reset
SPECTRAL_LLM_CASSETTE_MODE=replay SPECTRAL_LLM_CASSETTE_DIR=qa/cassettes \
  XAI_API_KEY=placeholder-for-replay tools/dev/start.sh --full
set -a; eval "$(bash tools/dev/resolve_supabase_env.sh)"; set +a   # qa needs SUPABASE_ANON_KEY
uv run python tools/dev/cold_start_seed.py
SPECTRAL_MODULE_STORE_ROOT="$PWD/.local/module-store" \
  uv run python tools/dev/qa_customer_seed.py                      # must match the booted store
SPECTRAL_OPERATOR_PASSWORD=… SPECTRAL_QA_CUSTOMER_PASSWORD=… SPECTRAL_QA_DECISION_KEY=… \
  pnpm exec playwright test --config qa/playwright.config.ts

Boot-state pitfall: a lingering old API on :8000 makes the replay-mode boot log Address already in use while /health still answers from the stale process — kill the pid-file processes AND whatever holds the ports before re-booting, or the suite runs against a non-replay API.

Two more run-state pitfalls (Wave 0 merge train, 2026-06-09):

Stale spectral_workers from a prior session starve chat replay. start.sh --stop does not reliably kill an orphaned worker runtime; a leftover one competes for agent-task chat rows, so World-Agent-driven scenarios (authoring-loop, candidate-review, world-model-card, publish-deploy) time out waiting for turn-assistant while every other test passes. Before the replay boot: kill the port holders (8000/3000/3001) AND pkill -f spectral_workers. (One worker runtime shows as two processes — the multiprocessing spawn child is normal.)
The first suite run against a cold stack flakes on chat-heavy beforeAll hooks — absorbed by a per-surface warm-up setup project. The first chat-driving hook on a freshly booted stack pays every one-time cost at once: Vite on-demand compilation of the Assistant route graph, the worker’s first agent-task (LangGraph compile + model-client init + cassette load + Realtime channel join), and cold DB pools — which exceeded the 60s turn-assistant wait and failed whichever chat spec ran first. Boot-script health-waits only prove the HTTP listeners answer, not that those paths are warm. The config now wires a Playwright setup project per surface (<surface>/tests/_warmup.setup.ts, a dependencies of the surface project) that pays that cost ONCE against a generous bound — the operations warm-up drives a full chat propose round-trip; the customer warm-up signs in and loads the portfolio — so every timed spec runs warm. The first run is now the real verdict; there is no “throwaway run” step. A cold full run goes green in one pass (operations 67 passed, customer 60 passed, 0 fail). Any first-run failure is real — selector, assertion, 4xx, or genuine cold-path regression — investigate it (the Wave 0 train caught a real rail-status bug exactly this way). Note: CI sets retries: 2, which used to silently mask this class on the first retry; locally retries: 0 exposed it.
The worker Realtime bridge degrades after many warm runs in one session. Across a long merge train (~10+ qa-replay runs without a reboot, Wave 1 cockpit), the worker’s Realtime WebSocket connections start failing (code: 1006 + join push timeout for channel realtime:world-agent:conversation:… in .workers.log), which starves every chat-seed beforeAll and makes the suite runtime balloon (35s → 1.8m → 3.6m) with progressively more chat-dependent specs failing. This is NOT branch drift — it is accumulated stack state. The warm-re-run rule alone does not recover it; a clean reboot does (start.sh --stop + kill port-holders + pkill -f spectral_workers + supabase db reset + reboot + reseed). Reboot the stack every few merges across a long train, and reboot-then-verify the moment runtime balloons or the failing set grows between runs.

Prevention

Treat “unit suites green” as necessary, not sufficient, for changes that touch UI copy, display conventions, seeds, or RLS posture — each has a qa-side consumer the unit suites never execute.
A UI copy/convention sweep must cover qa/ — generated *.spec.ts, the NL spec sources under qa/*/specs/ (regen would resurrect stale assertions), and the helper files (_*.ts — they are not *.spec.ts and a spec-glob grep misses them).
Grep SET ROLE platform_role across tools/ and tests/ whenever an RLS posture change lands; every hit needs an app.world_id audit.

References

SPEC-604 merge a31899ea; gate-repair commit 283b465
SPEC-564 FORCE-RLS merge 7a72ec99; masked-identifier pass 730218e1

Previous
Fabricated fixture values keep unit suites green on wire-impossible assumptions — derive fixtures from the real enum/wire values Next
Module-level DB pool creation blocks all unit tests