Skip to content
GitHub
Test Failures

Local qa-replay run before merge catches latent drift the unit suites cannot see

Local qa-replay run before merge catches latent drift the unit suites cannot see

Problem

The SPEC-604 merge gate ran the full-stack qa replay suite locally and found four latent breaks already on main — none caused by the branch under review. The suite had simply not been executed since several earlier merges (FORCE-RLS, the masked-identifier pass, the version-history overhaul): under direct-merge-to-main, CI only runs when something is pushed, so the repo accrues qa drift invisibly while every unit suite stays green.

Root Cause(s) — the four drift classes

  1. SET ROLE platform_role worlds reads without app.world_id (tools/dev/qa_customer_seed.py::_deploy). Under FORCE ROW LEVEL SECURITY (SPEC-564) the enshrined-rule SELECT returned zero rows → NoEnshrinedRulesError. This is the second confirmed instance of the class predicted when SPEC-564 landed (the first was integration-test teardown). Fix: SELECT set_config('app.world_id', <world>, false) after SET ROLE in any tooling that reads/writes worlds tables as platform_role.
  2. qa helpers coupled to UI display conventions. _worlds.ts createWorld read the created world id from the success panel’s <code> text, which the masked-identifier pass (730218e1) reduced to a last-6 handle — so every world-scoped test drove /worlds/<6-chars>/… and 422’d. Fix on both sides of the convention: the masked <code> carries the full id in data-id/title (which the convention promises), and the helper reads data-id with a text fallback. Corollary: a test that asserts toContainText(worldId) against a masked display only “passed” while the helper was returning the short id — assert worldId.slice(-6) visible + [data-id="${worldId}"] present instead.
  3. Copy assertions vs. overhauled surfaces. publish-deploy.spec.ts asserted “Version 1” / “Published:” — pre-version-history-overhaul row copy. Any surface rework must re-run the suite or the assertions rot.
  4. Bare or() locators with co-visible alternatives. Playwright or() fails strict mode when more than one alternative is visible at once (empty-state text + section heading; alert + version line). Racy against parallel workers that change data mid-run. Fix: append .first() when alternatives can co-render.

Solution

Run the replay gate locally as part of every merge gate while merges go directly to main (sequence and gotchas):

Terminal window
bash tools/dev/start.sh --stop && supabase db reset
SPECTRAL_LLM_CASSETTE_MODE=replay SPECTRAL_LLM_CASSETTE_DIR=qa/cassettes \
XAI_API_KEY=placeholder-for-replay tools/dev/start.sh --full
set -a; eval "$(bash tools/dev/resolve_supabase_env.sh)"; set +a # qa needs SUPABASE_ANON_KEY
uv run python tools/dev/cold_start_seed.py
SPECTRAL_MODULE_STORE_ROOT="$PWD/.local/module-store" \
uv run python tools/dev/qa_customer_seed.py # must match the booted store
SPECTRAL_OPERATOR_PASSWORD= SPECTRAL_QA_CUSTOMER_PASSWORD= SPECTRAL_QA_DECISION_KEY= \
pnpm exec playwright test --config qa/playwright.config.ts

Boot-state pitfall: a lingering old API on :8000 makes the replay-mode boot log Address already in use while /health still answers from the stale process — kill the pid-file processes AND whatever holds the ports before re-booting, or the suite runs against a non-replay API.

Two more run-state pitfalls (Wave 0 merge train, 2026-06-09):

  • Stale spectral_workers from a prior session starve chat replay. start.sh --stop does not reliably kill an orphaned worker runtime; a leftover one competes for agent-task chat rows, so World-Agent-driven scenarios (authoring-loop, candidate-review, world-model-card, publish-deploy) time out waiting for turn-assistant while every other test passes. Before the replay boot: kill the port holders (8000/3000/3001) AND pkill -f spectral_workers. (One worker runtime shows as two processes — the multiprocessing spawn child is normal.)
  • The first suite run against a cold stack flakes on chat-heavy beforeAll hooks — absorbed by a per-surface warm-up setup project. The first chat-driving hook on a freshly booted stack pays every one-time cost at once: Vite on-demand compilation of the Assistant route graph, the worker’s first agent-task (LangGraph compile + model-client init + cassette load + Realtime channel join), and cold DB pools — which exceeded the 60s turn-assistant wait and failed whichever chat spec ran first. Boot-script health-waits only prove the HTTP listeners answer, not that those paths are warm. The config now wires a Playwright setup project per surface (<surface>/tests/_warmup.setup.ts, a dependencies of the surface project) that pays that cost ONCE against a generous bound — the operations warm-up drives a full chat propose round-trip; the customer warm-up signs in and loads the portfolio — so every timed spec runs warm. The first run is now the real verdict; there is no “throwaway run” step. A cold full run goes green in one pass (operations 67 passed, customer 60 passed, 0 fail). Any first-run failure is real — selector, assertion, 4xx, or genuine cold-path regression — investigate it (the Wave 0 train caught a real rail-status bug exactly this way). Note: CI sets retries: 2, which used to silently mask this class on the first retry; locally retries: 0 exposed it.
  • The worker Realtime bridge degrades after many warm runs in one session. Across a long merge train (~10+ qa-replay runs without a reboot, Wave 1 cockpit), the worker’s Realtime WebSocket connections start failing (code: 1006 + join push timeout for channel realtime:world-agent:conversation:… in .workers.log), which starves every chat-seed beforeAll and makes the suite runtime balloon (35s → 1.8m → 3.6m) with progressively more chat-dependent specs failing. This is NOT branch drift — it is accumulated stack state. The warm-re-run rule alone does not recover it; a clean reboot does (start.sh --stop + kill port-holders + pkill -f spectral_workers + supabase db reset + reboot + reseed). Reboot the stack every few merges across a long train, and reboot-then-verify the moment runtime balloons or the failing set grows between runs.

Prevention

  • Treat “unit suites green” as necessary, not sufficient, for changes that touch UI copy, display conventions, seeds, or RLS posture — each has a qa-side consumer the unit suites never execute.
  • A UI copy/convention sweep must cover qa/ — generated *.spec.ts, the NL spec sources under qa/*/specs/ (regen would resurrect stale assertions), and the helper files (_*.ts — they are not *.spec.ts and a spec-glob grep misses them).
  • Grep SET ROLE platform_role across tools/ and tests/ whenever an RLS posture change lands; every hit needs an app.world_id audit.

References

  • SPEC-604 merge a31899ea; gate-repair commit 283b465
  • SPEC-564 FORCE-RLS merge 7a72ec99; masked-identifier pass 730218e1