Full-stack-in-CI replay gate — Supabase signing-key generation broke on modern CLIs
Full-stack-in-CI replay gate — Supabase signing-key generation broke on modern CLIs
Problem
SPEC-590 (D6 of the UX QA harness) added a PR-blocking CI job that boots the entire app stack
(Supabase + API + workers + operator cockpit + customer dashboard) in cassette replay mode and
runs the Playwright NL-spec suite. ci.yml had never booted an app stack before, so the job
surfaced a chain of environment-specific failures — the load-bearing one being that
supabase start died at boot:
postgrest: FatalError {fatalErrorMessage = "user error (The JWT secret must be at least 32 characters long.)"}Error status 503Locally everything worked — but only because the local Supabase stack had been started in a prior session with an older CLI; a fresh boot had never been exercised this cycle.
Investigation
Steps Tried
- Suspected a missing symmetric
jwt_secretinsupabase/config.toml— but the config uses asymmetric ES256 signing (signing_keys_path = "./signing_keys.json"), no symmetric secret, and that is correct for the SPEC-552 auth model. Not the cause. - Suspected the CI CLI version (
latest) vs local (2.102.0). Pinning was a candidate, but before guessing, tore down the local stack and ran a freshsupabase startto reproduce — the right move (don’t burn CI runs guessing). - Inspected
supabase gen signing-keybehavior directly. Found the root cause (below).
Root Cause
supabase gen signing-key changed behavior on modern CLIs (≥ 2.102). It now:
- writes the key to the configured
signing_keys_path(and reads that file first), and - prints nothing to stdout (older CLIs printed the JWK to stdout).
start.sh (and the first CI draft) generated the key with:
printf '[%s]\n' "$(supabase gen signing-key --algorithm ES256)" > supabase/signing_keys.jsonOn a modern CLI the command substitution captures empty stdout, so the file becomes [] — an empty
key set. With no signing key, Supabase falls back to a symmetric JWT secret that resolves shorter
than 32 chars, and PostgREST fatals. start.sh only “worked” for developers carrying a
signing_keys.json left over from an older CLI; a genuinely fresh checkout was latently broken.
Two smaller CI-only failures rode along:
resolve_supabase_env.shis designed to be sourced, so it carries no executable bit; invoking it directly (tools/dev/resolve_supabase_env.sh >> "$GITHUB_ENV") →Permission denied(126).- Cassette replay is credential-free at the HTTP layer (recorded responses are returned by a
MockTransport), but building the
ChatXAIclient still callsresolve_bearer_source, which needs a bearer. With no key the chat model is unbound and the chat route 503s.
Solution
Signing-key generation (CI ci.yml + deploy system gate + tools/dev/start.sh)
Seed an empty array first (the CLI reads the file before writing it), then let the CLI write:
# Before (empty [] on modern CLIs → PostgREST <32-char JWT secret)printf '[%s]\n' "$(supabase gen signing-key --algorithm ES256)" > supabase/signing_keys.json
# Afterecho '[]' > supabase/signing_keys.jsonsupabase gen signing-key --algorithm ES256 --yes # --yes answers the overwrite promptThis is robust across CLI versions (it always ends with the CLI writing a valid [{kty:EC,alg:ES256,…}]).
resolve_supabase_env.sh in CI
Invoke via bash so no executable bit is required:
run: bash tools/dev/resolve_supabase_env.sh >> "$GITHUB_ENV"Credential-free replay
Default a placeholder XAI_API_KEY whenever the stack boots in replay mode, so the client
constructs (the bearer is never sent — cassettes intercept). Done in the CI boot helper
tools/ci/qa_replay_up.sh:
if [ "$SPECTRAL_LLM_CASSETTE_MODE" = "replay" ]; then export XAI_API_KEY="${XAI_API_KEY:-replay-not-used-in-cassette-mode}"fiImplementation Notes
- Validate the Supabase boot path locally with a true
supabase stop --all && supabase start, not a stack left running from a prior session — the stale stack hides fresh-boot breakage. - The full gate can be validated locally without burning CI runs:
qa_replay_up.shreaches “stack ready”, thencold_start_seed.py→qa_customer_seed.py→pnpm exec playwright testreproduces the CI result (54 pass / 16 documented skips / 0 fail).
Prevention
Best Practices
- When a CI job runs a dev-shell command (signing-key gen, env resolution, stack boot), prefer the
exact
bash-invoked form and a fresh-state assumption — don’t rely on a developer’s accumulated local state. - Treat “works locally” as suspect when the local long-running service predates the change; re-boot from scratch.
- Keep CI and
start.shgenerating the signing key the same way (a single canonical recipe) so they can’t drift.
Warning Signs
- A CLI tool that used to print to stdout now prints status text instead — command substitution silently captures the wrong thing.
git ls-files -s <script>shows100644for a script you invoke directly in CI → it needs the exec bit or abashprefix.
Latent-drift note
Main had not run CI since 2026-06-01 (direct-merge-to-main policy; CI runs only on push-to-main / PR
/ workflow_dispatch). Several unrelated checks (biome on generated tests, a stale validator
self-test, a migration-compat marker) had quietly gone red in the interim. The first CI-gated change
to force a green run must expect to sweep that accumulated drift.
References
- Merge:
c263e34(SPEC-590); branchmatt/spec-590-d6-ci-gate-scheduled-recordverify-pass .github/workflows/ci.yml(qa-replayjob),tools/ci/qa_replay_up.sh,tools/dev/start.shsrc/spectral/core/llm/infrastructure/cassette.py(derive_cassette_version/wire_cassette_version)