Skip to content
GitHub
Integration Issues

Full-stack-in-CI replay gate — Supabase signing-key generation broke on modern CLIs

Full-stack-in-CI replay gate — Supabase signing-key generation broke on modern CLIs

Problem

SPEC-590 (D6 of the UX QA harness) added a PR-blocking CI job that boots the entire app stack (Supabase + API + workers + operator cockpit + customer dashboard) in cassette replay mode and runs the Playwright NL-spec suite. ci.yml had never booted an app stack before, so the job surfaced a chain of environment-specific failures — the load-bearing one being that supabase start died at boot:

postgrest: FatalError {fatalErrorMessage = "user error (The JWT secret must be at least 32 characters long.)"}
Error status 503

Locally everything worked — but only because the local Supabase stack had been started in a prior session with an older CLI; a fresh boot had never been exercised this cycle.

Investigation

Steps Tried

  1. Suspected a missing symmetric jwt_secret in supabase/config.toml — but the config uses asymmetric ES256 signing (signing_keys_path = "./signing_keys.json"), no symmetric secret, and that is correct for the SPEC-552 auth model. Not the cause.
  2. Suspected the CI CLI version (latest) vs local (2.102.0). Pinning was a candidate, but before guessing, tore down the local stack and ran a fresh supabase start to reproduce — the right move (don’t burn CI runs guessing).
  3. Inspected supabase gen signing-key behavior directly. Found the root cause (below).

Root Cause

supabase gen signing-key changed behavior on modern CLIs (≥ 2.102). It now:

  • writes the key to the configured signing_keys_path (and reads that file first), and
  • prints nothing to stdout (older CLIs printed the JWK to stdout).

start.sh (and the first CI draft) generated the key with:

Terminal window
printf '[%s]\n' "$(supabase gen signing-key --algorithm ES256)" > supabase/signing_keys.json

On a modern CLI the command substitution captures empty stdout, so the file becomes [] — an empty key set. With no signing key, Supabase falls back to a symmetric JWT secret that resolves shorter than 32 chars, and PostgREST fatals. start.sh only “worked” for developers carrying a signing_keys.json left over from an older CLI; a genuinely fresh checkout was latently broken.

Two smaller CI-only failures rode along:

  • resolve_supabase_env.sh is designed to be sourced, so it carries no executable bit; invoking it directly (tools/dev/resolve_supabase_env.sh >> "$GITHUB_ENV") → Permission denied (126).
  • Cassette replay is credential-free at the HTTP layer (recorded responses are returned by a MockTransport), but building the ChatXAI client still calls resolve_bearer_source, which needs a bearer. With no key the chat model is unbound and the chat route 503s.

Solution

Signing-key generation (CI ci.yml + deploy system gate + tools/dev/start.sh)

Seed an empty array first (the CLI reads the file before writing it), then let the CLI write:

Terminal window
# Before (empty [] on modern CLIs → PostgREST <32-char JWT secret)
printf '[%s]\n' "$(supabase gen signing-key --algorithm ES256)" > supabase/signing_keys.json
# After
echo '[]' > supabase/signing_keys.json
supabase gen signing-key --algorithm ES256 --yes # --yes answers the overwrite prompt

This is robust across CLI versions (it always ends with the CLI writing a valid [{kty:EC,alg:ES256,…}]).

resolve_supabase_env.sh in CI

Invoke via bash so no executable bit is required:

run: bash tools/dev/resolve_supabase_env.sh >> "$GITHUB_ENV"

Credential-free replay

Default a placeholder XAI_API_KEY whenever the stack boots in replay mode, so the client constructs (the bearer is never sent — cassettes intercept). Done in the CI boot helper tools/ci/qa_replay_up.sh:

Terminal window
if [ "$SPECTRAL_LLM_CASSETTE_MODE" = "replay" ]; then
export XAI_API_KEY="${XAI_API_KEY:-replay-not-used-in-cassette-mode}"
fi

Implementation Notes

  • Validate the Supabase boot path locally with a true supabase stop --all && supabase start, not a stack left running from a prior session — the stale stack hides fresh-boot breakage.
  • The full gate can be validated locally without burning CI runs: qa_replay_up.sh reaches “stack ready”, then cold_start_seed.pyqa_customer_seed.pypnpm exec playwright test reproduces the CI result (54 pass / 16 documented skips / 0 fail).

Prevention

Best Practices

  • When a CI job runs a dev-shell command (signing-key gen, env resolution, stack boot), prefer the exact bash-invoked form and a fresh-state assumption — don’t rely on a developer’s accumulated local state.
  • Treat “works locally” as suspect when the local long-running service predates the change; re-boot from scratch.
  • Keep CI and start.sh generating the signing key the same way (a single canonical recipe) so they can’t drift.

Warning Signs

  • A CLI tool that used to print to stdout now prints status text instead — command substitution silently captures the wrong thing.
  • git ls-files -s <script> shows 100644 for a script you invoke directly in CI → it needs the exec bit or a bash prefix.

Latent-drift note

Main had not run CI since 2026-06-01 (direct-merge-to-main policy; CI runs only on push-to-main / PR / workflow_dispatch). Several unrelated checks (biome on generated tests, a stale validator self-test, a migration-compat marker) had quietly gone red in the interim. The first CI-gated change to force a green run must expect to sweep that accumulated drift.

References

  • Merge: c263e34 (SPEC-590); branch matt/spec-590-d6-ci-gate-scheduled-recordverify-pass
  • .github/workflows/ci.yml (qa-replay job), tools/ci/qa_replay_up.sh, tools/dev/start.sh
  • src/spectral/core/llm/infrastructure/cassette.py (derive_cassette_version / wire_cassette_version)