Skip to content
GitHub
Integration Issues

Live (non-test) boot of API + workers shakes out env-wiring gaps the test harness papered over

Live boot env-wiring shake-out

Problem

The closed-loop validation (SPEC-498) was the first time the API and workers were booted on real local services outside the test harness. Every prior green check ran through tests/integration/conftest.py, whose _load_local_dotenv helper loads .env into os.environ and defaults the DB DSN. A live tools/dev/start.sh boot loads no dotenv, so a cascade of required environment variables were simply absent — and the failures were silent-until-first-request rather than loud-at-boot.

Symptoms, in the order they surfaced:

  1. Operator/customer containers built to None (build_*_from_env returns None when SUPABASE_DB_URL / SUPABASE_URL / SPECTRAL_MODULE_STORE_ROOT are unset), so the routes 500 on first request with no boot-time signal.
  2. RuntimeError: SPECTRAL_GENERATION not set on chat dispatch / deploy / decision recording.
  3. world_agent_chat container not wired on app.state — the Realtime subscriber needs SUPABASE_ANON_KEY / SUPABASE_PUBLISHABLE_KEY.
  4. Even with the API fixed, the workers skipped the chat consumer for the same missing keys, so dispatched tasks accumulated unconsumed in core.outbox.

Investigation

  • Confirmed nothing in src/apps calls load_dotenv — env reaches a process only from its launching shell. The only .env reader is the integration conftest.
  • supabase status and supabase status -o env (the documented machine-readable source) crash intermittently with a Bun SyntaxError: JSON Parse error in this environment, so start.sh’s key resolution silently fell back to hardcoded URL/DSN defaults (which happened to be correct) but left the keys empty (no default).
  • The reliable local source for the anon/service JWT keys turned out to be the running container env: docker inspect supabase_storage_spectral exposes ANON_KEY.
  • The deploy generation was originally read under two names: the API dispatch + every publisher read SPECTRAL_GENERATION (fail-loud if unset); the workers read SPECTRAL_OUTBOX_GENERATION (silent default 0). They only agreed locally by luck (both 0). SPEC-570 unified both sides on the one SPECTRAL_GENERATION (see below).

Root cause

The system had only ever been exercised through a harness that injected its environment. The real boot/config seam — which env each service requires, and how it is supplied without the test harness — was never exercised, so it had drifted into a state where: (a) required vars had no live source, (b) absence failed late and silently (container = None → 500) instead of at boot, and (c) the same concept (generation) was configured by two different variable names with mismatched failure modes.

Solution

Local boot wiring landed in tools/dev/start.sh (commit 1539038 + follow-ups): a --full mode that boots workers + the operator cockpit, and an env-resolution block that exports SUPABASE_URL, SUPABASE_DB_URL, SPECTRAL_MODULE_STORE_ROOT, SPECTRAL_GENERATION (the single generation var both sides read, post-SPEC-570) for every API-booting mode, with precedence shell-env > .env > supabase status > local default.

The durable fixes are tracked, not hacked into place:

  • SPEC-569 — a declared per-service runtime-env contract + fail-fast at startup on a missing required var (replace the silent container = None → 500), plus a robust key source for start.sh when supabase status is down.
  • SPEC-570 (resolved) — both the API/publisher side and the workers now resolve the generation through the shared spectral.core.events.infrastructure.generation.resolve_generation, reading the one SPECTRAL_GENERATION and failing loud when it is unset; the workers-only SPECTRAL_OUTBOX_GENERATION and its silent default 0 are gone.

Implementation notes

  • For local keys when the Supabase CLI is broken: docker inspect supabase_storage_spectralConfig.EnvANON_KEY / SERVICE_KEY.
  • The module-store root MUST be identical for the operator deploy path (deposit) and the customer /api/decide path (load) — they share SPECTRAL_MODULE_STORE_ROOT.

Prevention

Best practices

  • Treat the first non-harness boot of any service as a distinct validation step — a green integration suite proves logic, not that the service can start from its real environment.
  • A required env var should fail loud at boot, not produce a None collaborator that 500s on first request.
  • One concept = one env var name across all services that share it.

Warning signs

  • Routes that pass in integration tests but 500 on a live boot.
  • build_*_from_env helpers that return None on missing env (silent degradation).
  • Any config read as os.environ[...] in one service and os.environ.get(..., default) in a peer that must agree with it.

References

  • tools/dev/start.sh (env resolution + --full)
  • docs/runbooks/cold-start.md
  • Linear: SPEC-498 (walkthrough), SPEC-569 (env contract + fail-fast), SPEC-570 (generation unification), SPEC-568 (live chat-streaming bridge — still open)