Live (non-test) boot of API + workers shakes out env-wiring gaps the test harness papered over
Live boot env-wiring shake-out
Problem
The closed-loop validation (SPEC-498) was the first time the API and workers were
booted on real local services outside the test harness. Every prior green check
ran through tests/integration/conftest.py, whose _load_local_dotenv helper loads
.env into os.environ and defaults the DB DSN. A live tools/dev/start.sh boot
loads no dotenv, so a cascade of required environment variables were simply absent —
and the failures were silent-until-first-request rather than loud-at-boot.
Symptoms, in the order they surfaced:
- Operator/customer containers built to
None(build_*_from_envreturnsNonewhenSUPABASE_DB_URL/SUPABASE_URL/SPECTRAL_MODULE_STORE_ROOTare unset), so the routes 500 on first request with no boot-time signal. RuntimeError: SPECTRAL_GENERATION not seton chat dispatch / deploy / decision recording.world_agent_chat container not wired on app.state— the Realtime subscriber needsSUPABASE_ANON_KEY/SUPABASE_PUBLISHABLE_KEY.- Even with the API fixed, the workers skipped the chat consumer for the same
missing keys, so dispatched tasks accumulated unconsumed in
core.outbox.
Investigation
- Confirmed nothing in
src/appscallsload_dotenv— env reaches a process only from its launching shell. The only.envreader is the integration conftest. supabase statusandsupabase status -o env(the documented machine-readable source) crash intermittently with a BunSyntaxError: JSON Parse errorin this environment, sostart.sh’s key resolution silently fell back to hardcoded URL/DSN defaults (which happened to be correct) but left the keys empty (no default).- The reliable local source for the anon/service JWT keys turned out to be the
running container env:
docker inspect supabase_storage_spectralexposesANON_KEY. - The deploy generation was originally read under two names: the API dispatch +
every publisher read
SPECTRAL_GENERATION(fail-loud if unset); the workers readSPECTRAL_OUTBOX_GENERATION(silent default0). They only agreed locally by luck (both 0). SPEC-570 unified both sides on the oneSPECTRAL_GENERATION(see below).
Root cause
The system had only ever been exercised through a harness that injected its
environment. The real boot/config seam — which env each service requires, and how it
is supplied without the test harness — was never exercised, so it had drifted into a
state where: (a) required vars had no live source, (b) absence failed late and silently
(container = None → 500) instead of at boot, and (c) the same concept (generation)
was configured by two different variable names with mismatched failure modes.
Solution
Local boot wiring landed in tools/dev/start.sh (commit 1539038 + follow-ups): a
--full mode that boots workers + the operator cockpit, and an env-resolution block
that exports SUPABASE_URL, SUPABASE_DB_URL, SPECTRAL_MODULE_STORE_ROOT,
SPECTRAL_GENERATION (the single generation var both sides read, post-SPEC-570) for
every API-booting mode, with precedence shell-env > .env > supabase status > local
default.
The durable fixes are tracked, not hacked into place:
- SPEC-569 — a declared per-service runtime-env contract + fail-fast at startup
on a missing required var (replace the silent
container = None→ 500), plus a robust key source forstart.shwhensupabase statusis down. - SPEC-570 (resolved) — both the API/publisher side and the workers now resolve the
generation through the shared
spectral.core.events.infrastructure.generation.resolve_generation, reading the oneSPECTRAL_GENERATIONand failing loud when it is unset; the workers-onlySPECTRAL_OUTBOX_GENERATIONand its silent default0are gone.
Implementation notes
- For local keys when the Supabase CLI is broken:
docker inspect supabase_storage_spectral→Config.Env→ANON_KEY/SERVICE_KEY. - The module-store root MUST be identical for the operator deploy path (deposit) and
the customer
/api/decidepath (load) — they shareSPECTRAL_MODULE_STORE_ROOT.
Prevention
Best practices
- Treat the first non-harness boot of any service as a distinct validation step — a green integration suite proves logic, not that the service can start from its real environment.
- A required env var should fail loud at boot, not produce a
Nonecollaborator that 500s on first request. - One concept = one env var name across all services that share it.
Warning signs
- Routes that pass in integration tests but 500 on a live boot.
build_*_from_envhelpers thatreturn Noneon missing env (silent degradation).- Any config read as
os.environ[...]in one service andos.environ.get(..., default)in a peer that must agree with it.
References
tools/dev/start.sh(env resolution +--full)docs/runbooks/cold-start.md- Linear: SPEC-498 (walkthrough), SPEC-569 (env contract + fail-fast), SPEC-570 (generation unification), SPEC-568 (live chat-streaming bridge — still open)