Hosting runbook
Alpha operational runbook for Spectral’s hosting topology — per-deployable substrate map, revisit triggers, smoke-test protocol, and rollback posture. See ADR-046 for decision context and full rationale.
Decision summary
- PaaS substrate: Render, Virginia region (co-located with Supabase us-east-1 for sub-ms hot path).
- DB substrate: Supabase managed (per ADR-046).
- Docs substrate: Cloudflare Pages.
- Edge / DNS substrate: Cloudflare. DNS zone authoritative on
Cloudflare; Cloudflare Registrar holds the domain; TLS terminated at
Cloudflare for proxied hostnames; blue/green cutover via CNAME flip
on the public hostname. See
edge.mdfor the full posture. - Orchestration: GitHub Actions drives all deploys;
autoDeploy: falseacross every Render service, Cloudflare Pages project, and Supabase Management API call. - Environments: two —
staging(push-main-triggered) andproduction(SemVer-tag-triggered with deployment protection rules).
Per-deployable hosting map
| Deployable | Substrate | Environment | Notes |
|---|---|---|---|
api | Render web service | both | FastAPI + uvicorn; blue/green pairs in prod, single-color in staging |
dashboard | Render web service | both | TanStack Start; serves app.runspectral.com / app-staging.runspectral.com |
operations | Render web service | both | TanStack Start; serves ops.runspectral.com / ops-staging.runspectral.com; JWKS-local auth gate (Pattern A, per ADR-046 D9) |
workers | Render background worker | both | LISTEN/NOTIFY consumer + outbox drainer + scan orchestration |
retention-run | Render cron | both | Nightly DELETE-per-policy sweep (per ADR-042) |
backup-nightly | Render cron | both | Nightly pg_dump -Fc → age → GCS (per ADR-040) |
docs-user | Cloudflare Pages | both | Public; no auth. docs.runspectral.com / docs-staging.runspectral.com |
docs-codex | Cloudflare Pages + Pages Function | both | JWKS-local auth via Pages Function; SCOPE_*_OPERATIONS gate. codex.runspectral.com / codex-staging.runspectral.com |
core.db | Supabase main branch | production | Production Postgres + Auth + pgvector |
core.db (staging) | Supabase persistent preview branch | staging | Schema-synced to prod via ADR-048 D4 Management API merge flow |
Six Render services per environment (api, dashboard, operations, workers, retention-run, backup-nightly) plus two Cloudflare Pages projects (docs-user, docs-codex). Production adds blue/green pairs for the three web services, bringing the production Render-service count to nine (api-blue, api-green, dashboard-blue, dashboard-green, operations-blue, operations-green, workers, retention-run, backup-nightly). Staging runs single-color.
Auth-domain architecture
- Cookie scope:
runspectral.comeTLD+1. Session cookie set by Supabase Auth PKCE flow visible across every subdomain (app., ops., codex., docs.). - JWKS-local validation on operations Start (Pattern A, per ADR-046 D9) and on the
docs-codexPages Function. Both consume@supabase/supabase-jsgetClaims()against the Supabase project’s JWKS. - FastAPI auth middleware validates JWTs via JWKS-local on API requests. A contract test enforces parity between FastAPI’s and Start’s validation outputs on the same inputs.
Smoke-test protocol (R1)
R1 is the merge-gate for changes exercising Render-specific behavior. Two stages: desk research (R1-lite) and live-env validation (R1-full).
R1-lite (desk research)
Pre-deploy verification against Render documentation:
- Background Worker semantics: stable process, SIGTERM →
maxShutdownDelaySecondsgrace, SIGKILL fallback. render.yamlblueprint spec accommodates the six-service + two-cron topology cleanly.- Pricing envelope verified (~$121/mo at alpha).
R1-full (live-env validation)
Runs on the first cross-feature dev build against the live Render staging environment that exercises workers + auth. Checks:
- LISTEN/NOTIFY 24h hold: worker holds stable connection to Supabase Postgres for 24 hours. Reconnect-after-restart matches ADR-044.
- SIGTERM grace: simulated SIGTERM completes outbox cursor commit +
UNLISTEN within the 90s
maxShutdownDelaySecondswindow. - Auth correctness: Operations Start + docs-codex Pages Function return correct decisions on valid / invalid / expired JWTs.
- Schema-version gate: deploy verification blocks Cloudflare cutover when green services don’t report the expected post-migration schema.
SIGTERM / drain contract
Per ADR-046 D8 parameters:
HANDLER_MAX = 60s— bound on any single worker handler (asyncio.wait_for)maxShutdownDelaySeconds = 90s— Render’s shutdown grace after SIGTERM- Reaper interval: 30s — re-PENDs stuck IN_FLIGHT rows
- Claim TTL: 300s — 5× HANDLER_MAX safety buffer
SPECTRAL_DRAIN_COOLING_SECONDS = 60s— default for legacy-drain workers
Render’s rolling deploy sequence for a worker:
- New instance starts, registers heartbeat in
core.workers. - 60s overlap — both old and new running; SKIP LOCKED + generation filter keeps them from processing each other’s events.
- SIGTERM to old instance.
- Old instance sets
draining=true, stops claiming new work, finishes in-flight handler (bounded by HANDLER_MAX), unsubscribes LISTEN, commits outbox cursor, exits 0. - 90s grace window → SIGKILL fallback (does not fire under normal operation).
Rollback procedure (production)
Rollback scope depends on what failed:
- Cloudflare cutover complete, observed issue — flip the public CNAME back to the blue origin. The 60-second TTL bounds the recovery window. Blue stays warm during the legacy-drain window.
/versioncheck failed on green before cutover — abort; blue still serving. Investigate green failure; rebuild as needed.- Deploy-generation-specific data issue — redeploy previous code at
a new generation (e.g., if the tagged release is bad, deploy the
prior code as gen N+2 tagged
v0.3.1-rollback). Stranded outbox rows from the bad generation drain via thedrain-legacy-generation.ymlworkflow (per ADR-053). - Migration-caused issue — migrations are expand/contract and must
be backward-compatible (per ADR-048 D4). Old code works against
the new schema; no data-destructive migration ships without an
explicit
-- compat: breakingmarker. Rollback is code-level only — schema migrations are forward-only (per ADR-032 D4).
Revisit triggers
Hard (open a new hosting-choice spike immediately)
- Render outage materially impacting operations
- LISTEN/NOTIFY behavior on Render Background Worker fails R1-full
- Render acquired by entity triggering reputation concern
- Pricing changes that double cost or restructure billing
Soft (evaluate without committing to move)
- Render workspace cost > $400/mo sustained
- Multi-region requirement appears (non-US pilot, regulatory, p99 floor)
- Sustained worker memory pressure exceeding tier capacity
- First design partner SLA Render’s status-page history can’t credibly support
uvicorn startup convention
API service production entrypoint:
uvicorn spectral_api.main:app --host 0.0.0.0 --port $PORT --workers $WORKERSNo gunicorn wrapper. Render handles process supervision at the
orchestrator layer; uvicorn alone is sufficient when the orchestrator
owns restart-on-crash, health checks, and rolling deploys. Worker
count is env-configurable per service tier ($WORKERS), sized against
plan memory budget.
Related
- ADR-046 — alpha hosting choice (Render).
- ADR-048 — deployment topology.
- ADR-053 — CD pipeline orchestration (workflows, migration compat lint, Cloudflare LB flip).
- ADR-037 — secrets management.
- ADR-039 — auth substrate.
docs/runbooks/secrets-management.md— rotation + audit for runtime secrets delivered via Render Environment Groups.docs/runbooks/edge.md— Cloudflare DNS + TLS + edge-rules + blue/green CNAME-flip mechanism + Pages Function JWKS architecture.docs/runbooks/deployment.md— staging + production deploy flows; 12-step cutover sequence; observability and abort handling.docs/runbooks/rollback.md— production rollback decision tree by failure class.docs/runbooks/legacy-drain.md— legacy-generation outbox-drain protocol used after rollback.infra/render/production.yaml+infra/render/staging.yaml— declarative Render blueprints..github/deploy-manifest.yml— path-filtered rollout map.