Skip to content
GitHub
Operator

Hosting runbook

Alpha operational runbook for Spectral’s hosting topology — per-deployable substrate map, revisit triggers, smoke-test protocol, and rollback posture. See ADR-046 for decision context and full rationale.

Decision summary

  • PaaS substrate: Render, Virginia region (co-located with Supabase us-east-1 for sub-ms hot path).
  • DB substrate: Supabase managed (per ADR-046).
  • Docs substrate: Cloudflare Pages.
  • Edge / DNS substrate: Cloudflare. DNS zone authoritative on Cloudflare; Cloudflare Registrar holds the domain; TLS terminated at Cloudflare for proxied hostnames; blue/green cutover via CNAME flip on the public hostname. See edge.md for the full posture.
  • Orchestration: GitHub Actions drives all deploys; autoDeploy: false across every Render service, Cloudflare Pages project, and Supabase Management API call.
  • Environments: two — staging (push-main-triggered) and production (SemVer-tag-triggered with deployment protection rules).

Per-deployable hosting map

DeployableSubstrateEnvironmentNotes
apiRender web servicebothFastAPI + uvicorn; blue/green pairs in prod, single-color in staging
dashboardRender web servicebothTanStack Start; serves app.runspectral.com / app-staging.runspectral.com
operationsRender web servicebothTanStack Start; serves ops.runspectral.com / ops-staging.runspectral.com; JWKS-local auth gate (Pattern A, per ADR-046 D9)
workersRender background workerbothLISTEN/NOTIFY consumer + outbox drainer + scan orchestration
retention-runRender cronbothNightly DELETE-per-policy sweep (per ADR-042)
backup-nightlyRender cronbothNightly pg_dump -Fc → age → GCS (per ADR-040)
docs-userCloudflare PagesbothPublic; no auth. docs.runspectral.com / docs-staging.runspectral.com
docs-codexCloudflare Pages + Pages FunctionbothJWKS-local auth via Pages Function; SCOPE_*_OPERATIONS gate. codex.runspectral.com / codex-staging.runspectral.com
core.dbSupabase main branchproductionProduction Postgres + Auth + pgvector
core.db (staging)Supabase persistent preview branchstagingSchema-synced to prod via ADR-048 D4 Management API merge flow

Six Render services per environment (api, dashboard, operations, workers, retention-run, backup-nightly) plus two Cloudflare Pages projects (docs-user, docs-codex). Production adds blue/green pairs for the three web services, bringing the production Render-service count to nine (api-blue, api-green, dashboard-blue, dashboard-green, operations-blue, operations-green, workers, retention-run, backup-nightly). Staging runs single-color.

Auth-domain architecture

  • Cookie scope: runspectral.com eTLD+1. Session cookie set by Supabase Auth PKCE flow visible across every subdomain (app., ops., codex., docs.).
  • JWKS-local validation on operations Start (Pattern A, per ADR-046 D9) and on the docs-codex Pages Function. Both consume @supabase/supabase-js getClaims() against the Supabase project’s JWKS.
  • FastAPI auth middleware validates JWTs via JWKS-local on API requests. A contract test enforces parity between FastAPI’s and Start’s validation outputs on the same inputs.

Smoke-test protocol (R1)

R1 is the merge-gate for changes exercising Render-specific behavior. Two stages: desk research (R1-lite) and live-env validation (R1-full).

R1-lite (desk research)

Pre-deploy verification against Render documentation:

  • Background Worker semantics: stable process, SIGTERM → maxShutdownDelaySeconds grace, SIGKILL fallback.
  • render.yaml blueprint spec accommodates the six-service + two-cron topology cleanly.
  • Pricing envelope verified (~$121/mo at alpha).

R1-full (live-env validation)

Runs on the first cross-feature dev build against the live Render staging environment that exercises workers + auth. Checks:

  • LISTEN/NOTIFY 24h hold: worker holds stable connection to Supabase Postgres for 24 hours. Reconnect-after-restart matches ADR-044.
  • SIGTERM grace: simulated SIGTERM completes outbox cursor commit + UNLISTEN within the 90s maxShutdownDelaySeconds window.
  • Auth correctness: Operations Start + docs-codex Pages Function return correct decisions on valid / invalid / expired JWTs.
  • Schema-version gate: deploy verification blocks Cloudflare cutover when green services don’t report the expected post-migration schema.

SIGTERM / drain contract

Per ADR-046 D8 parameters:

  • HANDLER_MAX = 60s — bound on any single worker handler (asyncio.wait_for)
  • maxShutdownDelaySeconds = 90s — Render’s shutdown grace after SIGTERM
  • Reaper interval: 30s — re-PENDs stuck IN_FLIGHT rows
  • Claim TTL: 300s — 5× HANDLER_MAX safety buffer
  • SPECTRAL_DRAIN_COOLING_SECONDS = 60s — default for legacy-drain workers

Render’s rolling deploy sequence for a worker:

  1. New instance starts, registers heartbeat in core.workers.
  2. 60s overlap — both old and new running; SKIP LOCKED + generation filter keeps them from processing each other’s events.
  3. SIGTERM to old instance.
  4. Old instance sets draining=true, stops claiming new work, finishes in-flight handler (bounded by HANDLER_MAX), unsubscribes LISTEN, commits outbox cursor, exits 0.
  5. 90s grace window → SIGKILL fallback (does not fire under normal operation).

Rollback procedure (production)

Rollback scope depends on what failed:

  1. Cloudflare cutover complete, observed issue — flip the public CNAME back to the blue origin. The 60-second TTL bounds the recovery window. Blue stays warm during the legacy-drain window.
  2. /version check failed on green before cutover — abort; blue still serving. Investigate green failure; rebuild as needed.
  3. Deploy-generation-specific data issue — redeploy previous code at a new generation (e.g., if the tagged release is bad, deploy the prior code as gen N+2 tagged v0.3.1-rollback). Stranded outbox rows from the bad generation drain via the drain-legacy-generation.yml workflow (per ADR-053).
  4. Migration-caused issue — migrations are expand/contract and must be backward-compatible (per ADR-048 D4). Old code works against the new schema; no data-destructive migration ships without an explicit -- compat: breaking marker. Rollback is code-level only — schema migrations are forward-only (per ADR-032 D4).

Revisit triggers

Hard (open a new hosting-choice spike immediately)

  • Render outage materially impacting operations
  • LISTEN/NOTIFY behavior on Render Background Worker fails R1-full
  • Render acquired by entity triggering reputation concern
  • Pricing changes that double cost or restructure billing

Soft (evaluate without committing to move)

  • Render workspace cost > $400/mo sustained
  • Multi-region requirement appears (non-US pilot, regulatory, p99 floor)
  • Sustained worker memory pressure exceeding tier capacity
  • First design partner SLA Render’s status-page history can’t credibly support

uvicorn startup convention

API service production entrypoint:

uvicorn spectral_api.main:app --host 0.0.0.0 --port $PORT --workers $WORKERS

No gunicorn wrapper. Render handles process supervision at the orchestrator layer; uvicorn alone is sufficient when the orchestrator owns restart-on-crash, health checks, and rolling deploys. Worker count is env-configurable per service tier ($WORKERS), sized against plan memory budget.

  • ADR-046 — alpha hosting choice (Render).
  • ADR-048 — deployment topology.
  • ADR-053 — CD pipeline orchestration (workflows, migration compat lint, Cloudflare LB flip).
  • ADR-037 — secrets management.
  • ADR-039 — auth substrate.
  • docs/runbooks/secrets-management.md — rotation + audit for runtime secrets delivered via Render Environment Groups.
  • docs/runbooks/edge.md — Cloudflare DNS + TLS + edge-rules + blue/green CNAME-flip mechanism + Pages Function JWKS architecture.
  • docs/runbooks/deployment.md — staging + production deploy flows; 12-step cutover sequence; observability and abort handling.
  • docs/runbooks/rollback.md — production rollback decision tree by failure class.
  • docs/runbooks/legacy-drain.md — legacy-generation outbox-drain protocol used after rollback.
  • infra/render/production.yaml + infra/render/staging.yaml — declarative Render blueprints.
  • .github/deploy-manifest.yml — path-filtered rollout map.