Skip to content
GitHub
Operator

Deployment runbook

Operational runbook for Spectral’s CD pipeline — staging flow, production cutover sequence, observability, and abort handling. See ADR-053 for the full rationale.

Decision summary

  • Orchestrator: GitHub Actions drives every deploy. Render and Cloudflare are deploy targets, not orchestrators; autoDeploy: false across the board.
  • Trigger model: push to main deploys staging; tag push (prod-N or vX.Y.Z) deploys production after a same-SHA staging-success gate.
  • Composition: composite actions for primitives (Render auth + deploy + poll, version-poll, CNAME flip, KV bind); thin per-target jobs in each workflow file.
  • Concurrency: staging cancels in-progress on supersede; production queues, never cancels.
  • Cutover mechanism: CNAME flip in Cloudflare DNS with 60-second TTL, pre-lowered ≥ 2 hours before the planned cutover (per edge.md).
  • Generation correctness: SPECTRAL_GENERATION is a per-service env var set at deploy time, not an Env Group entry.

Trigger model

The deploy + drain workflow files (deploy-staging.yml, deploy-prod.yml, drain-legacy-generation.yml) are the contract per ADR-053 and land under .github/workflows/ with the deploy substrate. Today only ci.yml, release.yml, and generate-sbom.yml exist.

EventWorkflowEnvironmentConcurrency
push to maindeploy-staging.ymlstagingdeploy-staging-${{ github.ref }}, cancel-in-progress
push of tag matching prod-*deploy-prod.ymlproductiondeploy-prod, queued
push of tag matching v*.*.*deploy-prod.yml + release.yml + generate-sbom.ymlproductiondeploy-prod, queued
workflow_dispatch on legacy-draindrain-legacy-generation.yml(none)deploy-prod (shares prod mutex)

The production deploy workflow gates on a same-SHA staging-success marker (GH commit-status check) before any deploy step runs. Separate concurrency groups don’t order across each other; the marker provides the ordering primitive.

Staging deploy flow

  1. Push to main.
  2. Workflow’s first job parses .github/deploy-manifest.yml, runs git diff from the last successful staging deploy SHA to the current SHA, maps changed paths to the affected target set via globs, and expands per coupling rules. Output: a JSON list of targets.
  3. Per-target jobs gate on the affected set via if: and run in parallel.
  4. Each target job authenticates to Render with the staging-environment RENDER_API_KEY via the render-deploy composite action and triggers a deploy with SPECTRAL_GENERATION=<N> set per-service.
  5. The composite action polls the Render API for the deploy until status='live' or terminal failure. Hard timeout 25 min per service; no auto-retry on terminal failure.
  6. After Render reports live, the workflow polls /version on each green Render origin until commit_sha matches.
  7. Cloudflare Pages targets (docs-user, docs-codex) deploy via cloudflare/wrangler-action@v3 with --branch=staging.
  8. On all-green, the workflow records a commit-status marker (staging-success) on the SHA.

Staging is single-color per service; no blue/green pair, no CNAME flip, no hold window. Failures abort cleanly; the next push supersedes the in-flight workflow via cancel-in-progress: true.

Production cutover sequence

The 12-step sequence below is the contract. Steps 1–5 are abort-safe (nothing visible to users). Steps 6–9 are abort-recoverable (green is broken; do not flip; investigate and rebuild). Step 11 failure triggers the rollback procedure (rollback.md).

  1. Acquire concurrency mutex. Workflow declares concurrency: { group: deploy-prod, cancel-in-progress: false }. Queued behind any in-flight prod workflow.
  2. Verify staging-success marker. Read the GH commit-status for the tag’s commit SHA. Abort with clear error if absent.
  3. Pre-merge dry-run. Invoke tools/ops/premerge_dryrun.sh (per ADR-045 D13 and legacy-drain.md coordination). Abort on failure.
  4. Apply schema. Call Supabase Management API POST /v1/branches/{staging_branch_id}/merge. After response, query the production project’s supabase_migrations.schema_migrations and assert the row-count delta matches the pending-migration count. Mismatch indicates a partial apply — abort and run the recovery path documented under “Recovery” below.
  5. Allocate generation. Run INSERT INTO core.deployments (reference, tag, sha, ...) VALUES (...) RETURNING generation. Capture <N> from the returned row. Sequence gaps are normal under aborted deploys (Postgres nextval() is non-transactional); do not alert on gaps.
  6. Per-service deploy of green. For each target in the affected set, call Render API POST /v1/services/{service_id}/deploys with environment variable SPECTRAL_GENERATION=<N> set per-service. The composite action handles the per-service env var, image build, and pod start. Workers + api deploy in the same generation per coupling.
  7. Poll Render API for each green deploy. GET /v1/services/{service_id}/deploys/{deploy_id} every 10–15 seconds until status='live'. Hard timeout 25 min per service; fail fast on build_failed or update_failed.
  8. Poll /version on each green Render origin. Hit https://<service>-green.onrender.com/version until the JSON body reports commit_sha, schema_version, and generation all matching expected. Polling cadence 10s; hard timeout 5 min after Render reports live (allows for cold-start delay).
  9. Workers heartbeat verification. Query core.workers until workers at the new generation report state='running'. Polling cadence 5s; hard timeout 2 min.
  10. 30-second sanity check. Query Render API for recent deploys on every blue service in the affected set. Assert no blue service has started a new deploy in the last 30 seconds. This catches the env-group race (an unintended env-group bump kicking off blue redeploys with the new generation).
  11. CNAME flip. Update the public CNAMEs in Cloudflare DNS to point at green origins (app.runspectral.comdashboard-green.onrender.com, etc.) via the cloudflare-cname-flip composite action. The TTL was pre-lowered to 60 seconds ≥ 2 hours before this step (per edge.md). Cutover propagation completes for most users within the TTL window; long-tail resolvers may cache for hours, which is why the hold window is measured by traffic drop on blue, not by “DNS propagated.”
  12. Hold blue warm 24 hours, then sync. Blue continues running 24 hours post-cutover. After the hold window, sync blue to match green so the next deploy starts from a clean standby. The legacy- drain window for outbox rows at gen-(N-1) coincides with this hold per ADR-046 D8.

Failure modes during cutover

StepFailure modeAction
1–2Mutex unavailable / staging-success absentAbort cleanly; user-visible nothing; restart workflow after fix
3Pre-merge dry-run failsAbort; investigate migration; user-visible nothing
4schema_migrations delta mismatchAbort; run recovery path (below); do not retry blindly
5Generation INSERT failsAbort; investigate core.deployments substrate; gap is harmless
6Render deploy build_failed / update_failedAbort; investigate green build; user-visible nothing (blue still serving)
7Render API timeout (25 min)Abort; investigate Render-side issue; user-visible nothing
8/version mismatch persists past timeoutAbort; investigate green pod state; user-visible nothing
9Workers heartbeat absentAbort; investigate worker startup; user-visible nothing
10Blue redeploy detectedAbort; investigate env-group state; verify autoDeploy: false; do NOT flip
11CNAME flip API failureTrigger rollback per rollback.md

Recovery: schema_migrations delta mismatch (step 4)

Supabase Management API branches/{name}/merge is idempotent at the migration-name level (skips already-applied migrations) but not transactional across migrations. A timeout or 5xx during merge can leave the production schema in a partial-apply state.

  1. Query production supabase_migrations.schema_migrations directly to see which migrations applied and which did not.
  2. If the missing migrations are forward-compat with what applied, re-run branches/{name}/merge — it skips applied and applies missing.
  3. If a migration applied partially mid-statement (rare; only happens with non-transactional DDL like CREATE INDEX CONCURRENTLY), manually fix forward via a follow-up migration, then re-run merge.
  4. Do NOT blindly retry the merge call without inspecting state — the API is idempotent at the migration-name level, but you may be chasing a different failure.

/version and /version/detail contract

Every Render web service exposes both endpoints.

/version (public):

{
"commit_sha": "3fc271e8a...",
"schema_version": "20260422170300_core_event_handled",
"generation": 42,
"deployed_at": "2026-04-25T14:32:11Z"
}

/version/detail (auth-gated via deploy-key registry):

{
"commit_sha": "3fc271e8a...",
"short_sha": "3fc271e",
"describe": "v0.3.0-12-g3fc271e",
"built_at": "2026-04-25T14:30:42Z",
"uv_lock_sha": "abc123...",
"pnpm_lock_sha": "def456...",
"generation": 42,
"schema_version": "20260422170300_core_event_handled",
"service": "api",
"service_color": "green",
"deployed_at": "2026-04-25T14:32:11Z"
}

The deploy-key registry validates Authorization: Bearer sk_deploy_<...> against the env-group registry per ADR-046 D7. Operations users authenticate via session-based auth with SCOPE_OPERATIONS_READ.

A contract test asserts the JSON shape is identical across api, dashboard, and operations.

Pre-deploy DNS TTL pre-lowering

Cloudflare TTL on production CNAMEs is permanently set to 60 seconds. Upstream resolvers ignore TTL changes for hours after the change is made; lowering TTL “as part of cutover” produces zero benefit at the moment of flip.

When provisioning a new public hostname, set TTL to 60s before the record receives any non-trivial production traffic. Production CNAMEs are configured with this TTL from creation per edge.md.

  • ADR-053 — CD pipeline orchestration.
  • ADR-048 — deployment topology + generation-stamping (D8), schema-version gate (D4), deploy-key registry (D7), env topology (D10).
  • ADR-049 — container strategy + two-tag-lineage (D5).
  • ADR-046 — alpha hosting choice + Render PaaS.
  • ADR-052 — edge + CNAME-flip mechanism (D5).
  • ADR-045 — first-integration validation pass (informs pre-merge dry-run smoke scope).
  • rollback.md — production rollback decision tree.
  • legacy-drain.md — legacy-generation drain protocol.
  • hosting.md — per-deployable hosting map.
  • edge.md — Cloudflare DNS + TLS + edge-rules.
  • secrets-management.md — secret rotation and env-group placement principle.
  • disaster-recovery.md — DR substrate when rollback paths exhaust.