Deployment runbook
Operational runbook for Spectral’s CD pipeline — staging flow, production cutover sequence, observability, and abort handling. See ADR-053 for the full rationale.
Decision summary
- Orchestrator: GitHub Actions drives every deploy. Render and Cloudflare
are deploy targets, not orchestrators;
autoDeploy: falseacross the board. - Trigger model: push to
maindeploys staging; tag push (prod-NorvX.Y.Z) deploys production after a same-SHA staging-success gate. - Composition: composite actions for primitives (Render auth + deploy + poll, version-poll, CNAME flip, KV bind); thin per-target jobs in each workflow file.
- Concurrency: staging cancels in-progress on supersede; production queues, never cancels.
- Cutover mechanism: CNAME flip in Cloudflare DNS with 60-second TTL,
pre-lowered ≥ 2 hours before the planned cutover (per
edge.md). - Generation correctness:
SPECTRAL_GENERATIONis a per-service env var set at deploy time, not an Env Group entry.
Trigger model
The deploy + drain workflow files (deploy-staging.yml, deploy-prod.yml, drain-legacy-generation.yml) are the contract per ADR-053 and land under .github/workflows/ with the deploy substrate. Today only ci.yml, release.yml, and generate-sbom.yml exist.
| Event | Workflow | Environment | Concurrency |
|---|---|---|---|
push to main | deploy-staging.yml | staging | deploy-staging-${{ github.ref }}, cancel-in-progress |
push of tag matching prod-* | deploy-prod.yml | production | deploy-prod, queued |
push of tag matching v*.*.* | deploy-prod.yml + release.yml + generate-sbom.yml | production | deploy-prod, queued |
workflow_dispatch on legacy-drain | drain-legacy-generation.yml | (none) | deploy-prod (shares prod mutex) |
The production deploy workflow gates on a same-SHA staging-success marker (GH commit-status check) before any deploy step runs. Separate concurrency groups don’t order across each other; the marker provides the ordering primitive.
Staging deploy flow
- Push to
main. - Workflow’s first job parses
.github/deploy-manifest.yml, runsgit difffrom the last successful staging deploy SHA to the current SHA, maps changed paths to the affected target set via globs, and expands percouplingrules. Output: a JSON list of targets. - Per-target jobs gate on the affected set via
if:and run in parallel. - Each target job authenticates to Render with the staging-environment
RENDER_API_KEYvia therender-deploycomposite action and triggers a deploy withSPECTRAL_GENERATION=<N>set per-service. - The composite action polls the Render API for the deploy until
status='live'or terminal failure. Hard timeout 25 min per service; no auto-retry on terminal failure. - After Render reports live, the workflow polls
/versionon each green Render origin untilcommit_shamatches. - Cloudflare Pages targets (
docs-user,docs-codex) deploy viacloudflare/wrangler-action@v3with--branch=staging. - On all-green, the workflow records a commit-status marker
(
staging-success) on the SHA.
Staging is single-color per service; no blue/green pair, no CNAME flip,
no hold window. Failures abort cleanly; the next push supersedes the
in-flight workflow via cancel-in-progress: true.
Production cutover sequence
The 12-step sequence below is the contract. Steps 1–5 are abort-safe
(nothing visible to users). Steps 6–9 are abort-recoverable (green is
broken; do not flip; investigate and rebuild). Step 11 failure triggers
the rollback procedure (rollback.md).
- Acquire concurrency mutex. Workflow declares
concurrency: { group: deploy-prod, cancel-in-progress: false }. Queued behind any in-flight prod workflow. - Verify staging-success marker. Read the GH commit-status for the tag’s commit SHA. Abort with clear error if absent.
- Pre-merge dry-run. Invoke
tools/ops/premerge_dryrun.sh(per ADR-045 D13 andlegacy-drain.mdcoordination). Abort on failure. - Apply schema. Call Supabase Management API
POST /v1/branches/{staging_branch_id}/merge. After response, query the production project’ssupabase_migrations.schema_migrationsand assert the row-count delta matches the pending-migration count. Mismatch indicates a partial apply — abort and run the recovery path documented under “Recovery” below. - Allocate generation. Run
INSERT INTO core.deployments (reference, tag, sha, ...) VALUES (...) RETURNING generation. Capture<N>from the returned row. Sequence gaps are normal under aborted deploys (Postgresnextval()is non-transactional); do not alert on gaps. - Per-service deploy of green. For each target in the affected
set, call Render API
POST /v1/services/{service_id}/deployswith environment variableSPECTRAL_GENERATION=<N>set per-service. The composite action handles the per-service env var, image build, and pod start. Workers + api deploy in the same generation percoupling. - Poll Render API for each green deploy.
GET /v1/services/{service_id}/deploys/{deploy_id}every 10–15 seconds untilstatus='live'. Hard timeout 25 min per service; fail fast onbuild_failedorupdate_failed. - Poll
/versionon each green Render origin. Hithttps://<service>-green.onrender.com/versionuntil the JSON body reportscommit_sha,schema_version, andgenerationall matching expected. Polling cadence 10s; hard timeout 5 min after Render reports live (allows for cold-start delay). - Workers heartbeat verification. Query
core.workersuntil workers at the new generation reportstate='running'. Polling cadence 5s; hard timeout 2 min. - 30-second sanity check. Query Render API for recent deploys on every blue service in the affected set. Assert no blue service has started a new deploy in the last 30 seconds. This catches the env-group race (an unintended env-group bump kicking off blue redeploys with the new generation).
- CNAME flip. Update the public CNAMEs in Cloudflare DNS to
point at green origins (
app.runspectral.com→dashboard-green.onrender.com, etc.) via thecloudflare-cname-flipcomposite action. The TTL was pre-lowered to 60 seconds ≥ 2 hours before this step (peredge.md). Cutover propagation completes for most users within the TTL window; long-tail resolvers may cache for hours, which is why the hold window is measured by traffic drop on blue, not by “DNS propagated.” - Hold blue warm 24 hours, then sync. Blue continues running 24 hours post-cutover. After the hold window, sync blue to match green so the next deploy starts from a clean standby. The legacy- drain window for outbox rows at gen-(N-1) coincides with this hold per ADR-046 D8.
Failure modes during cutover
| Step | Failure mode | Action |
|---|---|---|
| 1–2 | Mutex unavailable / staging-success absent | Abort cleanly; user-visible nothing; restart workflow after fix |
| 3 | Pre-merge dry-run fails | Abort; investigate migration; user-visible nothing |
| 4 | schema_migrations delta mismatch | Abort; run recovery path (below); do not retry blindly |
| 5 | Generation INSERT fails | Abort; investigate core.deployments substrate; gap is harmless |
| 6 | Render deploy build_failed / update_failed | Abort; investigate green build; user-visible nothing (blue still serving) |
| 7 | Render API timeout (25 min) | Abort; investigate Render-side issue; user-visible nothing |
| 8 | /version mismatch persists past timeout | Abort; investigate green pod state; user-visible nothing |
| 9 | Workers heartbeat absent | Abort; investigate worker startup; user-visible nothing |
| 10 | Blue redeploy detected | Abort; investigate env-group state; verify autoDeploy: false; do NOT flip |
| 11 | CNAME flip API failure | Trigger rollback per rollback.md |
Recovery: schema_migrations delta mismatch (step 4)
Supabase Management API branches/{name}/merge is idempotent at the
migration-name level (skips already-applied migrations) but not
transactional across migrations. A timeout or 5xx during merge can
leave the production schema in a partial-apply state.
- Query production
supabase_migrations.schema_migrationsdirectly to see which migrations applied and which did not. - If the missing migrations are forward-compat with what applied,
re-run
branches/{name}/merge— it skips applied and applies missing. - If a migration applied partially mid-statement (rare; only happens
with non-transactional DDL like
CREATE INDEX CONCURRENTLY), manually fix forward via a follow-up migration, then re-run merge. - Do NOT blindly retry the merge call without inspecting state — the API is idempotent at the migration-name level, but you may be chasing a different failure.
/version and /version/detail contract
Every Render web service exposes both endpoints.
/version (public):
{ "commit_sha": "3fc271e8a...", "schema_version": "20260422170300_core_event_handled", "generation": 42, "deployed_at": "2026-04-25T14:32:11Z"}/version/detail (auth-gated via deploy-key registry):
{ "commit_sha": "3fc271e8a...", "short_sha": "3fc271e", "describe": "v0.3.0-12-g3fc271e", "built_at": "2026-04-25T14:30:42Z", "uv_lock_sha": "abc123...", "pnpm_lock_sha": "def456...", "generation": 42, "schema_version": "20260422170300_core_event_handled", "service": "api", "service_color": "green", "deployed_at": "2026-04-25T14:32:11Z"}The deploy-key registry validates Authorization: Bearer sk_deploy_<...>
against the env-group registry per ADR-046 D7. Operations users
authenticate via session-based auth with SCOPE_OPERATIONS_READ.
A contract test asserts the JSON shape is identical across api, dashboard, and operations.
Pre-deploy DNS TTL pre-lowering
Cloudflare TTL on production CNAMEs is permanently set to 60 seconds. Upstream resolvers ignore TTL changes for hours after the change is made; lowering TTL “as part of cutover” produces zero benefit at the moment of flip.
When provisioning a new public hostname, set TTL to 60s before the
record receives any non-trivial production traffic. Production CNAMEs
are configured with this TTL from creation per edge.md.
Related
- ADR-053 — CD pipeline orchestration.
- ADR-048 — deployment topology + generation-stamping (D8), schema-version gate (D4), deploy-key registry (D7), env topology (D10).
- ADR-049 — container strategy + two-tag-lineage (D5).
- ADR-046 — alpha hosting choice + Render PaaS.
- ADR-052 — edge + CNAME-flip mechanism (D5).
- ADR-045 — first-integration validation pass (informs pre-merge dry-run smoke scope).
rollback.md— production rollback decision tree.legacy-drain.md— legacy-generation drain protocol.hosting.md— per-deployable hosting map.edge.md— Cloudflare DNS + TLS + edge-rules.secrets-management.md— secret rotation and env-group placement principle.disaster-recovery.md— DR substrate when rollback paths exhaust.