ADR-053: CD pipeline orchestration — composite actions, 12-step cutover, migration-compat lint, release-only changelog
Status: Accepted (2026-04-24)
Context
ADR-048 (deployment topology) + ADR-046 (hosting) + ADR-049 (containers) + ADR-052 (edge) all landed before this ADR. TA-26 defines how deploys actually get triggered, sequenced, gated, rolled back. Substantial pre-shaped scope from upstream bookkeeping comments: ADR-048 (workflow machinery, generation counter, legacy-drain, schema-version gate, migration-compat lint, pre-merge dry-run, Cloudflare Pages, deploy keys); ADR-049 (two tag lineages, git-cliff workflow, GH Release policy, product-version cadence, Dependabot app-dep extension); ADR-052 (CNAME-flip step, KV provisioning, Cache-Control contract test, CF API token, CNAME rollback). An adversarial pass surfaced 13 concrete operator-pain failure modes; five fold into the cutover-sequence ordering, one triggers a material refinement to ADR-049 D5 (SPECTRAL_GENERATION placement), the rest drive specific quality lints + workflow gates.
Decision
D1 — Workflow composition: composite actions for primitives + thin per-target jobs
Composite actions at .github/actions/<name>/action.yml for the small primitives (Render auth + deploy + poll, version-poll, CNAME flip, KV bind, env-group helpers). One job per deployable target in each top-level workflow, each declaring its own environment: block. Reusable workflows (workflow_call) NOT used. Matrix job NOT used.
D2 — Concurrency model: separate groups; prod queued; staging cancel-on-supersede
- Staging:
concurrency: { group: deploy-staging-${{ github.ref }}, cancel-in-progress: true }. Latest commit wins. - Production:
concurrency: { group: deploy-prod, cancel-in-progress: false }. Queued, never canceled. - Production gated on same-SHA staging-success marker (commit-status check). GH concurrency does not order across separate groups; the gate provides the ordering primitive.
D3 — Path-filtered rollout via .github/deploy-manifest.yml
The workflow’s first job parses the manifest, runs git diff from deployed_sha to github.sha, maps changed paths to an affected target set via globs, expands per coupling rules (api ↔ workers), produces outputs.targets JSON. Per-target jobs gate via if: contains(fromJSON(needs.changes.outputs.targets), '<name>'). Custom resolver matches the existing manifest schema; dorny/paths-filter rejected (does not express coupling rules cleanly).
D4 — Pre-merge dry-run gate
Workflow:
- Create branch via Supabase Management API
POST /v1/projects/{ref}/branches. - Apply pending migrations to branch.
- Assert
schema_migrationsrow-count delta matches expected (catches partial-apply on retry). - Run smoke-test suite against branch URL — empty at alpha; auto-engages when ADR-045 D13 first-integration suite lands.
- Delete branch.
- Any failure → abort merge; report.
D5 — Migration-compat lint (tools/quality/check_migration_compat.py)
Rejects in supabase/migrations/*.sql:
DROP COLUMNDROP TABLEALTER COLUMN ... TYPEto incompatible targetADD COLUMN ... NOT NULLwithoutDEFAULTADD ... UNIQUEconstraint on populated column
Override: -- compat: breaking (reason: <reason>) marker on the file. The override forces explicit human review; an unannotated breaking change blocks PR. The V1-against-V2-schema corner is prevented at PR time, not deploy time.
Wired into ci.yml (PR + push-main) and tools/dev/precheck.sh. Lives alongside check_migration_naming.py.
D6 — /version and /version/detail contract
Refines ADR-048 D7. Both endpoints on every Render web service.
/version (public): { commit_sha, schema_version, generation, deployed_at }. JSON. Used by deploy verification + simple operator inspection.
/version/detail (auth-gated via deploy-key registry): full version.json from ADR-049 D5 build script + runtime generation + service identity.
Contract test asserts JSON shape across api / dashboard / operations.
D7 — SPECTRAL_GENERATION placement: per-service env var
Material refinement of ADR-049 D5 step 4. SPECTRAL_GENERATION set per-service via Render API at deploy time, NOT in the env group. Other shared values (DB URL, Sentry DSN, OTel endpoint, deploy-key registry) stay in the env group per ADR-037 / ADR-046.
Eliminates env-group auto-redeploy hazard structurally rather than via autoDeploy=false discipline. Materially mitigates ADR-049 Race C scope for generation specifically — generation is now baked into the service deploy; pod crash-restart sees the same generation.
D8 — Env-group placement principle
| Class | Location | Why |
|---|---|---|
| Code-coupled (correctness depends on running image) | Per-service env var, set at deploy | Atomic with image; rolling-restart safe |
| Operationally shared, infrequent rotation, tolerates rolling-restart inconsistency | Render Env Group | ADR-037 / ADR-046 secrets posture |
| Per-service operational config | Per-service env var | Logically per-service; future-proofs tuning |
Sweep against known values:
- Per-service:
SPECTRAL_GENERATION,HANDLER_MAX,SPECTRAL_DRAIN_COOLING_SECONDS, reaper interval, claim TTL - Env group:
SUPABASE_URL,SUPABASE_ANON_KEY,SUPABASE_SERVICE_ROLE_KEY,SENTRY_DSN,OTEL_EXPORTER_ENDPOINT,sk_deploy_*registry - Image-baked (strongest form):
schema_version, build metadata - GH-side: Cloudflare API token, Render API key, Supabase Management PAT
D9 — Production cutover sequence (12 steps)
For tag push (prod-N or v*.*.*):
- Acquire
concurrency: deploy-prodmutex. - Verify same-SHA staging-success marker; abort if absent.
- Pre-merge dry-run per D4. Abort on failure.
- Apply schema: Supabase Management API
branches/staging/merge→ main. Assertschema_migrationsrow-count delta matches expected. - Allocate generation:
INSERT INTO core.deployments RETURNING generation. Capture<N>. - Per-service deploy of green (Render API authenticated, NOT deploy hook): for each affected target, deploy with
SPECTRAL_GENERATION=<N>set per-service. Workers + api in same generation. - Poll Render API for each green deploy until
status='live'. Hard timeout 25 min. Fail fast onbuild_failed/update_failed. - Poll
/versionon each green Render origin until{commit_sha, schema_version, generation}all match expected. - Workers heartbeat verification: poll
core.workersuntil workers at new generation reportstate='running'. - 30 s sanity check: assert no blue service has started a new deploy in the last 30 s (env-group race detection).
- CNAME flip (per ADR-052 D5): public CNAMEs point at green origins. TTL must be pre-lowered to 60 s at least 2 hours before this step.
- Hold blue warm 24 h (D11), then sync blue to green.
Failure modes:
- Steps 1–5: nothing visible to users; abort cleanly, restart workflow after fix.
- Steps 6–9: green is broken; do NOT flip; investigate; legacy-drain not needed.
- Step 11: rollback per D14.
D10 — Render API call discipline
Authenticated POST /v1/services/{id}/deploys + poll GET /v1/services/{id}/deploys/{deployId} until status='live'. Deploy hooks NOT used — fire-and-forget, return 200 even when later builds fail. Composite action render-deploy wraps auth + deploy + poll. Hard timeout 25 min per service; no auto-retry on terminal failure.
D11 — Hold window: 24 h fixed at alpha
Blue stays warm for 24 h post-cutover. Cutover completion measured by core.workers heartbeat at new generation + /version reflecting expected SHA — not by “DNS propagated” (upstream resolvers cache hours-to-day regardless of TTL). After 24 h, sync blue to green.
Forward trigger: traffic-driven measurement when traffic volume produces meaningful “blue traffic dropped below X%” signal in <24 h.
D12 — Cloudflare Pages deploy via cloudflare/wrangler-action@v3
Production deployments via wrangler pages deploy ./dist --project-name=<project> --branch=main. Preview deployments per PR via branch name. Custom domains attached once via dashboard; production-branch deploys flow through automatically. Custom-domain cert is a pre-flight gate (workflow asserts cert green-light before considering Pages “live”).
D13 — Legacy-drain workflow contract
drain-legacy-generation.yml with workflow_dispatch input target_generation: integer.
- Read
core.deploymentsforreferenceattarget_generation. Abort if not found. - Checkout repo at that reference.
- Deploy temporary Render worker service
workers-drain-gen-<N>withSPECTRAL_GENERATION=<N>,SPECTRAL_DRAIN_AND_EXIT=true, optionalSPECTRAL_DRAIN_COOLING_SECONDSoverride. - Monitor authoritative drain-complete signal: log line
drain complete, exiting. Fallback: pollcore.outboxfor zero pending+in_flight at target generation. - On signal: Render API delete temporary service.
- On workflow failure mid-drain: temporary service stays up; runbook documents manual cleanup.
D14 — Rollback decision tree
- Cutover incomplete (CNAME not flipped): abort; blue still serving; investigate green; rebuild as needed. No legacy-drain.
- Post-cutover behavior-only issue: flip CNAME back to blue. 60 s TTL bounds the window. Blue warm during hold. Outbox at gen-N drains naturally.
- Post-cutover deploy-generation-specific data issue: redeploy prior code at new generation N+2 tagged
vX.Y.Z-rollback. Stranded gen-(N+1) outbox rows drained viadrain-legacy-generation.ymlwithtarget_generation=N+1. - Migration-caused issue: migrations are forward-only (per ADR-032 D4) + expand/contract (per ADR-048 D4) + compat-linted (D5). Old code works against new schema by design. Rollback is code-level only via path 3.
- Past Render image retention OR upstream-yanked dep: declare DR per ADR-040.
The V1-against-V2-schema corner is prevented at PR time by D5 lint; rollback path stays simple.
D15 — Product-version tag policy: curated cadence
Cofounder cuts vX.Y.Z at feature-bundle moments. No automatic version bump on prod deploys. prod-N tags are mechanical (every prod deploy); vX.Y.Z are curated. Two tag lineages coexist per ADR-049 D5.
D16 — Release-page-only changelog (no CHANGELOG.md commit-back)
git-cliff output goes to GH Release body only via step-output piping. CHANGELOG.md does NOT exist in repo. Drops the workflow-loop hazard (commit-back lands on main and triggers downstream workflows), drops write-token requirement, drops bot-author commits in history. The GH Releases page IS the canonical changelog.
Forward trigger: published changelog page on docs-user → revisit (cliff renders server-side without committing to repo).
D17 — GH Environments + protection rules
Two environments at TA-26 disposition time: staging, production. ADR-062 (TA-25) added a third (test-live) for the nightly LLM live-drift workflow.
Staging: secrets RENDER_API_KEY (staging-scoped), SUPABASE_MANAGEMENT_PAT (staging-scoped), CLOUDFLARE_API_TOKEN. Protection: none.
Production: same secret names, prod-scoped values (environment-secrets-not-repo-secrets). Protection: required reviewer = self; deployment-tags rule with explicit ref type: tag set, restricted to prod-* and v*.*.*; no wait timer at alpha.
Forward triggers: required reviewer = peer when team grows; wait timer when coordination matters.
D18 — Preview environments per PR
- Database: Supabase branching via GH integration handles automatically.
- Docs: Cloudflare Pages preview deploys per branch automatic.
- Apps (api/dashboard/operations): Render preview environments NOT in scope at alpha.
D19 — Deploy-manifest coverage lint (tools/quality/check_deploy_manifest_coverage.py)
Asserts every directory under apps/ and src/spectral/ is mapped to at least one target’s paths glob in .github/deploy-manifest.yml, or declared non_deployed:. Catches silent-non-deploy. Runs in ci.yml.
D20 — Substrate cleanup
- Delete
.github/workflows/nightly-backup.yml— ADR-048 / ADR-049 moved backup-nightly to a Render cron runningtools/ops/backup/backup-nightly.sh. The GH Actions workflow is superseded substrate. Operational provisioning of the Render cron lives in SPEC-330. - Update
.github/workflows/ci.yml— replacemypy --strictinvocation withuv run ty checkto match the established gate per ADR-051. Wire D5 + D19 lints.
Alternatives considered
Reusable workflows (workflow_call). Rejected; primitive-shape mismatch + environment-scoped secrets edge case.
Matrix job over targets. Rejected at alpha; couples per-target approval; chosen only if target count grows.
Render deploy hooks. Rejected; fire-and-forget; 200-on-later-failure.
SPECTRAL_GENERATION in env group (the ADR-049 D5 literal text). Rejected per D7.
Cancel-in-progress on prod concurrency. Rejected; mid-cutover cancel is worst-case.
Single concurrency group across staging+prod. Rejected; ordering not guaranteed within group; separate-with-marker is the pattern.
dorny/paths-filter as resolver. Rejected; does not express coupling rules.
Postgres advisory lock around generation allocation. Rejected at alpha; workflow concurrency suffices.
CHANGELOG.md commit-back. Rejected per D16.
Render preview environments per PR. Rejected per D18.
pgroll for migration management. Deferred; D5 lint is the alpha-tier safety net; pgroll is a substrate replacement, forward-trigger if migrations consistently complex.
Consequences
- Cutover sequence is explicit, ordered, and validated against 13 known operator failure modes.
- Five safety items folded structurally into the workflow (Render API not hooks;
schema_migrationsdelta assert; 30 s pre-flip race check; pre-lowered TTL; traffic-drop measurement). - Migration-compat lint catches V1-vs-V2 corner at PR time.
- Deploy-manifest coverage lint catches silent-non-deploy.
- Release-only changelog removes the workflow-loop hazard.
- Per-service
SPECTRAL_GENERATIONremoves the env-group race entirely for generation. - Pre-merge dry-run useful from day 0; lights up further on integration-suite landing.
- All deploy secrets environment-scoped — leak-resistant structurally.
- 24 h hold window is fixed (could be over- or under-sized) until traffic-driven measurement lights up.
- Release-only changelog: no
CHANGELOG.mdin repo; readers must use GH Releases. - Migration-compat override marker requires human discipline (the lint enforces presence, not correctness of
reason:). - Composite-action approach means flatter logs than reusable workflows produce.
- Pre-merge dry-run smoke-test scope is empty at alpha. Real risk: migrations that pass apply but break in app code; only mitigated when ADR-045 D13 lands.
- Cloudflare Pages first-deploy custom-domain cert stall — workflow has a pre-flight but the pre-flight can itself stall. Mitigation: provision domains ≥7 days before relevant tag push (per ADR-052 doctrine).
References
- ADR-012 — tiered hooks; mypy → ty per ADR-051
- ADR-065 —
spectral.coreadmission discipline - ADR-032 — forward-only migrations
- ADR-037 — env-group placement principle ratified
- ADR-040 — DR runbook for path 5 rollback
- ADR-046 — Render PaaS; TA-21
- ADR-048 — generation stamping; deploy manifest
- ADR-049 — D5
SPECTRAL_GENERATIONplacement correction (via D7 here) - ADR-051 —
ci.ymlty checkswap (D20) - ADR-052 — CNAME-flip step (D9 step 11);
Cache-Controlcontract test - ADR-062 —
test-liveEnvironment extension (third Environment) - TA-26 disposition — SPEC-329 comment
2089020b - TA-26 verification — SPEC-329 comment
1f7d85c0 - TA-25 Environment scoping confirmation — SPEC-329 comment
0309025f tools/quality/check_migration_compat.py— D5 linttools/quality/check_deploy_manifest_coverage.py— D19 lintdocs/runbooks/deployment.md— 12-step cutover runbookdocs/runbooks/rollback.md— rollback decision treedocs/runbooks/legacy-drain.md— legacy-drain runbook- Codex
system-design/topology/infrastructure/cd-pipeline-overview.mdx— close-pass new page
Addendum: ADR-021 — CI redesign from scratch
ADR-021 (Accepted 2026-04-20; retired by this ADR) ratified that CI for the 0.3.0 rebuild would be designed correctly from the first commit rather than ported from the v0.2 configuration. The v0.2 CI had been added late, immediately disabled, and never operated in its intended form; the rebuild was the opportunity to set automatic-trigger PR gates, enforced coverage thresholds, and integration-test acceptance criteria as first-class requirements rather than aspirational targets.
Why a future reader should know about ADR-021:
- The “PR gates from day one” posture is preserved here and in ADR-062: no merge to main without passing CI; integration-test coverage is a non-negotiable acceptance criterion for any path between contexts.
- The “Done != Done” failure mode that ADR-021 named (tickets marked complete without integration coverage) is structurally prevented by the per-epic Definition of Done in
AGENTS.mdand the integration-test AC rule in the Epic Template & DoD. - CI substrate specifics — composite actions, the 12-step cutover, migration-compat lint, release-only changelog — are settled by this ADR. CI secrets handling (fork-PR safety, environment scoping) is settled by ADR-062.
Git history at the commit retiring ADR-021 preserves the original text.