Skip to content
GitHub
Decisions

ADR-053: CD pipeline orchestration — composite actions, 12-step cutover, migration-compat lint, release-only changelog

Status: Accepted (2026-04-24)

Context

ADR-048 (deployment topology) + ADR-046 (hosting) + ADR-049 (containers) + ADR-052 (edge) all landed before this ADR. TA-26 defines how deploys actually get triggered, sequenced, gated, rolled back. Substantial pre-shaped scope from upstream bookkeeping comments: ADR-048 (workflow machinery, generation counter, legacy-drain, schema-version gate, migration-compat lint, pre-merge dry-run, Cloudflare Pages, deploy keys); ADR-049 (two tag lineages, git-cliff workflow, GH Release policy, product-version cadence, Dependabot app-dep extension); ADR-052 (CNAME-flip step, KV provisioning, Cache-Control contract test, CF API token, CNAME rollback). An adversarial pass surfaced 13 concrete operator-pain failure modes; five fold into the cutover-sequence ordering, one triggers a material refinement to ADR-049 D5 (SPECTRAL_GENERATION placement), the rest drive specific quality lints + workflow gates.

Decision

D1 — Workflow composition: composite actions for primitives + thin per-target jobs

Composite actions at .github/actions/<name>/action.yml for the small primitives (Render auth + deploy + poll, version-poll, CNAME flip, KV bind, env-group helpers). One job per deployable target in each top-level workflow, each declaring its own environment: block. Reusable workflows (workflow_call) NOT used. Matrix job NOT used.

D2 — Concurrency model: separate groups; prod queued; staging cancel-on-supersede

  • Staging: concurrency: { group: deploy-staging-${{ github.ref }}, cancel-in-progress: true }. Latest commit wins.
  • Production: concurrency: { group: deploy-prod, cancel-in-progress: false }. Queued, never canceled.
  • Production gated on same-SHA staging-success marker (commit-status check). GH concurrency does not order across separate groups; the gate provides the ordering primitive.

D3 — Path-filtered rollout via .github/deploy-manifest.yml

The workflow’s first job parses the manifest, runs git diff from deployed_sha to github.sha, maps changed paths to an affected target set via globs, expands per coupling rules (api ↔ workers), produces outputs.targets JSON. Per-target jobs gate via if: contains(fromJSON(needs.changes.outputs.targets), '<name>'). Custom resolver matches the existing manifest schema; dorny/paths-filter rejected (does not express coupling rules cleanly).

D4 — Pre-merge dry-run gate

Workflow:

  1. Create branch via Supabase Management API POST /v1/projects/{ref}/branches.
  2. Apply pending migrations to branch.
  3. Assert schema_migrations row-count delta matches expected (catches partial-apply on retry).
  4. Run smoke-test suite against branch URL — empty at alpha; auto-engages when ADR-045 D13 first-integration suite lands.
  5. Delete branch.
  6. Any failure → abort merge; report.

D5 — Migration-compat lint (tools/quality/check_migration_compat.py)

Rejects in supabase/migrations/*.sql:

  • DROP COLUMN
  • DROP TABLE
  • ALTER COLUMN ... TYPE to incompatible target
  • ADD COLUMN ... NOT NULL without DEFAULT
  • ADD ... UNIQUE constraint on populated column

Override: -- compat: breaking (reason: <reason>) marker on the file. The override forces explicit human review; an unannotated breaking change blocks PR. The V1-against-V2-schema corner is prevented at PR time, not deploy time.

Wired into ci.yml (PR + push-main) and tools/dev/precheck.sh. Lives alongside check_migration_naming.py.

D6 — /version and /version/detail contract

Refines ADR-048 D7. Both endpoints on every Render web service.

/version (public): { commit_sha, schema_version, generation, deployed_at }. JSON. Used by deploy verification + simple operator inspection.

/version/detail (auth-gated via deploy-key registry): full version.json from ADR-049 D5 build script + runtime generation + service identity.

Contract test asserts JSON shape across api / dashboard / operations.

D7 — SPECTRAL_GENERATION placement: per-service env var

Material refinement of ADR-049 D5 step 4. SPECTRAL_GENERATION set per-service via Render API at deploy time, NOT in the env group. Other shared values (DB URL, Sentry DSN, OTel endpoint, deploy-key registry) stay in the env group per ADR-037 / ADR-046.

Eliminates env-group auto-redeploy hazard structurally rather than via autoDeploy=false discipline. Materially mitigates ADR-049 Race C scope for generation specifically — generation is now baked into the service deploy; pod crash-restart sees the same generation.

D8 — Env-group placement principle

ClassLocationWhy
Code-coupled (correctness depends on running image)Per-service env var, set at deployAtomic with image; rolling-restart safe
Operationally shared, infrequent rotation, tolerates rolling-restart inconsistencyRender Env GroupADR-037 / ADR-046 secrets posture
Per-service operational configPer-service env varLogically per-service; future-proofs tuning

Sweep against known values:

  • Per-service: SPECTRAL_GENERATION, HANDLER_MAX, SPECTRAL_DRAIN_COOLING_SECONDS, reaper interval, claim TTL
  • Env group: SUPABASE_URL, SUPABASE_ANON_KEY, SUPABASE_SERVICE_ROLE_KEY, SENTRY_DSN, OTEL_EXPORTER_ENDPOINT, sk_deploy_* registry
  • Image-baked (strongest form): schema_version, build metadata
  • GH-side: Cloudflare API token, Render API key, Supabase Management PAT

D9 — Production cutover sequence (12 steps)

For tag push (prod-N or v*.*.*):

  1. Acquire concurrency: deploy-prod mutex.
  2. Verify same-SHA staging-success marker; abort if absent.
  3. Pre-merge dry-run per D4. Abort on failure.
  4. Apply schema: Supabase Management API branches/staging/merge → main. Assert schema_migrations row-count delta matches expected.
  5. Allocate generation: INSERT INTO core.deployments RETURNING generation. Capture <N>.
  6. Per-service deploy of green (Render API authenticated, NOT deploy hook): for each affected target, deploy with SPECTRAL_GENERATION=<N> set per-service. Workers + api in same generation.
  7. Poll Render API for each green deploy until status='live'. Hard timeout 25 min. Fail fast on build_failed / update_failed.
  8. Poll /version on each green Render origin until {commit_sha, schema_version, generation} all match expected.
  9. Workers heartbeat verification: poll core.workers until workers at new generation report state='running'.
  10. 30 s sanity check: assert no blue service has started a new deploy in the last 30 s (env-group race detection).
  11. CNAME flip (per ADR-052 D5): public CNAMEs point at green origins. TTL must be pre-lowered to 60 s at least 2 hours before this step.
  12. Hold blue warm 24 h (D11), then sync blue to green.

Failure modes:

  • Steps 1–5: nothing visible to users; abort cleanly, restart workflow after fix.
  • Steps 6–9: green is broken; do NOT flip; investigate; legacy-drain not needed.
  • Step 11: rollback per D14.

D10 — Render API call discipline

Authenticated POST /v1/services/{id}/deploys + poll GET /v1/services/{id}/deploys/{deployId} until status='live'. Deploy hooks NOT used — fire-and-forget, return 200 even when later builds fail. Composite action render-deploy wraps auth + deploy + poll. Hard timeout 25 min per service; no auto-retry on terminal failure.

D11 — Hold window: 24 h fixed at alpha

Blue stays warm for 24 h post-cutover. Cutover completion measured by core.workers heartbeat at new generation + /version reflecting expected SHA — not by “DNS propagated” (upstream resolvers cache hours-to-day regardless of TTL). After 24 h, sync blue to green.

Forward trigger: traffic-driven measurement when traffic volume produces meaningful “blue traffic dropped below X%” signal in <24 h.

D12 — Cloudflare Pages deploy via cloudflare/wrangler-action@v3

Production deployments via wrangler pages deploy ./dist --project-name=<project> --branch=main. Preview deployments per PR via branch name. Custom domains attached once via dashboard; production-branch deploys flow through automatically. Custom-domain cert is a pre-flight gate (workflow asserts cert green-light before considering Pages “live”).

D13 — Legacy-drain workflow contract

drain-legacy-generation.yml with workflow_dispatch input target_generation: integer.

  1. Read core.deployments for reference at target_generation. Abort if not found.
  2. Checkout repo at that reference.
  3. Deploy temporary Render worker service workers-drain-gen-<N> with SPECTRAL_GENERATION=<N>, SPECTRAL_DRAIN_AND_EXIT=true, optional SPECTRAL_DRAIN_COOLING_SECONDS override.
  4. Monitor authoritative drain-complete signal: log line drain complete, exiting. Fallback: poll core.outbox for zero pending+in_flight at target generation.
  5. On signal: Render API delete temporary service.
  6. On workflow failure mid-drain: temporary service stays up; runbook documents manual cleanup.

D14 — Rollback decision tree

  1. Cutover incomplete (CNAME not flipped): abort; blue still serving; investigate green; rebuild as needed. No legacy-drain.
  2. Post-cutover behavior-only issue: flip CNAME back to blue. 60 s TTL bounds the window. Blue warm during hold. Outbox at gen-N drains naturally.
  3. Post-cutover deploy-generation-specific data issue: redeploy prior code at new generation N+2 tagged vX.Y.Z-rollback. Stranded gen-(N+1) outbox rows drained via drain-legacy-generation.yml with target_generation=N+1.
  4. Migration-caused issue: migrations are forward-only (per ADR-032 D4) + expand/contract (per ADR-048 D4) + compat-linted (D5). Old code works against new schema by design. Rollback is code-level only via path 3.
  5. Past Render image retention OR upstream-yanked dep: declare DR per ADR-040.

The V1-against-V2-schema corner is prevented at PR time by D5 lint; rollback path stays simple.

D15 — Product-version tag policy: curated cadence

Cofounder cuts vX.Y.Z at feature-bundle moments. No automatic version bump on prod deploys. prod-N tags are mechanical (every prod deploy); vX.Y.Z are curated. Two tag lineages coexist per ADR-049 D5.

D16 — Release-page-only changelog (no CHANGELOG.md commit-back)

git-cliff output goes to GH Release body only via step-output piping. CHANGELOG.md does NOT exist in repo. Drops the workflow-loop hazard (commit-back lands on main and triggers downstream workflows), drops write-token requirement, drops bot-author commits in history. The GH Releases page IS the canonical changelog.

Forward trigger: published changelog page on docs-user → revisit (cliff renders server-side without committing to repo).

D17 — GH Environments + protection rules

Two environments at TA-26 disposition time: staging, production. ADR-062 (TA-25) added a third (test-live) for the nightly LLM live-drift workflow.

Staging: secrets RENDER_API_KEY (staging-scoped), SUPABASE_MANAGEMENT_PAT (staging-scoped), CLOUDFLARE_API_TOKEN. Protection: none.

Production: same secret names, prod-scoped values (environment-secrets-not-repo-secrets). Protection: required reviewer = self; deployment-tags rule with explicit ref type: tag set, restricted to prod-* and v*.*.*; no wait timer at alpha.

Forward triggers: required reviewer = peer when team grows; wait timer when coordination matters.

D18 — Preview environments per PR

  • Database: Supabase branching via GH integration handles automatically.
  • Docs: Cloudflare Pages preview deploys per branch automatic.
  • Apps (api/dashboard/operations): Render preview environments NOT in scope at alpha.

D19 — Deploy-manifest coverage lint (tools/quality/check_deploy_manifest_coverage.py)

Asserts every directory under apps/ and src/spectral/ is mapped to at least one target’s paths glob in .github/deploy-manifest.yml, or declared non_deployed:. Catches silent-non-deploy. Runs in ci.yml.

D20 — Substrate cleanup

  • Delete .github/workflows/nightly-backup.yml — ADR-048 / ADR-049 moved backup-nightly to a Render cron running tools/ops/backup/backup-nightly.sh. The GH Actions workflow is superseded substrate. Operational provisioning of the Render cron lives in SPEC-330.
  • Update .github/workflows/ci.yml — replace mypy --strict invocation with uv run ty check to match the established gate per ADR-051. Wire D5 + D19 lints.

Alternatives considered

Reusable workflows (workflow_call). Rejected; primitive-shape mismatch + environment-scoped secrets edge case.

Matrix job over targets. Rejected at alpha; couples per-target approval; chosen only if target count grows.

Render deploy hooks. Rejected; fire-and-forget; 200-on-later-failure.

SPECTRAL_GENERATION in env group (the ADR-049 D5 literal text). Rejected per D7.

Cancel-in-progress on prod concurrency. Rejected; mid-cutover cancel is worst-case.

Single concurrency group across staging+prod. Rejected; ordering not guaranteed within group; separate-with-marker is the pattern.

dorny/paths-filter as resolver. Rejected; does not express coupling rules.

Postgres advisory lock around generation allocation. Rejected at alpha; workflow concurrency suffices.

CHANGELOG.md commit-back. Rejected per D16.

Render preview environments per PR. Rejected per D18.

pgroll for migration management. Deferred; D5 lint is the alpha-tier safety net; pgroll is a substrate replacement, forward-trigger if migrations consistently complex.

Consequences

  • Cutover sequence is explicit, ordered, and validated against 13 known operator failure modes.
  • Five safety items folded structurally into the workflow (Render API not hooks; schema_migrations delta assert; 30 s pre-flip race check; pre-lowered TTL; traffic-drop measurement).
  • Migration-compat lint catches V1-vs-V2 corner at PR time.
  • Deploy-manifest coverage lint catches silent-non-deploy.
  • Release-only changelog removes the workflow-loop hazard.
  • Per-service SPECTRAL_GENERATION removes the env-group race entirely for generation.
  • Pre-merge dry-run useful from day 0; lights up further on integration-suite landing.
  • All deploy secrets environment-scoped — leak-resistant structurally.
  • 24 h hold window is fixed (could be over- or under-sized) until traffic-driven measurement lights up.
  • Release-only changelog: no CHANGELOG.md in repo; readers must use GH Releases.
  • Migration-compat override marker requires human discipline (the lint enforces presence, not correctness of reason:).
  • Composite-action approach means flatter logs than reusable workflows produce.
  • Pre-merge dry-run smoke-test scope is empty at alpha. Real risk: migrations that pass apply but break in app code; only mitigated when ADR-045 D13 lands.
  • Cloudflare Pages first-deploy custom-domain cert stall — workflow has a pre-flight but the pre-flight can itself stall. Mitigation: provision domains ≥7 days before relevant tag push (per ADR-052 doctrine).

References

  • ADR-012 — tiered hooks; mypy → ty per ADR-051
  • ADR-065spectral.core admission discipline
  • ADR-032 — forward-only migrations
  • ADR-037 — env-group placement principle ratified
  • ADR-040 — DR runbook for path 5 rollback
  • ADR-046 — Render PaaS; TA-21
  • ADR-048 — generation stamping; deploy manifest
  • ADR-049 — D5 SPECTRAL_GENERATION placement correction (via D7 here)
  • ADR-051ci.yml ty check swap (D20)
  • ADR-052 — CNAME-flip step (D9 step 11); Cache-Control contract test
  • ADR-062test-live Environment extension (third Environment)
  • TA-26 disposition — SPEC-329 comment 2089020b
  • TA-26 verification — SPEC-329 comment 1f7d85c0
  • TA-25 Environment scoping confirmation — SPEC-329 comment 0309025f
  • tools/quality/check_migration_compat.py — D5 lint
  • tools/quality/check_deploy_manifest_coverage.py — D19 lint
  • docs/runbooks/deployment.md — 12-step cutover runbook
  • docs/runbooks/rollback.md — rollback decision tree
  • docs/runbooks/legacy-drain.md — legacy-drain runbook
  • Codex system-design/topology/infrastructure/cd-pipeline-overview.mdx — close-pass new page

Addendum: ADR-021 — CI redesign from scratch

ADR-021 (Accepted 2026-04-20; retired by this ADR) ratified that CI for the 0.3.0 rebuild would be designed correctly from the first commit rather than ported from the v0.2 configuration. The v0.2 CI had been added late, immediately disabled, and never operated in its intended form; the rebuild was the opportunity to set automatic-trigger PR gates, enforced coverage thresholds, and integration-test acceptance criteria as first-class requirements rather than aspirational targets.

Why a future reader should know about ADR-021:

  • The “PR gates from day one” posture is preserved here and in ADR-062: no merge to main without passing CI; integration-test coverage is a non-negotiable acceptance criterion for any path between contexts.
  • The “Done != Done” failure mode that ADR-021 named (tickets marked complete without integration coverage) is structurally prevented by the per-epic Definition of Done in AGENTS.md and the integration-test AC rule in the Epic Template & DoD.
  • CI substrate specifics — composite actions, the 12-step cutover, migration-compat lint, release-only changelog — are settled by this ADR. CI secrets handling (fork-PR safety, environment scoping) is settled by ADR-062.

Git history at the commit retiring ADR-021 preserves the original text.