Skip to content
GitHub
Recovery

Rollback runbook

Production rollback procedure for Spectral. Decision tree by failure class, with explicit reference to the legacy-drain workflow when generation-cleanup is needed. See ADR-053 D14 for the disposition.

Failure-class decision tree

The rollback path depends on where the deploy is in the cutover sequence and what failed. Match the failure to the most-specific matching class.

Class 1 — Cutover incomplete (CNAME not flipped yet)

Symptoms: the deploy workflow aborted at any of steps 1–10 of the deployment runbook. The CNAME still points at blue. Blue is still serving traffic and is healthy.

Action: abort the workflow. Investigate green; rebuild as needed. Re-tag and re-run when fixed.

No legacy-drain needed. No outbox rows have been stamped at the new generation, because workers at the new generation never reached state='running' (verified at step 9 of the cutover sequence).

Class 2 — Post-cutover, behavior-only issue

Symptoms: CNAME flipped successfully (cutover completed), but green is exhibiting a behavior issue — wrong responses, performance regression, sentry alerts spiking. The schema is not at fault; the issue is code-level on green.

Action: flip the CNAME back to blue via the cloudflare-cname-flip composite action. The 60-second TTL bounds the recovery window for well-behaved resolvers; long-tail resolvers may cache for hours, but blue stays warm during the hold window.

After flipping back:

  1. Investigate green’s behavior issue.
  2. Fix forward; tag a new prod release; re-run the cutover sequence.
  3. Outbox rows stamped at gen-(N+1) during the brief green-traffic window will drain naturally as the (now-blue-after-second-cutover) workers continue processing. They are not “stranded” because the generation that produced them is still the active one — green workers were running gen-(N+1) code and stamping gen-(N+1) events. When CNAME flips back to blue, blue workers (gen-N) take over, but the outbox routing is by generation, so gen-(N+1) events remain claimable by gen-(N+1) workers — which are no longer running, so they need legacy-drain.

Use legacy-drain (legacy-drain.md) with target_generation=N+1 after flipping back.

Class 3 — Post-cutover, deploy-generation-specific data issue

Symptoms: green has corrupted or unexpected data behavior that cannot be fixed by simply flipping back. Migration may have applied correctly but the new code is producing unintended state at the data layer.

Action:

  1. Tag the prior code at a new generation: git tag vX.Y.Z-rollback <prior_sha>.
  2. Push the tag, which triggers deploy-prod.yml and allocates generation N+2.
  3. The cutover sequence runs against the prior code; CNAME flips from green (gen-N+1) to a new green (gen-N+2 running the prior code).
  4. Stranded outbox rows from gen-(N+1) drain via legacy-drain.md with target_generation=N+1.

Schema is not rolled back. Migrations are forward-only (per ADR-032 D4). The prior code works against the new schema by design (expand/contract migrations per ADR-048 D4 + compat lint per ADR-053 D5). If the schema itself is the problem, the next class applies.

Class 4 — Migration-caused issue

Symptoms: the migration that applied at step 4 of the cutover sequence is itself the source of the issue. Either the migration is incompatible with N-1 code in a way the compat lint missed, or it has performance / locking issues that only surface in production.

Action:

  1. Schema is forward-only. Do not attempt to “roll back the migration.” Postgres + the migration pipeline are not designed for schema rollback.
  2. Determine whether the prior code can run against the new schema:
    • If yes: rollback per Class 3.
    • If no: this should never happen because the migration-compat lint (ADR-053 D5) rejects breaking changes at PR time. If it does happen, the lint missed something; see Class 5.
  3. Fix forward via a follow-up migration that restores the compatibility property (e.g., make a NOT NULL column nullable again until backfill completes). Tag a new prod release that includes both the follow-up migration and the corrected code.

Class 5 — Catastrophic / DR

Symptoms: any of the above paths fail because:

  • Render image retention has expired and the prior code is no longer rebuildable (upstream-yanked dep, base-image gone, etc.).
  • Production database is corrupted in a way that requires PITR or backup restoration.
  • Region-wide Render or Supabase outage affecting cutover machinery.

Action: declare DR per the disaster-recovery runbook (per ADR-040). Notify cofounder. Engage Supabase support if database state is involved. Backup substrate is the nightly pg_dump → age → GCS archive (per ADR-040 D2).

V1-against-V2-schema corner

The classic rollback failure mode is “V1 code shipped against V2 schema breaks on the rolling-deploy window” — e.g., V2 added a NOT NULL column V1 doesn’t populate. This corner is structurally prevented at PR time by the migration-compat lint (ADR-053 D5), which rejects:

  • ADD COLUMN ... NOT NULL without DEFAULT
  • DROP COLUMN, DROP TABLE
  • ALTER COLUMN ... TYPE
  • ADD ... UNIQUE on existing tables

unless the migration carries the -- compat: breaking (reason: ...) override marker. Override files require explicit reviewer signoff.

The rollback path stays simple because the corner cannot be entered without explicit operator opt-in.

Communication

For any rollback that touches production traffic (Class 2, 3, 4, 5):

  1. Note the rollback in the ops chat (or equivalent) with the deploy tag, the generation numbers involved, and the failure class.
  2. Update the GitHub Release if it was created (mark as pre-release or attach a note explaining the rollback).
  3. After resolution, write a brief incident note in the ADR / forward-considerations log if the failure class indicates a gap in deploy-pipeline doctrine.