Skip to content
GitHub
Recovery

Rollback runbook

How to recover a bad production deploy. Cutover is generation-stamped and the production branch is fast-forward-only (ADR-109 D5), so rollback is forward-fix — there is no blue/green CNAME flip and no legacy-generation drain.

The model

  • A container deploy ships a new generation (one container). The new container claims its generation’s outbox rows; the prior generation’s rows age out unclaimed — there is no reaper to run.
  • A Pages deploy ships a new static/Function deployment for one frontend/docs project; it does not affect the API/workers generation.
  • Schema migrations are forward-only (ADR-032 D4) + expand/contract, so the prior generation’s code keeps working against the new schema.
  • production is fast-forward-only (the ruleset blocks force-push), so you cannot move it backward — you roll back by deploying a new commit that reverts the change.

By failure class

Behavior regression (code, schema fine)

The new generation serves wrong responses / regresses, but the schema is not at fault.

  1. git revert the offending commit(s) on main.
  2. Fast-forward production to the revert and push (redeploy per deployment.md).
  3. The reverted (prior-generation) code runs safely against the already-migrated schema by the expand/contract guarantee.

Pages frontend/docs regression

The bad deploy affects only app., ops., or docs..

  1. Revert the offending frontend/docs commit on main.
  2. Fast-forward production to the revert and push so deploy-pages.yml redeploys the affected Pages project.
  3. If the bad deploy is already live and the revert is not ready, use Cloudflare Pages’ deployment rollback for the affected project, then still land the revert/fix in git.

Migration-caused issue

The migration that applied is itself the problem.

  1. Do not roll back the schema — migrations are forward-only.
  2. If the prior code can run against the new schema (expand/contract held), revert the code as above.
  3. Otherwise fix forward: a follow-up migration that restores the compatibility property (e.g. make a NOT NULL column nullable again until backfill completes), shipped with the corrected code.

Catastrophic / DR

The database is corrupted or a region is down. Declare DR per disaster-recovery.md (PITR or managed-backup restore).

The V1-against-V2-schema corner

The classic “prior code breaks against the new schema” failure is structurally prevented by expand/contract discipline (ADR-109 D5): a schema change must leave the prior generation working. Breaking DDL — ADD COLUMN … NOT NULL without a default, DROP COLUMN/DROP TABLE, ALTER COLUMN … TYPE, ADD … UNIQUE on an existing table — is rejected at PR time by the migration-compat lint (ADR-053 D5) unless the migration carries an explicit -- compat: breaking (reason: …) override with reviewer signoff. The rollback path stays simple because the corner cannot be entered without explicit opt-in.

Communication

For any production rollback: note it (deploy SHA, generation, failure class) in the ops channel; mark / annotate the GitHub Release if one was created; after resolution, capture any deploy-doctrine gap.