Disaster recovery runbook
Alpha operational runbook for Spectral’s DR posture. Implements ADR-040 disciplines. See the ADR for the full decision context.
Current backup substrate
PITR is not enabled at alpha (see ADR-040 D1
for activation triggers). Daily-snapshot cadence + nightly pg_dump covers the alpha posture.
| Target | Value |
|---|---|
| RPO (in-provider) | 24 h (daily snapshot cadence) |
| RTO (in-provider) | 60 min (Duplicate Project restore + verification) |
| RPO (provider-failure) | 24 h (restore from nightly pg_dump) |
| RTO (provider-failure) | Hours-to-a-day, best-effort |
No taxpayer data in scope — the IRS / tax-prep world-agent is a development
- demo asset only. Customer-domain data from design-partner co-implementation is the only non-public data that enters the system during alpha.
What is backed up
| Layer | Cadence | Retention | Source of truth |
|---|---|---|---|
| Supabase daily snapshot | Daily (Supabase-managed) | 7 days (Pro tier) | Supabase platform |
Nightly pg_dump → GCS | Daily 03:17 UTC | 30 days (bucket lifecycle) | .github/workflows/nightly-backup.yml (per ADR-040 D2; lands with the deploy substrate) |
| Partner rule corpora (post-partner-landing) | On commit | Unbounded (git history) | spectral db export-rules → data/rules/<partner>/ |
PITR activation triggers
Triggers are owned by ADR-040 D1 — see the ADR for the authoritative list. When any trigger fires, run the activation playbook below.
PITR activation playbook
- Verify the Small compute add-on (~$5 /mo net after Pro credit) is provisioned on the Supabase project.
- Enable PITR 7-day in the Supabase dashboard → Database → Backups. Enabling PITR disables the daily snapshot subsystem (PITR supersedes it).
- Update the “Current posture” section of this runbook to Post-PITR alpha: RPO 2 min in-window, RTO 60 min.
- Trigger a functional restore drill within 30 days to verify the restore-to-seconds UX.
- Reconcile nightly
pg_dumpretention with retention decisions per ADR-042 when those land. The 30-daypg_dumplifecycle extends beyond the 7-day PITR window by design — late-corruption scenarios need the longer tail.
Restore playbooks
Scenario: accidental DELETE or bad migration (self-inflicted)
Pre-PITR mode:
- Identify the damage; decide acceptable data loss against the 24-hour RPO gap.
- If < 24 h of legitimate work would be lost: consider manual replay of the lost writes.
- Otherwise: restore from most recent nightly
pg_dumpvia the provider-failure playbook below (even though provider is healthy — the mechanics are the same).
Post-PITR mode:
- Identify the exact timestamp just before the damage.
- Supabase dashboard → Database → Backups → Point in Time → pick timestamp → in-place restore.
- Expected downtime: 10-30 min.
- Verify schema + row counts against pre-damage expectations.
Scenario: regional outage (Supabase us-east-1 hard down)
- Wait it out if the outage is expected to resolve within 4 hours (historically most AWS regional incidents resolve within 2 h).
- If extended or business-critical: restore the most recent nightly
pg_dumpto a fresh Supabase project in a different region, then update theSUPABASE_URL+DATABASE_URLsecrets viatools/provision/setup.sh --mode=rotate. - Communicate the cutover to active design partners; re-issue invites
per the
core.usersmirror recovery note below.
Scenario: provider failure / EOL / acquisition-sunset
- Restore the most recent nightly
pg_dumpinto a managed Postgres elsewhere (GCP Cloud SQL for Postgres, Neon, Crunchy Bridge). - Supabase-managed surfaces (auth.users, storage.objects) do not
restore cleanly into a non-Supabase host. Consequence:
auth.usersdoes not move; users re-authenticate via the new provider’s flow.core.usersmirror is populated by re-invite acceptance.workspace_invitesmust be re-issued; accepted invites in the dump are historical and non-actionable on the new host.
- Update FastAPI middleware to point at the new host; JWT verification
may shift provider (per ADR-039 D12 abstraction preserves the
AuthContextshape).
Scenario: compromised service-role credential
- Rotate the credential per
docs/runbooks/secrets-management.mdemergency-rotation flow. - Determine blast-radius window (when was the credential live; what could it reach).
- Post-PITR mode: restore to a timestamp just before the suspected compromise to undo any unauthorized writes.
- Pre-PITR mode: if destructive action occurred, restore from the
most recent nightly
pg_dump(up to 24 h of legitimate work lost). - Log the incident per the secrets-management runbook.
Scenario: corruption or bad data found > 7 days after write
The 7-day PITR window does not reach this. Rely on nightly pg_dump
retention (30 days).
Late-corruption floor commitment (per ADR-042 D10). The 30-day
pg_dump retention is the load-bearing recovery window for any
corruption discovered after the PITR cutoff. So long as every
workspace-scoped entity’s RetentionPolicy.active_ttl_days remains
≥ 30 days, the dump covers the gap. If future retention research
shortens any PLATFORM-class active_ttl_days below 30 days, the
pg_dump lifecycle rule in .github/workflows/nightly-backup.yml
must extend to match: active_ttl + tombstoned_grace is the floor,
never less. Tracked alongside the ADR-042 D11 forward triggers.
- Locate the last-known-good dump in
gs://<BACKUP_BUCKET>/nightly/. - Restore the dump to a throwaway Supabase Duplicate Project via
pg_restore. - Spot-check the affected table in the restored clone.
- Extract corrected rows; apply to production via targeted
INSERT ... ON CONFLICT ... UPDATEor equivalent. - Do not perform a full restore — days of legitimate writes since the corruption would be lost.
- Tear down the clone within the Supabase billing hour.
Scenario: accidental schema drop
- Pre-PITR mode: restore from the most recent nightly
pg_dump. - Post-PITR mode: PITR to just before the drop.
- After restore: verify
core.*tables (llm_usage,embedding_profile,users) are intact — these are shared contracts;worldsandplatformwill fail-closed on missing data.
Restore-drill cadence (D4)
| Type | Cadence | Effort |
|---|---|---|
| Functional drill | Quarterly | ~2 hours |
| Tabletop review | Monthly | ~15 min |
| Post-migration drill | After any multi-schema migration | ~1 hour |
NIST SP 800-34 Rev. 1 alignment (annual functional minimum; more frequent for high-impact systems). AWS Well-Architected REL13-BP02 treats multi-schema migrations as “significant workload changes” requiring a game-day-style drill.
Quarterly functional drill procedure
- Pick the most recent nightly
pg_dumpfromgs://<BACKUP_BUCKET>/ nightly/. - Duplicate the staging Supabase project (dashboard → Duplicate Project).
- Decrypt and restore the dump into the clone:
gcloud storage cp gs://<BACKUP_BUCKET>/nightly/<ts>.dump.age -| age -d -i <identity-path>| pg_restore -d <clone-connection-string> --no-owner --no-privileges
- Run schema checksum comparison across
spectral,worlds,core. - Row-count spot-check five load-bearing tables (e.g.,
core.llm_usage,core.embedding_profile,core.users, and two tables inworldsorplatformonce those schemas are scaffolded). - Tear down the clone within the Supabase billing hour.
- Log the drill result (pass / fail / findings) in the history section below.
Tabletop review checklist
- Most recent nightly workflow run is green (check Actions tab).
- Last object in
gs://<BACKUP_BUCKET>/nightly/is < 26 hours old. - Backup credential rotations are on schedule (see
secrets-management.md). - age identity key still retrievable from GCP Secret Manager
(
BACKUP_AGE_IDENTITY). - No schema changes since last drill (if yes, schedule a post-migration drill).
First-partner onboarding DR checklist
Execute before a first design-partner co-implementation session persists their domain data:
- Activate PITR per D8 playbook above.
- Implement
spectral db export-rulesfor the partner’s corpus schema (D3 concrete implementation lands per the partner-corpus migration epic). - Create
data/rules/<partner>/directory with a README noting the partner, onboarding date, and regeneration procedure. - Wire a git pre-commit hook (or rule-corpus-migration hook) that re-exports on mutation.
- Run a functional drill within 30 days of partner onboarding using their corpus.
Backup credential inventory (D7)
| Credential | Scope | Storage | Used by |
|---|---|---|---|
SUPABASE_DB_SUPERUSER_URL | @scope=staging,production | GCP Secret Manager | Nightly backup workflow (via WIF) |
BACKUP_BUCKET | @scope=shared | GitHub repo variable | Nightly backup workflow |
BACKUP_AGE_RECIPIENT | @scope=shared | GitHub repo variable (public key; not secret) | Nightly backup workflow |
BACKUP_AGE_IDENTITY | @scope=shared | GCP Secret Manager (private key) | Restore operations only — never loaded in nightly workflow |
GCP_WORKLOAD_IDENTITY_PROVIDER | @scope=shared | GitHub repo variable | Nightly backup workflow auth |
GCP_BACKUP_SERVICE_ACCOUNT | @scope=shared | GitHub repo variable | Nightly backup workflow auth |
SUPABASE_DB_SUPERUSER_URL runs as the postgres superuser
(BYPASSRLS required for pg_dump completeness per Supabase docs) and
is held under a distinct service account from the app service-role
credential. Bucket-level object versioning + retention lock ensure a
compromised write credential cannot delete backup history.
Runbook history
Drills, incidents, and activation events logged in reverse-chronological order.
(no entries yet)
Related
- ADR-040 — DR substrate (decision context).
secrets-management.md— credential rotation backing the “compromised credential” playbook..github/workflows/nightly-backup.yml— nightlypg_dumpworkflow implementing ADR-040 D2.- ADR-032 — single-DB topology informs restore unit shape.
- ADR-042 — retention policies intersect with PITR + nightly dump retention; late-corruption floor reconciled above.
- ADR-037 — credential provisioning via
tools/provision/setup.sh. - ADR-039 — auth abstraction enables the provider-failure-restore path without re-architecting authz.