Disaster recovery runbook
Operational runbook for Spectral’s DR posture. Implements ADR-040 — see the ADR for the decision context.
Posture
DR is Supabase-native managed backups + PITR — there is no self-run backup pipeline (no pg_dump, no object-store, no encryption job). The posture tiers by data value:
- Pre-customer alpha (now): the database is regenerable from
tools/dev/cold_start_seed.py+ the authoring harness; the tax-prep world is a dev/demo asset and no irreplaceable data is in the system. Recovery is re-seed, not restore. (The production project is on Supabase Free.) - Before customer launch / first partner data: adopt Supabase Pro → managed daily backups + PITR (~2-minute RPO in-window). Partner rule corpora are load-bearing and not regenerable. Adopting Pro also unblocks a persistent staging branch.
| Target | Pre-PITR | Post-PITR |
|---|---|---|
| RPO | re-seedable (alpha) / ~24 h daily snapshot (Pro) | ~2 min in-window |
| RTO | ~60 min (restore + verify) | ~60 min |
Activate Supabase Pro + PITR
Triggers are owned by ADR-040 D5 (first partner data; a compromised-credential incident or near-miss; sustained > 10% daily change for > 14 days; the first PITR-covered failure hit — customer launch is the floor, adopt before then regardless). When one fires:
- Upgrade the Supabase project to Pro.
- Enable PITR in the dashboard → Database → Backups.
- Update the Posture section above to Post-PITR.
- Run a functional restore drill within 30 days to verify the restore-to-seconds UX.
Restore playbooks
Self-inflicted (accidental DELETE / bad migration)
- Pre-PITR alpha: if the affected data is regenerable, re-seed —
supabase db reset→cold_start_seed.py→ the authoring harness. If partner data is present (post-Pro), restore the most recent managed backup, or PITR to just before the damage. - Post-PITR: dashboard → Database → Backups → Point in Time → a timestamp just before the damage → in-place restore (10–30 min). Verify schema + row counts.
Regional outage (Supabase us-east-1 hard down)
- Wait it out if it is expected to resolve within a few hours.
- If extended / business-critical: restore the most recent managed backup to a fresh Supabase project in another region, update the Supabase URL/DSN op refs in
infra/environments.toml[production], re-runtools/provision/provision.sh --env production, and redeploy. Communicate the cutover to active partners; re-issue invites (see thecore.usersnote below).
Provider failure / EOL
- Restore a Supabase backup into a managed Postgres elsewhere (Neon, Crunchy Bridge, Cloud SQL). Supabase-managed surfaces (
auth.users) do not move cleanly:auth.usersdoes not transfer; users re-authenticate via the new provider’s flow.core.usersrepopulates from re-invite acceptance; re-issuedomain_invites.
- Point FastAPI at the new host; the ADR-039 auth abstraction preserves the
AuthContextshape so authz does not need re-architecting.
Compromised service-role credential
- Rotate the credential per
secrets-management.md. - Determine the blast-radius window (when the credential was live; what it could reach).
- Post-PITR: PITR to just before the suspected compromise to undo unauthorized writes. Pre-PITR alpha: re-seed if the affected data is regenerable.
- Log the incident per the secrets-management runbook.
Accidental schema drop
- Post-PITR: PITR to just before the drop. Pre-PITR alpha: re-seed. After restore, verify the shared
core.*tables are intact (worldsandplatformfail-closed on missingcorecontracts).
Restore-drill cadence (ADR-040 D4)
| Type | Cadence | Effort |
|---|---|---|
| Functional restore drill | Quarterly (once on Pro) | ~2 h |
| Tabletop review | Monthly | ~15 min |
| Post-migration drill | After any multi-schema migration | ~1 h |
Functional drill: restore the latest managed backup to a throwaway project (dashboard → restore / Duplicate Project), run a schema-checksum comparison across core/worlds/platform, row-count spot-check a few load-bearing tables, tear down within the billing hour, and log the result below.
First-partner onboarding DR checklist
Execute before a first design-partner co-implementation persists their domain data:
- Supabase Pro + PITR active.
-
spectral db export-rulesimplemented for the partner corpus (ADR-040 D3). -
data/rules/<partner>/created with a README (partner, onboarding date, regeneration procedure); a hook re-exports on mutation. - Functional drill within 30 days using their corpus.
Runbook history
Drills, incidents, and activation events, reverse-chronological.
(no entries yet)
Related
- ADR-040 — DR decision.
secrets-management.md— credential rotation behind the compromised-credential playbook.- ADR-032 — single-DB topology informs the restore unit shape.
- ADR-042 — retention intersects with the PITR window.
- ADR-039 — auth abstraction enabling the provider-failure path.