Skip to content
GitHub
Recovery

Disaster recovery runbook

Operational runbook for Spectral’s DR posture. Implements ADR-040 — see the ADR for the decision context.

Posture

DR is Supabase-native managed backups + PITR — there is no self-run backup pipeline (no pg_dump, no object-store, no encryption job). The posture tiers by data value:

  • Pre-customer alpha (now): the database is regenerable from tools/dev/cold_start_seed.py + the authoring harness; the tax-prep world is a dev/demo asset and no irreplaceable data is in the system. Recovery is re-seed, not restore. (The production project is on Supabase Free.)
  • Before customer launch / first partner data: adopt Supabase Pro → managed daily backups + PITR (~2-minute RPO in-window). Partner rule corpora are load-bearing and not regenerable. Adopting Pro also unblocks a persistent staging branch.
TargetPre-PITRPost-PITR
RPOre-seedable (alpha) / ~24 h daily snapshot (Pro)~2 min in-window
RTO~60 min (restore + verify)~60 min

Activate Supabase Pro + PITR

Triggers are owned by ADR-040 D5 (first partner data; a compromised-credential incident or near-miss; sustained > 10% daily change for > 14 days; the first PITR-covered failure hit — customer launch is the floor, adopt before then regardless). When one fires:

  1. Upgrade the Supabase project to Pro.
  2. Enable PITR in the dashboard → Database → Backups.
  3. Update the Posture section above to Post-PITR.
  4. Run a functional restore drill within 30 days to verify the restore-to-seconds UX.

Restore playbooks

Self-inflicted (accidental DELETE / bad migration)

  • Pre-PITR alpha: if the affected data is regenerable, re-seed — supabase db resetcold_start_seed.py → the authoring harness. If partner data is present (post-Pro), restore the most recent managed backup, or PITR to just before the damage.
  • Post-PITR: dashboard → Database → Backups → Point in Time → a timestamp just before the damage → in-place restore (10–30 min). Verify schema + row counts.

Regional outage (Supabase us-east-1 hard down)

  • Wait it out if it is expected to resolve within a few hours.
  • If extended / business-critical: restore the most recent managed backup to a fresh Supabase project in another region, update the Supabase URL/DSN op refs in infra/environments.toml [production], re-run tools/provision/provision.sh --env production, and redeploy. Communicate the cutover to active partners; re-issue invites (see the core.users note below).

Provider failure / EOL

  • Restore a Supabase backup into a managed Postgres elsewhere (Neon, Crunchy Bridge, Cloud SQL). Supabase-managed surfaces (auth.users) do not move cleanly:
    • auth.users does not transfer; users re-authenticate via the new provider’s flow.
    • core.users repopulates from re-invite acceptance; re-issue domain_invites.
  • Point FastAPI at the new host; the ADR-039 auth abstraction preserves the AuthContext shape so authz does not need re-architecting.

Compromised service-role credential

  1. Rotate the credential per secrets-management.md.
  2. Determine the blast-radius window (when the credential was live; what it could reach).
  3. Post-PITR: PITR to just before the suspected compromise to undo unauthorized writes. Pre-PITR alpha: re-seed if the affected data is regenerable.
  4. Log the incident per the secrets-management runbook.

Accidental schema drop

  • Post-PITR: PITR to just before the drop. Pre-PITR alpha: re-seed. After restore, verify the shared core.* tables are intact (worlds and platform fail-closed on missing core contracts).

Restore-drill cadence (ADR-040 D4)

TypeCadenceEffort
Functional restore drillQuarterly (once on Pro)~2 h
Tabletop reviewMonthly~15 min
Post-migration drillAfter any multi-schema migration~1 h

Functional drill: restore the latest managed backup to a throwaway project (dashboard → restore / Duplicate Project), run a schema-checksum comparison across core/worlds/platform, row-count spot-check a few load-bearing tables, tear down within the billing hour, and log the result below.

First-partner onboarding DR checklist

Execute before a first design-partner co-implementation persists their domain data:

  • Supabase Pro + PITR active.
  • spectral db export-rules implemented for the partner corpus (ADR-040 D3).
  • data/rules/<partner>/ created with a README (partner, onboarding date, regeneration procedure); a hook re-exports on mutation.
  • Functional drill within 30 days using their corpus.

Runbook history

Drills, incidents, and activation events, reverse-chronological.

(no entries yet)

  • ADR-040 — DR decision.
  • secrets-management.md — credential rotation behind the compromised-credential playbook.
  • ADR-032 — single-DB topology informs the restore unit shape.
  • ADR-042 — retention intersects with the PITR window.
  • ADR-039 — auth abstraction enabling the provider-failure path.

External references