Skip to content
GitHub
Recovery

Disaster recovery runbook

Alpha operational runbook for Spectral’s DR posture. Implements ADR-040 disciplines. See the ADR for the full decision context.

Current backup substrate

PITR is not enabled at alpha (see ADR-040 D1 for activation triggers). Daily-snapshot cadence + nightly pg_dump covers the alpha posture.

TargetValue
RPO (in-provider)24 h (daily snapshot cadence)
RTO (in-provider)60 min (Duplicate Project restore + verification)
RPO (provider-failure)24 h (restore from nightly pg_dump)
RTO (provider-failure)Hours-to-a-day, best-effort

No taxpayer data in scope — the IRS / tax-prep world-agent is a development

  • demo asset only. Customer-domain data from design-partner co-implementation is the only non-public data that enters the system during alpha.

What is backed up

LayerCadenceRetentionSource of truth
Supabase daily snapshotDaily (Supabase-managed)7 days (Pro tier)Supabase platform
Nightly pg_dump → GCSDaily 03:17 UTC30 days (bucket lifecycle).github/workflows/nightly-backup.yml (per ADR-040 D2; lands with the deploy substrate)
Partner rule corpora (post-partner-landing)On commitUnbounded (git history)spectral db export-rulesdata/rules/<partner>/

PITR activation triggers

Triggers are owned by ADR-040 D1 — see the ADR for the authoritative list. When any trigger fires, run the activation playbook below.

PITR activation playbook

  1. Verify the Small compute add-on (~$5 /mo net after Pro credit) is provisioned on the Supabase project.
  2. Enable PITR 7-day in the Supabase dashboard → Database → Backups. Enabling PITR disables the daily snapshot subsystem (PITR supersedes it).
  3. Update the “Current posture” section of this runbook to Post-PITR alpha: RPO 2 min in-window, RTO 60 min.
  4. Trigger a functional restore drill within 30 days to verify the restore-to-seconds UX.
  5. Reconcile nightly pg_dump retention with retention decisions per ADR-042 when those land. The 30-day pg_dump lifecycle extends beyond the 7-day PITR window by design — late-corruption scenarios need the longer tail.

Restore playbooks

Scenario: accidental DELETE or bad migration (self-inflicted)

Pre-PITR mode:

  • Identify the damage; decide acceptable data loss against the 24-hour RPO gap.
  • If < 24 h of legitimate work would be lost: consider manual replay of the lost writes.
  • Otherwise: restore from most recent nightly pg_dump via the provider-failure playbook below (even though provider is healthy — the mechanics are the same).

Post-PITR mode:

  • Identify the exact timestamp just before the damage.
  • Supabase dashboard → Database → Backups → Point in Time → pick timestamp → in-place restore.
  • Expected downtime: 10-30 min.
  • Verify schema + row counts against pre-damage expectations.

Scenario: regional outage (Supabase us-east-1 hard down)

  • Wait it out if the outage is expected to resolve within 4 hours (historically most AWS regional incidents resolve within 2 h).
  • If extended or business-critical: restore the most recent nightly pg_dump to a fresh Supabase project in a different region, then update the SUPABASE_URL + DATABASE_URL secrets via tools/provision/setup.sh --mode=rotate.
  • Communicate the cutover to active design partners; re-issue invites per the core.users mirror recovery note below.

Scenario: provider failure / EOL / acquisition-sunset

  • Restore the most recent nightly pg_dump into a managed Postgres elsewhere (GCP Cloud SQL for Postgres, Neon, Crunchy Bridge).
  • Supabase-managed surfaces (auth.users, storage.objects) do not restore cleanly into a non-Supabase host. Consequence:
    • auth.users does not move; users re-authenticate via the new provider’s flow.
    • core.users mirror is populated by re-invite acceptance.
    • workspace_invites must be re-issued; accepted invites in the dump are historical and non-actionable on the new host.
  • Update FastAPI middleware to point at the new host; JWT verification may shift provider (per ADR-039 D12 abstraction preserves the AuthContext shape).

Scenario: compromised service-role credential

  1. Rotate the credential per docs/runbooks/secrets-management.md emergency-rotation flow.
  2. Determine blast-radius window (when was the credential live; what could it reach).
  3. Post-PITR mode: restore to a timestamp just before the suspected compromise to undo any unauthorized writes.
  4. Pre-PITR mode: if destructive action occurred, restore from the most recent nightly pg_dump (up to 24 h of legitimate work lost).
  5. Log the incident per the secrets-management runbook.

Scenario: corruption or bad data found > 7 days after write

The 7-day PITR window does not reach this. Rely on nightly pg_dump retention (30 days).

Late-corruption floor commitment (per ADR-042 D10). The 30-day pg_dump retention is the load-bearing recovery window for any corruption discovered after the PITR cutoff. So long as every workspace-scoped entity’s RetentionPolicy.active_ttl_days remains ≥ 30 days, the dump covers the gap. If future retention research shortens any PLATFORM-class active_ttl_days below 30 days, the pg_dump lifecycle rule in .github/workflows/nightly-backup.yml must extend to match: active_ttl + tombstoned_grace is the floor, never less. Tracked alongside the ADR-042 D11 forward triggers.

  1. Locate the last-known-good dump in gs://<BACKUP_BUCKET>/nightly/.
  2. Restore the dump to a throwaway Supabase Duplicate Project via pg_restore.
  3. Spot-check the affected table in the restored clone.
  4. Extract corrected rows; apply to production via targeted INSERT ... ON CONFLICT ... UPDATE or equivalent.
  5. Do not perform a full restore — days of legitimate writes since the corruption would be lost.
  6. Tear down the clone within the Supabase billing hour.

Scenario: accidental schema drop

  • Pre-PITR mode: restore from the most recent nightly pg_dump.
  • Post-PITR mode: PITR to just before the drop.
  • After restore: verify core.* tables (llm_usage, embedding_profile, users) are intact — these are shared contracts; worlds and platform will fail-closed on missing data.

Restore-drill cadence (D4)

TypeCadenceEffort
Functional drillQuarterly~2 hours
Tabletop reviewMonthly~15 min
Post-migration drillAfter any multi-schema migration~1 hour

NIST SP 800-34 Rev. 1 alignment (annual functional minimum; more frequent for high-impact systems). AWS Well-Architected REL13-BP02 treats multi-schema migrations as “significant workload changes” requiring a game-day-style drill.

Quarterly functional drill procedure

  1. Pick the most recent nightly pg_dump from gs://<BACKUP_BUCKET>/ nightly/.
  2. Duplicate the staging Supabase project (dashboard → Duplicate Project).
  3. Decrypt and restore the dump into the clone:
    gcloud storage cp gs://<BACKUP_BUCKET>/nightly/<ts>.dump.age -
    | age -d -i <identity-path>
    | pg_restore -d <clone-connection-string> --no-owner --no-privileges
  4. Run schema checksum comparison across spectral, worlds, core.
  5. Row-count spot-check five load-bearing tables (e.g., core.llm_usage, core.embedding_profile, core.users, and two tables in worlds or platform once those schemas are scaffolded).
  6. Tear down the clone within the Supabase billing hour.
  7. Log the drill result (pass / fail / findings) in the history section below.

Tabletop review checklist

  • Most recent nightly workflow run is green (check Actions tab).
  • Last object in gs://<BACKUP_BUCKET>/nightly/ is < 26 hours old.
  • Backup credential rotations are on schedule (see secrets-management.md).
  • age identity key still retrievable from GCP Secret Manager (BACKUP_AGE_IDENTITY).
  • No schema changes since last drill (if yes, schedule a post-migration drill).

First-partner onboarding DR checklist

Execute before a first design-partner co-implementation session persists their domain data:

  • Activate PITR per D8 playbook above.
  • Implement spectral db export-rules for the partner’s corpus schema (D3 concrete implementation lands per the partner-corpus migration epic).
  • Create data/rules/<partner>/ directory with a README noting the partner, onboarding date, and regeneration procedure.
  • Wire a git pre-commit hook (or rule-corpus-migration hook) that re-exports on mutation.
  • Run a functional drill within 30 days of partner onboarding using their corpus.

Backup credential inventory (D7)

CredentialScopeStorageUsed by
SUPABASE_DB_SUPERUSER_URL@scope=staging,productionGCP Secret ManagerNightly backup workflow (via WIF)
BACKUP_BUCKET@scope=sharedGitHub repo variableNightly backup workflow
BACKUP_AGE_RECIPIENT@scope=sharedGitHub repo variable (public key; not secret)Nightly backup workflow
BACKUP_AGE_IDENTITY@scope=sharedGCP Secret Manager (private key)Restore operations only — never loaded in nightly workflow
GCP_WORKLOAD_IDENTITY_PROVIDER@scope=sharedGitHub repo variableNightly backup workflow auth
GCP_BACKUP_SERVICE_ACCOUNT@scope=sharedGitHub repo variableNightly backup workflow auth

SUPABASE_DB_SUPERUSER_URL runs as the postgres superuser (BYPASSRLS required for pg_dump completeness per Supabase docs) and is held under a distinct service account from the app service-role credential. Bucket-level object versioning + retention lock ensure a compromised write credential cannot delete backup history.

Runbook history

Drills, incidents, and activation events logged in reverse-chronological order.

(no entries yet)

  • ADR-040 — DR substrate (decision context).
  • secrets-management.md — credential rotation backing the “compromised credential” playbook.
  • .github/workflows/nightly-backup.yml — nightly pg_dump workflow implementing ADR-040 D2.
  • ADR-032 — single-DB topology informs restore unit shape.
  • ADR-042 — retention policies intersect with PITR + nightly dump retention; late-corruption floor reconciled above.
  • ADR-037 — credential provisioning via tools/provision/setup.sh.
  • ADR-039 — auth abstraction enables the provider-failure-restore path without re-architecting authz.

External references