Recovery

Disaster recovery runbook

Alpha operational runbook for Spectral’s DR posture. Implements ADR-040 disciplines. See the ADR for the full decision context.

Current backup substrate

PITR is not enabled at alpha (see ADR-040 D1 for activation triggers). Daily-snapshot cadence + nightly pg_dump covers the alpha posture.

Target	Value
RPO (in-provider)	24 h (daily snapshot cadence)
RTO (in-provider)	60 min (Duplicate Project restore + verification)
RPO (provider-failure)	24 h (restore from nightly `pg_dump`)
RTO (provider-failure)	Hours-to-a-day, best-effort

No taxpayer data in scope — the IRS / tax-prep world-agent is a development

demo asset only. Customer-domain data from design-partner co-implementation is the only non-public data that enters the system during alpha.

What is backed up

Layer	Cadence	Retention	Source of truth
Supabase daily snapshot	Daily (Supabase-managed)	7 days (Pro tier)	Supabase platform
Nightly `pg_dump` → GCS	Daily 03:17 UTC	30 days (bucket lifecycle)	`.github/workflows/nightly-backup.yml` (per ADR-040 D2; lands with the deploy substrate)
Partner rule corpora (post-partner-landing)	On commit	Unbounded (git history)	`spectral db export-rules` → `data/rules/<partner>/`

PITR activation triggers

Triggers are owned by ADR-040 D1 — see the ADR for the authoritative list. When any trigger fires, run the activation playbook below.

PITR activation playbook

Verify the Small compute add-on (~$5 /mo net after Pro credit) is provisioned on the Supabase project.
Enable PITR 7-day in the Supabase dashboard → Database → Backups. Enabling PITR disables the daily snapshot subsystem (PITR supersedes it).
Update the “Current posture” section of this runbook to Post-PITR alpha: RPO 2 min in-window, RTO 60 min.
Trigger a functional restore drill within 30 days to verify the restore-to-seconds UX.
Reconcile nightly pg_dump retention with retention decisions per ADR-042 when those land. The 30-day pg_dump lifecycle extends beyond the 7-day PITR window by design — late-corruption scenarios need the longer tail.

Restore playbooks

Scenario: accidental `DELETE` or bad migration (self-inflicted)

Pre-PITR mode:

Identify the damage; decide acceptable data loss against the 24-hour RPO gap.
If < 24 h of legitimate work would be lost: consider manual replay of the lost writes.
Otherwise: restore from most recent nightly pg_dump via the provider-failure playbook below (even though provider is healthy — the mechanics are the same).

Post-PITR mode:

Identify the exact timestamp just before the damage.
Supabase dashboard → Database → Backups → Point in Time → pick timestamp → in-place restore.
Expected downtime: 10-30 min.
Verify schema + row counts against pre-damage expectations.

Scenario: regional outage (Supabase us-east-1 hard down)

Wait it out if the outage is expected to resolve within 4 hours (historically most AWS regional incidents resolve within 2 h).
If extended or business-critical: restore the most recent nightly pg_dump to a fresh Supabase project in a different region, then update the SUPABASE_URL + DATABASE_URL secrets via tools/provision/setup.sh --mode=rotate.
Communicate the cutover to active design partners; re-issue invites per the core.users mirror recovery note below.

Scenario: provider failure / EOL / acquisition-sunset

Restore the most recent nightly pg_dump into a managed Postgres elsewhere (GCP Cloud SQL for Postgres, Neon, Crunchy Bridge).
Supabase-managed surfaces (auth.users, storage.objects) do not restore cleanly into a non-Supabase host. Consequence:
- auth.users does not move; users re-authenticate via the new provider’s flow.
- core.users mirror is populated by re-invite acceptance.
- workspace_invites must be re-issued; accepted invites in the dump are historical and non-actionable on the new host.
Update FastAPI middleware to point at the new host; JWT verification may shift provider (per ADR-039 D12 abstraction preserves the AuthContext shape).

Scenario: compromised service-role credential

Rotate the credential per docs/runbooks/secrets-management.md emergency-rotation flow.
Determine blast-radius window (when was the credential live; what could it reach).
Post-PITR mode: restore to a timestamp just before the suspected compromise to undo any unauthorized writes.
Pre-PITR mode: if destructive action occurred, restore from the most recent nightly pg_dump (up to 24 h of legitimate work lost).
Log the incident per the secrets-management runbook.

Scenario: corruption or bad data found > 7 days after write

The 7-day PITR window does not reach this. Rely on nightly pg_dump retention (30 days).

Late-corruption floor commitment (per ADR-042 D10). The 30-day pg_dump retention is the load-bearing recovery window for any corruption discovered after the PITR cutoff. So long as every workspace-scoped entity’s RetentionPolicy.active_ttl_days remains ≥ 30 days, the dump covers the gap. If future retention research shortens any PLATFORM-class active_ttl_days below 30 days, the pg_dump lifecycle rule in .github/workflows/nightly-backup.yml must extend to match: active_ttl + tombstoned_grace is the floor, never less. Tracked alongside the ADR-042 D11 forward triggers.

Locate the last-known-good dump in gs://<BACKUP_BUCKET>/nightly/.
Restore the dump to a throwaway Supabase Duplicate Project via pg_restore.
Spot-check the affected table in the restored clone.
Extract corrected rows; apply to production via targeted INSERT ... ON CONFLICT ... UPDATE or equivalent.
Do not perform a full restore — days of legitimate writes since the corruption would be lost.
Tear down the clone within the Supabase billing hour.

Scenario: accidental schema drop

Pre-PITR mode: restore from the most recent nightly pg_dump.
Post-PITR mode: PITR to just before the drop.
After restore: verify core.* tables (llm_usage, embedding_profile, users) are intact — these are shared contracts; worlds and platform will fail-closed on missing data.

Restore-drill cadence (D4)

Type	Cadence	Effort
Functional drill	Quarterly	~2 hours
Tabletop review	Monthly	~15 min
Post-migration drill	After any multi-schema migration	~1 hour

NIST SP 800-34 Rev. 1 alignment (annual functional minimum; more frequent for high-impact systems). AWS Well-Architected REL13-BP02 treats multi-schema migrations as “significant workload changes” requiring a game-day-style drill.

Quarterly functional drill procedure

Pick the most recent nightly pg_dump from gs://<BACKUP_BUCKET>/ nightly/.
Duplicate the staging Supabase project (dashboard → Duplicate Project).

Decrypt and restore the dump into the clone:

gcloud storage cp gs://<BACKUP_BUCKET>/nightly/<ts>.dump.age -
  | age -d -i <identity-path>
  | pg_restore -d <clone-connection-string> --no-owner --no-privileges

Run schema checksum comparison across spectral, worlds, core.
Row-count spot-check five load-bearing tables (e.g., core.llm_usage, core.embedding_profile, core.users, and two tables in worlds or platform once those schemas are scaffolded).
Tear down the clone within the Supabase billing hour.
Log the drill result (pass / fail / findings) in the history section below.

Tabletop review checklist

Most recent nightly workflow run is green (check Actions tab).
Last object in gs://<BACKUP_BUCKET>/nightly/ is < 26 hours old.
Backup credential rotations are on schedule (see secrets-management.md).
age identity key still retrievable from GCP Secret Manager (BACKUP_AGE_IDENTITY).
No schema changes since last drill (if yes, schedule a post-migration drill).

First-partner onboarding DR checklist

Execute before a first design-partner co-implementation session persists their domain data:

Activate PITR per D8 playbook above.
Implement spectral db export-rules for the partner’s corpus schema (D3 concrete implementation lands per the partner-corpus migration epic).
Create data/rules/<partner>/ directory with a README noting the partner, onboarding date, and regeneration procedure.
Wire a git pre-commit hook (or rule-corpus-migration hook) that re-exports on mutation.
Run a functional drill within 30 days of partner onboarding using their corpus.

Backup credential inventory (D7)

Credential	Scope	Storage	Used by
`SUPABASE_DB_SUPERUSER_URL`	`@scope=staging,production`	GCP Secret Manager	Nightly backup workflow (via WIF)
`BACKUP_BUCKET`	`@scope=shared`	GitHub repo variable	Nightly backup workflow
`BACKUP_AGE_RECIPIENT`	`@scope=shared`	GitHub repo variable (public key; not secret)	Nightly backup workflow
`BACKUP_AGE_IDENTITY`	`@scope=shared`	GCP Secret Manager (private key)	Restore operations only — never loaded in nightly workflow
`GCP_WORKLOAD_IDENTITY_PROVIDER`	`@scope=shared`	GitHub repo variable	Nightly backup workflow auth
`GCP_BACKUP_SERVICE_ACCOUNT`	`@scope=shared`	GitHub repo variable	Nightly backup workflow auth

SUPABASE_DB_SUPERUSER_URL runs as the postgres superuser (BYPASSRLS required for pg_dump completeness per Supabase docs) and is held under a distinct service account from the app service-role credential. Bucket-level object versioning + retention lock ensure a compromised write credential cannot delete backup history.

Runbook history

Drills, incidents, and activation events logged in reverse-chronological order.

(no entries yet)

ADR-040 — DR substrate (decision context).
secrets-management.md — credential rotation backing the “compromised credential” playbook.
.github/workflows/nightly-backup.yml — nightly pg_dump workflow implementing ADR-040 D2.
ADR-032 — single-DB topology informs restore unit shape.
ADR-042 — retention policies intersect with PITR + nightly dump retention; late-corruption floor reconciled above.
ADR-037 — credential provisioning via tools/provision/setup.sh.
ADR-039 — auth abstraction enables the provider-failure-restore path without re-architecting authz.

External references

Previous
Testing operations Next
Legacy Drain