ADR-040: Baseline disaster recovery and backup posture — regenerability-dominant alpha; PITR triggered
Status: Accepted (2026-04-21) — D2 and D7 partially superseded by ADR-072.
Context
Spectral runs a single Supabase Pro project + pgvector with three schemas (core, worlds, platform) in one DB. US-only NDA alpha. Solo builder; bootstrap-plausible funding; DR comfort band $20–100/month.
A critical clarification from disposition: the IRS / tax-prep world-agent is a development and demo asset, not a planned production surface. No taxpayer data ever in the system. Alpha = solo builder + design partners; the partnership model is landing a partner first, then discovering their domain and co-implementing a world model with them. Partner domain data enters the system during co-implementation, not before.
The landscape survey recommended ~$133/month (PITR 7-day Day 1 + nightly pg_dump + git-backed corpus + quarterly drill). The adversarial pair argued pure defer, citing regenerability-dominant architecture. Synthesis splits along the “first partner co-implementation persisting their data” threshold: cheap controls now (operational value + provider independence); PITR deferred to a named trigger.
Decision
D1 — PITR activation deferred to named triggers
Until one fires, run on Supabase Pro baseline (7-day daily-snapshot retention). Triggers:
- First design-partner co-implementation session persisting their domain data into the system.
- Any compromised-credential near-miss or realized incident.
- Sustained daily change volume >10% of DB size for >14 consecutive days.
- First PITR-covered failure actually hit (restore-from-snapshot vs restore-to-seconds materially different).
PITR cost when activated: ~$105/month (7-day add-on + Small compute floor net). RPO drops from ~24 h to ~2 min within the 7-day window.
D2 — Nightly pg_dump backup pipeline lands alpha Day 1
Storage destination superseded by ADR-072. Backup destination is Cloudflare R2 (was GCS); see ADR-072 D1 for the bucket configuration. The pipeline shape (pg_dump -Fc compression, age encryption with recipient key sourced from the runtime secrets backend per ADR-046), schedule, and “covers what PITR structurally cannot” framing — regional outage, provider failure or EOL, corruption found later than the PITR window, vendor-independent recovery — remain authoritative.
D3 — Git-backed rule-corpus serialization: stub reserved now, activated per partner
- The IRS world-agent rule corpus is a regenerable dev/demo asset — does not need git-backed serialization.
- The first design-partner corpus built during co-implementation is load-bearing and is NOT regenerable via the synthetic eval-generation path (no internal eval path for partner domains).
- Contract shape lands now:
spectral db export-rulesresolver stub that raisesNotImplementedError, mirroring the TA-11 contract-shape precedent (Protocol declared, concrete implementation deferred). Concrete implementation lands alongside the first partner-corpus migration.
D4 — Restore-drill cadence
- Quarterly functional drill — restore the most recent nightly
pg_dumpto a throwaway Supabase Duplicate Project, schema-checksum compare, row-count spot-check, tear down within billing hour. Budget ~2 hours/quarter. - Monthly tabletop — walk the DR runbook mentally; verify secrets still rotate, bucket still writable, workflow still green. ~15 min.
- Mandatory drill after any multi-schema migration — treat as “significant workload change” per AWS Well-Architected REL13-BP02.
NIST SP 800-34 alignment (annual functional minimum; more frequent for high-impact).
D5 — Per-data-class posture
| Class | Alpha posture | Recovery path |
|---|---|---|
| IRS world-agent rule corpus | Daily snapshot only; regenerable dev asset | Re-run synthetic eval-generation |
| Partner rule corpora (once partners land) | Daily snapshot + per-partner git-backed export (D3 activation) | Replay git serialization against fresh project |
| Trace data (platform) | Daily snapshot only; replaceable | Re-scan |
| Memory tiers (pgvector) | Daily snapshot; embeddings regen via the ADR-038 D11 ladder | Re-embed from preserved source |
core.llm_usage | Daily snapshot + nightly pg_dump for retention beyond 7 d | Restore dump |
core.users mirror | Daily snapshot; auth.users is Supabase-managed primary | Re-invite users; mirror rebuilds from invite acceptance |
D6 — RPO / RTO targets
- Pre-PITR alpha (current): RPO 24 h (daily snapshot cadence); RTO 60 min (Duplicate Project restore + verification).
- Post-PITR activation (first trigger fires): RPO 2 min; RTO 60 min.
- Provider-failure scenario (regional outage + off-Supabase rebuild from nightly dump): RPO 24 h; RTO best-effort (“hours to a day”). Acceptable pre-revenue; tighten before first revenue-bearing tenant.
D7 — Backup credential isolation
Storage-side controls superseded by ADR-072. Bucket-level versioning and retention lock are configured per ADR-072 D1; storage identity is an R2 Object R/W API token rather than a GCP service account. The discipline of “distinct identity from the app service-role” + FTC Safeguards Rule rationale + pg_dump superuser context remains authoritative.
D8 — PITR activation playbook
When a D1 trigger fires:
- Verify Small compute add-on (~$5/month net) is provisioned.
- Enable PITR 7-day in the Supabase dashboard (disables daily snapshots; PITR supersedes).
- Update
docs/runbooks/disaster-recovery.mdto flip the “Pre-PITR / Post-PITR” sections. - Trigger a functional drill within 30 days (verify restore-to-seconds UX).
- Review whether nightly
pg_dumpretention extends beyond the PITR window (it does — 30 d versus 7 d); reconcile with ADR-042 retention decisions.
D9 — Restore runbook as the operational contract
docs/runbooks/disaster-recovery.md is the operational source of truth. Sections:
- DR posture by mode (Pre-PITR alpha / Post-PITR / Provider-failure)
- Restore playbooks by failure scenario (accidental DELETE, schema drop, compromised credential, regional outage, provider EOL, late corruption)
- Quarterly drill checklist
- PITR activation playbook (D8)
- First-partner onboarding checklist (reconfirm D1 trigger; activate D3 per-partner export)
D10 — Known revisit triggers codified
- DB > 50 GB → recompute GCS egress math; consider Nearline tier for dumps > 14 days old.
- First SOC2 engagement conversation → re-evaluate Team tier ($599/month adds SOC2 reports, 14-day daily retention, SSO).
- Restore drills repeatedly painful due to Duplicate Project cost / UX → evaluate Neon branching UX as a migration target (the “re-work not re-architecture” bar is met by the ADR-039 D12 auth abstraction + the ADR-033 session-var RLS).
- Daily WAL volume makes the PITR add-on exceed $200/month → evaluate self-hosted Postgres on managed infra.
- Cross-region replication → deferred to SPEC-302 post-alpha hardening.
Alternatives considered
PITR 7-day on Day 1 (~$105/month). Landscape survey baseline. Rejected for pre-first-partner alpha because the high-probability failures (bad migration, accidental DELETE) are blast-radius-bounded to “solo builder loses a day of their own dev work” — annoying, not fatal. Re-enabled by D1 trigger when data value crosses threshold.
PITR 14 d or 28 d. 2× and 4× the 7-day cost for linear risk improvement. Not defensible at alpha. Re-evaluate at first revenue-bearing tenant.
Team tier ($599/month). Adds SOC2 paperwork (not yet needed) plus 14-day daily snapshot retention (achievable more cheaply via 14-day GCS retention). $574/month delta over Pro buys paperwork, not DR capability.
Neon migration for branching-based DR UX. Real advantage, but migration cost at pre-alpha is wasted motion when no DR pain exists yet. Named revisit trigger (D10).
Weekly-only dump of core.llm_usage plus customer-submission tables (the adversarial Position A-prime). Narrower scope than nightly full-DB dump. Rejected because the cost delta is <$2/month and the operational simplicity of “one backup job, one bucket” beats “three targeted jobs with different schedules.”
Defer nightly pg_dump entirely (pure adversarial Position A). Rejected. Near-zero cost (~$3/month at alpha volume + solo-builder time is just the GH Actions YAML), covers the scenarios PITR structurally cannot, aligns operationally with IRS Pub 4557-adjacent controls even without taxpayer-data scope (FTC Safeguards Rule reasonableness standard for customer-domain data post-partner-landing).
Defer git-backed rule corpus entirely. Rejected for the reserved-stub approach. Reserving the export-command shape follows the ADR-038 resolver precedent; zero code cost; preserves the provider-independent recovery path.
Restore-drill cadence annual (NIST minimum). Rejected. Quarterly functional is only 2 hours/quarter; catches drift far faster than annual.
Consequences
- Alpha monthly DR cost: ~$28/month (Pro base $25 + GCS ~$3). Sits comfortably below the $20–100 comfort band.
- Post-PITR cost: ~$133/month (Pro + PITR 7-day + Small compute + GCS). Sits at the top of the band but activation is triggered by a revenue-adjacent event.
- GCS bucket becomes a second persistent-state surface alongside Supabase DB and the runtime secrets backend. Minor operational complexity delta; carries forward through the ADR-046 hosting swap because the GCS bucket is decoupled from the compute platform.
- GitHub Actions secret/variable surface expands by the
pg_dumpcredentials, age recipient ref, and GCS service account configuration. Provisioned viatools/provision/setup.shper ADR-037 D4,@scope=shared. core.usersmirror recovery is via user re-invite, not DB restore. Row-count rebuilds fromworkspace_invitesacceptance after each user re-auths. Acknowledged UX cost in a provider-failure scenario; documented in the runbook.- Partner co-implementation onboarding gets a DR checklist item. Step 1 of landing a partner: activate PITR; step 2: activate per-partner git-backed export (D3).
- No new
spectral.coresurface. DR is operational. No ADR-065 admission triggered. - ADR-042 (TA-4 retention) reconciles the PITR window, daily-snapshot retention, nightly-dump lifecycle, and
core.llm_usageretention.
References
- ADR-065 —
spectral.coreadmission discipline (no surface added here) - ADR-032 — single-DB topology that informs backup unit shape
- ADR-037 — credential store for backup creds (D7)
- ADR-038 — resolver-stub precedent for D3
- ADR-042 — TA-4 retention reconciliation
- ADR-046 — hosting; GCS dump target unchanged by hosting swap
- TA-2 disposition — SPEC-305 comment
870fdb73 - TA-2 verification — SPEC-305 comment
038837af docs/runbooks/disaster-recovery.md— operational contract.github/workflows/nightly-backup.yml— nightlypg_dump→ age → GCS workflow