Skip to content
GitHub
Decisions

ADR-040: Baseline disaster recovery and backup posture — regenerability-dominant alpha; PITR triggered

Status: Accepted (2026-04-21) — D2 and D7 partially superseded by ADR-072.

Context

Spectral runs a single Supabase Pro project + pgvector with three schemas (core, worlds, platform) in one DB. US-only NDA alpha. Solo builder; bootstrap-plausible funding; DR comfort band $20–100/month.

A critical clarification from disposition: the IRS / tax-prep world-agent is a development and demo asset, not a planned production surface. No taxpayer data ever in the system. Alpha = solo builder + design partners; the partnership model is landing a partner first, then discovering their domain and co-implementing a world model with them. Partner domain data enters the system during co-implementation, not before.

The landscape survey recommended ~$133/month (PITR 7-day Day 1 + nightly pg_dump + git-backed corpus + quarterly drill). The adversarial pair argued pure defer, citing regenerability-dominant architecture. Synthesis splits along the “first partner co-implementation persisting their data” threshold: cheap controls now (operational value + provider independence); PITR deferred to a named trigger.

Decision

D1 — PITR activation deferred to named triggers

Until one fires, run on Supabase Pro baseline (7-day daily-snapshot retention). Triggers:

  • First design-partner co-implementation session persisting their domain data into the system.
  • Any compromised-credential near-miss or realized incident.
  • Sustained daily change volume >10% of DB size for >14 consecutive days.
  • First PITR-covered failure actually hit (restore-from-snapshot vs restore-to-seconds materially different).

PITR cost when activated: ~$105/month (7-day add-on + Small compute floor net). RPO drops from ~24 h to ~2 min within the 7-day window.

D2 — Nightly pg_dump backup pipeline lands alpha Day 1

Storage destination superseded by ADR-072. Backup destination is Cloudflare R2 (was GCS); see ADR-072 D1 for the bucket configuration. The pipeline shape (pg_dump -Fc compression, age encryption with recipient key sourced from the runtime secrets backend per ADR-046), schedule, and “covers what PITR structurally cannot” framing — regional outage, provider failure or EOL, corruption found later than the PITR window, vendor-independent recovery — remain authoritative.

D3 — Git-backed rule-corpus serialization: stub reserved now, activated per partner

  • The IRS world-agent rule corpus is a regenerable dev/demo asset — does not need git-backed serialization.
  • The first design-partner corpus built during co-implementation is load-bearing and is NOT regenerable via the synthetic eval-generation path (no internal eval path for partner domains).
  • Contract shape lands now: spectral db export-rules resolver stub that raises NotImplementedError, mirroring the TA-11 contract-shape precedent (Protocol declared, concrete implementation deferred). Concrete implementation lands alongside the first partner-corpus migration.

D4 — Restore-drill cadence

  • Quarterly functional drill — restore the most recent nightly pg_dump to a throwaway Supabase Duplicate Project, schema-checksum compare, row-count spot-check, tear down within billing hour. Budget ~2 hours/quarter.
  • Monthly tabletop — walk the DR runbook mentally; verify secrets still rotate, bucket still writable, workflow still green. ~15 min.
  • Mandatory drill after any multi-schema migration — treat as “significant workload change” per AWS Well-Architected REL13-BP02.

NIST SP 800-34 alignment (annual functional minimum; more frequent for high-impact).

D5 — Per-data-class posture

ClassAlpha postureRecovery path
IRS world-agent rule corpusDaily snapshot only; regenerable dev assetRe-run synthetic eval-generation
Partner rule corpora (once partners land)Daily snapshot + per-partner git-backed export (D3 activation)Replay git serialization against fresh project
Trace data (platform)Daily snapshot only; replaceableRe-scan
Memory tiers (pgvector)Daily snapshot; embeddings regen via the ADR-038 D11 ladderRe-embed from preserved source
core.llm_usageDaily snapshot + nightly pg_dump for retention beyond 7 dRestore dump
core.users mirrorDaily snapshot; auth.users is Supabase-managed primaryRe-invite users; mirror rebuilds from invite acceptance

D6 — RPO / RTO targets

  • Pre-PITR alpha (current): RPO 24 h (daily snapshot cadence); RTO 60 min (Duplicate Project restore + verification).
  • Post-PITR activation (first trigger fires): RPO 2 min; RTO 60 min.
  • Provider-failure scenario (regional outage + off-Supabase rebuild from nightly dump): RPO 24 h; RTO best-effort (“hours to a day”). Acceptable pre-revenue; tighten before first revenue-bearing tenant.

D7 — Backup credential isolation

Storage-side controls superseded by ADR-072. Bucket-level versioning and retention lock are configured per ADR-072 D1; storage identity is an R2 Object R/W API token rather than a GCP service account. The discipline of “distinct identity from the app service-role” + FTC Safeguards Rule rationale + pg_dump superuser context remains authoritative.

D8 — PITR activation playbook

When a D1 trigger fires:

  1. Verify Small compute add-on (~$5/month net) is provisioned.
  2. Enable PITR 7-day in the Supabase dashboard (disables daily snapshots; PITR supersedes).
  3. Update docs/runbooks/disaster-recovery.md to flip the “Pre-PITR / Post-PITR” sections.
  4. Trigger a functional drill within 30 days (verify restore-to-seconds UX).
  5. Review whether nightly pg_dump retention extends beyond the PITR window (it does — 30 d versus 7 d); reconcile with ADR-042 retention decisions.

D9 — Restore runbook as the operational contract

docs/runbooks/disaster-recovery.md is the operational source of truth. Sections:

  • DR posture by mode (Pre-PITR alpha / Post-PITR / Provider-failure)
  • Restore playbooks by failure scenario (accidental DELETE, schema drop, compromised credential, regional outage, provider EOL, late corruption)
  • Quarterly drill checklist
  • PITR activation playbook (D8)
  • First-partner onboarding checklist (reconfirm D1 trigger; activate D3 per-partner export)

D10 — Known revisit triggers codified

  • DB > 50 GB → recompute GCS egress math; consider Nearline tier for dumps > 14 days old.
  • First SOC2 engagement conversation → re-evaluate Team tier ($599/month adds SOC2 reports, 14-day daily retention, SSO).
  • Restore drills repeatedly painful due to Duplicate Project cost / UX → evaluate Neon branching UX as a migration target (the “re-work not re-architecture” bar is met by the ADR-039 D12 auth abstraction + the ADR-033 session-var RLS).
  • Daily WAL volume makes the PITR add-on exceed $200/month → evaluate self-hosted Postgres on managed infra.
  • Cross-region replication → deferred to SPEC-302 post-alpha hardening.

Alternatives considered

PITR 7-day on Day 1 (~$105/month). Landscape survey baseline. Rejected for pre-first-partner alpha because the high-probability failures (bad migration, accidental DELETE) are blast-radius-bounded to “solo builder loses a day of their own dev work” — annoying, not fatal. Re-enabled by D1 trigger when data value crosses threshold.

PITR 14 d or 28 d. 2× and 4× the 7-day cost for linear risk improvement. Not defensible at alpha. Re-evaluate at first revenue-bearing tenant.

Team tier ($599/month). Adds SOC2 paperwork (not yet needed) plus 14-day daily snapshot retention (achievable more cheaply via 14-day GCS retention). $574/month delta over Pro buys paperwork, not DR capability.

Neon migration for branching-based DR UX. Real advantage, but migration cost at pre-alpha is wasted motion when no DR pain exists yet. Named revisit trigger (D10).

Weekly-only dump of core.llm_usage plus customer-submission tables (the adversarial Position A-prime). Narrower scope than nightly full-DB dump. Rejected because the cost delta is <$2/month and the operational simplicity of “one backup job, one bucket” beats “three targeted jobs with different schedules.”

Defer nightly pg_dump entirely (pure adversarial Position A). Rejected. Near-zero cost (~$3/month at alpha volume + solo-builder time is just the GH Actions YAML), covers the scenarios PITR structurally cannot, aligns operationally with IRS Pub 4557-adjacent controls even without taxpayer-data scope (FTC Safeguards Rule reasonableness standard for customer-domain data post-partner-landing).

Defer git-backed rule corpus entirely. Rejected for the reserved-stub approach. Reserving the export-command shape follows the ADR-038 resolver precedent; zero code cost; preserves the provider-independent recovery path.

Restore-drill cadence annual (NIST minimum). Rejected. Quarterly functional is only 2 hours/quarter; catches drift far faster than annual.

Consequences

  • Alpha monthly DR cost: ~$28/month (Pro base $25 + GCS ~$3). Sits comfortably below the $20–100 comfort band.
  • Post-PITR cost: ~$133/month (Pro + PITR 7-day + Small compute + GCS). Sits at the top of the band but activation is triggered by a revenue-adjacent event.
  • GCS bucket becomes a second persistent-state surface alongside Supabase DB and the runtime secrets backend. Minor operational complexity delta; carries forward through the ADR-046 hosting swap because the GCS bucket is decoupled from the compute platform.
  • GitHub Actions secret/variable surface expands by the pg_dump credentials, age recipient ref, and GCS service account configuration. Provisioned via tools/provision/setup.sh per ADR-037 D4, @scope=shared.
  • core.users mirror recovery is via user re-invite, not DB restore. Row-count rebuilds from workspace_invites acceptance after each user re-auths. Acknowledged UX cost in a provider-failure scenario; documented in the runbook.
  • Partner co-implementation onboarding gets a DR checklist item. Step 1 of landing a partner: activate PITR; step 2: activate per-partner git-backed export (D3).
  • No new spectral.core surface. DR is operational. No ADR-065 admission triggered.
  • ADR-042 (TA-4 retention) reconciles the PITR window, daily-snapshot retention, nightly-dump lifecycle, and core.llm_usage retention.

References

  • ADR-065spectral.core admission discipline (no surface added here)
  • ADR-032 — single-DB topology that informs backup unit shape
  • ADR-037 — credential store for backup creds (D7)
  • ADR-038 — resolver-stub precedent for D3
  • ADR-042 — TA-4 retention reconciliation
  • ADR-046 — hosting; GCS dump target unchanged by hosting swap
  • TA-2 disposition — SPEC-305 comment 870fdb73
  • TA-2 verification — SPEC-305 comment 038837af
  • docs/runbooks/disaster-recovery.md — operational contract
  • .github/workflows/nightly-backup.yml — nightly pg_dump → age → GCS workflow