ADR-072: Cloudflare R2 for backups and Terraform state
Status: Accepted (2026-05-06) Supersedes: ADR-040 D2 and D7 (partial — GCS-specific elements)
Context
ADR-040 D2 chose GCS as the destination for the nightly pg_dump backup pipeline; D7 paired it with bucket-level versioning, retention lock, and a distinct service-account identity for backup credential isolation. Both decisions were taken in April 2026 when the working assumption was that GCP would also host backend compute (per SPEC-62, March 2026).
SPEC-324 (TA-21 hosting choice, April 2026) selected Render for backend compute. SPEC-325 (TA-22 edge / CDN / DNS) selected Cloudflare for edge and DNS. GCP exited the active stack except for the backup bucket. The backup-destination decision was not reconsidered at that pivot — the GCS choice persisted by inertia rather than by re-evaluation.
A second forcing function: ADR-073 adopts OpenTofu as the resource-provisioning layer (extending ADR-037 D9). OpenTofu requires a state backend with native locking, versioning, and encryption at rest. The economic shape favours the same object store used for backups — one vendor relationship covering both purposes.
R2 reaches feature parity for both purposes:
- Bucket Locks (GA 2025-03) provide write-once-read-many immutability with prefix-scoped retention rules, functionally equivalent to S3 Object Lock Compliance mode (no admin override path documented).
- Lifecycle rules support prefix-based expiration and Standard → Infrequent Access transitions.
- The Hashicorp
s3Terraform backend supports R2 via endpoint override, withuse_lockfile = trueproviding native conditional-PUT locking (no DynamoDB-equivalent required). - Egress is zero-cost — material for backup-restore drills under ADR-040 D4 quarterly cadence.
- Server-side encryption is on by default. Per-bucket-scoped Object R/W API tokens provide IAM granularity.
Vendor count: keeping GCS adds GCP as a fifth vendor for one bucket. Folding into R2 keeps us at four (Cloudflare, Render, Supabase, GitHub). The vendor-consolidation gain is real; the capability gap is not.
Decision
D1 — Backup destination is Cloudflare R2
Nightly pg_dump writes to an R2 bucket (spectral-backups-prod; spectral-backups-staging if staging backup is desired). Bucket configuration:
- Bucket Lock with prefix-scoped retention. Three retention tiers via prefix conventions:
daily/(30 d),weekly/(90 d),monthly/(1 y). Lock rules attach at the bucket via the Cloudflare-native API (not the S3 Object Lock API surface). - Versioning enabled for accidental-overwrite recovery.
- Server-side encryption (R2-managed) — default; no caller action.
- Lifecycle rule transitioning
daily/to Infrequent Access at 14 d and expiring at 30 d.weekly/andmonthly/follow analogous shapes.
ADR-040 D5 per-data-class posture remains authoritative; only the destination changes.
D2 — Terraform state backend is Cloudflare R2 via the Hashicorp s3 backend
State lives in a separate R2 bucket (spectral-tfstate) with versioning enabled. Backend configuration uses the canonical Cloudflare-documented R2 + Hashicorp s3 backend pattern:
terraform { backend "s3" { bucket = "spectral-tfstate" key = "<stack-prefix>/terraform.tfstate" region = "auto" endpoints = { s3 = "https://<account_id>.r2.cloudflarestorage.com" }
use_lockfile = true skip_credentials_validation = true skip_metadata_api_check = true skip_region_validation = true skip_requesting_account_id = true skip_s3_checksum = true use_path_style = true }}use_lockfile = true provides native state locking via R2’s conditional-PUT semantics. No DynamoDB or external locking service required. skip_s3_checksum = true is mandatory for non-AWS S3-compatible endpoints.
D3 — Backups bucket and state bucket are separate
Different blast radii and different IAM credentials. The backups bucket is retention-locked; the state bucket is mutable (TF must be able to overwrite state on every apply). A misconfigured retention lock on the state bucket would brick tofu apply. Per-bucket Object R/W tokens isolate compromise.
D4 — Backup pipeline rewrites from gcloud storage to R2 S3 API
tools/ops/backup/backup-nightly.sh switches from gcloud storage cp to aws s3 cp with R2 endpoint override (or wrangler r2 object put; aws s3 is the canonical S3-API shape and is the default choice). The encryption flow (pg_dump | age | upload) is unchanged. The Render cron service entrypoint per ADR-053 D20 is unchanged.
D5 — GitHub Actions auth pattern: scoped, rotated API tokens
R2 has no native OIDC / Workload Identity Federation. Per-bucket-scoped Object R/W API tokens, stored as GitHub Actions repository or Environment secrets per audience, rotated quarterly per ADR-062 D5. Acceptable trade — GitHub Actions Secrets is already our CI auth surface for everything else; one more secret follows the same rotation cadence.
D6 — Restore-drill UX benefits from R2 zero egress
ADR-040 D4 quarterly functional drill restores the most recent dump to a Supabase Duplicate Project. R2’s zero-egress property removes per-restore egress cost; drill cadence is unaffected, but the operational friction drops.
Alternatives considered
Status quo (GCS for backups; new GCS bucket for state). Reject — GCP returns as a fifth vendor for two buckets. R2 covers both purposes equivalently with no new vendor relationship.
Supabase Storage. Reject — no Object Lock or retention-lock equivalent for compliance-grade backup immutability. The Hashicorp s3 backend has not been validated against Supabase Storage’s S3-compatible surface; pioneering risk on a load-bearing primitive.
Local Terraform state + age/sops encryption + git commit. Reject — state contains secret values for non-ephemeral resources (per ADR-073 D7). Encryption-key compromise leaks all historical secrets; no native locking; merge conflicts on state become a real problem when the team grows past one operator.
AWS S3. Reject — adds AWS as a vendor for one purpose. R2 is functionally equivalent for our needs without the addition; egress on AWS S3 is non-zero (a tax on every restore drill).
Terraform Cloud / HCP free tier. Reject — free tier is constrained at sustained use; trends paid; vendor footprint expands. Self-hosted backend on R2 has no recurring vendor cost.
Single bucket for backups + state. Reject — different blast radii. The retention lock that protects backups would prevent state writes.
Consequences
- ADR-040 D2 and D7 are partially superseded by this ADR. ADR-040’s other D-points remain authoritative.
- ADR-053 D20 (
backup-nightlyruns as a Render cron service) remains authoritative; only the destination changes. - SPEC-330 ACs that reference GCS (AC23–AC32 inclusive of the backup-bucket controls and IAM identity) require rewrite to R2 equivalents.
- SPEC-471 (S5 backup-nightly) requires substantial scope rewrite: bucket creation, token issuance, lifecycle and lock configuration, script rewrite.
- New runbook content in
docs/runbooks/disaster-recovery.md: R2 token rotation, R2 Bucket Lock administration, R2 lifecycle rule conventions. tools/ops/backup/backup-nightly.shrewrite fromgcloud storagetoaws s3with R2 endpoint.- ADR-073 introduces the OpenTofu workflow; this ADR provides the state backend that workflow depends on. The two ADRs ship together.
- Vendor count: 4 (Cloudflare, Render, Supabase, GitHub). GCP exits the active stack.
- Trade accepted: HashiCorp’s “best effort” stance on non-AWS S3-compatible backends. No R2-specific
use_lockfileregressions reported in 2025–2026 community sources; risk is small but present.