Decisions

ADR-073: Provisioning orchestrator — OpenTofu, setup.sh, 1Password, manual_step

Status: Accepted (2026-05-06) Supersedes: ADR-037 D5 (partial — names 1Password as the documented default operator-workstation secrets store, rather than treating tool choice as private), ADR-037 D9

Context

ADR-037 chose a deployer-operated provisioning script with a target-swap seam as the alpha-stage secrets-management substrate. D9 explicitly framed declarative IaC (Terraform / Pulumi) as post-alpha — the script bridges manual console clicks to declarative infra at alpha scale.

That framing assumed declarative IaC required a maturity threshold the alpha team had not crossed: stable target implementations, a team larger than three engineers, the operational appetite for HCL and state management. Operator practice and a focused research pass have shifted the calculus:

Provisioning surface has multiplied. Cloudflare Pages projects (×2), KV namespaces, Access apps and policies, DNS records, WAF, rate-limit rules, zone settings; GitHub Environments, branch protection, repo settings, Actions secrets; Render web / cron / worker services, env groups, custom domains; Supabase project + auth + custom domains + edge functions; the new R2 buckets per ADR-072. Each adds idempotency, drift-detection, and re-runnable-from-empty surface. Building all of it bespoke in bash compounds engineering cost across every deployable epic.
OpenTofu provider coverage is where it needs to be. Cloudflare and GitHub providers are mature, official, and cover the resource shapes Spectral needs end-to-end. Render’s official provider has known plan-noise and env-group bugs but is workable with lifecycle.ignore_changes discipline and single-root ownership. Supabase’s official provider has permanent gaps for OAuth provider config, RLS policies, DB migrations, and Vault — all CLI-shaped operations that should not be TF-shaped anyway.
Hybrid is the right shape. Roughly 70% of resources land natively in OpenTofu; 20% require CLI gap-fills (Supabase OAuth via API PATCH; RLS and migrations via supabase db push); 10% require human-in-loop steps (account-level OAuth grants and similar). The script orchestrates all three.
The secrets cache is genuine attack surface. ADR-037 D4’s .env.provision is a plaintext file at rest on the operator machine. 1Password’s op run injects secret references as env vars at apply time without the values touching disk. Eliminating the on-disk plaintext for the documented default path is a meaningful security improvement.
Naming the default tool requires a partial supersession of ADR-037 D5. D5 kept specific upstream key-source tooling private to avoid cofounder-personal preferences leaking into system docs. Naming 1Password as the documented default for operator workstations changes that posture: it is now a published recommendation, not personal practice. The privacy concern is satisfied differently — by recommending a default for any operator, not by hiding which tool a particular operator uses.
Contributor optionality must be preserved. Locking provisioning into a single password manager harms contributors who use Bitwarden, KeePass, 1Password Teams via a different account, or no password manager at all. The fallback path (.env.provision) survives as an explicit second-class option, with the security caveat made plain in operator docs.

Decision

D1 — `tools/provision/setup.sh` is the operator’s single entry point for environment provisioning

The script orchestrates three layers of work: declarative resource provisioning via OpenTofu (D2), CLI gap-fills for resources without OpenTofu coverage (D3), and human-in-loop manual steps with shell-command verification (D4). The provider-swap-seam discipline from ADR-037 D11 stays — push_<target>(env, name, value) for secret push — and is joined by a parallel provision_<resource>(...) family for non-secret resources, plus the manual_step helper.

This extends ADR-037 D4 in scope; the modes (init, update, rotate, verify), scope annotation pattern, and cache discipline are unchanged.

D2 — OpenTofu is the declarative resource-provisioning layer

Per-provider stack roots under infra/tofu/: cloudflare/, github/, render/, supabase/, r2/. Each plans and applies independently. Cross-stack values flow via outputs and terraform_remote_state data sources.

Conventions, locked at the base-structure sub-issue of the provisioning-orchestrator epic:

Per-stack-root variables.tf is the operator’s interface; locals.tf for derived values.
Repetitive resources are driven by data files (YAML / JSON + for_each) where appropriate. The current infra/cloudflare/zone-records.md markdown table becomes a YAML data file consumed by a single cloudflare_dns_record for_each resource.
Modules for genuine repeats only (e.g. a pages_site module instantiated for the two docs sites). Not module-everything.
One workspace OR one directory per environment — chosen at the base-structure sub-issue.
State on R2 via the s3 backend per ADR-072 D2.

Bootstrap: setup.sh creates the state bucket via wrangler r2 bucket create as a pre-step on first init, idempotent (check-then-create). No bootstrap-TF-with-local-state dance.

D3 — CLI gap-fills via setup.sh for resources without OpenTofu coverage

Each gap-fill is a function in setup.sh matching the provision_<resource>(env, ...) signature. Idempotent at target (check-then-create-or-update). Drains into the same cache, verification, and manual_step framework as everything else.

Permanent CLI gap-fills (provider has no first-class TF resource):

Supabase OAuth provider config — curl -X PATCH /v1/projects/$REF/config/auth with external_google_client_id, etc.
Supabase RLS policies + DB migrations — supabase db push against the linked project. Schema is SQL-shaped, not TF-shaped.
Supabase Vault secrets (post-alpha BYOK per ADR-037 D6) — SQL migration: select vault.create_secret(...).

Transient CLI gap-fills (provider issue expected to resolve):

Cloudflare v5 cloudflare_pages_domain “already added” race (cloudflare/terraform-provider-cloudflare#5619). Tolerate one retry; import-then-manage if persistent. Re-evaluate at each Cloudflare provider minor.

D4 — `manual_step` helper for human-in-loop steps

Signature: manual_step <id> <description> <verification_command>.

Idempotent via cache marker. Cache key manual:<id>=done@<ISO8601> parallels the existing <scope>:<NAME> namespace. Re-runs detect the marker and skip with a one-line log.
Verification by default. The verification command must exit 0 before the cache marker is written. Examples: dig +short NS runspectral.com | grep -q cloudflare; gh api /repos/.../environments/test-live | jq -e '.protection_rules | length > 0'.
Explicit --unverifiable escape hatch for steps with no programmatic check. Drift-prone; documented in the runbook with a recommendation to minimize the unverifiable surface.
Each step carries a delete_when: field naming the condition under which it can be retired (e.g., “Cloudflare provider adds resource X”). Manual steps are debt; the field captures the retirement trigger.

Initial manual steps include the Cloudflare → GitHub OAuth source-connection grant (one-time per Cloudflare account; no API surface).

D5 — 1Password is the documented default operator-workstation secrets source

The default secrets path on operator workstations is 1Password Individual or higher with the op CLI. setup.sh apply invokes op run --env-file=.env.example -- tofu apply; .env.example carries committed op://VAULT/ITEM/FIELD references. Initial population: setup.sh init prompts and writes to 1Password via op item create and op item edit rather than to a local cache file. Rotation: edit the value in 1Password; re-run apply.

Personal / Individual plan is sufficient — verified. The op CLI features used (op run, op read, op item create, op item edit) are not plan-tier-gated. Service accounts (Business plan and above) are not required, because CI uses GitHub Actions Secrets (populated by TF), not 1Password.

This partially supersedes ADR-037 D5: specific upstream key-source tooling is no longer treated as private when it is the published default. The privacy intent of D5 — avoiding cofounder-personal tooling leaking into system docs — is satisfied differently, by naming a default that is recommended for any operator rather than describing one cofounder’s preference.

D6 — `.env.provision` is the fallback secrets path

Auto-detected: if op CLI is unavailable or no active session resolves, the script falls back with a one-line banner. An explicit --no-1password flag opts out when op is available but the operator chooses not to use it.

Same dispatch interface (store_value, read_value); two implementations sharing one signature, dispatched by which path is active. Parity is required: any new feature works on both paths or it does not ship.

Operator discipline for the fallback path is documented in the operator runbook: chmod 600 (script-enforced), gitignored (already), back up to chosen password manager, delete the local file after backup. The “no plaintext on disk” property holds for the documented default; degrades gracefully on opt-out.

D7 — Pattern C secrets architecture: TF_VAR_* injection at apply time

Secret values flow: 1Password (or .env.provision fallback) → TF_VAR_* env vars (via op run or shell-injection from the cache) → tofu apply. TF state contains values for non-ephemeral resources; Cloudflare, GitHub, Render, and Supabase providers do not yet ship ephemeral resources (Terraform 1.10 ephemeral-value support is currently provider-side AWS / Azure / Kubernetes / Google).

Mitigations for state-stored values:

R2 state bucket per ADR-072 D2 with server-side encryption and bucket versioning.
Object R/W API token scoped to the state bucket only, held by the operator (not by CI).
State bucket separate from backups bucket per ADR-072 D3 — different blast radii.
Ephemeral values used for the google provider where supported; the surface is small given ADR-072 reduces GCP usage.

D8 — Roll-out is iterative across the alpha milestone

Each deployable epic ships its own resources first (manually if needed), then a parallel sub-issue under the provisioning-orchestrator epic brings those resources under script control. Sub-issues stack up as we work through alpha; the epic closes when every alpha-required deployable can be provisioned end-to-end from setup.sh against an empty target.

Codex pages and runbooks describing “manually create X” prerequisites get swept and rewritten as the corresponding sub-issues land.

Alternatives considered

Bespoke multi-provider orchestrator entirely in bash. Reject — every capability OpenTofu provides natively (idempotent diff, drift detection, dependency graph, plan/apply pattern, audit trail) becomes engineering we own. Bespoke orchestrators require sustained engineering as targets evolve and as new providers join the stack. The break-even for IaC is roughly ≥3 providers + secrets + repeatability + multi-environment; we cleared that bar long before this ADR.

Pulumi. Reject — comparable feature set to OpenTofu, but Pulumi’s default state backend (Pulumi Cloud) requires an account and trends paid for sustained use; self-hosted state needs setup that R2 already provides. Smaller community than Terraform / OpenTofu for our specific providers.

All-CLI / no-IaC. Reject — drift detection and re-runnable-from-empty are real value; lost without a declarative layer. This is what we have today, and the friction will compound as the provisioning surface grows.

Pattern A (bash-only secrets, TF avoids secret-containing resources). Reject — loses TF drift detection on GitHub Actions secrets, Render env groups, Supabase auth. A secret rotated via dashboard becomes invisible to tofu plan.

Pattern B (secrets via external secret manager + TF data sources). Reject — adds GCP Secret Manager (or equivalent) as a new vendor at alpha; does not actually keep secret values out of state for many providers’ resources; more moving parts than the gain.

1Password mandatory; no fallback. Reject — locks contributors into a single password manager; the fallback path is small surface area to maintain and preserves contributor optionality. Per the parity requirement (D6), maintenance burden stays bounded.

Wait-and-see — defer this decision until post-alpha as ADR-037 D9 originally framed. Reject — the cost of waiting is sustained bespoke engineering across every deployable epic. The alpha-stage maturity bar that D9 cited was empirical, not principled; the empirical evidence has shifted (research confirmed scale fit; the OpenTofu skill ramp is short with current LLM tooling).

Consequences

tools/provision/setup.sh scope expands from secrets-only to a multi-layer orchestrator. The provider-swap-seam discipline from ADR-037 D11 is preserved; new families (provision_<resource>, manual_step) join push_<target>.
New infra/tofu/ directory tree with per-provider stack roots; state on R2 via ADR-072 D2.
.env.provision cache survives as the fallback secrets path; the documented primary path is 1Password.
ADR-037 D5 partially superseded. Specific upstream key-source tooling (1Password) is named as the documented default. The privacy intent is satisfied differently.
ADR-037 D9 superseded. Declarative IaC adopted at alpha rather than deferred to post-alpha. The script-as-IaC framing is replaced by script-as-thin-orchestrator-around-IaC.
Hard dependency on the op CLI for the documented default secrets path. Operator workstations need op installed and authenticated. The fallback path has no op dependency.
HCL becomes part of the operator skill set — small ramp; well within agent-LLM and operator capability.
Codex sweep required for content describing the provisioning model: any “manually create X” prerequisite, .env.provision references, secrets rotation flow, runbook narratives for setup.
Per-provider provider-version pins required in TF root configurations. Locked at the base-structure sub-issue.
The provisioning-orchestrator epic captures the work; sub-issues land iteratively across alpha.
Trade accepted: state-at-rest contains secret values for non-ephemeral resources; mitigated by R2 SSE + IAM-scoped operator-only token + bucket versioning + separate state bucket from backups bucket per ADR-072.

Previous
ADR-072: Cloudflare R2 for backups and Terraform state