Decisions

ADR-061: LLM testing strategy

Status: Accepted (2026-04-25)

Context

LLM-invoking code in Spectral spans multiple contexts — Spectral Agent’s scan-analysis tools, World Agent’s domain-exploration reasoning, Operations Agent’s task tools, plus the LLMProvider substrate itself in spectral.core.llm. Each path needs a testing posture that is deterministic enough for CI, faithful enough to catch real regressions, and cost-bounded.

Three candidate test substrates surfaced during disposition:

Mocks — fast, deterministic, cost-zero, but cannot catch regressions in real provider behavior.
Local Ollama — deterministic-ish (still has temperature variance even at temp=0), self-hosted, but drifts independently of production providers; passing Ollama tests do not predict Anthropic / OpenAI / Google behavior. Resource overhead in CI (GPU runners expensive; CPU inference 10–100× slower) and maintenance burden (model pulls, version pinning, container images) do not earn their keep.
Recorded cassettes (VCR-style) — byte-perfect deterministic on replay, production-aligned at recording time, cost-zero on replay. Requires recording discipline when prompts change.

A real-provider safety net is also needed: cassettes capture recorded behavior; a prompt or model regression can slip silently if no test ever hits the real provider after the cassette was recorded. The safety net cannot run on every PR (cost prohibitive; flaky on rate-limits) but must run frequently enough to catch regressions before they accumulate.

This ADR lands a three-tier test posture, names recording mechanism and threshold, sets cost controls, and places test helpers shared between contexts at spectral.core.llm.testing per ADR-065 D1 admission discipline.

Decision

D1 — Three-tier test posture for LLM-invoking code

Unit + contract = mock LLM via FakeLLMProvider impl (implements LLMProvider protocol from spectral.core.llm.protocols); returns canned responses; deterministic; zero external calls.
Integration = real LLM provider via VCR-style recorded cassettes (D2). Replay is byte-perfect deterministic.
Live-provider drift detection (D4) = separate nightly workflow that bypasses VCR and hits real providers; compares to recorded outputs.

The three tiers map cleanly onto the existing unit / contract / integration marker enforcement (TA-23 D2 root conftest). Live-drift is a fourth marker (live_drift) used only by the nightly workflow.

D2 — Recording mechanism: pytest-recording per-test cassettes

Cassettes stored at tests/<context>/_fixtures/llm/<test-id>.yaml. Per-test (not global) avoids merge conflicts on prompt edits — a prompt change only invalidates the affected tests’ cassettes. Updated via RECORD_NEW_FIXTURES=1 env flag during local recording sessions; committed to the repo as test artifacts.

D3 — Ollama is not part of the test contract or CI infrastructure

Cassettes are stricter (byte-perfect replay; no FP variance) and production-aligned (real Anthropic / OpenAI / Google output). Ollama models drift independently of production providers; passing Ollama tests do not predict production behavior. Resource overhead in CI and maintenance burden do not earn their keep.

Developer-local Ollama use for prompt iteration or sandbox exploration is a workflow choice (not an architectural decision) — the LLMProvider protocol is swappable; a developer-local OllamaLLMProvider impl is permissible as a workflow tool, just not committed as a test fixture.

Forward trigger: revisit if property-based fuzzing of LLM-mediated code emerges as a need that cassettes cannot cover (e.g., generate 1000 inputs; need real LLM behavior on each; cassettes infeasible for unbounded inputs).

D4 — Live-provider drift detection runs in a nightly workflow

.github/workflows/nightly-live-drift.yml (lands jointly with TA-25 D10):

Triggers: schedule (nightly cron) + workflow_dispatch + on-merge-to-main
Runs the integration test suite with LIVE_PROVIDER=1 env that bypasses VCR replay
Compares live outputs against recorded cassettes via similarity (D5)
Drift surfaces as a Sentry alert (per TA-16) plus a nightly summary issue posted via the GitHub API
Cost bounded by the TA-10 D5 rate-limit plus a dedicated test-account daily cap (D6)

D5 — Drift similarity threshold = 0.85 text / structural exact-match tool calls

Initial calibration starting point. Threshold is config (workflow env var plus per-test override via pytest marker), not contract.

Text outputs compared via difflib.SequenceMatcher.ratio(); threshold 0.85 is the default
Tool-call outputs compared structurally — tool name plus argument values; exact match. Soft similarity on tool calls would mask real behavioral regressions (tool calls are the agent’s actions)
Per-test override via marker: @pytest.mark.llm_drift_threshold(0.7) — for tests where outputs are inherently more variable (creative-writing prompts; multi-valid-phrasing summarization)
Calibration triggers: tune the global default when (a) a first false-positive blocks a legitimate change, OR (b) a first regression slips through. Tracked in the workflow’s evidence log.

D6 — Cost controls inherit TA-10 D5 plus a dedicated test-account

Per-day cap on the test-account; GitHub Actions concurrency group serializes live-drift runs (one at a time); VCR-replay tests (the bulk) cost zero. Test-account API keys live in the test-live GitHub Environment per TA-25 D1.

D7 — Test helpers in `spectral.core.llm.testing`

Test fixtures live under each functional-area subdir’s testing/ sub-namespace per ADR-065 D1 admission discipline. LLM test helpers live at spectral.core.llm.testing.*. Initial surface:

FakeLLMProvider — implements spectral.core.llm.protocols.LLMProvider; returns canned responses keyed by purpose plus content-class; useful in unit/contract tests
vcr_cassette pytest fixture — wraps pytest-recording with Spectral conventions: cassette path resolution by test ID; sensitive-header redaction; record-mode controlled by RECORD_NEW_FIXTURES
assert_llm_output_similar(actual, recorded, threshold=0.85) — used by the live-drift workflow’s drift-comparison logic; respects per-test threshold override

Implementation lands in consumer-epic integration (per the TA-12 / TA-14 / TA-15 precedent) — FakeLLMProvider lands when the first test consumer needs it (likely SPEC-242 Spectral Agent integration); vcr_cassette lands with the first integration test that records LLM output; the helper for drift comparison lands with the nightly workflow.

Each addition was approved under the contract-requirement-test discipline in force at the time (a contract-requirement test plus an inter-context requirement statement at PR time); that discipline is now superseded by ADR-065’s admission discipline.

D8 — Sensitive content in cassettes redacted at recording time

API keys plus PII stripped via pytest-recording filter hooks (filter_headers plus before_record_response). Cassette files are committed to the repo; redaction is contract.

A new quality lint tools/quality/check_cassette_redaction.py (lands with the first cassette commit; before then a dead lint) blocks Authorization: Bearer ... and similar patterns from being committed. Wired into the pre-push gate per TA-26.

D9 — Coverage expectations follow TA-23 D3 layer floors (90 / 80 / 60)

LLM-invoking application code is covered via mock unit tests plus cassette integration tests; live-drift is a separate workflow, not a coverage source. The mock-vs-cassette split honors the unit-vs-integration test marker distinction.

D10 — Test-agents are reference implementations, not a test harness

apps/test-agents/ hosts working agent code for exploration plus demonstration. Automated test composition that uses test-agents (cassettes plus FakeLLMProvider) is acceptable but secondary to their primary purpose. The earlier draft “test-agent E2E lifecycle in TA-24 scope” is dropped after the founder reframe captured during disposition.

Alternatives considered

Ollama as CI substrate. Rejected per D3. Drift versus production providers; FP variance even at temperature 0; resource overhead in CI; maintenance load.

Global VCR cassette file. Rejected per D2. Merge-conflict storm on prompt edits; per-test cassettes scope blast radius to the affected test.

Live-provider on every PR. Rejected. Cost prohibitive; flaky on provider rate-limits; nightly plus on-merge gives the same regression coverage at a fraction of the cost.

No drift detection at alpha. Rejected. The whole point of cassettes is byte-perfect determinism on replay; without drift detection, prompt-or-model regressions slip silently and accumulate.

Test-agent E2E lifecycle in TA-24 scope (was draft D9). Dropped after the founder reframe. Test-agents are reference implementations for exploration plus demonstration, not a test harness.

Soft similarity on tool calls. Rejected per D5. Tool calls are the agent’s actions; soft similarity masks behavioral regressions.

Manual cassette curation instead of redaction filter hooks. Rejected per D8. Manual curation has a leakage failure mode (forget to redact a header); filter hooks plus a CI lint provide layered defense.

Consequences

Deterministic CI for LLM-invoking code without sacrificing real-provider regression coverage.
Cost bounded by structure — cassettes for bulk; live-drift in a dedicated workflow with a daily-cap account.
Test posture aligns with existing TA-23 marker discipline. Adds one new marker (live_drift).
spectral.core.<functional-area>.testing.* pattern establishes the home for test helpers shared between contexts; LLM helpers live at spectral.core.llm.testing; future shared fixtures for other functional areas (events, db, retention, etc.) follow the same pattern under their respective subdirs.
Cassette regeneration discipline — when prompts intentionally change, recording sessions are required (RECORD_NEW_FIXTURES=1); developer workflow includes this step. Documented in docs/runbooks/llm-testing.md at close-pass.
Threshold calibration — 0.85 is a starting point; the first false-positive or regression-slip will tune it. The workflow’s evidence log tracks both the trigger and the new threshold.
Test-account credential management — daily-capped account, scoped to the test-live GitHub Environment, rotation per TA-25.
tools/quality/check_cassette_redaction.py lands with the first cassette commit. Before then it is a dead lint with no inputs to evaluate; queued in tools/quality/ and noted in the test-infrastructure follow-on list.
.github/workflows/nightly-live-drift.yml lands with the first cassette commit per the consumer-epic sequencing — the workflow has nothing to compare against until cassettes exist.

References

ADR-065 — spectral.core admission discipline; spectral.core.llm.testing placement complies with the functional-area subdir killer test
ADR-035 — TA-10 LLM stack (LLMProvider protocol; cost tracking; rate-limit defaults)
ADR-036 — TA-16 Sentry substrate (drift alert routing)
ADR-045 — TA-23 test substrate (marker enforcement; coverage floors)
ADR-053 — TA-26 pre-push gate wiring
ADR-062 — TA-25 (test-live GitHub Environment; secrets scoping)
TA-24 disposition — SPEC-327 comment b445660f
TA-24 verification — SPEC-327 comment 462b983b
docs/runbooks/ci-secrets.md — operational scaffold (joint TA-24 / TA-25)
docs/runbooks/llm-testing.md — close-pass runbook (recording sessions; drift triage; threshold calibration)
Codex developer-guide/testing.mdx — close-pass update folds three-tier posture, cassette discipline, drift workflow, cost controls

Previous
ADR-060: Agent tool invocation, framework-layer composition, and LLM-mediated error handling Next
ADR-062: CI secrets handling for integration tests