ADR-061: LLM testing strategy
Status: Accepted (2026-04-25)
Context
LLM-invoking code in Spectral spans multiple contexts — Spectral Agent’s scan-analysis tools, World Agent’s domain-exploration reasoning, Operations Agent’s task tools, plus the LLMProvider substrate itself in spectral.core.llm. Each path needs a testing posture that is deterministic enough for CI, faithful enough to catch real regressions, and cost-bounded.
Three candidate test substrates surfaced during disposition:
- Mocks — fast, deterministic, cost-zero, but cannot catch regressions in real provider behavior.
- Local Ollama — deterministic-ish (still has temperature variance even at temp=0), self-hosted, but drifts independently of production providers; passing Ollama tests do not predict Anthropic / OpenAI / Google behavior. Resource overhead in CI (GPU runners expensive; CPU inference 10–100× slower) and maintenance burden (model pulls, version pinning, container images) do not earn their keep.
- Recorded cassettes (VCR-style) — byte-perfect deterministic on replay, production-aligned at recording time, cost-zero on replay. Requires recording discipline when prompts change.
A real-provider safety net is also needed: cassettes capture recorded behavior; a prompt or model regression can slip silently if no test ever hits the real provider after the cassette was recorded. The safety net cannot run on every PR (cost prohibitive; flaky on rate-limits) but must run frequently enough to catch regressions before they accumulate.
This ADR lands a three-tier test posture, names recording mechanism and threshold, sets cost controls, and places test helpers shared between contexts at spectral.core.llm.testing per ADR-065 D1 admission discipline.
Decision
D1 — Three-tier test posture for LLM-invoking code
- Unit + contract = mock LLM via
FakeLLMProviderimpl (implementsLLMProviderprotocol fromspectral.core.llm.protocols); returns canned responses; deterministic; zero external calls. - Integration = real LLM provider via VCR-style recorded cassettes (D2). Replay is byte-perfect deterministic.
- Live-provider drift detection (D4) = separate nightly workflow that bypasses VCR and hits real providers; compares to recorded outputs.
The three tiers map cleanly onto the existing unit / contract / integration marker enforcement (TA-23 D2 root conftest). Live-drift is a fourth marker (live_drift) used only by the nightly workflow.
D2 — Recording mechanism: pytest-recording per-test cassettes
Cassettes stored at tests/<context>/_fixtures/llm/<test-id>.yaml. Per-test (not global) avoids merge conflicts on prompt edits — a prompt change only invalidates the affected tests’ cassettes. Updated via RECORD_NEW_FIXTURES=1 env flag during local recording sessions; committed to the repo as test artifacts.
D3 — Ollama is not part of the test contract or CI infrastructure
Cassettes are stricter (byte-perfect replay; no FP variance) and production-aligned (real Anthropic / OpenAI / Google output). Ollama models drift independently of production providers; passing Ollama tests do not predict production behavior. Resource overhead in CI and maintenance burden do not earn their keep.
Developer-local Ollama use for prompt iteration or sandbox exploration is a workflow choice (not an architectural decision) — the LLMProvider protocol is swappable; a developer-local OllamaLLMProvider impl is permissible as a workflow tool, just not committed as a test fixture.
Forward trigger: revisit if property-based fuzzing of LLM-mediated code emerges as a need that cassettes cannot cover (e.g., generate 1000 inputs; need real LLM behavior on each; cassettes infeasible for unbounded inputs).
D4 — Live-provider drift detection runs in a nightly workflow
.github/workflows/nightly-live-drift.yml (lands jointly with TA-25 D10):
- Triggers:
schedule(nightly cron) +workflow_dispatch+ on-merge-to-main - Runs the integration test suite with
LIVE_PROVIDER=1env that bypasses VCR replay - Compares live outputs against recorded cassettes via similarity (D5)
- Drift surfaces as a Sentry alert (per TA-16) plus a nightly summary issue posted via the GitHub API
- Cost bounded by the TA-10 D5 rate-limit plus a dedicated test-account daily cap (D6)
D5 — Drift similarity threshold = 0.85 text / structural exact-match tool calls
Initial calibration starting point. Threshold is config (workflow env var plus per-test override via pytest marker), not contract.
- Text outputs compared via
difflib.SequenceMatcher.ratio(); threshold 0.85 is the default - Tool-call outputs compared structurally — tool name plus argument values; exact match. Soft similarity on tool calls would mask real behavioral regressions (tool calls are the agent’s actions)
- Per-test override via marker:
@pytest.mark.llm_drift_threshold(0.7)— for tests where outputs are inherently more variable (creative-writing prompts; multi-valid-phrasing summarization) - Calibration triggers: tune the global default when (a) a first false-positive blocks a legitimate change, OR (b) a first regression slips through. Tracked in the workflow’s evidence log.
D6 — Cost controls inherit TA-10 D5 plus a dedicated test-account
Per-day cap on the test-account; GitHub Actions concurrency group serializes live-drift runs (one at a time); VCR-replay tests (the bulk) cost zero. Test-account API keys live in the test-live GitHub Environment per TA-25 D1.
D7 — Test helpers in spectral.core.llm.testing
Test fixtures live under each functional-area subdir’s testing/ sub-namespace per ADR-065 D1 admission discipline. LLM test helpers live at spectral.core.llm.testing.*. Initial surface:
FakeLLMProvider— implementsspectral.core.llm.protocols.LLMProvider; returns canned responses keyed by purpose plus content-class; useful in unit/contract testsvcr_cassettepytest fixture — wraps pytest-recording with Spectral conventions: cassette path resolution by test ID; sensitive-header redaction; record-mode controlled byRECORD_NEW_FIXTURESassert_llm_output_similar(actual, recorded, threshold=0.85)— used by the live-drift workflow’s drift-comparison logic; respects per-test threshold override
Implementation lands in consumer-epic integration (per the TA-12 / TA-14 / TA-15 precedent) — FakeLLMProvider lands when the first test consumer needs it (likely SPEC-242 Spectral Agent integration); vcr_cassette lands with the first integration test that records LLM output; the helper for drift comparison lands with the nightly workflow.
Each addition was approved under the contract-requirement-test discipline in force at the time (a contract-requirement test plus an inter-context requirement statement at PR time); that discipline is now superseded by ADR-065’s admission discipline.
D8 — Sensitive content in cassettes redacted at recording time
API keys plus PII stripped via pytest-recording filter hooks (filter_headers plus before_record_response). Cassette files are committed to the repo; redaction is contract.
A new quality lint tools/quality/check_cassette_redaction.py (lands with the first cassette commit; before then a dead lint) blocks Authorization: Bearer ... and similar patterns from being committed. Wired into the pre-push gate per TA-26.
D9 — Coverage expectations follow TA-23 D3 layer floors (90 / 80 / 60)
LLM-invoking application code is covered via mock unit tests plus cassette integration tests; live-drift is a separate workflow, not a coverage source. The mock-vs-cassette split honors the unit-vs-integration test marker distinction.
D10 — Test-agents are reference implementations, not a test harness
apps/test-agents/ hosts working agent code for exploration plus demonstration. Automated test composition that uses test-agents (cassettes plus FakeLLMProvider) is acceptable but secondary to their primary purpose. The earlier draft “test-agent E2E lifecycle in TA-24 scope” is dropped after the founder reframe captured during disposition.
Alternatives considered
Ollama as CI substrate. Rejected per D3. Drift versus production providers; FP variance even at temperature 0; resource overhead in CI; maintenance load.
Global VCR cassette file. Rejected per D2. Merge-conflict storm on prompt edits; per-test cassettes scope blast radius to the affected test.
Live-provider on every PR. Rejected. Cost prohibitive; flaky on provider rate-limits; nightly plus on-merge gives the same regression coverage at a fraction of the cost.
No drift detection at alpha. Rejected. The whole point of cassettes is byte-perfect determinism on replay; without drift detection, prompt-or-model regressions slip silently and accumulate.
Test-agent E2E lifecycle in TA-24 scope (was draft D9). Dropped after the founder reframe. Test-agents are reference implementations for exploration plus demonstration, not a test harness.
Soft similarity on tool calls. Rejected per D5. Tool calls are the agent’s actions; soft similarity masks behavioral regressions.
Manual cassette curation instead of redaction filter hooks. Rejected per D8. Manual curation has a leakage failure mode (forget to redact a header); filter hooks plus a CI lint provide layered defense.
Consequences
- Deterministic CI for LLM-invoking code without sacrificing real-provider regression coverage.
- Cost bounded by structure — cassettes for bulk; live-drift in a dedicated workflow with a daily-cap account.
- Test posture aligns with existing TA-23 marker discipline. Adds one new marker (
live_drift). spectral.core.<functional-area>.testing.*pattern establishes the home for test helpers shared between contexts; LLM helpers live atspectral.core.llm.testing; future shared fixtures for other functional areas (events, db, retention, etc.) follow the same pattern under their respective subdirs.- Cassette regeneration discipline — when prompts intentionally change, recording sessions are required (
RECORD_NEW_FIXTURES=1); developer workflow includes this step. Documented indocs/runbooks/llm-testing.mdat close-pass. - Threshold calibration — 0.85 is a starting point; the first false-positive or regression-slip will tune it. The workflow’s evidence log tracks both the trigger and the new threshold.
- Test-account credential management — daily-capped account, scoped to the
test-liveGitHub Environment, rotation per TA-25. tools/quality/check_cassette_redaction.pylands with the first cassette commit. Before then it is a dead lint with no inputs to evaluate; queued intools/quality/and noted in the test-infrastructure follow-on list..github/workflows/nightly-live-drift.ymllands with the first cassette commit per the consumer-epic sequencing — the workflow has nothing to compare against until cassettes exist.
References
- ADR-065 —
spectral.coreadmission discipline;spectral.core.llm.testingplacement complies with the functional-area subdir killer test - ADR-035 — TA-10 LLM stack (
LLMProviderprotocol; cost tracking; rate-limit defaults) - ADR-036 — TA-16 Sentry substrate (drift alert routing)
- ADR-045 — TA-23 test substrate (marker enforcement; coverage floors)
- ADR-053 — TA-26 pre-push gate wiring
- ADR-062 — TA-25 (
test-liveGitHub Environment; secrets scoping) - TA-24 disposition — SPEC-327 comment
b445660f - TA-24 verification — SPEC-327 comment
462b983b docs/runbooks/ci-secrets.md— operational scaffold (joint TA-24 / TA-25)docs/runbooks/llm-testing.md— close-pass runbook (recording sessions; drift triage; threshold calibration)- Codex
developer-guide/testing.mdx— close-pass update folds three-tier posture, cassette discipline, drift workflow, cost controls