ADR-061: LLM testing strategy
Context
LLM-invoking code in Spectral spans multiple contexts — Spectral Agent’s scan-analysis tools, World Agent’s domain-exploration reasoning, Operations Agent’s task tools, plus the LLMProvider substrate itself in spectral.core.llm. Each path needs a testing posture that is deterministic enough for CI, faithful enough to catch real regressions, and cost-bounded.
Three candidate test substrates surfaced during disposition:
- Mocks — fast, deterministic, cost-zero, but cannot catch regressions in real provider behavior.
- Local Ollama — deterministic-ish (still has temperature variance even at temp=0), self-hosted, but drifts independently of production providers; passing Ollama tests do not predict Anthropic / OpenAI / Google behavior. Resource overhead in CI (GPU runners expensive; CPU inference 10–100× slower) and maintenance burden (model pulls, version pinning, container images) do not earn their keep.
- Recorded cassettes (VCR-style) — byte-perfect deterministic on replay, production-aligned at recording time, cost-zero on replay. Requires recording discipline when prompts change.
A real-provider safety net is still useful, but it is an operator action rather than a standing CI concern. Cassettes capture recorded behavior; prompt/model changes regenerate the affected cassettes during the same change that needs them. CI replays committed cassettes deterministically and never owns provider credentials just to refresh fixtures.
This ADR lands a two-tier automated posture plus manual live-recording discipline, names the recording mechanism, and places test helpers shared between contexts under tests/<context>/<area>/ mirroring source layout (test substrate is categorically outside the kernel per ADR-065 D1).
Decision
D1 — Automated LLM test posture for LLM-invoking code
- Unit + contract = mock LLM via
FakeLLMProviderimpl (implementsLLMProviderprotocol fromspectral.core.llm.protocols); returns canned responses; deterministic; zero external calls. - Integration = real LLM provider via VCR-style recorded cassettes (D2). Replay is byte-perfect deterministic.
- Manual live recording (D4) = operator-run recording sessions when prompts, fixtures, or model/provider choices intentionally change.
The automated tiers map cleanly onto the existing unit / contract / integration marker enforcement (TA-23 D2 root conftest). Tests that require live provider credentials use the existing live_llm exclusion bucket and are not scheduled nightly.
D2 — Recording mechanism: pytest-recording per-test cassettes
Cassettes stored at the area-scoped path tests/core/llm/cassettes/<rel-test-path>/<test-id>.yaml, so the redaction-lint CI step scans a single tree (linear path-walk; one root). Per-test (not global) avoids merge conflicts on prompt edits — a prompt change only invalidates the affected tests’ cassettes. Recording is triggered via pytest-recording’s native --record-mode={once,all} CLI flag during local recording sessions — a per-invocation command-line override, clearer than env-var state and with no Spectral-custom surface to maintain; cassettes are committed to the repo as test artifacts.
D3 — Ollama is not part of the test contract or CI infrastructure
Cassettes are stricter (byte-perfect replay; no FP variance) and production-aligned (real Anthropic / OpenAI / Google output). Ollama models drift independently of production providers; passing Ollama tests do not predict production behavior. Resource overhead in CI and maintenance burden do not earn their keep.
Developer-local Ollama use for prompt iteration or sandbox exploration is a workflow choice (not an architectural decision) — the LLMProvider protocol is swappable; a developer-local OllamaLLMProvider impl is permissible as a workflow tool, just not committed as a test fixture.
Forward trigger: revisit if property-based fuzzing of LLM-mediated code emerges as a need that cassettes cannot cover (e.g., generate 1000 inputs; need real LLM behavior on each; cassettes infeasible for unbounded inputs).
D4 — Live recording is manual, not scheduled
There is no nightly live-provider re-record workflow and no dedicated drift marker. The previous nightly smoke provided little product signal: it exercised the provider transport/cassette seam, not the deployed platform. Refreshing cassettes is therefore an explicit operator action:
- Run the narrow recording command for the affected suite (
--record-mode=oncefor new cassettes,--record-mode=allonly when intentionally regenerating the selected surface). - Review the cassette diff before commit.
- Run
tools/quality/check_cassette_redaction.pybefore commit. - Commit cassette changes with the prompt/fixture/model change that requires them.
D5 — Similarity helper = 0.85 text / structural exact-match tool calls
Initial calibration starting point. Threshold is helper config plus per-test override via pytest marker, not a workflow contract.
- Text outputs compared via
difflib.SequenceMatcher.ratio(); threshold 0.85 is the default - Tool-call outputs compared structurally — tool name plus argument values; exact match. Soft similarity on tool calls would mask real behavioral regressions (tool calls are the agent’s actions)
- Per-test override via marker:
@pytest.mark.llm_drift_threshold(0.7)— for tests where outputs are inherently more variable (creative-writing prompts; multi-valid-phrasing summarization) - Calibration triggers: tune the global default when (a) a first false-positive blocks a legitimate change, OR (b) a first regression slips through. Track the reason in the test or runbook entry that changes the threshold.
D6 — Cost controls inherit TA-10 D5 and avoid scheduled provider spend
Automated CI uses mocks and cassette replay, so it makes zero live-provider calls for LLM tests. Operator-run recording sessions use local/operator credentials and are intentionally scoped to the affected cassettes.
D7 — Test helpers under tests/<context>/<area>/
Shared test substrate lives in the test tree, mirroring the source tree it tests. LLM test helpers live at tests/core/llm/* — paired with the LLM source area (src/spectral/core/llm/) and the existing LLM tests (test_contract_llm.py, test_contract_llm_usage.py). Initial surface (SPEC-428):
FakeLLMProvider(infake_provider.py) — implementsspectral.core.llm.protocols.LLMProvider; canned single response, sequence, or callable form; useful in unit/contract testsvcr_cassettepytest fixture (inrecording.py) — wraps pytest-recording with Spectral conventions: area-scoped cassette directory undertests/core/llm/cassettes/<rel-test-path>/; sensitive-header redaction at record time; record-mode controlled by pytest-recording’s native--record-mode={once,all}CLI flag (per D2)assert_llm_output_similar(actual, recorded, *, threshold=0.85)(insimilarity.py) —difflib.SequenceMatcher.ratio()-based similarity assertion (per D5) for non-deterministic completions; respects per-test threshold override
Pytest’s test_*.py collection convention separates collected tests from substrate modules within the same directory; substrate filenames (fake_provider.py, recording.py, similarity.py) do not get collected and are imported by consumers as tests.core.llm.*.
Test helpers live in the test tree, mirroring the source area they pair with, and are kept out of src/spectral/. Placing them inside src/spectral/ (e.g. at spectral.core.llm.testing.*) would put pytest fixtures and assertion helpers — code with logic and a test-time dependency surface (pytest, pytest-recording) — into the production wheel, and would force an implicit relaxation of the kernel admission discipline (no top-level functions; frozen models only) to admit them. Test code is categorically outside runtime substrate; mirroring source from the test side keeps the kernel narrow and the production wheel clean.
D8 — Sensitive content in cassettes redacted at recording time
API keys plus PII stripped via pytest-recording filter hooks (filter_headers plus before_record_response). Cassette files are committed to the repo; redaction is contract.
A new quality lint tools/quality/check_cassette_redaction.py (lands with the first cassette commit; before then a dead lint) blocks Authorization: Bearer ... and similar patterns from being committed. Wired into the pre-push gate per TA-26.
D9 — Coverage expectations follow TA-23 D3 layer floors (90 / 80 / 60)
LLM-invoking application code is covered via mock unit tests plus cassette integration tests. Manual live recording is not a coverage source. The mock-vs-cassette split honors the unit-vs-integration test marker distinction.
D10 — Test-agents are reference implementations, not a test harness
apps/test-agents/ hosts working agent code for exploration plus demonstration. Automated test composition that uses test-agents (cassettes plus FakeLLMProvider) is acceptable but secondary to their primary purpose. The earlier draft “test-agent E2E lifecycle in TA-24 scope” is dropped after the founder reframe captured during disposition.
Alternatives considered
Ollama as CI substrate. Rejected per D3. Drift versus production providers; FP variance even at temperature 0; resource overhead in CI; maintenance load.
Global VCR cassette file. Rejected per D2. Merge-conflict storm on prompt edits; per-test cassettes scope blast radius to the affected test.
Live-provider on every PR. Rejected. Cost prohibitive and flaky on provider rate-limits.
Nightly live-provider drift workflow. Rejected after implementation review. The only drift-marked test was a low-level cassette smoke, not product validation, and the QA re-record path did not justify a standing GitHub provider key. Re-record cassettes manually when prompts, fixtures, or models intentionally change.
Test-agent E2E lifecycle in TA-24 scope (was draft D9). Dropped after the founder reframe. Test-agents are reference implementations for exploration plus demonstration, not a test harness.
Soft similarity on tool calls. Rejected per D5. Tool calls are the agent’s actions; soft similarity masks behavioral regressions.
Manual cassette curation instead of redaction filter hooks. Rejected per D8. Manual curation has a leakage failure mode (forget to redact a header); filter hooks plus a CI lint provide layered defense.
Consequences
- Deterministic CI for LLM-invoking code without scheduled live-provider spend.
- Cost bounded by structure — cassettes for automated coverage; manual recording for fixture refresh.
- Test posture aligns with existing TA-23 marker discipline. No extra nightly drift marker is required.
tests/<context>/<area>/pattern establishes the home for test helpers shared between contexts, mirroring source layout from the test side; LLM helpers live attests/core/llm/; future shared substrate for other functional areas (events, db, retention, etc.) follows the same pattern under their respectivetests/core/<area>/subdirs. Production source undersrc/spectral/carries zero test-only dependencies.- Cassette regeneration discipline — when prompts intentionally change, targeted recording sessions are required. Documented in
docs/runbooks/llm-testing.md. - Threshold calibration — 0.85 is a starting point; the first false-positive or regression-slip will tune it.
tools/quality/check_cassette_redaction.pylands with the first cassette commit. Before then it is a dead lint with no inputs to evaluate; queued intools/quality/and noted in the test-infrastructure follow-on list.
References
- ADR-065 —
spectral.coreadmission discipline; D1 places test substrate intests/<context>/<area>/, not in the kernel - ADR-035 — TA-10 LLM stack (
LLMProviderprotocol; cost tracking; rate-limit defaults) - ADR-036 — TA-16 Sentry substrate (drift alert routing)
- ADR-045 — TA-23 test substrate (marker enforcement; coverage floors)
- ADR-053 — TA-26 pre-push gate wiring
- ADR-062 — CI secrets handling
- TA-24 disposition — SPEC-327 comment
b445660f - TA-24 verification — SPEC-327 comment
462b983b docs/runbooks/ci-secrets.md— CI secret handlingdocs/runbooks/llm-testing.md— recording sessions and cassette discipline- Codex
developer-guide/testing.mdx— testing posture and cassette discipline