Skip to content
GitHub
Decisions

ADR-061: LLM testing strategy

Context

LLM-invoking code in Spectral spans multiple contexts — Spectral Agent’s scan-analysis tools, World Agent’s domain-exploration reasoning, Operations Agent’s task tools, plus the LLMProvider substrate itself in spectral.core.llm. Each path needs a testing posture that is deterministic enough for CI, faithful enough to catch real regressions, and cost-bounded.

Three candidate test substrates surfaced during disposition:

  • Mocks — fast, deterministic, cost-zero, but cannot catch regressions in real provider behavior.
  • Local Ollama — deterministic-ish (still has temperature variance even at temp=0), self-hosted, but drifts independently of production providers; passing Ollama tests do not predict Anthropic / OpenAI / Google behavior. Resource overhead in CI (GPU runners expensive; CPU inference 10–100× slower) and maintenance burden (model pulls, version pinning, container images) do not earn their keep.
  • Recorded cassettes (VCR-style) — byte-perfect deterministic on replay, production-aligned at recording time, cost-zero on replay. Requires recording discipline when prompts change.

A real-provider safety net is still useful, but it is an operator action rather than a standing CI concern. Cassettes capture recorded behavior; prompt/model changes regenerate the affected cassettes during the same change that needs them. CI replays committed cassettes deterministically and never owns provider credentials just to refresh fixtures.

This ADR lands a two-tier automated posture plus manual live-recording discipline, names the recording mechanism, and places test helpers shared between contexts under tests/<context>/<area>/ mirroring source layout (test substrate is categorically outside the kernel per ADR-065 D1).

Decision

D1 — Automated LLM test posture for LLM-invoking code

  • Unit + contract = mock LLM via FakeLLMProvider impl (implements LLMProvider protocol from spectral.core.llm.protocols); returns canned responses; deterministic; zero external calls.
  • Integration = real LLM provider via VCR-style recorded cassettes (D2). Replay is byte-perfect deterministic.
  • Manual live recording (D4) = operator-run recording sessions when prompts, fixtures, or model/provider choices intentionally change.

The automated tiers map cleanly onto the existing unit / contract / integration marker enforcement (TA-23 D2 root conftest). Tests that require live provider credentials use the existing live_llm exclusion bucket and are not scheduled nightly.

D2 — Recording mechanism: pytest-recording per-test cassettes

Cassettes stored at the area-scoped path tests/core/llm/cassettes/<rel-test-path>/<test-id>.yaml, so the redaction-lint CI step scans a single tree (linear path-walk; one root). Per-test (not global) avoids merge conflicts on prompt edits — a prompt change only invalidates the affected tests’ cassettes. Recording is triggered via pytest-recording’s native --record-mode={once,all} CLI flag during local recording sessions — a per-invocation command-line override, clearer than env-var state and with no Spectral-custom surface to maintain; cassettes are committed to the repo as test artifacts.

D3 — Ollama is not part of the test contract or CI infrastructure

Cassettes are stricter (byte-perfect replay; no FP variance) and production-aligned (real Anthropic / OpenAI / Google output). Ollama models drift independently of production providers; passing Ollama tests do not predict production behavior. Resource overhead in CI and maintenance burden do not earn their keep.

Developer-local Ollama use for prompt iteration or sandbox exploration is a workflow choice (not an architectural decision) — the LLMProvider protocol is swappable; a developer-local OllamaLLMProvider impl is permissible as a workflow tool, just not committed as a test fixture.

Forward trigger: revisit if property-based fuzzing of LLM-mediated code emerges as a need that cassettes cannot cover (e.g., generate 1000 inputs; need real LLM behavior on each; cassettes infeasible for unbounded inputs).

D4 — Live recording is manual, not scheduled

There is no nightly live-provider re-record workflow and no dedicated drift marker. The previous nightly smoke provided little product signal: it exercised the provider transport/cassette seam, not the deployed platform. Refreshing cassettes is therefore an explicit operator action:

  • Run the narrow recording command for the affected suite (--record-mode=once for new cassettes, --record-mode=all only when intentionally regenerating the selected surface).
  • Review the cassette diff before commit.
  • Run tools/quality/check_cassette_redaction.py before commit.
  • Commit cassette changes with the prompt/fixture/model change that requires them.

D5 — Similarity helper = 0.85 text / structural exact-match tool calls

Initial calibration starting point. Threshold is helper config plus per-test override via pytest marker, not a workflow contract.

  • Text outputs compared via difflib.SequenceMatcher.ratio(); threshold 0.85 is the default
  • Tool-call outputs compared structurally — tool name plus argument values; exact match. Soft similarity on tool calls would mask real behavioral regressions (tool calls are the agent’s actions)
  • Per-test override via marker: @pytest.mark.llm_drift_threshold(0.7) — for tests where outputs are inherently more variable (creative-writing prompts; multi-valid-phrasing summarization)
  • Calibration triggers: tune the global default when (a) a first false-positive blocks a legitimate change, OR (b) a first regression slips through. Track the reason in the test or runbook entry that changes the threshold.

D6 — Cost controls inherit TA-10 D5 and avoid scheduled provider spend

Automated CI uses mocks and cassette replay, so it makes zero live-provider calls for LLM tests. Operator-run recording sessions use local/operator credentials and are intentionally scoped to the affected cassettes.

D7 — Test helpers under tests/<context>/<area>/

Shared test substrate lives in the test tree, mirroring the source tree it tests. LLM test helpers live at tests/core/llm/* — paired with the LLM source area (src/spectral/core/llm/) and the existing LLM tests (test_contract_llm.py, test_contract_llm_usage.py). Initial surface (SPEC-428):

  • FakeLLMProvider (in fake_provider.py) — implements spectral.core.llm.protocols.LLMProvider; canned single response, sequence, or callable form; useful in unit/contract tests
  • vcr_cassette pytest fixture (in recording.py) — wraps pytest-recording with Spectral conventions: area-scoped cassette directory under tests/core/llm/cassettes/<rel-test-path>/; sensitive-header redaction at record time; record-mode controlled by pytest-recording’s native --record-mode={once,all} CLI flag (per D2)
  • assert_llm_output_similar(actual, recorded, *, threshold=0.85) (in similarity.py) — difflib.SequenceMatcher.ratio()-based similarity assertion (per D5) for non-deterministic completions; respects per-test threshold override

Pytest’s test_*.py collection convention separates collected tests from substrate modules within the same directory; substrate filenames (fake_provider.py, recording.py, similarity.py) do not get collected and are imported by consumers as tests.core.llm.*.

Test helpers live in the test tree, mirroring the source area they pair with, and are kept out of src/spectral/. Placing them inside src/spectral/ (e.g. at spectral.core.llm.testing.*) would put pytest fixtures and assertion helpers — code with logic and a test-time dependency surface (pytest, pytest-recording) — into the production wheel, and would force an implicit relaxation of the kernel admission discipline (no top-level functions; frozen models only) to admit them. Test code is categorically outside runtime substrate; mirroring source from the test side keeps the kernel narrow and the production wheel clean.

D8 — Sensitive content in cassettes redacted at recording time

API keys plus PII stripped via pytest-recording filter hooks (filter_headers plus before_record_response). Cassette files are committed to the repo; redaction is contract.

A new quality lint tools/quality/check_cassette_redaction.py (lands with the first cassette commit; before then a dead lint) blocks Authorization: Bearer ... and similar patterns from being committed. Wired into the pre-push gate per TA-26.

D9 — Coverage expectations follow TA-23 D3 layer floors (90 / 80 / 60)

LLM-invoking application code is covered via mock unit tests plus cassette integration tests. Manual live recording is not a coverage source. The mock-vs-cassette split honors the unit-vs-integration test marker distinction.

D10 — Test-agents are reference implementations, not a test harness

apps/test-agents/ hosts working agent code for exploration plus demonstration. Automated test composition that uses test-agents (cassettes plus FakeLLMProvider) is acceptable but secondary to their primary purpose. The earlier draft “test-agent E2E lifecycle in TA-24 scope” is dropped after the founder reframe captured during disposition.

Alternatives considered

Ollama as CI substrate. Rejected per D3. Drift versus production providers; FP variance even at temperature 0; resource overhead in CI; maintenance load.

Global VCR cassette file. Rejected per D2. Merge-conflict storm on prompt edits; per-test cassettes scope blast radius to the affected test.

Live-provider on every PR. Rejected. Cost prohibitive and flaky on provider rate-limits.

Nightly live-provider drift workflow. Rejected after implementation review. The only drift-marked test was a low-level cassette smoke, not product validation, and the QA re-record path did not justify a standing GitHub provider key. Re-record cassettes manually when prompts, fixtures, or models intentionally change.

Test-agent E2E lifecycle in TA-24 scope (was draft D9). Dropped after the founder reframe. Test-agents are reference implementations for exploration plus demonstration, not a test harness.

Soft similarity on tool calls. Rejected per D5. Tool calls are the agent’s actions; soft similarity masks behavioral regressions.

Manual cassette curation instead of redaction filter hooks. Rejected per D8. Manual curation has a leakage failure mode (forget to redact a header); filter hooks plus a CI lint provide layered defense.

Consequences

  • Deterministic CI for LLM-invoking code without scheduled live-provider spend.
  • Cost bounded by structure — cassettes for automated coverage; manual recording for fixture refresh.
  • Test posture aligns with existing TA-23 marker discipline. No extra nightly drift marker is required.
  • tests/<context>/<area>/ pattern establishes the home for test helpers shared between contexts, mirroring source layout from the test side; LLM helpers live at tests/core/llm/; future shared substrate for other functional areas (events, db, retention, etc.) follows the same pattern under their respective tests/core/<area>/ subdirs. Production source under src/spectral/ carries zero test-only dependencies.
  • Cassette regeneration discipline — when prompts intentionally change, targeted recording sessions are required. Documented in docs/runbooks/llm-testing.md.
  • Threshold calibration — 0.85 is a starting point; the first false-positive or regression-slip will tune it.
  • tools/quality/check_cassette_redaction.py lands with the first cassette commit. Before then it is a dead lint with no inputs to evaluate; queued in tools/quality/ and noted in the test-infrastructure follow-on list.

References

  • ADR-065spectral.core admission discipline; D1 places test substrate in tests/<context>/<area>/, not in the kernel
  • ADR-035 — TA-10 LLM stack (LLMProvider protocol; cost tracking; rate-limit defaults)
  • ADR-036 — TA-16 Sentry substrate (drift alert routing)
  • ADR-045 — TA-23 test substrate (marker enforcement; coverage floors)
  • ADR-053 — TA-26 pre-push gate wiring
  • ADR-062 — CI secrets handling
  • TA-24 disposition — SPEC-327 comment b445660f
  • TA-24 verification — SPEC-327 comment 462b983b
  • docs/runbooks/ci-secrets.md — CI secret handling
  • docs/runbooks/llm-testing.md — recording sessions and cassette discipline
  • Codex developer-guide/testing.mdx — testing posture and cassette discipline