LLM testing runbook
Operational procedures for the three-tier LLM test posture — recording sessions, drift triage, threshold calibration, cost controls.
System reference: Codex how-to/testing.mdx · ADR-061 · ADR-062.
Three tiers
| Tier | Substrate | Marker | Cost |
|---|---|---|---|
| Unit + contract | FakeLLMProvider (mock) | unit / contract | $0 |
| Integration | pytest-recording cassettes (replay) | integration | $0 |
| Live drift detection | Real provider via nightly workflow | live_drift | bounded by daily-capped test-account |
Recording sessions
When a prompt intentionally changes, regenerate the affected cassettes.
Local recording
# Set env flag, run the affected test (or test file/dir)RECORD_NEW_FIXTURES=1 uv run pytest tests/platform/integration/test_scan_diagnose.py -m integration
# Inspect the diff before commitgit diff tests/platform/_fixtures/llm/The redaction lint (tools/quality/check_cassette_redaction.py) lands with the first cassette
commit and runs as a pre-push gate per ADR-061 D8; before
that point it has no inputs and is a dead lint.
Discipline
- Always review the diff before commit. New cassettes can carry sensitive content; redaction at recording time strips known-sensitive headers, but custom fields may leak through.
- Don’t bulk-regenerate. Per-test cassettes scope blast radius; bulk regeneration loses the change record.
- Commit the cassette in the same change as the prompt edit. Detecting drift later is harder once the cassette and the code disagree.
Drift detection (nightly)
.github/workflows/nightly-live-drift.yml runs LIVE_PROVIDER=1 against the integration suite, bypasses VCR replay, and compares live output against recorded cassettes.
Triggers
- Schedule: nightly at 02:00 UTC
workflow_dispatch(manual)- On-merge-to-main
Drift comparison
- Text outputs:
difflib.SequenceMatcher.ratio(); threshold 0.85 default - Tool-call outputs: structural exact-match (tool name + argument values); soft similarity on tool calls would mask behavioral regressions
Per-test threshold override
@pytest.mark.llm_drift_threshold(0.7)async def test_creative_summarization(): ...For tests where outputs are inherently more variable (creative-writing, multi-valid-phrasing summarization).
Drift signal
- Sentry alert (per ADR-036)
- Nightly summary issue posted via the GitHub API listing all drifted tests + similarity scores
Drift triage
Day-after-drift workflow:
- Open the nightly summary issue. Each entry lists test name, threshold, observed similarity, link to the workflow run.
- For each drifted test, classify:
- Provider regression — same prompt, materially different output, similarity well below threshold. Action: file a ticket; consider provider fallback per ADR-035 D7.
- Cassette stale — prompt or model intentionally changed but cassette wasn’t regenerated. Action: regenerate the cassette per the recording session above.
- False positive at current threshold — output is semantically equivalent but textually different (e.g., different phrasing). Action: adjust per-test threshold via marker; document the calibration in this runbook’s calibration log below.
- Close the issue when all entries are addressed.
Threshold calibration
0.85 text / structural exact-match tool-calls is the alpha-default. Calibration triggers:
- First false-positive blocking a legitimate change → adjust per-test threshold via
@pytest.mark.llm_drift_threshold(...). Track in the log below with date + test + new threshold + reason. - First regression slips through → tune the global default downward. Track in the log.
Calibration log
| Date | Test | Old | New | Reason |
|---|
(Empty at alpha until the first calibration fires.)
Cost controls
- VCR-replay tests are zero-cost (the bulk of integration tests).
- Live-drift workflow runs against a dedicated test-account on the
test-liveGitHub Environment per ADR-062 D1. - Daily cap on the test-account; concurrency-group serializes drift runs (one at a time).
- Provider rate-limit defaults inherit from ADR-035 D5.
Monitor live-drift cost:
-- Last 7 days, live-drift only (provider keys live in test-live Environment)SELECT date_trunc('day', recorded_at) AS day, sum(cost_estimate) AS daily_costFROM core.llm_usageWHERE workspace_id IS NULL -- platform-tier (no workspace) AND model IN ('claude-opus-4-7', 'gpt-5-4', 'gemini-3-1-pro') -- live providers, not Fake AND recorded_at > now() - interval '7 days'GROUP BY dayORDER BY day DESC;FakeLLMProvider usage
For unit + contract tests that don’t need real-provider semantics:
from spectral.core.llm.testing import FakeLLMProvider
@pytest.fixturedef fake_llm() -> FakeLLMProvider: return FakeLLMProvider(canned={ ("scoring", "PLATFORM"): "score=8.5; reasoning=...", ("reasoning", "OPERATIONS"): "diagnosis=...", })
async def test_scan_diagnose(fake_llm): ...Returns canned responses keyed by (purpose, content_class). Implements LLMProvider protocol from spectral.core.llm.protocols. Implementation lands with the first consumer per the deferred-protocol pattern.
See also
- ADR-061 — LLM testing strategy
- ADR-062 — CI secrets handling (
test-liveEnvironment) - ADR-035 — Rate-limit + cost controls
- ADR-036 — Sentry alert substrate
- Codex testing
- Codex LLM platform
docs/runbooks/testing.md— broader test posturedocs/runbooks/ci-secrets.md— Environment scoping