LLM testing runbook
Operational procedures for the LLM test posture: cassette recording sessions, replay discipline, threshold calibration, and cost controls.
System reference: Codex how-to/testing.mdx · ADR-061 · ADR-062.
Automated tiers
| Tier | Substrate | Marker | Cost |
|---|---|---|---|
| Unit + contract | FakeLLMProvider (mock) | unit / contract | $0 |
| Integration | pytest-recording cassettes (replay) | integration | $0 |
Tests that require a live provider credential use live_llm and run manually from an operator machine. CI does not own provider credentials for cassette refresh.
Recording sessions
When a prompt intentionally changes, regenerate the affected cassettes.
Local recording
# Pass `--record-mode=once` to capture missing cassettes (first record),# or `--record-mode=all` to re-record everything the selected tests touch.uv run pytest tests/core/llm/test_my_agent.py --record-mode=once
# Inspect the diff before commitgit diff tests/core/llm/cassettes/The redaction lint (tools/quality/check_cassette_redaction.py) lands with the first cassette
commit and runs as a pre-push gate per ADR-061 D8; before
that point it has no inputs and is a dead lint.
Discipline
- Always review the diff before commit. New cassettes can carry sensitive content; redaction at recording time strips known-sensitive headers, but custom fields may leak through.
- Don’t bulk-regenerate. Per-test cassettes scope blast radius; bulk regeneration loses the change record.
- Commit the cassette in the same change as the prompt edit. Detecting drift later is harder once the cassette and the code disagree.
Drift comparison
- Text outputs:
difflib.SequenceMatcher.ratio(); threshold 0.85 default - Tool-call outputs: structural exact-match (tool name + argument values); soft similarity on tool calls would mask behavioral regressions
Per-test threshold override
@pytest.mark.llm_drift_threshold(0.7)async def test_creative_summarization(): ...For tests where outputs are inherently more variable (creative-writing, multi-valid-phrasing summarization). This helper is available to tests that need tolerant assertions; cassette refresh itself is reviewed by inspecting git diff.
Drift triage
When a manual recording session changes cassettes, classify the diff before commit:
- Expected fixture drift — prompt, template, fixture, or model intentionally changed. Commit the cassette diff with that code/doc change.
- Provider regression — same prompt produces materially different behavior. File a ticket; consider provider fallback per ADR-035 D7.
- False positive at current threshold — output is semantically equivalent but textually different. Adjust the per-test threshold via marker and document the calibration below.
Threshold calibration
0.85 text / structural exact-match tool-calls is the alpha-default. Calibration triggers:
- First false-positive blocking a legitimate change → adjust per-test threshold via
@pytest.mark.llm_drift_threshold(...). Track in the log below with date + test + new threshold + reason. - First regression slips through → tune the global default downward. Track in the log.
Calibration log
| Date | Test | Old | New | Reason |
|---|
(Empty at alpha until the first calibration fires.)
Cost controls
- VCR-replay tests are zero-cost (the bulk of integration tests).
- Manual live recording is scoped to the affected cassette surface and uses local/operator credentials.
- Provider rate-limit defaults inherit from ADR-035 D5.
Monitor manual live-recording cost when using metered provider keys:
-- Last 7 days, platform-tier LLM usageSELECT date_trunc('day', recorded_at) AS day, sum(cost_estimate) AS daily_costFROM core.llm_usageWHERE domain_id IS NULL -- platform-tier (no domain) AND model IN ('grok-4.3') -- live providers, not Fake AND recorded_at > now() - interval '7 days'GROUP BY dayORDER BY day DESC;FakeLLMProvider usage
For unit + contract tests that don’t need real-provider semantics:
from tests.core.llm import FakeLLMProvider
@pytest.fixturedef fake_llm() -> FakeLLMProvider: return FakeLLMProvider(canned={ ("scoring", "PLATFORM"): "score=8.5; reasoning=...", ("reasoning", "OPERATIONS"): "diagnosis=...", })
async def test_scan_diagnose(fake_llm): ...Returns canned responses keyed by (purpose, content_class). Implements LLMProvider protocol from spectral.core.llm.protocols. Implementation lands with the first consumer per the deferred-protocol pattern.
OpenAI subscription (Codex) live smoke — builder-run, EXPERIMENTAL (SPEC-719)
The org-BYO OpenAI subscription OAuth path conforms to OpenAI’s Codex request-shape (the
chatgpt.com/backend-api/codex/responses Responses endpoint + the Codex instructions envelope +
the ChatGPT-Account-ID/originator headers). It cannot be exercised in CI or from any hosted
deployment: OpenAI Cloudflare-challenges datacenter/headless origins. So AC2’s “a live smoke
confirms a real subscription call succeeds” is a manual, builder-run check from a non-datacenter
origin (your own machine), not an automated gate. OpenAI’s supported production path stays the
org-BYO API key.
Procedure (needs a real ChatGPT Plus/Pro subscription):
- Acquire a bundle locally:
uv run python tools/dev/oauth_login.py --provider openai→ completes theauth.openai.comPKCE login in your browser and prints a bundle JSON carryingaccount_id. - Boot the local stack (
tools/dev/start.sh --full) and seed a customer org/domain (tools/dev/customer_seed.py, per the cold-start runbook). - In the customer dashboard
/org-settings, choose provider OpenAI, model (e.g.gpt-5-codex), credential type Subscription (OAuth bundle), paste the bundle, accept the ToS, save. - Run an authoring turn for that org’s world and confirm it routes to OpenAI on the subscription credential (a real Codex response, not a degrade).
- If the call is rejected with an auth/instructions error, OpenAI’s Codex base instructions have
drifted — set
SPECTRAL_OPENAI_CODEX_INSTRUCTIONSto the current Codex base-instructions prompt (the single absorption point; no code change) and retry.
Record the outcome on the SPEC-719 issue. A 403/Cloudflare challenge from a datacenter origin is the expected limitation, not a regression.
See also
- ADR-061 — LLM testing strategy
- ADR-062 — CI secrets handling
- ADR-035 — Rate-limit + cost controls
- ADR-036 — Sentry alert substrate
- Codex testing
- Codex LLM platform
docs/runbooks/testing.md— broader test posturedocs/runbooks/ci-secrets.md— Environment scoping