Skip to content
GitHub
Developer

LLM testing runbook

Operational procedures for the LLM test posture: cassette recording sessions, replay discipline, threshold calibration, and cost controls.

System reference: Codex how-to/testing.mdx · ADR-061 · ADR-062.


Automated tiers

TierSubstrateMarkerCost
Unit + contractFakeLLMProvider (mock)unit / contract$0
Integrationpytest-recording cassettes (replay)integration$0

Tests that require a live provider credential use live_llm and run manually from an operator machine. CI does not own provider credentials for cassette refresh.


Recording sessions

When a prompt intentionally changes, regenerate the affected cassettes.

Local recording

Terminal window
# Pass `--record-mode=once` to capture missing cassettes (first record),
# or `--record-mode=all` to re-record everything the selected tests touch.
uv run pytest tests/core/llm/test_my_agent.py --record-mode=once
# Inspect the diff before commit
git diff tests/core/llm/cassettes/

The redaction lint (tools/quality/check_cassette_redaction.py) lands with the first cassette commit and runs as a pre-push gate per ADR-061 D8; before that point it has no inputs and is a dead lint.

Discipline

  • Always review the diff before commit. New cassettes can carry sensitive content; redaction at recording time strips known-sensitive headers, but custom fields may leak through.
  • Don’t bulk-regenerate. Per-test cassettes scope blast radius; bulk regeneration loses the change record.
  • Commit the cassette in the same change as the prompt edit. Detecting drift later is harder once the cassette and the code disagree.

Drift comparison

  • Text outputs: difflib.SequenceMatcher.ratio(); threshold 0.85 default
  • Tool-call outputs: structural exact-match (tool name + argument values); soft similarity on tool calls would mask behavioral regressions

Per-test threshold override

@pytest.mark.llm_drift_threshold(0.7)
async def test_creative_summarization():
...

For tests where outputs are inherently more variable (creative-writing, multi-valid-phrasing summarization). This helper is available to tests that need tolerant assertions; cassette refresh itself is reviewed by inspecting git diff.


Drift triage

When a manual recording session changes cassettes, classify the diff before commit:

  1. Expected fixture drift — prompt, template, fixture, or model intentionally changed. Commit the cassette diff with that code/doc change.
  2. Provider regression — same prompt produces materially different behavior. File a ticket; consider provider fallback per ADR-035 D7.
  3. False positive at current threshold — output is semantically equivalent but textually different. Adjust the per-test threshold via marker and document the calibration below.

Threshold calibration

0.85 text / structural exact-match tool-calls is the alpha-default. Calibration triggers:

  • First false-positive blocking a legitimate change → adjust per-test threshold via @pytest.mark.llm_drift_threshold(...). Track in the log below with date + test + new threshold + reason.
  • First regression slips through → tune the global default downward. Track in the log.

Calibration log

DateTestOldNewReason

(Empty at alpha until the first calibration fires.)


Cost controls

  • VCR-replay tests are zero-cost (the bulk of integration tests).
  • Manual live recording is scoped to the affected cassette surface and uses local/operator credentials.
  • Provider rate-limit defaults inherit from ADR-035 D5.

Monitor manual live-recording cost when using metered provider keys:

-- Last 7 days, platform-tier LLM usage
SELECT date_trunc('day', recorded_at) AS day, sum(cost_estimate) AS daily_cost
FROM core.llm_usage
WHERE domain_id IS NULL -- platform-tier (no domain)
AND model IN ('grok-4.3') -- live providers, not Fake
AND recorded_at > now() - interval '7 days'
GROUP BY day
ORDER BY day DESC;

FakeLLMProvider usage

For unit + contract tests that don’t need real-provider semantics:

from tests.core.llm import FakeLLMProvider
@pytest.fixture
def fake_llm() -> FakeLLMProvider:
return FakeLLMProvider(canned={
("scoring", "PLATFORM"): "score=8.5; reasoning=...",
("reasoning", "OPERATIONS"): "diagnosis=...",
})
async def test_scan_diagnose(fake_llm):
...

Returns canned responses keyed by (purpose, content_class). Implements LLMProvider protocol from spectral.core.llm.protocols. Implementation lands with the first consumer per the deferred-protocol pattern.


OpenAI subscription (Codex) live smoke — builder-run, EXPERIMENTAL (SPEC-719)

The org-BYO OpenAI subscription OAuth path conforms to OpenAI’s Codex request-shape (the chatgpt.com/backend-api/codex/responses Responses endpoint + the Codex instructions envelope + the ChatGPT-Account-ID/originator headers). It cannot be exercised in CI or from any hosted deployment: OpenAI Cloudflare-challenges datacenter/headless origins. So AC2’s “a live smoke confirms a real subscription call succeeds” is a manual, builder-run check from a non-datacenter origin (your own machine), not an automated gate. OpenAI’s supported production path stays the org-BYO API key.

Procedure (needs a real ChatGPT Plus/Pro subscription):

  1. Acquire a bundle locally: uv run python tools/dev/oauth_login.py --provider openai → completes the auth.openai.com PKCE login in your browser and prints a bundle JSON carrying account_id.
  2. Boot the local stack (tools/dev/start.sh --full) and seed a customer org/domain (tools/dev/customer_seed.py, per the cold-start runbook).
  3. In the customer dashboard /org-settings, choose provider OpenAI, model (e.g. gpt-5-codex), credential type Subscription (OAuth bundle), paste the bundle, accept the ToS, save.
  4. Run an authoring turn for that org’s world and confirm it routes to OpenAI on the subscription credential (a real Codex response, not a degrade).
  5. If the call is rejected with an auth/instructions error, OpenAI’s Codex base instructions have drifted — set SPECTRAL_OPENAI_CODEX_INSTRUCTIONS to the current Codex base-instructions prompt (the single absorption point; no code change) and retry.

Record the outcome on the SPEC-719 issue. A 403/Cloudflare challenge from a datacenter origin is the expected limitation, not a regression.


See also

  • ADR-061 — LLM testing strategy
  • ADR-062 — CI secrets handling
  • ADR-035 — Rate-limit + cost controls
  • ADR-036 — Sentry alert substrate
  • Codex testing
  • Codex LLM platform
  • docs/runbooks/testing.md — broader test posture
  • docs/runbooks/ci-secrets.md — Environment scoping