Skip to content
GitHub
Developer

LLM testing runbook

Operational procedures for the three-tier LLM test posture — recording sessions, drift triage, threshold calibration, cost controls.

System reference: Codex how-to/testing.mdx · ADR-061 · ADR-062.


Three tiers

TierSubstrateMarkerCost
Unit + contractFakeLLMProvider (mock)unit / contract$0
Integrationpytest-recording cassettes (replay)integration$0
Live drift detectionReal provider via nightly workflowlive_driftbounded by daily-capped test-account

Recording sessions

When a prompt intentionally changes, regenerate the affected cassettes.

Local recording

Terminal window
# Set env flag, run the affected test (or test file/dir)
RECORD_NEW_FIXTURES=1 uv run pytest tests/platform/integration/test_scan_diagnose.py -m integration
# Inspect the diff before commit
git diff tests/platform/_fixtures/llm/

The redaction lint (tools/quality/check_cassette_redaction.py) lands with the first cassette commit and runs as a pre-push gate per ADR-061 D8; before that point it has no inputs and is a dead lint.

Discipline

  • Always review the diff before commit. New cassettes can carry sensitive content; redaction at recording time strips known-sensitive headers, but custom fields may leak through.
  • Don’t bulk-regenerate. Per-test cassettes scope blast radius; bulk regeneration loses the change record.
  • Commit the cassette in the same change as the prompt edit. Detecting drift later is harder once the cassette and the code disagree.

Drift detection (nightly)

.github/workflows/nightly-live-drift.yml runs LIVE_PROVIDER=1 against the integration suite, bypasses VCR replay, and compares live output against recorded cassettes.

Triggers

  • Schedule: nightly at 02:00 UTC
  • workflow_dispatch (manual)
  • On-merge-to-main

Drift comparison

  • Text outputs: difflib.SequenceMatcher.ratio(); threshold 0.85 default
  • Tool-call outputs: structural exact-match (tool name + argument values); soft similarity on tool calls would mask behavioral regressions

Per-test threshold override

@pytest.mark.llm_drift_threshold(0.7)
async def test_creative_summarization():
...

For tests where outputs are inherently more variable (creative-writing, multi-valid-phrasing summarization).

Drift signal

  • Sentry alert (per ADR-036)
  • Nightly summary issue posted via the GitHub API listing all drifted tests + similarity scores

Drift triage

Day-after-drift workflow:

  1. Open the nightly summary issue. Each entry lists test name, threshold, observed similarity, link to the workflow run.
  2. For each drifted test, classify:
    • Provider regression — same prompt, materially different output, similarity well below threshold. Action: file a ticket; consider provider fallback per ADR-035 D7.
    • Cassette stale — prompt or model intentionally changed but cassette wasn’t regenerated. Action: regenerate the cassette per the recording session above.
    • False positive at current threshold — output is semantically equivalent but textually different (e.g., different phrasing). Action: adjust per-test threshold via marker; document the calibration in this runbook’s calibration log below.
  3. Close the issue when all entries are addressed.

Threshold calibration

0.85 text / structural exact-match tool-calls is the alpha-default. Calibration triggers:

  • First false-positive blocking a legitimate change → adjust per-test threshold via @pytest.mark.llm_drift_threshold(...). Track in the log below with date + test + new threshold + reason.
  • First regression slips through → tune the global default downward. Track in the log.

Calibration log

DateTestOldNewReason

(Empty at alpha until the first calibration fires.)


Cost controls

  • VCR-replay tests are zero-cost (the bulk of integration tests).
  • Live-drift workflow runs against a dedicated test-account on the test-live GitHub Environment per ADR-062 D1.
  • Daily cap on the test-account; concurrency-group serializes drift runs (one at a time).
  • Provider rate-limit defaults inherit from ADR-035 D5.

Monitor live-drift cost:

-- Last 7 days, live-drift only (provider keys live in test-live Environment)
SELECT date_trunc('day', recorded_at) AS day, sum(cost_estimate) AS daily_cost
FROM core.llm_usage
WHERE workspace_id IS NULL -- platform-tier (no workspace)
AND model IN ('claude-opus-4-7', 'gpt-5-4', 'gemini-3-1-pro') -- live providers, not Fake
AND recorded_at > now() - interval '7 days'
GROUP BY day
ORDER BY day DESC;

FakeLLMProvider usage

For unit + contract tests that don’t need real-provider semantics:

from spectral.core.llm.testing import FakeLLMProvider
@pytest.fixture
def fake_llm() -> FakeLLMProvider:
return FakeLLMProvider(canned={
("scoring", "PLATFORM"): "score=8.5; reasoning=...",
("reasoning", "OPERATIONS"): "diagnosis=...",
})
async def test_scan_diagnose(fake_llm):
...

Returns canned responses keyed by (purpose, content_class). Implements LLMProvider protocol from spectral.core.llm.protocols. Implementation lands with the first consumer per the deferred-protocol pattern.


See also

  • ADR-061 — LLM testing strategy
  • ADR-062 — CI secrets handling (test-live Environment)
  • ADR-035 — Rate-limit + cost controls
  • ADR-036 — Sentry alert substrate
  • Codex testing
  • Codex LLM platform
  • docs/runbooks/testing.md — broader test posture
  • docs/runbooks/ci-secrets.md — Environment scoping