Developer

LLM testing runbook

Operational procedures for the three-tier LLM test posture — recording sessions, drift triage, threshold calibration, cost controls.

System reference: Codex how-to/testing.mdx · ADR-061 · ADR-062.

Three tiers

Tier	Substrate	Marker	Cost
Unit + contract	`FakeLLMProvider` (mock)	`unit` / `contract`	$0
Integration	pytest-recording cassettes (replay)	`integration`	$0
Live drift detection	Real provider via nightly workflow	`live_drift`	bounded by daily-capped test-account

Recording sessions

When a prompt intentionally changes, regenerate the affected cassettes.

Local recording

# Set env flag, run the affected test (or test file/dir)
RECORD_NEW_FIXTURES=1 uv run pytest tests/platform/integration/test_scan_diagnose.py -m integration

# Inspect the diff before commit
git diff tests/platform/_fixtures/llm/

The redaction lint (tools/quality/check_cassette_redaction.py) lands with the first cassette commit and runs as a pre-push gate per ADR-061 D8; before that point it has no inputs and is a dead lint.

Discipline

Always review the diff before commit. New cassettes can carry sensitive content; redaction at recording time strips known-sensitive headers, but custom fields may leak through.
Don’t bulk-regenerate. Per-test cassettes scope blast radius; bulk regeneration loses the change record.
Commit the cassette in the same change as the prompt edit. Detecting drift later is harder once the cassette and the code disagree.

Drift detection (nightly)

.github/workflows/nightly-live-drift.yml runs LIVE_PROVIDER=1 against the integration suite, bypasses VCR replay, and compares live output against recorded cassettes.

Triggers

Schedule: nightly at 02:00 UTC
workflow_dispatch (manual)
On-merge-to-main

Drift comparison

Text outputs: difflib.SequenceMatcher.ratio(); threshold 0.85 default
Tool-call outputs: structural exact-match (tool name + argument values); soft similarity on tool calls would mask behavioral regressions

Per-test threshold override

@pytest.mark.llm_drift_threshold(0.7)
async def test_creative_summarization():
    ...

For tests where outputs are inherently more variable (creative-writing, multi-valid-phrasing summarization).

Drift signal

Sentry alert (per ADR-036)
Nightly summary issue posted via the GitHub API listing all drifted tests + similarity scores

Drift triage

Day-after-drift workflow:

Open the nightly summary issue. Each entry lists test name, threshold, observed similarity, link to the workflow run.
For each drifted test, classify:
- Provider regression — same prompt, materially different output, similarity well below threshold. Action: file a ticket; consider provider fallback per ADR-035 D7.
- Cassette stale — prompt or model intentionally changed but cassette wasn’t regenerated. Action: regenerate the cassette per the recording session above.
- False positive at current threshold — output is semantically equivalent but textually different (e.g., different phrasing). Action: adjust per-test threshold via marker; document the calibration in this runbook’s calibration log below.
Close the issue when all entries are addressed.

Threshold calibration

0.85 text / structural exact-match tool-calls is the alpha-default. Calibration triggers:

First false-positive blocking a legitimate change → adjust per-test threshold via @pytest.mark.llm_drift_threshold(...). Track in the log below with date + test + new threshold + reason.
First regression slips through → tune the global default downward. Track in the log.

Calibration log

Date	Test	Old	New	Reason

(Empty at alpha until the first calibration fires.)

Cost controls

VCR-replay tests are zero-cost (the bulk of integration tests).
Live-drift workflow runs against a dedicated test-account on the test-live GitHub Environment per ADR-062 D1.
Daily cap on the test-account; concurrency-group serializes drift runs (one at a time).
Provider rate-limit defaults inherit from ADR-035 D5.

Monitor live-drift cost:

-- Last 7 days, live-drift only (provider keys live in test-live Environment)
SELECT date_trunc('day', recorded_at) AS day, sum(cost_estimate) AS daily_cost
FROM core.llm_usage
WHERE workspace_id IS NULL  -- platform-tier (no workspace)
  AND model IN ('claude-opus-4-7', 'gpt-5-4', 'gemini-3-1-pro')  -- live providers, not Fake
  AND recorded_at > now() - interval '7 days'
GROUP BY day
ORDER BY day DESC;

`FakeLLMProvider` usage

For unit + contract tests that don’t need real-provider semantics:

from spectral.core.llm.testing import FakeLLMProvider

@pytest.fixture
def fake_llm() -> FakeLLMProvider:
    return FakeLLMProvider(canned={
        ("scoring", "PLATFORM"): "score=8.5; reasoning=...",
        ("reasoning", "OPERATIONS"): "diagnosis=...",
    })

async def test_scan_diagnose(fake_llm):
    ...

Returns canned responses keyed by (purpose, content_class). Implements LLMProvider protocol from spectral.core.llm.protocols. Implementation lands with the first consumer per the deferred-protocol pattern.