Testing
Testing strategy for spectral.core, spectral.worlds, and spectral.platform. Every
agent-written test follows this page. The short form is in CONTRIBUTING.md → Testing in the repo root; this is the full reference for what test lives where, what to assert, and what CI enforces.
Testing principles
Section titled “Testing principles”Non-negotiable rules. Every other decision on this page descends from these.
- Test behavior, not implementation. Tests assert what the system does, not how it does it internally. A refactor that preserves behavior should not break tests.
- Every test must have a reason to exist. If you can’t articulate what production failure a test catches, delete it. “It increases coverage” is not a reason.
- Test at the right boundary. Unit for pure logic; property-based for invariants; integration for cross-collaborator interactions; contract for external surfaces; E2E for user-visible paths. Don’t test internal wiring — test inputs and outputs at meaningful boundaries.
- Failing tests must be actionable. A test failure names what broke and where without requiring the developer to debug the test itself.
- Tests are production code. Same quality bar: clear names, no duplication, no dead tests. A flaky test is a bug.
- Integration tests hit real infrastructure. No DB mocks. Past incident: mocked tests passed while the prod migration broke. See CONTRIBUTING.md → Testing in the repo root.
- Fewer focused tests beat many shallow ones. Ten scenarios that matter beat a hundred that don’t.
Strategy per layer
Section titled “Strategy per layer”Every test file declares one layer via pytestmark. The root conftest.py enforces the marker —
unmarked tests block the suite.
| Layer | Primary strategy | Notes |
|---|---|---|
| Domain (all three contexts) | Unit + property-based for invariants | State machines, statistical uniqueness, bootstrap-CI properties, blend arithmetic — via Hypothesis. |
| Application | Mock at service-abstraction boundaries; fakes preferred over mocks | Only across protocols declared in application/shared/protocols/; never internal collaborators. |
| Infrastructure | Integration tests against real Supabase + pgvector + Ollama | No DB or LLM mocking at this layer. Past incident makes this non-negotiable. |
| API / workers | Contract tests against OpenAPI; E2E on critical paths | E2E covers the operator-walkthrough + first-customer walkthrough paths. |
Test agent (apps/test-agents, tax_prep) | Scan-pipeline E2E backbone — runs in CI | See Test Agents — Pluggable OTEL emitter for the parameterization matrix. |
| Agent workflows (Spectral Agent, Operations Agent, WorldAgent) | Unit-test tools; integration-test conversation flows with recorded LLM fixtures; live LLM gated nightly | Live LLM exercise runs nightly only — never in PR CI. |
| Dual-occupant flows (Ops Agent + human operator on the same workflow) | Integration test: API action + UI action, assert both see updates | Real DB + real Realtime + scripted Ops Agent. New category with the Operations app. |
Mocks vs fakes. Hand-written in-memory implementations survive refactors; MagicMock does
not. Mock only across shared/protocols/; do not mock internal collaborators within the same
layer.
Property-tested domain invariants. Rule + ChangeSet state machines, EvalSet statistical uniqueness, bootstrap-CI properties, conformity-gate self-consistency, CompositeScore blend arithmetic.
Test-agent role. Single tax_prep agent + pluggable OTEL emitter (instrumentation framework
× LLM-vendor span shape) is the canonical full-scan-pipeline assertion.
Coverage floors
Section titled “Coverage floors”CI-enforced from commit one. These are floors, not ceilings. A layer above its floor is not an invitation to stop writing tests; a layer below the floor blocks merge.
| Layer | Floor | Rationale |
|---|---|---|
| Domain | ≥ 90% | Pure business logic with no I/O. There is no reason not to cover it. Domain bugs are the most expensive to miss because they propagate into everything above. |
| Application | ≥ 80% | Orchestration logic. Slightly lower floor because some branches exist purely to translate domain results into API-layer concerns, and those are already covered by API-layer tests. |
| Infrastructure | ≥ 60% | Adapters to external systems. Exhaustive coverage is uneconomic (much of the work is already done by the upstream library), but the contract-facing edges and error paths must be covered. |
Coverage is measured per package via pytest-cov and reported to CI. A PR that drops any package
below its floor fails the coverage job before tests finish.
Test layer markers
Section titled “Test layer markers”| Marker | What it tests | Typical latency | Network | Runs in CI |
|---|---|---|---|---|
unit | One subject, collaborators stubbed or faked | < 1s | None | PR, merge, nightly |
contract | Consumer-facing agreement remains stable (OpenAPI, event payloads in <context>.contracts.events.*, OHS Protocols in <context>.contracts.protocols.*, spectral.core substrate types) | < 1s | None | PR, merge, nightly |
integration | Two+ subjects interact correctly against real infra (DB / pgvector / cassette LLM) | < 5s | Local | PR, merge, nightly |
e2e | Walkthrough paths end-to-end | 5–30s | Local | Merge, nightly |
live_drift | Nightly LLM live-drift detection — bypasses VCR cassettes; hits real providers; compares to recorded outputs | varies | External | Nightly + merge only |
import pytestpytestmark = pytest.mark.unit # or contract, integration, e2e, live_driftThe root tests/conftest.py rejects any file missing one of the four primary markers (unit,
contract, integration, e2e); live_drift is the marker the nightly workflow filters on.
Three-tier LLM test posture
Section titled “Three-tier LLM test posture”ADR-061 defines three tiers:
- Unit / contract —
FakeLLMProvider(inspectral.core.llm.testing) implements theLLMProviderprotocol; deterministic; zero external calls. - Integration — pytest-recording per-test cassettes at
tests/<context>/_fixtures/llm/<test-id>.yaml. Replay is byte-perfect deterministic. - Live drift detection —
.github/workflows/nightly-live-drift.ymlrunsLIVE_PROVIDER=1against the integration suite, bypasses VCR replay, compares outputs against recorded cassettes via similarity threshold (0.85 text / structural exact-match tool calls; per-test override).
Cassette recording sessions: RECORD_NEW_FIXTURES=1 uv run pytest <path> -m integration. Always
review the cassette diff before commit; redaction at recording time strips known-sensitive
headers, but custom fields can leak through. The tools/quality/check_cassette_redaction.py
lint blocks Authorization: Bearer ... patterns; it lands with the first cassette commit per
ADR-061 D8 (dead lint until then). Detailed playbook
in docs/runbooks/llm-testing.md.
Mock-first PR CI; live secrets gated to non-PR triggers
Section titled “Mock-first PR CI; live secrets gated to non-PR triggers”ADR-062 sets the policy:
- Default PR CI: unit + contract + integration with
FakeLLMProvider+ cassettes; no external service calls; mock-first by default. - Live-secret runs are gated to
push-to-main,schedule, orworkflow_dispatch. Fork PRs never trigger live-secret workflows.pull_request_targetis not used. - GitHub Environments scope secrets:
staging,production,test-live(the latter holds the LLM provider keys for the nightly drift workflow). Seedocs/runbooks/ci-secrets.md.
Bilateral contract tests
Section titled “Bilateral contract tests”Events that flow between contexts are pinned by bilateral contract tests under
tests/contracts/ (per ADR-065 D6
- ADR-066). This directory is
the only place in the codebase exempt from the import discipline that prevents
worldsandplatformfrom importing each other (validator rule 6) — bilateral tests legitimately import both the producer’s typed payload (from<producer>.contracts.events.*) and the consumer’s local model (from the consuming flow).
The pattern, demonstrated by tests/contracts/test_failure_cluster_detected.py:
"""Bilateral contract test for platform.failure_cluster.detected."""from spectral.platform.contracts.events.failure_cluster_detected import ( FailureClusterDetectedPayload, FailureRef,)# Consumer-narrow local model — declares only the fields platform's curation# intake actually needs. In production, this lives with the consuming flow# (e.g. spectral.platform.curation.intake.failure_cluster_event).class FailureClusterEvent(BaseModel): model_config = ConfigDict(frozen=True) cluster_id: UUID snapshot_hash: str rule_id: UUID workspace_id: UUID severity: Literal["low", "medium", "high"]
def test_consumer_parses_producer_emit_shape() -> None: """Round-trip invariant between contexts per ADR-065 D4.""" producer = FailureClusterDetectedPayload(...) # producer-rich wire = producer.model_dump(mode="json") consumer = FailureClusterEvent.model_validate(wire) assert consumer.cluster_id == producer.cluster_id # Consumer-narrow: producer-only fields silently dropped, consumer # never depends on them.Two complementary tests per event between contexts:
- Round-trip — verify the consumer’s local
<EventName>Eventmodel parses the producer’smodel_dump(mode="json")output. Catches structural mismatch at PR time. - Schema-drift snapshot (per ADR-066,
syrupy) — oncesyrupyis wired in as a dev-dep, snapshot the producer’smodel_json_schema(). First run creates the baseline (pytest --snapshot-update); subsequent runs detect drift. Intentional changes update the snapshot in the same commit; reviewer validates intent from the diff.
Snapshot first-run discipline: the first run of a new contract test creates the
syrupybaseline. Author runspytest --snapshot-updateonce, commits both the test file and the generated__snapshots__/directory; subsequent runs verify against the committed baseline. Until syrupy lands, the round-trip test alone is the load-bearing check.
See Events and Protocols for the catalogs of existing events and Protocols. New events or Protocols land alongside their owning epic and the catalog page is updated by hand at the same time.
Lookup table: I’m changing X, run Y
Section titled “Lookup table: I’m changing X, run Y”| What you’re changing | Run these tests | Marker | Command |
|---|---|---|---|
| Domain entity (pydantic model, value object) | Construction, validation, serialization | unit | pytest -m unit |
| Entity state machine (lifecycle transitions) | Valid transitions + invalid-state rejection | unit | pytest -m unit |
| Domain invariant (uniqueness, monotonicity, blend arithmetic) | Property-based via Hypothesis | unit | pytest -m unit |
spectral.core substrate type | Contract test pinning the type’s wire shape | contract | pytest -m contract |
Producer-owned event payload (<context>.contracts.events.*) | Producer wire-shape test under tests/<context>/contracts/events/ + bilateral round-trip + drift snapshot under tests/contracts/ | contract | pytest -m contract |
OHS Protocol (<context>.contracts.protocols.*) | Protocol-conformance test under tests/<context>/contracts/protocols/ (structural isinstance against a stub) | contract | pytest -m contract |
| Application use case | Use case orchestration with injected fakes | unit | pytest -m unit |
Application protocol in application/shared/protocols/ | Protocol conformance (structural) | unit | pytest -m unit |
| Supabase migration / RLS policy | RLS isolation per role, constraint enforcement | integration | pytest -m integration |
| Infrastructure adapter (DB, LLM, notification) | Round-trip against real local service | integration | pytest -m integration |
| OTEL ingestion path | Trace parsing + sample derivation | unit + integration | pytest -m "unit or integration" |
| LLM prompt construction | Recorded-response replay | unit | pytest -m unit |
| Scan pipeline phase | Phase behavior with fakes for collaborators | unit | pytest -m unit |
| Full scan pipeline | Test-agent E2E backbone, diagonal emitter slice (3 cells) | e2e | pytest -m e2e |
| API endpoint / router | Request/response contract, auth, validation | contract + unit | pytest -m "contract or unit" |
| Agent conversation flow (Spectral / Ops / World) | Recorded fixture replay | integration | pytest -m integration |
| Dual-occupant flow (Ops Agent + human UI) | API action + UI action, both see updates | integration | pytest -m integration |
| Frontend component (dashboard / operations) | Playwright NL specs with mocked LLM | e2e | pytest -m e2e |
Property-based testing with Hypothesis
Section titled “Property-based testing with Hypothesis”Invariants are the strongest form of test because they hold for all inputs in the described space, not just hand-picked examples. We use Hypothesis for every invariant we can express.
Required property-based coverage:
| Subject | Invariant |
|---|---|
| Rule status machine | Only declared transitions are reachable; every reachable state has a valid predecessor |
| ChangeSet status machine | Same structural guarantees; terminal states are absorbing |
| EvalSet sample generation | Statistical uniqueness — no two generated samples collide within a corpus above chance |
| Bootstrap CI | Coverage properties — computed intervals contain the population statistic at the declared confidence rate across many seeds |
| Conformity gate | Self-consistency — gate output is deterministic given inputs; gate never contradicts its own prior decision on the same inputs |
| CompositeScore | Blend arithmetic — weighted combinations respect monotonicity; bounds stay within [0, 1]; no silent NaN propagation |
Each of these lives as a @given(...) Hypothesis test under the owning package’s tests/unit/
tree. Shrinking failure cases is the point; a flaky Hypothesis test with an unshrunk
counter-example is a bug to investigate, not to retry.
The test-agent scan-pipeline backbone
Section titled “The test-agent scan-pipeline backbone”apps/test-agents is the alpha home for a single subject agent — tax_prep — whose sole
purpose is to exercise the full scan pipeline end-to-end in CI. The prior plan of one test agent
per workflow shape was consolidated because ingestion diversity is better expressed at the
emitter level. The directory currently holds a scaffold; the working tax_prep agent + pluggable
OTEL emitter described here land under the test-substrate epic — see Test Agents
for the current-state framing.
Pluggable OTEL emitter. The tax_prep agent is parameterized over an OTEL emitter that
varies along two axes — instrumentation framework (LangChain / OpenLLMetry / Manual SDK) ×
LLM-vendor span shape (Anthropic / OpenAI / raw OTLP). The full coverage matrix and CI tier
policy are canonical in
Test Agents — Pluggable OTEL emitter.
The agent runs a deterministic tax-prep workflow (fixed inputs, fixed expected outputs) against each emitter cell. The scan pipeline ingests, calibrates, diagnoses, evaluates, optimizes, checks safety, and renders a verdict. Every phase has assertions on the intermediate state.
This is the one E2E path that the scan pipeline is obligated to pass on every merge. The per-push diagonal slice covers both axes; a cell failure localises the ingestion bug.
Agent workflow testing
Section titled “Agent workflow testing”Three agents (Spectral Agent, Operations Agent, WorldAgent) are each tested in three passes:
- Unit tests on tools. Agent tools are pure functions or thin wrappers over application services. Unit-test them directly. These are the first things to write when adding a tool.
- Integration tests on conversation flows with recorded LLM fixtures. Record once against a live LLM, replay deterministically in CI. See LLM fixture recording below. This is the default test for end-to-end conversation behavior.
- Live-LLM exercise gated nightly. The nightly job runs representative conversation flows against the real LLM provider and checks that the recorded fixtures still hold. Divergence from the fixture is a signal — either the provider drifted (re-record) or the prompt regressed (fix). Never run live LLM tests in PR CI: cost, flakiness, and provider-incident blast radius.
Dual-occupant flows (Ops Agent + human operator)
Section titled “Dual-occupant flows (Ops Agent + human operator)”When the Operations app shipped, a new test category emerged: flows where the Ops Agent and a human operator occupy the same workflow concurrently. The critical invariant is both occupants see the same updates.
Minimum coverage for every dual-occupant surface:
pytestmark = pytest.mark.integration
def test_ops_agent_sees_human_ui_action(ops_agent_session, human_ui_client, workflow_id): # 1. Human triggers a UI action (via a scripted dashboard client) human_ui_client.approve_rule_candidate(workflow_id, rule_id="...") # 2. Ops Agent reads the workflow state (via its own tools) state = ops_agent_session.read_workflow_state(workflow_id) # 3. Assert the agent's view reflects the human action assert state.pending_approvals == [] assert state.last_decision.actor_id == human_ui_client.user_id
def test_human_sees_ops_agent_action(ops_agent_session, human_ui_client, workflow_id): # Same test, reversed roles. ops_agent_session.flag_rule_for_review(workflow_id, rule_id="...") ui_state = human_ui_client.get_workflow(workflow_id) assert any(r.status == "flagged" for r in ui_state.rules)These run against the real Supabase project and exercise the real Realtime channel. They are slow (seconds, not milliseconds). Keep them few and sharp — one per meaningful dual-occupant surface.
Mutation testing
Section titled “Mutation testing”Mutation testing verifies that tests actually catch bugs, not just cover lines. We use cosmic-ray for this. It runs nightly, not per-PR.
Mutation scope (nightly):
| Module | Why it’s in scope |
|---|---|
| Verdict engine | The final decision layer. A silent regression here produces wrong approvals. |
| Conformity gate | Gates what reaches customers. An unnoticed weakness here lets non-conforming rules through. |
| CompositeScore arithmetic | Blend math. Off-by-one or sign errors in weights are both invisible and expensive. |
| Holdout selection | Determines what’s evaluated. Biased selection silently corrupts downstream metrics. |
A survivor is a mutant that passes the test suite — meaning a bug was introduced that no test caught. Surviving mutants are tracked as a small standing task; triage each survivor and either add a test that kills it, or document why the mutation is semantically equivalent (rare).
A target survival rate is not declared up-front — the goal is to drive survivors to zero on the scoped modules. Outside the scoped modules, mutation testing is not run in CI.
Conftest fixtures
Section titled “Conftest fixtures”The root tests/conftest.py and each package’s tests/conftest.py provide the standard fixtures.
Integration-test fixtures hit a dedicated test-Supabase instance whose schema-isolation lifecycle
is defined by ADR-045 — per-test rollback inside
a shared schema-isolated database, with migration parity to production.
Default names (may vary slightly by package — follow the conftest, not this list):
supabase_db— raw connection with per-test rollbackas_owner,as_member,as_operator— authenticated role contextsas_service_role— admin / backend access, bypasses RLSas_anon— unauthenticated accessollama_client— local Ollama client for embedding / small-model testsllm_replay— replay-mode LLM client, reads a recorded fixture path
RLS integration test pattern
Section titled “RLS integration test pattern”Every workspace-scoped table needs integration tests validating the four RLS shapes:
pytestmark = pytest.mark.integration
class TestMyTableRLS: def test_member_sees_own_workspace(self, supabase_db): # 1. Create workspace + membership as postgres (bypasses RLS) # 2. Insert test data # 3. Switch to authenticated role with _set_role() # 4. Assert data is visible
def test_cross_tenant_invisible(self, supabase_db): # 1. Create workspace + data for a DIFFERENT workspace # 2. Switch to your test user # 3. Assert the other workspace's data is NOT visible (zero rows — not "different rows")
def test_anon_sees_nothing(self, supabase_db): # Switch to anon role, assert zero rows
def test_service_role_sees_all(self, supabase_db): # Switch to service_role, assert all rows visibleSmoke-level RLS coverage is alpha-required. Adversarial RLS testing is a a future item — see future considerations.
LLM fixture recording
Section titled “LLM fixture recording”Record once against a real provider, then replay deterministically in every subsequent run:
from spectral.core.llm.testing.fixture import llm_recorder, llm_replay
# Record once (creates the fixture file; run against real LLM)with llm_recorder("tests/fixtures/llm/my_agent/scenario.json", real_client) as client: result = call_llm(client, "system prompt", "user prompt")
# Replay in tests (no network, deterministic)client = llm_replay("tests/fixtures/llm/my_agent/scenario.json")result = call_llm(client, "system prompt", "user prompt")Recording is a deliberate action — never automatic. A recorded fixture that drifts from the live provider is caught by the nightly live-LLM job.
Developer commands
Section titled “Developer commands”# Fast loop (unit only)uv run --all-packages pytest -m unit -q
# PR-equivalent CI pipelineuv run --all-packages pytest -m "unit or integration or contract"
# Everything including E2Euv run --all-packages pytest -v
# Coverage report (per-package)uv run --all-packages pytest --cov --cov-report=term-missing
# Mutation testing (nightly-equivalent, expensive)cosmic-ray init cosmic-ray.toml session.sqlitecosmic-ray exec session.sqlitecosmic-ray dump session.sqlite | cr-reportOr run the tiered pre-push script, which bundles fast checks into the same gate CI uses:
bash tools/dev/precheck.sh # matches CI pre-push gatebash tools/dev/precheck.sh --install # install as pre-push hookCI pipeline
Section titled “CI pipeline”| Trigger | Layers run | Extras | Timeout |
|---|---|---|---|
| Pull request | unit + contract + integration | Coverage floors enforced | 10 min |
| Push to main | + e2e (test-agent backbone, diagonal emitter slice) | Coverage floors enforced | 15 min |
| Nightly (03:00 UTC) | All layers | + full-matrix emitter coverage (all 7 cells), Live LLM conversation-flow replay, mutation testing on scoped modules | 60 min |
Manual (workflow_dispatch) | Configurable | Configurable | 60 min |
Test results are reported via JUnit XML in the Actions job summary. Mutation survivors are filed as open items on the standing mutation-triage issue.
When a test is wrong
Section titled “When a test is wrong”- If the test breaks during a refactor that preserved behavior: the test was coupled to implementation. Rewrite it to assert on behavior at the boundary, not internal wiring.
- If the test flakes: treat as a bug. Root-cause it (timing, ordering, shared state, hidden
I/O). Don’t
@pytest.mark.flaky. - If the test is hard to understand: the test is the bug. Rewrite for clarity. Tests are production code.
- If you can’t articulate what failure the test catches: delete it.
Related reading
Section titled “Related reading”- CONTRIBUTING.md → Testing (repo root) — short form of the testing rules
- Epic Template & DoD — integration-test AC requirement per epic
- Architecture — three-context topology
- Future considerations — a future hardening (adversarial RLS, observability stack choice)
docs/runbooks/testing.md— operational playbook for the test-Supabase instance, fixture-recording sessions, and CI gate troubleshootingdocs/runbooks/llm-testing.md— cassette redaction + nightly drift workflow detail