Developer

Testing

Testing strategy for spectral.core, spectral.worlds, and spectral.platform. Every agent-written test follows this page. The short form is in CONTRIBUTING.md → Testing in the repo root; this is the full reference for what test lives where, what to assert, and what CI enforces.

Testing principles

Non-negotiable rules. Every other decision on this page descends from these.

Test behavior, not implementation. Tests assert what the system does, not how it does it internally. A refactor that preserves behavior should not break tests.
Every test must have a reason to exist. If you can’t articulate what production failure a test catches, delete it. “It increases coverage” is not a reason.
Test at the right boundary. Unit for pure logic; property-based for invariants; integration for cross-collaborator interactions; contract for external surfaces; E2E for user-visible paths. Don’t test internal wiring — test inputs and outputs at meaningful boundaries.
Failing tests must be actionable. A test failure names what broke and where without requiring the developer to debug the test itself.
Tests are production code. Same quality bar: clear names, no duplication, no dead tests. A flaky test is a bug.
Integration tests hit real infrastructure. No DB mocks. Past incident: mocked tests passed while the prod migration broke. See CONTRIBUTING.md → Testing in the repo root.
Fewer focused tests beat many shallow ones. Ten scenarios that matter beat a hundred that don’t.

Strategy per layer

Every test file declares one layer via pytestmark. The root conftest.py enforces the marker — unmarked tests block the suite.

Layer	Primary strategy	Notes
Domain (all three contexts)	Unit + property-based for invariants	State machines, statistical uniqueness, bootstrap-CI properties, blend arithmetic — via Hypothesis.
Application	Mock at service-abstraction boundaries; fakes preferred over mocks	Only across protocols declared in `application/shared/protocols/`; never internal collaborators.
Infrastructure	Integration tests against real Supabase + pgvector + Ollama	No DB or LLM mocking at this layer. Past incident makes this non-negotiable.
API / workers	Contract tests against OpenAPI; E2E on critical paths	E2E covers the operator-walkthrough + first-customer walkthrough paths.
Test agent (`apps/test-agents`, `tax_prep`)	Scan-pipeline E2E backbone — runs in CI	See Test Agents — Pluggable OTEL emitter for the parameterization matrix.
Agent workflows (Spectral Agent, Operations Agent, WorldAgent)	Unit-test tools; integration-test conversation flows with recorded LLM fixtures; live LLM gated nightly	Live LLM exercise runs nightly only — never in PR CI.
Dual-occupant flows (Ops Agent + human operator on the same workflow)	Integration test: API action + UI action, assert both see updates	Real DB + real Realtime + scripted Ops Agent. New category with the Operations app.

Mocks vs fakes. Hand-written in-memory implementations survive refactors; MagicMock does not. Mock only across shared/protocols/; do not mock internal collaborators within the same layer.

Property-tested domain invariants. Rule + ChangeSet state machines, EvalSet statistical uniqueness, bootstrap-CI properties, conformity-gate self-consistency, CompositeScore blend arithmetic.

Test-agent role. Single tax_prep agent + pluggable OTEL emitter (instrumentation framework × LLM-vendor span shape) is the canonical full-scan-pipeline assertion.

Coverage floors

CI-enforced from commit one. These are floors, not ceilings. A layer above its floor is not an invitation to stop writing tests; a layer below the floor blocks merge.

Layer	Floor	Rationale
Domain	≥ 90%	Pure business logic with no I/O. There is no reason not to cover it. Domain bugs are the most expensive to miss because they propagate into everything above.
Application	≥ 80%	Orchestration logic. Slightly lower floor because some branches exist purely to translate domain results into API-layer concerns, and those are already covered by API-layer tests.
Infrastructure	≥ 60%	Adapters to external systems. Exhaustive coverage is uneconomic (much of the work is already done by the upstream library), but the contract-facing edges and error paths must be covered.

Coverage is measured per package via pytest-cov and reported to CI. A PR that drops any package below its floor fails the coverage job before tests finish.

Test layer markers

Marker	What it tests	Typical latency	Network	Runs in CI
`unit`	One subject, collaborators stubbed or faked	< 1s	None	PR, merge, nightly
`contract`	Consumer-facing agreement remains stable (OpenAPI, event payloads in `<context>.contracts.events.`, OHS Protocols in `<context>.contracts.protocols.`, `spectral.core` substrate types)	< 1s	None	PR, merge, nightly
`integration`	Two+ subjects interact correctly against real infra (DB / pgvector / cassette LLM)	< 5s	Local	PR, merge, nightly
`e2e`	Walkthrough paths end-to-end	5–30s	Local	Merge, nightly
`live_drift`	Nightly LLM live-drift detection — bypasses VCR cassettes; hits real providers; compares to recorded outputs	varies	External	Nightly + merge only

import pytest
pytestmark = pytest.mark.unit  # or contract, integration, e2e, live_drift

The root tests/conftest.py rejects any file missing one of the four primary markers (unit, contract, integration, e2e); live_drift is the marker the nightly workflow filters on.

Three-tier LLM test posture

ADR-061 defines three tiers:

Unit / contract — FakeLLMProvider (in spectral.core.llm.testing) implements the LLMProvider protocol; deterministic; zero external calls.
Integration — pytest-recording per-test cassettes at tests/<context>/_fixtures/llm/<test-id>.yaml. Replay is byte-perfect deterministic.
Live drift detection — .github/workflows/nightly-live-drift.yml runs LIVE_PROVIDER=1 against the integration suite, bypasses VCR replay, compares outputs against recorded cassettes via similarity threshold (0.85 text / structural exact-match tool calls; per-test override).

Cassette recording sessions: RECORD_NEW_FIXTURES=1 uv run pytest <path> -m integration. Always review the cassette diff before commit; redaction at recording time strips known-sensitive headers, but custom fields can leak through. The tools/quality/check_cassette_redaction.py lint blocks Authorization: Bearer ... patterns; it lands with the first cassette commit per ADR-061 D8 (dead lint until then). Detailed playbook in docs/runbooks/llm-testing.md.

Mock-first PR CI; live secrets gated to non-PR triggers

ADR-062 sets the policy:

Default PR CI: unit + contract + integration with FakeLLMProvider + cassettes; no external service calls; mock-first by default.
Live-secret runs are gated to push-to-main, schedule, or workflow_dispatch. Fork PRs never trigger live-secret workflows. pull_request_target is not used.
GitHub Environments scope secrets: staging, production, test-live (the latter holds the LLM provider keys for the nightly drift workflow). See docs/runbooks/ci-secrets.md.

Bilateral contract tests

Events that flow between contexts are pinned by bilateral contract tests under tests/contracts/ (per ADR-065 D6

ADR-066). This directory is the only place in the codebase exempt from the import discipline that prevents worlds and platform from importing each other (validator rule 6) — bilateral tests legitimately import both the producer’s typed payload (from <producer>.contracts.events.*) and the consumer’s local model (from the consuming flow).

The pattern, demonstrated by tests/contracts/test_failure_cluster_detected.py:

"""Bilateral contract test for platform.failure_cluster.detected."""
from spectral.platform.contracts.events.failure_cluster_detected import (
    FailureClusterDetectedPayload,
    FailureRef,
)
# Consumer-narrow local model — declares only the fields platform's curation
# intake actually needs. In production, this lives with the consuming flow
# (e.g. spectral.platform.curation.intake.failure_cluster_event).
class FailureClusterEvent(BaseModel):
    model_config = ConfigDict(frozen=True)
    cluster_id: UUID
    snapshot_hash: str
    rule_id: UUID
    workspace_id: UUID
    severity: Literal["low", "medium", "high"]


def test_consumer_parses_producer_emit_shape() -> None:
    """Round-trip invariant between contexts per ADR-065 D4."""
    producer = FailureClusterDetectedPayload(...)  # producer-rich
    wire = producer.model_dump(mode="json")
    consumer = FailureClusterEvent.model_validate(wire)
    assert consumer.cluster_id == producer.cluster_id
    # Consumer-narrow: producer-only fields silently dropped, consumer
    # never depends on them.

Two complementary tests per event between contexts:

Round-trip — verify the consumer’s local <EventName>Event model parses the producer’s model_dump(mode="json") output. Catches structural mismatch at PR time.
Schema-drift snapshot (per ADR-066, syrupy) — once syrupy is wired in as a dev-dep, snapshot the producer’s model_json_schema(). First run creates the baseline (pytest --snapshot-update); subsequent runs detect drift. Intentional changes update the snapshot in the same commit; reviewer validates intent from the diff.

Snapshot first-run discipline: the first run of a new contract test creates the syrupy baseline. Author runs pytest --snapshot-update once, commits both the test file and the generated __snapshots__/ directory; subsequent runs verify against the committed baseline. Until syrupy lands, the round-trip test alone is the load-bearing check.

See Events and Protocols for the catalogs of existing events and Protocols. New events or Protocols land alongside their owning epic and the catalog page is updated by hand at the same time.

Lookup table: I’m changing X, run Y

What you’re changing	Run these tests	Marker	Command
Domain entity (pydantic model, value object)	Construction, validation, serialization	`unit`	`pytest -m unit`
Entity state machine (lifecycle transitions)	Valid transitions + invalid-state rejection	`unit`	`pytest -m unit`
Domain invariant (uniqueness, monotonicity, blend arithmetic)	Property-based via Hypothesis	`unit`	`pytest -m unit`
`spectral.core` substrate type	Contract test pinning the type’s wire shape	`contract`	`pytest -m contract`
Producer-owned event payload (`<context>.contracts.events.*`)	Producer wire-shape test under `tests/<context>/contracts/events/` + bilateral round-trip + drift snapshot under `tests/contracts/`	`contract`	`pytest -m contract`
OHS Protocol (`<context>.contracts.protocols.*`)	Protocol-conformance test under `tests/<context>/contracts/protocols/` (structural `isinstance` against a stub)	`contract`	`pytest -m contract`
Application use case	Use case orchestration with injected fakes	`unit`	`pytest -m unit`
Application protocol in `application/shared/protocols/`	Protocol conformance (structural)	`unit`	`pytest -m unit`
Supabase migration / RLS policy	RLS isolation per role, constraint enforcement	`integration`	`pytest -m integration`
Infrastructure adapter (DB, LLM, notification)	Round-trip against real local service	`integration`	`pytest -m integration`
OTEL ingestion path	Trace parsing + sample derivation	`unit` + `integration`	`pytest -m "unit or integration"`
LLM prompt construction	Recorded-response replay	`unit`	`pytest -m unit`
Scan pipeline phase	Phase behavior with fakes for collaborators	`unit`	`pytest -m unit`
Full scan pipeline	Test-agent E2E backbone, diagonal emitter slice (3 cells)	`e2e`	`pytest -m e2e`
API endpoint / router	Request/response contract, auth, validation	`contract` + `unit`	`pytest -m "contract or unit"`
Agent conversation flow (Spectral / Ops / World)	Recorded fixture replay	`integration`	`pytest -m integration`
Dual-occupant flow (Ops Agent + human UI)	API action + UI action, both see updates	`integration`	`pytest -m integration`
Frontend component (dashboard / operations)	Playwright NL specs with mocked LLM	`e2e`	`pytest -m e2e`

Property-based testing with Hypothesis

Invariants are the strongest form of test because they hold for all inputs in the described space, not just hand-picked examples. We use Hypothesis for every invariant we can express.

Required property-based coverage:

Subject	Invariant
Rule status machine	Only declared transitions are reachable; every reachable state has a valid predecessor
ChangeSet status machine	Same structural guarantees; terminal states are absorbing
EvalSet sample generation	Statistical uniqueness — no two generated samples collide within a corpus above chance
Bootstrap CI	Coverage properties — computed intervals contain the population statistic at the declared confidence rate across many seeds
Conformity gate	Self-consistency — gate output is deterministic given inputs; gate never contradicts its own prior decision on the same inputs
CompositeScore	Blend arithmetic — weighted combinations respect monotonicity; bounds stay within `[0, 1]`; no silent NaN propagation

Each of these lives as a @given(...) Hypothesis test under the owning package’s tests/unit/ tree. Shrinking failure cases is the point; a flaky Hypothesis test with an unshrunk counter-example is a bug to investigate, not to retry.

The test-agent scan-pipeline backbone

apps/test-agents is the alpha home for a single subject agent — tax_prep — whose sole purpose is to exercise the full scan pipeline end-to-end in CI. The prior plan of one test agent per workflow shape was consolidated because ingestion diversity is better expressed at the emitter level. The directory currently holds a scaffold; the working tax_prep agent + pluggable OTEL emitter described here land under the test-substrate epic — see Test Agents for the current-state framing.

Pluggable OTEL emitter. The tax_prep agent is parameterized over an OTEL emitter that varies along two axes — instrumentation framework (LangChain / OpenLLMetry / Manual SDK) × LLM-vendor span shape (Anthropic / OpenAI / raw OTLP). The full coverage matrix and CI tier policy are canonical in Test Agents — Pluggable OTEL emitter.

The agent runs a deterministic tax-prep workflow (fixed inputs, fixed expected outputs) against each emitter cell. The scan pipeline ingests, calibrates, diagnoses, evaluates, optimizes, checks safety, and renders a verdict. Every phase has assertions on the intermediate state.

This is the one E2E path that the scan pipeline is obligated to pass on every merge. The per-push diagonal slice covers both axes; a cell failure localises the ingestion bug.

Agent workflow testing

Three agents (Spectral Agent, Operations Agent, WorldAgent) are each tested in three passes:

Unit tests on tools. Agent tools are pure functions or thin wrappers over application services. Unit-test them directly. These are the first things to write when adding a tool.
Integration tests on conversation flows with recorded LLM fixtures. Record once against a live LLM, replay deterministically in CI. See LLM fixture recording below. This is the default test for end-to-end conversation behavior.
Live-LLM exercise gated nightly. The nightly job runs representative conversation flows against the real LLM provider and checks that the recorded fixtures still hold. Divergence from the fixture is a signal — either the provider drifted (re-record) or the prompt regressed (fix). Never run live LLM tests in PR CI: cost, flakiness, and provider-incident blast radius.

Dual-occupant flows (Ops Agent + human operator)

When the Operations app shipped, a new test category emerged: flows where the Ops Agent and a human operator occupy the same workflow concurrently. The critical invariant is both occupants see the same updates.

Minimum coverage for every dual-occupant surface:

pytestmark = pytest.mark.integration

def test_ops_agent_sees_human_ui_action(ops_agent_session, human_ui_client, workflow_id):
    # 1. Human triggers a UI action (via a scripted dashboard client)
    human_ui_client.approve_rule_candidate(workflow_id, rule_id="...")
    # 2. Ops Agent reads the workflow state (via its own tools)
    state = ops_agent_session.read_workflow_state(workflow_id)
    # 3. Assert the agent's view reflects the human action
    assert state.pending_approvals == []
    assert state.last_decision.actor_id == human_ui_client.user_id

def test_human_sees_ops_agent_action(ops_agent_session, human_ui_client, workflow_id):
    # Same test, reversed roles.
    ops_agent_session.flag_rule_for_review(workflow_id, rule_id="...")
    ui_state = human_ui_client.get_workflow(workflow_id)
    assert any(r.status == "flagged" for r in ui_state.rules)

These run against the real Supabase project and exercise the real Realtime channel. They are slow (seconds, not milliseconds). Keep them few and sharp — one per meaningful dual-occupant surface.

Mutation testing

Mutation testing verifies that tests actually catch bugs, not just cover lines. We use cosmic-ray for this. It runs nightly, not per-PR.

Mutation scope (nightly):

Module	Why it’s in scope
Verdict engine	The final decision layer. A silent regression here produces wrong approvals.
Conformity gate	Gates what reaches customers. An unnoticed weakness here lets non-conforming rules through.
CompositeScore arithmetic	Blend math. Off-by-one or sign errors in weights are both invisible and expensive.
Holdout selection	Determines what’s evaluated. Biased selection silently corrupts downstream metrics.

A survivor is a mutant that passes the test suite — meaning a bug was introduced that no test caught. Surviving mutants are tracked as a small standing task; triage each survivor and either add a test that kills it, or document why the mutation is semantically equivalent (rare).

A target survival rate is not declared up-front — the goal is to drive survivors to zero on the scoped modules. Outside the scoped modules, mutation testing is not run in CI.

Conftest fixtures

The root tests/conftest.py and each package’s tests/conftest.py provide the standard fixtures. Integration-test fixtures hit a dedicated test-Supabase instance whose schema-isolation lifecycle is defined by ADR-045 — per-test rollback inside a shared schema-isolated database, with migration parity to production.

Default names (may vary slightly by package — follow the conftest, not this list):

supabase_db — raw connection with per-test rollback
as_owner, as_member, as_operator — authenticated role contexts
as_service_role — admin / backend access, bypasses RLS
as_anon — unauthenticated access
ollama_client — local Ollama client for embedding / small-model tests
llm_replay — replay-mode LLM client, reads a recorded fixture path

RLS integration test pattern

Every workspace-scoped table needs integration tests validating the four RLS shapes:

pytestmark = pytest.mark.integration

class TestMyTableRLS:
    def test_member_sees_own_workspace(self, supabase_db):
        # 1. Create workspace + membership as postgres (bypasses RLS)
        # 2. Insert test data
        # 3. Switch to authenticated role with _set_role()
        # 4. Assert data is visible

    def test_cross_tenant_invisible(self, supabase_db):
        # 1. Create workspace + data for a DIFFERENT workspace
        # 2. Switch to your test user
        # 3. Assert the other workspace's data is NOT visible (zero rows — not "different rows")

    def test_anon_sees_nothing(self, supabase_db):
        # Switch to anon role, assert zero rows

    def test_service_role_sees_all(self, supabase_db):
        # Switch to service_role, assert all rows visible

Smoke-level RLS coverage is alpha-required. Adversarial RLS testing is a a future item — see future considerations.

LLM fixture recording

Record once against a real provider, then replay deterministically in every subsequent run:

from spectral.core.llm.testing.fixture import llm_recorder, llm_replay

# Record once (creates the fixture file; run against real LLM)
with llm_recorder("tests/fixtures/llm/my_agent/scenario.json", real_client) as client:
    result = call_llm(client, "system prompt", "user prompt")

# Replay in tests (no network, deterministic)
client = llm_replay("tests/fixtures/llm/my_agent/scenario.json")
result = call_llm(client, "system prompt", "user prompt")

Recording is a deliberate action — never automatic. A recorded fixture that drifts from the live provider is caught by the nightly live-LLM job.

Developer commands

# Fast loop (unit only)
uv run --all-packages pytest -m unit -q

# PR-equivalent CI pipeline
uv run --all-packages pytest -m "unit or integration or contract"

# Everything including E2E
uv run --all-packages pytest -v

# Coverage report (per-package)
uv run --all-packages pytest --cov --cov-report=term-missing

# Mutation testing (nightly-equivalent, expensive)
cosmic-ray init cosmic-ray.toml session.sqlite
cosmic-ray exec session.sqlite
cosmic-ray dump session.sqlite | cr-report

Or run the tiered pre-push script, which bundles fast checks into the same gate CI uses:

bash tools/dev/precheck.sh           # matches CI pre-push gate
bash tools/dev/precheck.sh --install # install as pre-push hook

CI pipeline

Trigger	Layers run	Extras	Timeout
Pull request	`unit` + `contract` + `integration`	Coverage floors enforced	10 min
Push to main	+ `e2e` (test-agent backbone, diagonal emitter slice)	Coverage floors enforced	15 min
Nightly (03:00 UTC)	All layers	+ full-matrix emitter coverage (all 7 cells), Live LLM conversation-flow replay, mutation testing on scoped modules	60 min
Manual (`workflow_dispatch`)	Configurable	Configurable	60 min

Test results are reported via JUnit XML in the Actions job summary. Mutation survivors are filed as open items on the standing mutation-triage issue.

When a test is wrong

If the test breaks during a refactor that preserved behavior: the test was coupled to implementation. Rewrite it to assert on behavior at the boundary, not internal wiring.
If the test flakes: treat as a bug. Root-cause it (timing, ordering, shared state, hidden I/O). Don’t @pytest.mark.flaky.
If the test is hard to understand: the test is the bug. Rewrite for clarity. Tests are production code.
If you can’t articulate what failure the test catches: delete it.

CONTRIBUTING.md → Testing (repo root) — short form of the testing rules
Epic Template & DoD — integration-test AC requirement per epic
Architecture — three-context topology
Future considerations — a future hardening (adversarial RLS, observability stack choice)
docs/runbooks/testing.md — operational playbook for the test-Supabase instance, fixture-recording sessions, and CI gate troubleshooting
docs/runbooks/llm-testing.md — cassette redaction + nightly drift workflow detail

Previous
Epic Template & DoD Next
Test Agents