Skip to content
GitHub
Developer

Testing

Testing strategy for spectral.core, spectral.worlds, and spectral.platform. Every agent-written test follows this page. The short form is in CONTRIBUTING.md → Testing in the repo root; this is the full reference for what test lives where, what to assert, and what CI enforces.


Non-negotiable rules. Every other decision on this page descends from these.

  1. Test behavior, not implementation. Tests assert what the system does, not how it does it internally. A refactor that preserves behavior should not break tests.
  2. Every test must have a reason to exist. If you can’t articulate what production failure a test catches, delete it. “It increases coverage” is not a reason.
  3. Test at the right boundary. Unit for pure logic; property-based for invariants; integration for cross-collaborator interactions; contract for external surfaces; E2E for user-visible paths. Don’t test internal wiring — test inputs and outputs at meaningful boundaries.
  4. Failing tests must be actionable. A test failure names what broke and where without requiring the developer to debug the test itself.
  5. Tests are production code. Same quality bar: clear names, no duplication, no dead tests. A flaky test is a bug.
  6. Integration tests hit real infrastructure. No DB mocks. Past incident: mocked tests passed while the prod migration broke. See CONTRIBUTING.md → Testing in the repo root.
  7. Fewer focused tests beat many shallow ones. Ten scenarios that matter beat a hundred that don’t.

Every test file declares one layer via pytestmark. The root conftest.py enforces the marker — unmarked tests block the suite.

LayerPrimary strategyNotes
Domain (all three contexts)Unit + property-based for invariantsState machines, statistical uniqueness, bootstrap-CI properties, blend arithmetic — via Hypothesis.
ApplicationMock at service-abstraction boundaries; fakes preferred over mocksOnly across protocols declared in application/shared/protocols/; never internal collaborators.
InfrastructureIntegration tests against real Supabase + pgvector + OllamaNo DB or LLM mocking at this layer. Past incident makes this non-negotiable.
API / workersContract tests against OpenAPI; E2E on critical pathsE2E covers the operator-walkthrough + first-customer walkthrough paths.
Test agent (apps/test-agents, tax_prep)Scan-pipeline E2E backbone — runs in CISee Test Agents — Pluggable OTEL emitter for the parameterization matrix.
Agent workflows (Spectral Agent, Operations Agent, WorldAgent)Unit-test tools; integration-test conversation flows with recorded LLM fixtures; live LLM gated nightlyLive LLM exercise runs nightly only — never in PR CI.
Dual-occupant flows (Ops Agent + human operator on the same workflow)Integration test: API action + UI action, assert both see updatesReal DB + real Realtime + scripted Ops Agent. New category with the Operations app.

Mocks vs fakes. Hand-written in-memory implementations survive refactors; MagicMock does not. Mock only across shared/protocols/; do not mock internal collaborators within the same layer.

Property-tested domain invariants. Rule + ChangeSet state machines, EvalSet statistical uniqueness, bootstrap-CI properties, conformity-gate self-consistency, CompositeScore blend arithmetic.

Test-agent role. Single tax_prep agent + pluggable OTEL emitter (instrumentation framework × LLM-vendor span shape) is the canonical full-scan-pipeline assertion.


CI-enforced from commit one. These are floors, not ceilings. A layer above its floor is not an invitation to stop writing tests; a layer below the floor blocks merge.

LayerFloorRationale
Domain≥ 90%Pure business logic with no I/O. There is no reason not to cover it. Domain bugs are the most expensive to miss because they propagate into everything above.
Application≥ 80%Orchestration logic. Slightly lower floor because some branches exist purely to translate domain results into API-layer concerns, and those are already covered by API-layer tests.
Infrastructure≥ 60%Adapters to external systems. Exhaustive coverage is uneconomic (much of the work is already done by the upstream library), but the contract-facing edges and error paths must be covered.

Coverage is measured per package via pytest-cov and reported to CI. A PR that drops any package below its floor fails the coverage job before tests finish.


MarkerWhat it testsTypical latencyNetworkRuns in CI
unitOne subject, collaborators stubbed or faked< 1sNonePR, merge, nightly
contractConsumer-facing agreement remains stable (OpenAPI, event payloads in <context>.contracts.events.*, OHS Protocols in <context>.contracts.protocols.*, spectral.core substrate types)< 1sNonePR, merge, nightly
integrationTwo+ subjects interact correctly against real infra (DB / pgvector / cassette LLM)< 5sLocalPR, merge, nightly
e2eWalkthrough paths end-to-end5–30sLocalMerge, nightly
live_driftNightly LLM live-drift detection — bypasses VCR cassettes; hits real providers; compares to recorded outputsvariesExternalNightly + merge only
import pytest
pytestmark = pytest.mark.unit # or contract, integration, e2e, live_drift

The root tests/conftest.py rejects any file missing one of the four primary markers (unit, contract, integration, e2e); live_drift is the marker the nightly workflow filters on.

ADR-061 defines three tiers:

  • Unit / contractFakeLLMProvider (in spectral.core.llm.testing) implements the LLMProvider protocol; deterministic; zero external calls.
  • Integration — pytest-recording per-test cassettes at tests/<context>/_fixtures/llm/<test-id>.yaml. Replay is byte-perfect deterministic.
  • Live drift detection.github/workflows/nightly-live-drift.yml runs LIVE_PROVIDER=1 against the integration suite, bypasses VCR replay, compares outputs against recorded cassettes via similarity threshold (0.85 text / structural exact-match tool calls; per-test override).

Cassette recording sessions: RECORD_NEW_FIXTURES=1 uv run pytest <path> -m integration. Always review the cassette diff before commit; redaction at recording time strips known-sensitive headers, but custom fields can leak through. The tools/quality/check_cassette_redaction.py lint blocks Authorization: Bearer ... patterns; it lands with the first cassette commit per ADR-061 D8 (dead lint until then). Detailed playbook in docs/runbooks/llm-testing.md.

Mock-first PR CI; live secrets gated to non-PR triggers

Section titled “Mock-first PR CI; live secrets gated to non-PR triggers”

ADR-062 sets the policy:

  • Default PR CI: unit + contract + integration with FakeLLMProvider + cassettes; no external service calls; mock-first by default.
  • Live-secret runs are gated to push-to-main, schedule, or workflow_dispatch. Fork PRs never trigger live-secret workflows. pull_request_target is not used.
  • GitHub Environments scope secrets: staging, production, test-live (the latter holds the LLM provider keys for the nightly drift workflow). See docs/runbooks/ci-secrets.md.

Events that flow between contexts are pinned by bilateral contract tests under tests/contracts/ (per ADR-065 D6

  • ADR-066). This directory is the only place in the codebase exempt from the import discipline that prevents worlds and platform from importing each other (validator rule 6) — bilateral tests legitimately import both the producer’s typed payload (from <producer>.contracts.events.*) and the consumer’s local model (from the consuming flow).

The pattern, demonstrated by tests/contracts/test_failure_cluster_detected.py:

"""Bilateral contract test for platform.failure_cluster.detected."""
from spectral.platform.contracts.events.failure_cluster_detected import (
FailureClusterDetectedPayload,
FailureRef,
)
# Consumer-narrow local model — declares only the fields platform's curation
# intake actually needs. In production, this lives with the consuming flow
# (e.g. spectral.platform.curation.intake.failure_cluster_event).
class FailureClusterEvent(BaseModel):
model_config = ConfigDict(frozen=True)
cluster_id: UUID
snapshot_hash: str
rule_id: UUID
workspace_id: UUID
severity: Literal["low", "medium", "high"]
def test_consumer_parses_producer_emit_shape() -> None:
"""Round-trip invariant between contexts per ADR-065 D4."""
producer = FailureClusterDetectedPayload(...) # producer-rich
wire = producer.model_dump(mode="json")
consumer = FailureClusterEvent.model_validate(wire)
assert consumer.cluster_id == producer.cluster_id
# Consumer-narrow: producer-only fields silently dropped, consumer
# never depends on them.

Two complementary tests per event between contexts:

  1. Round-trip — verify the consumer’s local <EventName>Event model parses the producer’s model_dump(mode="json") output. Catches structural mismatch at PR time.
  2. Schema-drift snapshot (per ADR-066, syrupy) — once syrupy is wired in as a dev-dep, snapshot the producer’s model_json_schema(). First run creates the baseline (pytest --snapshot-update); subsequent runs detect drift. Intentional changes update the snapshot in the same commit; reviewer validates intent from the diff.

Snapshot first-run discipline: the first run of a new contract test creates the syrupy baseline. Author runs pytest --snapshot-update once, commits both the test file and the generated __snapshots__/ directory; subsequent runs verify against the committed baseline. Until syrupy lands, the round-trip test alone is the load-bearing check.

See Events and Protocols for the catalogs of existing events and Protocols. New events or Protocols land alongside their owning epic and the catalog page is updated by hand at the same time.


What you’re changingRun these testsMarkerCommand
Domain entity (pydantic model, value object)Construction, validation, serializationunitpytest -m unit
Entity state machine (lifecycle transitions)Valid transitions + invalid-state rejectionunitpytest -m unit
Domain invariant (uniqueness, monotonicity, blend arithmetic)Property-based via Hypothesisunitpytest -m unit
spectral.core substrate typeContract test pinning the type’s wire shapecontractpytest -m contract
Producer-owned event payload (<context>.contracts.events.*)Producer wire-shape test under tests/<context>/contracts/events/ + bilateral round-trip + drift snapshot under tests/contracts/contractpytest -m contract
OHS Protocol (<context>.contracts.protocols.*)Protocol-conformance test under tests/<context>/contracts/protocols/ (structural isinstance against a stub)contractpytest -m contract
Application use caseUse case orchestration with injected fakesunitpytest -m unit
Application protocol in application/shared/protocols/Protocol conformance (structural)unitpytest -m unit
Supabase migration / RLS policyRLS isolation per role, constraint enforcementintegrationpytest -m integration
Infrastructure adapter (DB, LLM, notification)Round-trip against real local serviceintegrationpytest -m integration
OTEL ingestion pathTrace parsing + sample derivationunit + integrationpytest -m "unit or integration"
LLM prompt constructionRecorded-response replayunitpytest -m unit
Scan pipeline phasePhase behavior with fakes for collaboratorsunitpytest -m unit
Full scan pipelineTest-agent E2E backbone, diagonal emitter slice (3 cells)e2epytest -m e2e
API endpoint / routerRequest/response contract, auth, validationcontract + unitpytest -m "contract or unit"
Agent conversation flow (Spectral / Ops / World)Recorded fixture replayintegrationpytest -m integration
Dual-occupant flow (Ops Agent + human UI)API action + UI action, both see updatesintegrationpytest -m integration
Frontend component (dashboard / operations)Playwright NL specs with mocked LLMe2epytest -m e2e

Invariants are the strongest form of test because they hold for all inputs in the described space, not just hand-picked examples. We use Hypothesis for every invariant we can express.

Required property-based coverage:

SubjectInvariant
Rule status machineOnly declared transitions are reachable; every reachable state has a valid predecessor
ChangeSet status machineSame structural guarantees; terminal states are absorbing
EvalSet sample generationStatistical uniqueness — no two generated samples collide within a corpus above chance
Bootstrap CICoverage properties — computed intervals contain the population statistic at the declared confidence rate across many seeds
Conformity gateSelf-consistency — gate output is deterministic given inputs; gate never contradicts its own prior decision on the same inputs
CompositeScoreBlend arithmetic — weighted combinations respect monotonicity; bounds stay within [0, 1]; no silent NaN propagation

Each of these lives as a @given(...) Hypothesis test under the owning package’s tests/unit/ tree. Shrinking failure cases is the point; a flaky Hypothesis test with an unshrunk counter-example is a bug to investigate, not to retry.


apps/test-agents is the alpha home for a single subject agent — tax_prep — whose sole purpose is to exercise the full scan pipeline end-to-end in CI. The prior plan of one test agent per workflow shape was consolidated because ingestion diversity is better expressed at the emitter level. The directory currently holds a scaffold; the working tax_prep agent + pluggable OTEL emitter described here land under the test-substrate epic — see Test Agents for the current-state framing.

Pluggable OTEL emitter. The tax_prep agent is parameterized over an OTEL emitter that varies along two axes — instrumentation framework (LangChain / OpenLLMetry / Manual SDK) × LLM-vendor span shape (Anthropic / OpenAI / raw OTLP). The full coverage matrix and CI tier policy are canonical in Test Agents — Pluggable OTEL emitter.

The agent runs a deterministic tax-prep workflow (fixed inputs, fixed expected outputs) against each emitter cell. The scan pipeline ingests, calibrates, diagnoses, evaluates, optimizes, checks safety, and renders a verdict. Every phase has assertions on the intermediate state.

This is the one E2E path that the scan pipeline is obligated to pass on every merge. The per-push diagonal slice covers both axes; a cell failure localises the ingestion bug.


Three agents (Spectral Agent, Operations Agent, WorldAgent) are each tested in three passes:

  1. Unit tests on tools. Agent tools are pure functions or thin wrappers over application services. Unit-test them directly. These are the first things to write when adding a tool.
  2. Integration tests on conversation flows with recorded LLM fixtures. Record once against a live LLM, replay deterministically in CI. See LLM fixture recording below. This is the default test for end-to-end conversation behavior.
  3. Live-LLM exercise gated nightly. The nightly job runs representative conversation flows against the real LLM provider and checks that the recorded fixtures still hold. Divergence from the fixture is a signal — either the provider drifted (re-record) or the prompt regressed (fix). Never run live LLM tests in PR CI: cost, flakiness, and provider-incident blast radius.

Dual-occupant flows (Ops Agent + human operator)

Section titled “Dual-occupant flows (Ops Agent + human operator)”

When the Operations app shipped, a new test category emerged: flows where the Ops Agent and a human operator occupy the same workflow concurrently. The critical invariant is both occupants see the same updates.

Minimum coverage for every dual-occupant surface:

pytestmark = pytest.mark.integration
def test_ops_agent_sees_human_ui_action(ops_agent_session, human_ui_client, workflow_id):
# 1. Human triggers a UI action (via a scripted dashboard client)
human_ui_client.approve_rule_candidate(workflow_id, rule_id="...")
# 2. Ops Agent reads the workflow state (via its own tools)
state = ops_agent_session.read_workflow_state(workflow_id)
# 3. Assert the agent's view reflects the human action
assert state.pending_approvals == []
assert state.last_decision.actor_id == human_ui_client.user_id
def test_human_sees_ops_agent_action(ops_agent_session, human_ui_client, workflow_id):
# Same test, reversed roles.
ops_agent_session.flag_rule_for_review(workflow_id, rule_id="...")
ui_state = human_ui_client.get_workflow(workflow_id)
assert any(r.status == "flagged" for r in ui_state.rules)

These run against the real Supabase project and exercise the real Realtime channel. They are slow (seconds, not milliseconds). Keep them few and sharp — one per meaningful dual-occupant surface.


Mutation testing verifies that tests actually catch bugs, not just cover lines. We use cosmic-ray for this. It runs nightly, not per-PR.

Mutation scope (nightly):

ModuleWhy it’s in scope
Verdict engineThe final decision layer. A silent regression here produces wrong approvals.
Conformity gateGates what reaches customers. An unnoticed weakness here lets non-conforming rules through.
CompositeScore arithmeticBlend math. Off-by-one or sign errors in weights are both invisible and expensive.
Holdout selectionDetermines what’s evaluated. Biased selection silently corrupts downstream metrics.

A survivor is a mutant that passes the test suite — meaning a bug was introduced that no test caught. Surviving mutants are tracked as a small standing task; triage each survivor and either add a test that kills it, or document why the mutation is semantically equivalent (rare).

A target survival rate is not declared up-front — the goal is to drive survivors to zero on the scoped modules. Outside the scoped modules, mutation testing is not run in CI.


The root tests/conftest.py and each package’s tests/conftest.py provide the standard fixtures. Integration-test fixtures hit a dedicated test-Supabase instance whose schema-isolation lifecycle is defined by ADR-045 — per-test rollback inside a shared schema-isolated database, with migration parity to production.

Default names (may vary slightly by package — follow the conftest, not this list):

  • supabase_db — raw connection with per-test rollback
  • as_owner, as_member, as_operator — authenticated role contexts
  • as_service_role — admin / backend access, bypasses RLS
  • as_anon — unauthenticated access
  • ollama_client — local Ollama client for embedding / small-model tests
  • llm_replay — replay-mode LLM client, reads a recorded fixture path

Every workspace-scoped table needs integration tests validating the four RLS shapes:

pytestmark = pytest.mark.integration
class TestMyTableRLS:
def test_member_sees_own_workspace(self, supabase_db):
# 1. Create workspace + membership as postgres (bypasses RLS)
# 2. Insert test data
# 3. Switch to authenticated role with _set_role()
# 4. Assert data is visible
def test_cross_tenant_invisible(self, supabase_db):
# 1. Create workspace + data for a DIFFERENT workspace
# 2. Switch to your test user
# 3. Assert the other workspace's data is NOT visible (zero rows — not "different rows")
def test_anon_sees_nothing(self, supabase_db):
# Switch to anon role, assert zero rows
def test_service_role_sees_all(self, supabase_db):
# Switch to service_role, assert all rows visible

Smoke-level RLS coverage is alpha-required. Adversarial RLS testing is a a future item — see future considerations.

Record once against a real provider, then replay deterministically in every subsequent run:

from spectral.core.llm.testing.fixture import llm_recorder, llm_replay
# Record once (creates the fixture file; run against real LLM)
with llm_recorder("tests/fixtures/llm/my_agent/scenario.json", real_client) as client:
result = call_llm(client, "system prompt", "user prompt")
# Replay in tests (no network, deterministic)
client = llm_replay("tests/fixtures/llm/my_agent/scenario.json")
result = call_llm(client, "system prompt", "user prompt")

Recording is a deliberate action — never automatic. A recorded fixture that drifts from the live provider is caught by the nightly live-LLM job.


Terminal window
# Fast loop (unit only)
uv run --all-packages pytest -m unit -q
# PR-equivalent CI pipeline
uv run --all-packages pytest -m "unit or integration or contract"
# Everything including E2E
uv run --all-packages pytest -v
# Coverage report (per-package)
uv run --all-packages pytest --cov --cov-report=term-missing
# Mutation testing (nightly-equivalent, expensive)
cosmic-ray init cosmic-ray.toml session.sqlite
cosmic-ray exec session.sqlite
cosmic-ray dump session.sqlite | cr-report

Or run the tiered pre-push script, which bundles fast checks into the same gate CI uses:

Terminal window
bash tools/dev/precheck.sh # matches CI pre-push gate
bash tools/dev/precheck.sh --install # install as pre-push hook

TriggerLayers runExtrasTimeout
Pull requestunit + contract + integrationCoverage floors enforced10 min
Push to main+ e2e (test-agent backbone, diagonal emitter slice)Coverage floors enforced15 min
Nightly (03:00 UTC)All layers+ full-matrix emitter coverage (all 7 cells), Live LLM conversation-flow replay, mutation testing on scoped modules60 min
Manual (workflow_dispatch)ConfigurableConfigurable60 min

Test results are reported via JUnit XML in the Actions job summary. Mutation survivors are filed as open items on the standing mutation-triage issue.


  • If the test breaks during a refactor that preserved behavior: the test was coupled to implementation. Rewrite it to assert on behavior at the boundary, not internal wiring.
  • If the test flakes: treat as a bug. Root-cause it (timing, ordering, shared state, hidden I/O). Don’t @pytest.mark.flaky.
  • If the test is hard to understand: the test is the bug. Rewrite for clarity. Tests are production code.
  • If you can’t articulate what failure the test catches: delete it.

  • CONTRIBUTING.md → Testing (repo root) — short form of the testing rules
  • Epic Template & DoD — integration-test AC requirement per epic
  • Architecture — three-context topology
  • Future considerations — a future hardening (adversarial RLS, observability stack choice)
  • docs/runbooks/testing.md — operational playbook for the test-Supabase instance, fixture-recording sessions, and CI gate troubleshooting
  • docs/runbooks/llm-testing.md — cassette redaction + nightly drift workflow detail