Skip to content
GitHub
Developer

Testing

Testing strategy for spectral.core, spectral.worlds, and spectral.platform. Every agent-written test follows this page. The short form is in CONTRIBUTING.md → Testing in the repo root; this is the full reference for what test lives where, what to assert, and what CI enforces.


Non-negotiable rules. Every other decision on this page descends from these.

  1. Test behavior, not implementation. Tests assert what the system does, not how it does it internally. A refactor that preserves behavior should not break tests.
  2. Every test must have a reason to exist. If you can’t articulate what production failure a test catches, delete it. “It increases coverage” is not a reason.
  3. Test at the right boundary. Unit for pure logic; property-based for invariants; integration for cross-collaborator interactions; contract for external surfaces; E2E for user-visible paths. Don’t test internal wiring — test inputs and outputs at meaningful boundaries.
  4. Failing tests must be actionable. A test failure names what broke and where without requiring the developer to debug the test itself.
  5. Tests are production code. Same quality bar: clear names, no duplication, no dead tests. A flaky test is a bug.
  6. Integration tests hit real infrastructure. No DB mocks. Past incident: mocked tests passed while the prod migration broke. See CONTRIBUTING.md → Testing in the repo root.
  7. Fewer focused tests beat many shallow ones. Ten scenarios that matter beat a hundred that don’t.

Every test file declares one layer via pytestmark. The root conftest.py enforces the marker — unmarked tests block the suite.

LayerPrimary strategyNotes
Domain (all three contexts)Unit + property-based for invariantsState machines, statistical uniqueness, bootstrap-CI properties, blend arithmetic — via Hypothesis.
ApplicationMock at service-abstraction boundaries; fakes preferred over mocksOnly across protocols declared in application/shared/protocols/; never internal collaborators.
InfrastructureIntegration tests against real Supabase + pgvector + OllamaNo DB or LLM mocking at this layer. Past incident makes this non-negotiable.
API / workersContract tests against OpenAPI; E2E on critical pathsE2E covers the operator-walkthrough + decision-time tax_prep paths.
Test agent (apps/test-agents, tax_prep)Decision-time E2E backbone — the reproducible success-bar gate, a documented release gate run against a deployed world (not in per-push CI)See Test Agents — the LangGraph agent, its system: validation gate, and the interactive shell.
Agent workflows (World Agent)Unit-test tools; integration-test conversation flows with recorded LLM fixtures; manual live-provider recording only when fixture inputs intentionally changeLive provider calls never run in PR CI or scheduled CI.

Mocks vs fakes. Hand-written in-memory implementations survive refactors; MagicMock does not. Mock only across shared/protocols/; do not mock internal collaborators within the same layer.

Property-tested domain invariants. Rule status machine, action-module lifecycle, aggregation-mode determinism (T1 hard-floor preserved under every mode), conformity-gate self-consistency, predicate-purity invariants (no I/O, no nondeterminism), audit-chain entry shape, work-frame contract completeness.

Test-agent role. The tax_prep agent run against a deployed world — its reproducible system: validation gate per Test Agents — is the canonical full-decision-time-pipeline assertion.


CI-enforced from commit one. These are floors, not ceilings. A layer above its floor is not an invitation to stop writing tests; a layer below the floor blocks merge.

LayerFloorRationale
Domain≥ 90%Pure business logic with no I/O. There is no reason not to cover it. Domain bugs are the most expensive to miss because they propagate into everything above.
Application≥ 80%Orchestration logic. Slightly lower floor because some branches exist purely to translate domain results into API-layer concerns, and those are already covered by API-layer tests.
Infrastructure≥ 60%Adapters to external systems. Exhaustive coverage is uneconomic (much of the work is already done by the upstream library), but the contract-facing edges and error paths must be covered.

Coverage is measured per package via pytest-cov and reported to CI. A PR that drops any package below its floor fails the coverage job before tests finish.


MarkerWhat it testsTypical latencyNetworkRuns in CI
unitOne subject, collaborators stubbed or faked< 1sNonePR, merge, nightly
contractConsumer-facing agreement remains stable (OpenAPI, event payloads in <context>.contracts.events.*, OHS Protocols in <context>.contracts.protocols.*, spectral.core substrate types)< 1sNonePR, merge, nightly
integrationTwo+ subjects interact correctly against real infra (DB / pgvector / cassette LLM)< 5sLocalPR, merge, nightly
e2eWalkthrough paths end-to-end5–30sLocalMerge, nightly
import pytest
pytestmark = pytest.mark.unit # or contract, integration, e2e

The root tests/conftest.py rejects any file missing one of the four primary markers (unit, contract, integration, e2e).

ADR-061 defines three tiers:

  • Unit / contractFakeLLMProvider (in tests.core.llm) implements the LLMProvider protocol; deterministic; zero external calls.
  • Integration — pytest-recording per-test cassettes at tests/core/llm/cassettes/<rel-test-path>/<test-id>.yaml. Replay is byte-perfect deterministic.
  • Manual live recording — operator-run recording sessions refresh only the affected cassettes when prompts, fixtures, or model/provider choices intentionally change. Similarity helper (assert_llm_output_similar, 0.85 default per ADR-061 D5) is available for tests that need a paraphrase-tolerant assertion within the suite; cassette refresh itself is reviewed by inspecting git diff.

Cassette recording sessions: uv run pytest <path> --record-mode=once. Always review the cassette diff before commit; redaction at recording time strips known-sensitive headers, but custom fields can leak through. The tools/quality/check_cassette_redaction.py lint blocks Authorization: Bearer ... patterns and a broad set of provider key formats per ADR-061 D8. Detailed playbook in docs/runbooks/llm-testing.md.

Mock-first PR CI; live secrets gated to non-PR triggers

Section titled “Mock-first PR CI; live secrets gated to non-PR triggers”

ADR-062 sets the policy:

  • Default PR CI: unit + contract + integration with FakeLLMProvider + cassettes; no external service calls; mock-first by default.
  • Live-secret runs are gated to trusted non-PR triggers. Fork PRs never trigger live-secret workflows. pull_request_target is not used.
  • GitHub Environments scope secrets: staging and production. See docs/runbooks/ci-secrets.md.

Events that flow between contexts are pinned by bilateral contract tests under tests/contracts/ (per ADR-065 D6

  • ADR-066). This directory is the only place in the codebase exempt from the import discipline that prevents worlds and platform from importing each other (the validator’s tests-contracts-exempt rule) — bilateral tests legitimately import both the producer’s typed payload (from <producer>.contracts.events.*) and the consumer’s local model (from the consuming flow).

The pattern, demonstrated by tests/contracts/test_world_model_card_published.py:

"""Bilateral contract test for worlds.world_model_card.published."""
from spectral.worlds.contracts.events.world_model_card_published import (
WorldModelCardPublishedPayload,
)
# Consumer-narrow local model — declares only the fields platform's System
# Card projection intake actually needs. In production, this lives with the
# consuming flow (e.g. spectral.platform.system_card.intake.card_published_event).
class WorldModelCardPublishedEvent(BaseModel):
model_config = ConfigDict(frozen=True)
org_id: str
domain_id: str
world_model_version: int
authority_summary: str
provenance_summary: dict[str, int]
def test_consumer_parses_producer_emit_shape() -> None:
"""Round-trip invariant between contexts per ADR-065 D4."""
producer = WorldModelCardPublishedPayload(...) # producer-rich
wire = producer.model_dump(mode="json")
consumer = WorldModelCardPublishedEvent.model_validate(wire)
assert consumer.world_model_version == producer.world_model_version
# Consumer-narrow: producer-only fields silently dropped, consumer
# never depends on them.

Two complementary tests per event between contexts:

  1. Round-trip — verify the consumer’s local <EventName>Event model parses the producer’s model_dump(mode="json") output. Catches structural mismatch at PR time.
  2. Schema-drift snapshot (per ADR-066, syrupy) — once syrupy is wired in as a dev-dep, snapshot the producer’s model_json_schema(). First run creates the baseline (pytest --snapshot-update); subsequent runs detect drift. Intentional changes update the snapshot in the same commit; reviewer validates intent from the diff.

Snapshot first-run discipline: the first run of a new contract test creates the syrupy baseline. Author runs pytest --snapshot-update once, commits both the test file and the generated __snapshots__/ directory; subsequent runs verify against the committed baseline. Until syrupy lands, the round-trip test alone is the load-bearing check.

See Events and Protocols for the catalogs of existing events and Protocols. New events or Protocols land alongside their owning epic and the catalog page is updated by hand at the same time.


What you’re changingRun these testsMarkerCommand
Domain entity (pydantic model, value object)Construction, validation, serializationunitpytest -m unit
Entity state machine (lifecycle transitions)Valid transitions + invalid-state rejectionunitpytest -m unit
Domain invariant (uniqueness, monotonicity, blend arithmetic)Property-based via Hypothesisunitpytest -m unit
spectral.core substrate typeContract test pinning the type’s wire shapecontractpytest -m contract
Producer-owned event payload (<context>.contracts.events.*)Producer wire-shape test under tests/<context>/contracts/events/ + bilateral round-trip + drift snapshot under tests/contracts/contractpytest -m contract
OHS Protocol (<context>.contracts.protocols.*)Protocol-conformance test under tests/<context>/contracts/protocols/ (structural isinstance against a stub)contractpytest -m contract
Application use caseUse case orchestration with injected fakesunitpytest -m unit
Application protocol in application/shared/protocols/Protocol conformance (structural)unitpytest -m unit
Supabase migration / RLS policyRLS isolation per role, constraint enforcementintegrationpytest -m integration
Infrastructure adapter (DB, LLM, notification)Round-trip against real local serviceintegrationpytest -m integration
Decision-execution phasePhase behavior with fakes for collaborators (validation, derivation, predicate eval, aggregation)unitpytest -m unit
Action-module composition rootCodegen output + composition behavior with fakesunitpytest -m unit
LLM prompt constructionRecorded-response replayunitpytest -m unit
Full decision-time pipelineTest-agent backbone — the tax_prep agent against a deployed worldsystempytest -m system
API endpoint / routerRequest/response contract, auth, validationcontract + unitpytest -m "contract or unit"
Agent conversation flow (World Agent)Recorded fixture replayintegrationpytest -m integration
Frontend component (dashboard / operations)Playwright NL specs with mocked LLMe2epytest -m e2e

Invariants are the strongest form of test because they hold for all inputs in the described space, not just hand-picked examples. We use Hypothesis for every invariant we can express.

Required property-based coverage:

SubjectInvariant
Rule status machineOnly declared transitions are reachable; every reachable state has a valid predecessor
Action-module lifecycleSame structural guarantees; published modules are immutable (content-hash addressed)
Aggregation mode determinismT1 hard-floor preserved under every implemented and reserved aggregation mode
Predicate purityGenerated predicate code admits no I/O, no mutation, no nondeterminism (per ADR-083 D2 AST analysis)
Conformity gateSelf-consistency — gate output is deterministic given inputs; gate never contradicts its own prior decision on the same inputs
Audit-chain entry shapeEvery /decide invocation produces a single audit-chain row carrying the full decision metadata
Work-frame contractforbidden_actions always includes the by-construction anti-patterns (reinterpret_policy, override_spectral); llm_policy_decision: false always present

Each of these lives as a @given(...) Hypothesis test under the owning package’s tests/unit/ tree. Shrinking failure cases is the point; a flaky Hypothesis test with an unshrunk counter-example is a bug to investigate, not to retry.


apps/test-agents is the home for the subject agent — tax_prep — whose purpose is to exercise the full decision-time pipeline end-to-end as a real external customer. See Test Agents for the full backbone reference.

The reproducible success-bar gate. The tax_prep agent is one LangGraph orchestrator run two ways over the same graph: a reproducible validation driver that fail-loud asserts a set of full-taxpayer personas (spanning GREEN, YELLOW-held, RED-recovered) against a deployed world, and an interactive shell for live human runs. The validation driver is the standing system: gate.

The agent discovers the deployed action set, routes every consequential judgment to /decide, receives { status, work_frame, decision_metadata }, and honors the four-state outcome — it makes no tax judgment of its own. The interview LLM is pinned to recorded cassettes, so the gate reproduces from a clean database with no LLM credential. Every phase of decision execution has assertions on the intermediate state, and every routed decision record is read back for provability.

This is the one full-pipeline path the decision-time stack is obligated to pass — today a documented, reproducible release gate run against the deployed world.


The World Agent is tested in three passes:

  1. Unit tests on tools. Agent tools are pure functions or thin wrappers over application services. Unit-test them directly. These are the first things to write when adding a tool.
  2. Integration tests on conversation flows with recorded LLM fixtures. Record once against a live LLM, replay deterministically in CI. See LLM fixture recording below. This is the default test for end-to-end conversation behavior.
  3. Manual live-provider recording. Operator-run recording sessions refresh the affected fixtures when prompts, fixtures, or provider/model choices intentionally change. Divergence in replay is a signal to inspect the cassette diff and either accept the intentional refresh or fix the prompt. Never run live LLM tests in PR or scheduled CI: cost, flakiness, and provider-incident blast radius.

Mutation testing verifies that tests actually catch bugs, not just cover lines. We use cosmic-ray for this. It runs nightly, not per-PR.

Mutation scope (nightly):

ModuleWhy it’s in scope
Composition root (action-module codegen)The final decision layer. A silent regression here produces wrong work frames.
Conformity gateGates what reaches customers. An unnoticed weakness here lets non-conforming rules through.
Implementation-readiness gateVerifies generated predicate code matches natural-language intent. Silent regression lets divergent code into modules.
Aggregation modesT1 hard-floor preservation across every implemented and reserved mode. Mutation here would silently let T2/T3 rules suppress a T1 match.

A survivor is a mutant that passes the test suite — meaning a bug was introduced that no test caught. Surviving mutants are tracked as a small standing task; triage each survivor and either add a test that kills it, or document why the mutation is semantically equivalent (rare).

A target survival rate is not declared up-front — the goal is to drive survivors to zero on the scoped modules. Outside the scoped modules, mutation testing is not run in CI.


The root tests/conftest.py and each package’s tests/conftest.py provide the standard fixtures. Integration-test fixtures hit a dedicated test-Supabase instance whose schema-isolation lifecycle is defined by ADR-045 — per-test rollback inside a shared schema-isolated database, with migration parity to production.

Default names (may vary slightly by package — follow the conftest, not this list):

  • supabase_db — raw connection with per-test rollback
  • as_owner, as_member, as_operator — authenticated role contexts
  • as_service_role — admin / backend access, bypasses RLS
  • as_anon — unauthenticated access
  • ollama_client — local Ollama client for embedding / small-model tests
  • llm_replay — replay-mode LLM client, reads a recorded fixture path

Every domain-scoped table needs integration tests validating the four RLS shapes:

pytestmark = pytest.mark.integration
class TestMyTableRLS:
def test_member_sees_own_domain(self, supabase_db):
# 1. Create domain + membership as postgres (bypasses RLS)
# 2. Insert test data
# 3. Switch to authenticated role with _set_role()
# 4. Assert data is visible
def test_cross_tenant_invisible(self, supabase_db):
# 1. Create domain + data for a DIFFERENT domain
# 2. Switch to your test user
# 3. Assert the other domain's data is NOT visible (zero rows — not "different rows")
def test_anon_sees_nothing(self, supabase_db):
# Switch to anon role, assert zero rows
def test_service_role_sees_all(self, supabase_db):
# Switch to service_role, assert all rows visible

Smoke-level RLS coverage is required. Adversarial RLS testing is a future item — see future considerations.

Record once against a real provider, then replay deterministically in every subsequent run:

from tests.core.llm import FakeLLMProvider, FakeResponse
# Unit / contract tests: deterministic FakeLLMProvider, no recording needed
provider = FakeLLMProvider(source=FakeResponse(text="canned response"))
# Integration tests: cassette replay via @pytest.mark.vcr (see ADR-061 D2)
@pytest.mark.vcr
async def test_my_agent_first_turn(vcr_cassette):
...
result = call_llm(client, "system prompt", "user prompt")

Recording is a deliberate action — never automatic. A recorded fixture is refreshed manually when the prompt, fixture, or provider/model choice intentionally changes.


Terminal window
# Fast loop (unit only)
uv run --all-packages pytest -m unit -q
# PR-equivalent CI pipeline
uv run --all-packages pytest -m "unit or integration or contract"
# Everything including E2E
uv run --all-packages pytest -v
# Coverage report (per-package)
uv run --all-packages pytest --cov --cov-report=term-missing
# Mutation testing (nightly-equivalent, expensive)
cosmic-ray init cosmic-ray.toml session.sqlite
cosmic-ray exec session.sqlite
cosmic-ray dump session.sqlite | cr-report

Or run the tiered pre-push script, which bundles fast checks into the same gate CI uses:

Terminal window
bash tools/dev/precheck.sh # matches CI pre-push gate
bash tools/dev/precheck.sh --install # install as pre-push hook

TriggerLayers runExtrasTimeout
Pull requestunit + contract + integrationCoverage floors enforced10 min
Push to main+ e2e (operator-walkthrough + decision-time critical paths)Coverage floors enforced15 min
Nightly (03:00 UTC)All layersMutation testing on scoped modules; no live-provider calls60 min
Manual (workflow_dispatch)ConfigurableConfigurable60 min

Test results are reported via JUnit XML in the Actions job summary. Mutation survivors are filed as open items on the standing mutation-triage issue.


  • If the test breaks during a refactor that preserved behavior: the test was coupled to implementation. Rewrite it to assert on behavior at the boundary, not internal wiring.
  • If the test flakes: treat as a bug. Root-cause it (timing, ordering, shared state, hidden I/O). Don’t @pytest.mark.flaky.
  • If the test is hard to understand: the test is the bug. Rewrite for clarity. Tests are production code.
  • If you can’t articulate what failure the test catches: delete it.

  • CONTRIBUTING.md → Testing (repo root) — short form of the testing rules
  • Epic Template & DoD — integration-test AC requirement per epic
  • Architecture — three-context topology
  • Future considerations — a future hardening (adversarial RLS, observability stack choice)
  • docs/runbooks/testing.md — operational playbook for the test-Supabase instance, fixture-recording sessions, and CI gate troubleshooting
  • docs/runbooks/llm-testing.md — cassette redaction + manual live-provider recording detail