Testing
Testing strategy for spectral.core, spectral.worlds, and spectral.platform. Every
agent-written test follows this page. The short form is in CONTRIBUTING.md → Testing in the repo root; this is the full reference for what test lives where, what to assert, and what CI enforces.
Testing principles
Section titled “Testing principles”Non-negotiable rules. Every other decision on this page descends from these.
- Test behavior, not implementation. Tests assert what the system does, not how it does it internally. A refactor that preserves behavior should not break tests.
- Every test must have a reason to exist. If you can’t articulate what production failure a test catches, delete it. “It increases coverage” is not a reason.
- Test at the right boundary. Unit for pure logic; property-based for invariants; integration for cross-collaborator interactions; contract for external surfaces; E2E for user-visible paths. Don’t test internal wiring — test inputs and outputs at meaningful boundaries.
- Failing tests must be actionable. A test failure names what broke and where without requiring the developer to debug the test itself.
- Tests are production code. Same quality bar: clear names, no duplication, no dead tests. A flaky test is a bug.
- Integration tests hit real infrastructure. No DB mocks. Past incident: mocked tests passed while the prod migration broke. See CONTRIBUTING.md → Testing in the repo root.
- Fewer focused tests beat many shallow ones. Ten scenarios that matter beat a hundred that don’t.
Strategy per layer
Section titled “Strategy per layer”Every test file declares one layer via pytestmark. The root conftest.py enforces the marker —
unmarked tests block the suite.
| Layer | Primary strategy | Notes |
|---|---|---|
| Domain (all three contexts) | Unit + property-based for invariants | State machines, statistical uniqueness, bootstrap-CI properties, blend arithmetic — via Hypothesis. |
| Application | Mock at service-abstraction boundaries; fakes preferred over mocks | Only across protocols declared in application/shared/protocols/; never internal collaborators. |
| Infrastructure | Integration tests against real Supabase + pgvector + Ollama | No DB or LLM mocking at this layer. Past incident makes this non-negotiable. |
| API / workers | Contract tests against OpenAPI; E2E on critical paths | E2E covers the operator-walkthrough + decision-time tax_prep paths. |
Test agent (apps/test-agents, tax_prep) | Decision-time E2E backbone — the reproducible success-bar gate, a documented release gate run against a deployed world (not in per-push CI) | See Test Agents — the LangGraph agent, its system: validation gate, and the interactive shell. |
| Agent workflows (World Agent) | Unit-test tools; integration-test conversation flows with recorded LLM fixtures; manual live-provider recording only when fixture inputs intentionally change | Live provider calls never run in PR CI or scheduled CI. |
Mocks vs fakes. Hand-written in-memory implementations survive refactors; MagicMock does
not. Mock only across shared/protocols/; do not mock internal collaborators within the same
layer.
Property-tested domain invariants. Rule status machine, action-module lifecycle, aggregation-mode determinism (T1 hard-floor preserved under every mode), conformity-gate self-consistency, predicate-purity invariants (no I/O, no nondeterminism), audit-chain entry shape, work-frame contract completeness.
Test-agent role. The tax_prep agent run against a deployed world — its reproducible system:
validation gate per Test Agents — is the canonical full-decision-time-pipeline
assertion.
Coverage floors
Section titled “Coverage floors”CI-enforced from commit one. These are floors, not ceilings. A layer above its floor is not an invitation to stop writing tests; a layer below the floor blocks merge.
| Layer | Floor | Rationale |
|---|---|---|
| Domain | ≥ 90% | Pure business logic with no I/O. There is no reason not to cover it. Domain bugs are the most expensive to miss because they propagate into everything above. |
| Application | ≥ 80% | Orchestration logic. Slightly lower floor because some branches exist purely to translate domain results into API-layer concerns, and those are already covered by API-layer tests. |
| Infrastructure | ≥ 60% | Adapters to external systems. Exhaustive coverage is uneconomic (much of the work is already done by the upstream library), but the contract-facing edges and error paths must be covered. |
Coverage is measured per package via pytest-cov and reported to CI. A PR that drops any package
below its floor fails the coverage job before tests finish.
Test layer markers
Section titled “Test layer markers”| Marker | What it tests | Typical latency | Network | Runs in CI |
|---|---|---|---|---|
unit | One subject, collaborators stubbed or faked | < 1s | None | PR, merge, nightly |
contract | Consumer-facing agreement remains stable (OpenAPI, event payloads in <context>.contracts.events.*, OHS Protocols in <context>.contracts.protocols.*, spectral.core substrate types) | < 1s | None | PR, merge, nightly |
integration | Two+ subjects interact correctly against real infra (DB / pgvector / cassette LLM) | < 5s | Local | PR, merge, nightly |
e2e | Walkthrough paths end-to-end | 5–30s | Local | Merge, nightly |
import pytestpytestmark = pytest.mark.unit # or contract, integration, e2eThe root tests/conftest.py rejects any file missing one of the four primary markers (unit,
contract, integration, e2e).
LLM test posture
Section titled “LLM test posture”ADR-061 defines three tiers:
- Unit / contract —
FakeLLMProvider(intests.core.llm) implements theLLMProviderprotocol; deterministic; zero external calls. - Integration — pytest-recording per-test cassettes at
tests/core/llm/cassettes/<rel-test-path>/<test-id>.yaml. Replay is byte-perfect deterministic. - Manual live recording — operator-run recording sessions refresh only the affected
cassettes when prompts, fixtures, or model/provider choices intentionally change. Similarity
helper (
assert_llm_output_similar, 0.85 default per ADR-061 D5) is available for tests that need a paraphrase-tolerant assertion within the suite; cassette refresh itself is reviewed by inspectinggit diff.
Cassette recording sessions: uv run pytest <path> --record-mode=once. Always review the
cassette diff before commit; redaction at recording time strips known-sensitive headers, but
custom fields can leak through. The tools/quality/check_cassette_redaction.py lint blocks
Authorization: Bearer ... patterns and a broad set of provider key formats per
ADR-061 D8. Detailed playbook in docs/runbooks/llm-testing.md.
Mock-first PR CI; live secrets gated to non-PR triggers
Section titled “Mock-first PR CI; live secrets gated to non-PR triggers”ADR-062 sets the policy:
- Default PR CI: unit + contract + integration with
FakeLLMProvider+ cassettes; no external service calls; mock-first by default. - Live-secret runs are gated to trusted non-PR triggers. Fork PRs never trigger live-secret
workflows.
pull_request_targetis not used. - GitHub Environments scope secrets:
stagingandproduction. Seedocs/runbooks/ci-secrets.md.
Bilateral contract tests
Section titled “Bilateral contract tests”Events that flow between contexts are pinned by bilateral contract tests under
tests/contracts/ (per ADR-065 D6
- ADR-066). This directory is
the only place in the codebase exempt from the import discipline that prevents
worldsandplatformfrom importing each other (the validator’stests-contracts-exemptrule) — bilateral tests legitimately import both the producer’s typed payload (from<producer>.contracts.events.*) and the consumer’s local model (from the consuming flow).
The pattern, demonstrated by tests/contracts/test_world_model_card_published.py:
"""Bilateral contract test for worlds.world_model_card.published."""from spectral.worlds.contracts.events.world_model_card_published import ( WorldModelCardPublishedPayload,)# Consumer-narrow local model — declares only the fields platform's System# Card projection intake actually needs. In production, this lives with the# consuming flow (e.g. spectral.platform.system_card.intake.card_published_event).class WorldModelCardPublishedEvent(BaseModel): model_config = ConfigDict(frozen=True) org_id: str domain_id: str world_model_version: int authority_summary: str provenance_summary: dict[str, int]
def test_consumer_parses_producer_emit_shape() -> None: """Round-trip invariant between contexts per ADR-065 D4.""" producer = WorldModelCardPublishedPayload(...) # producer-rich wire = producer.model_dump(mode="json") consumer = WorldModelCardPublishedEvent.model_validate(wire) assert consumer.world_model_version == producer.world_model_version # Consumer-narrow: producer-only fields silently dropped, consumer # never depends on them.Two complementary tests per event between contexts:
- Round-trip — verify the consumer’s local
<EventName>Eventmodel parses the producer’smodel_dump(mode="json")output. Catches structural mismatch at PR time. - Schema-drift snapshot (per ADR-066,
syrupy) — oncesyrupyis wired in as a dev-dep, snapshot the producer’smodel_json_schema(). First run creates the baseline (pytest --snapshot-update); subsequent runs detect drift. Intentional changes update the snapshot in the same commit; reviewer validates intent from the diff.
Snapshot first-run discipline: the first run of a new contract test creates the
syrupybaseline. Author runspytest --snapshot-updateonce, commits both the test file and the generated__snapshots__/directory; subsequent runs verify against the committed baseline. Until syrupy lands, the round-trip test alone is the load-bearing check.
See Events and Protocols for the catalogs of existing events and Protocols. New events or Protocols land alongside their owning epic and the catalog page is updated by hand at the same time.
Lookup table: I’m changing X, run Y
Section titled “Lookup table: I’m changing X, run Y”| What you’re changing | Run these tests | Marker | Command |
|---|---|---|---|
| Domain entity (pydantic model, value object) | Construction, validation, serialization | unit | pytest -m unit |
| Entity state machine (lifecycle transitions) | Valid transitions + invalid-state rejection | unit | pytest -m unit |
| Domain invariant (uniqueness, monotonicity, blend arithmetic) | Property-based via Hypothesis | unit | pytest -m unit |
spectral.core substrate type | Contract test pinning the type’s wire shape | contract | pytest -m contract |
Producer-owned event payload (<context>.contracts.events.*) | Producer wire-shape test under tests/<context>/contracts/events/ + bilateral round-trip + drift snapshot under tests/contracts/ | contract | pytest -m contract |
OHS Protocol (<context>.contracts.protocols.*) | Protocol-conformance test under tests/<context>/contracts/protocols/ (structural isinstance against a stub) | contract | pytest -m contract |
| Application use case | Use case orchestration with injected fakes | unit | pytest -m unit |
Application protocol in application/shared/protocols/ | Protocol conformance (structural) | unit | pytest -m unit |
| Supabase migration / RLS policy | RLS isolation per role, constraint enforcement | integration | pytest -m integration |
| Infrastructure adapter (DB, LLM, notification) | Round-trip against real local service | integration | pytest -m integration |
| Decision-execution phase | Phase behavior with fakes for collaborators (validation, derivation, predicate eval, aggregation) | unit | pytest -m unit |
| Action-module composition root | Codegen output + composition behavior with fakes | unit | pytest -m unit |
| LLM prompt construction | Recorded-response replay | unit | pytest -m unit |
| Full decision-time pipeline | Test-agent backbone — the tax_prep agent against a deployed world | system | pytest -m system |
| API endpoint / router | Request/response contract, auth, validation | contract + unit | pytest -m "contract or unit" |
| Agent conversation flow (World Agent) | Recorded fixture replay | integration | pytest -m integration |
| Frontend component (dashboard / operations) | Playwright NL specs with mocked LLM | e2e | pytest -m e2e |
Property-based testing with Hypothesis
Section titled “Property-based testing with Hypothesis”Invariants are the strongest form of test because they hold for all inputs in the described space, not just hand-picked examples. We use Hypothesis for every invariant we can express.
Required property-based coverage:
| Subject | Invariant |
|---|---|
| Rule status machine | Only declared transitions are reachable; every reachable state has a valid predecessor |
| Action-module lifecycle | Same structural guarantees; published modules are immutable (content-hash addressed) |
| Aggregation mode determinism | T1 hard-floor preserved under every implemented and reserved aggregation mode |
| Predicate purity | Generated predicate code admits no I/O, no mutation, no nondeterminism (per ADR-083 D2 AST analysis) |
| Conformity gate | Self-consistency — gate output is deterministic given inputs; gate never contradicts its own prior decision on the same inputs |
| Audit-chain entry shape | Every /decide invocation produces a single audit-chain row carrying the full decision metadata |
| Work-frame contract | forbidden_actions always includes the by-construction anti-patterns (reinterpret_policy, override_spectral); llm_policy_decision: false always present |
Each of these lives as a @given(...) Hypothesis test under the owning package’s tests/unit/
tree. Shrinking failure cases is the point; a flaky Hypothesis test with an unshrunk
counter-example is a bug to investigate, not to retry.
The test-agent decision-time backbone
Section titled “The test-agent decision-time backbone”apps/test-agents is the home for the subject agent — tax_prep — whose purpose is to exercise
the full decision-time pipeline end-to-end as a real external customer. See
Test Agents for the full backbone reference.
The reproducible success-bar gate. The tax_prep agent is one LangGraph orchestrator run two
ways over the same graph: a reproducible validation driver that fail-loud asserts a set of
full-taxpayer personas (spanning GREEN, YELLOW-held, RED-recovered) against a deployed world, and an
interactive shell for live human runs. The validation driver is the standing system: gate.
The agent discovers the deployed action set, routes every consequential judgment to /decide,
receives { status, work_frame, decision_metadata }, and honors the four-state outcome — it makes no
tax judgment of its own. The interview LLM is pinned to recorded cassettes, so the gate reproduces
from a clean database with no LLM credential. Every phase of
decision execution has assertions on the
intermediate state, and every routed decision record is read back for provability.
This is the one full-pipeline path the decision-time stack is obligated to pass — today a documented, reproducible release gate run against the deployed world.
Agent workflow testing
Section titled “Agent workflow testing”The World Agent is tested in three passes:
- Unit tests on tools. Agent tools are pure functions or thin wrappers over application services. Unit-test them directly. These are the first things to write when adding a tool.
- Integration tests on conversation flows with recorded LLM fixtures. Record once against a live LLM, replay deterministically in CI. See LLM fixture recording below. This is the default test for end-to-end conversation behavior.
- Manual live-provider recording. Operator-run recording sessions refresh the affected fixtures when prompts, fixtures, or provider/model choices intentionally change. Divergence in replay is a signal to inspect the cassette diff and either accept the intentional refresh or fix the prompt. Never run live LLM tests in PR or scheduled CI: cost, flakiness, and provider-incident blast radius.
Mutation testing
Section titled “Mutation testing”Mutation testing verifies that tests actually catch bugs, not just cover lines. We use cosmic-ray for this. It runs nightly, not per-PR.
Mutation scope (nightly):
| Module | Why it’s in scope |
|---|---|
| Composition root (action-module codegen) | The final decision layer. A silent regression here produces wrong work frames. |
| Conformity gate | Gates what reaches customers. An unnoticed weakness here lets non-conforming rules through. |
| Implementation-readiness gate | Verifies generated predicate code matches natural-language intent. Silent regression lets divergent code into modules. |
| Aggregation modes | T1 hard-floor preservation across every implemented and reserved mode. Mutation here would silently let T2/T3 rules suppress a T1 match. |
A survivor is a mutant that passes the test suite — meaning a bug was introduced that no test caught. Surviving mutants are tracked as a small standing task; triage each survivor and either add a test that kills it, or document why the mutation is semantically equivalent (rare).
A target survival rate is not declared up-front — the goal is to drive survivors to zero on the scoped modules. Outside the scoped modules, mutation testing is not run in CI.
Conftest fixtures
Section titled “Conftest fixtures”The root tests/conftest.py and each package’s tests/conftest.py provide the standard fixtures.
Integration-test fixtures hit a dedicated test-Supabase instance whose schema-isolation lifecycle
is defined by ADR-045 — per-test rollback inside
a shared schema-isolated database, with migration parity to production.
Default names (may vary slightly by package — follow the conftest, not this list):
supabase_db— raw connection with per-test rollbackas_owner,as_member,as_operator— authenticated role contextsas_service_role— admin / backend access, bypasses RLSas_anon— unauthenticated accessollama_client— local Ollama client for embedding / small-model testsllm_replay— replay-mode LLM client, reads a recorded fixture path
RLS integration test pattern
Section titled “RLS integration test pattern”Every domain-scoped table needs integration tests validating the four RLS shapes:
pytestmark = pytest.mark.integration
class TestMyTableRLS: def test_member_sees_own_domain(self, supabase_db): # 1. Create domain + membership as postgres (bypasses RLS) # 2. Insert test data # 3. Switch to authenticated role with _set_role() # 4. Assert data is visible
def test_cross_tenant_invisible(self, supabase_db): # 1. Create domain + data for a DIFFERENT domain # 2. Switch to your test user # 3. Assert the other domain's data is NOT visible (zero rows — not "different rows")
def test_anon_sees_nothing(self, supabase_db): # Switch to anon role, assert zero rows
def test_service_role_sees_all(self, supabase_db): # Switch to service_role, assert all rows visibleSmoke-level RLS coverage is required. Adversarial RLS testing is a future item — see future considerations.
LLM fixture recording
Section titled “LLM fixture recording”Record once against a real provider, then replay deterministically in every subsequent run:
from tests.core.llm import FakeLLMProvider, FakeResponse
# Unit / contract tests: deterministic FakeLLMProvider, no recording neededprovider = FakeLLMProvider(source=FakeResponse(text="canned response"))
# Integration tests: cassette replay via @pytest.mark.vcr (see ADR-061 D2)@pytest.mark.vcrasync def test_my_agent_first_turn(vcr_cassette): ...result = call_llm(client, "system prompt", "user prompt")Recording is a deliberate action — never automatic. A recorded fixture is refreshed manually when the prompt, fixture, or provider/model choice intentionally changes.
Developer commands
Section titled “Developer commands”# Fast loop (unit only)uv run --all-packages pytest -m unit -q
# PR-equivalent CI pipelineuv run --all-packages pytest -m "unit or integration or contract"
# Everything including E2Euv run --all-packages pytest -v
# Coverage report (per-package)uv run --all-packages pytest --cov --cov-report=term-missing
# Mutation testing (nightly-equivalent, expensive)cosmic-ray init cosmic-ray.toml session.sqlitecosmic-ray exec session.sqlitecosmic-ray dump session.sqlite | cr-reportOr run the tiered pre-push script, which bundles fast checks into the same gate CI uses:
bash tools/dev/precheck.sh # matches CI pre-push gatebash tools/dev/precheck.sh --install # install as pre-push hookCI pipeline
Section titled “CI pipeline”| Trigger | Layers run | Extras | Timeout |
|---|---|---|---|
| Pull request | unit + contract + integration | Coverage floors enforced | 10 min |
| Push to main | + e2e (operator-walkthrough + decision-time critical paths) | Coverage floors enforced | 15 min |
| Nightly (03:00 UTC) | All layers | Mutation testing on scoped modules; no live-provider calls | 60 min |
Manual (workflow_dispatch) | Configurable | Configurable | 60 min |
Test results are reported via JUnit XML in the Actions job summary. Mutation survivors are filed as open items on the standing mutation-triage issue.
When a test is wrong
Section titled “When a test is wrong”- If the test breaks during a refactor that preserved behavior: the test was coupled to implementation. Rewrite it to assert on behavior at the boundary, not internal wiring.
- If the test flakes: treat as a bug. Root-cause it (timing, ordering, shared state, hidden
I/O). Don’t
@pytest.mark.flaky. - If the test is hard to understand: the test is the bug. Rewrite for clarity. Tests are production code.
- If you can’t articulate what failure the test catches: delete it.
Related reading
Section titled “Related reading”- CONTRIBUTING.md → Testing (repo root) — short form of the testing rules
- Epic Template & DoD — integration-test AC requirement per epic
- Architecture — three-context topology
- Future considerations — a future hardening (adversarial RLS, observability stack choice)
docs/runbooks/testing.md— operational playbook for the test-Supabase instance, fixture-recording sessions, and CI gate troubleshootingdocs/runbooks/llm-testing.md— cassette redaction + manual live-provider recording detail