Skip to content
GitHub
Developer

Test Agents

apps/test-agents/ is the target home for a single subject agent — tax_prep — whose purpose is to drive the full scan pipeline end-to-end in CI. The prior shape of one agent per workflow archetype was consolidated to a single pluggable agent because ingestion diversity is better expressed at the OTEL emitter level than by maintaining parallel agent codebases.


tax_prep runs a deterministic tax-preparation workflow against a fixed input set with fixed expected outputs. It is the one end-to-end path the scan pipeline must pass on every merge — ingestion through verdict — and the assertion surface for every phase along the way. Determinism is the point: when the pipeline regresses, the failure isolates to the offending phase rather than to subject-agent variance.

The agent lives at apps/test-agents/src/spectral_test_agents/tax_prep/ with shared scaffolding under apps/test-agents/src/spectral_test_agents/shared/. Per ADR-031, spectral_test_agents is the app’s snake-case leaf namespace; subject modules import as from spectral_test_agents.tax_prep import ….


tax_prep is parameterized over an OTEL emitter that varies along two independent axes. Real-world customer agents combine these axes — a customer using LangChain to call Anthropic emits LangChain-shaped spans wrapping an Anthropic-shaped LLM call. The ingestion path has to accept whatever combinations customers throw at it, so the test-agent backbone exercises a covering matrix.

How OTEL spans are produced. Different libraries produce structurally different span trees.

Emitter classFrameworkWhat it exercises
LangChainFormatEmitterLangChain OTEL conventionsLangChain-style chain / tool / LLM span hierarchy and langchain.* attribute namespace
OpenLLMetryFormatEmitterOpenLLMetry / Traceloop conventionsgen_ai.* semantic conventions plus workflow / task span attribution
ManualSdkFormatEmitterHand-rolled OTEL SDK callsBare OpenTelemetry-spec spans without framework conventions — the floor of what ingestion must accept

What the underlying LLM API call looks like. Each framework class accepts a vendor parameter that shapes the LLM-call span attributes:

VendorSpan shape exercised
anthropicAnthropic Messages-API span attributes; tool-use, stop-reason, gen_ai.system=anthropic
openaiOpenAI Responses-API span attributes; function-calls, finish-reason, gen_ai.system=openai
raw_otlpNo LLM-vendor-specific attributes — Manual SDK only (degenerate for framework wrappers)
anthropicopenairaw_otlp
LangChain(LangChain without an LLM call is degenerate)
OpenLLMetry(gen_ai.* implies an LLM call)
Manual SDK

Seven cells. Two cells are intentionally absent: a framework wrapper without an LLM call inside it is a degenerate configuration that real customers do not produce.

The full matrix is expensive to run on every push. Coverage runs in tiers:

  • Per-push — diagonal slice (3 cells): LangChain + anthropic, OpenLLMetry + openai, Manual SDK + raw_otlp. Both axes covered by at least one cell each on every merge.
  • Nightly + pre-release — the full 7-cell matrix.

A push-tier failure isolates to a single cell; the diagonal narrows the ingestion bug to its framework × vendor pair. A nightly-only failure indicates a non-diagonal combination regressed without taking down the diagonal — typically a subtle attribute-shape interaction.

Emitter selection is a CLI flag / env var on the spectral-tax-prep entrypoint; one process exercises one (framework, vendor) cell at a time. All cells drive the same tax_prep workflow against the same fixed inputs. The scan pipeline ingests each, calibrates, diagnoses, evaluates, optimizes, checks safety, and renders a verdict. If a cell fails, the (framework, vendor) diff localises the ingestion bug.


apps/test-agents/ is a reference implementation, not a test harness. The agent code is written to demonstrate realistic agent shape and to host failure modes the optimization loop should hunt — not to be a synthetic fixture stripped of agent characteristics. The package is excluded from default pytest runs and from the import graphs of spectral.core, spectral.worlds, and spectral.platform; production code never imports test-agents.

Test composition that automates the agent’s exercise (recorded traces, fixture replay, system runs) is acceptable but secondary — the primary value is a working agent that operators and engineers can read, run, and reason about.


Terminal window
# Install dependencies from the repo root
uv sync --all-packages
# Unit tests (single subject, collaborators stubbed)
uv run --package spectral-test-agents pytest tests/ -m unit
# Integration tests (recorded traces, deterministic replay)
uv run --package spectral-test-agents pytest tests/ -m integration
# System tests (live agent runs through the full optimization loop)
uv run --package spectral-test-agents pytest tests/ -m system

The package adds a local system marker for live full-loop runs alongside the unit and integration markers; the canonical layer ladder (unit, contract, integration, e2e, live_drift) lives in Testing.


tax_prep maps to a dedicated local-dev workspace with its own API key for trace ingestion. The workspace ID and key are seeded by the local development environment (tools/dev/start.sh). Each emitter configuration uses the same workspace; ingestion routing distinguishes them by span attributes, not by workspace.