Test Agents
apps/test-agents/ is the target home for a single subject agent — tax_prep — whose purpose
is to drive the full scan pipeline end-to-end in CI. The prior shape of one agent per workflow
archetype was consolidated to a single pluggable agent because ingestion diversity is better
expressed at the OTEL emitter level than by maintaining parallel agent codebases.
The tax_prep agent
Section titled “The tax_prep agent”tax_prep runs a deterministic tax-preparation workflow against a fixed input set with fixed
expected outputs. It is the one end-to-end path the scan pipeline must pass on every merge —
ingestion through verdict — and the assertion surface for every phase along the way. Determinism
is the point: when the pipeline regresses, the failure isolates to the offending phase rather than
to subject-agent variance.
The agent lives at apps/test-agents/src/spectral_test_agents/tax_prep/ with shared scaffolding
under apps/test-agents/src/spectral_test_agents/shared/. Per
ADR-031, spectral_test_agents is the
app’s snake-case leaf namespace; subject modules import as
from spectral_test_agents.tax_prep import ….
Pluggable OTEL emitter
Section titled “Pluggable OTEL emitter”tax_prep is parameterized over an OTEL emitter that varies along two independent axes.
Real-world customer agents combine these axes — a customer using LangChain to call Anthropic
emits LangChain-shaped spans wrapping an Anthropic-shaped LLM call. The ingestion path has to
accept whatever combinations customers throw at it, so the test-agent backbone exercises a
covering matrix.
Axis A — Instrumentation framework
Section titled “Axis A — Instrumentation framework”How OTEL spans are produced. Different libraries produce structurally different span trees.
| Emitter class | Framework | What it exercises |
|---|---|---|
LangChainFormatEmitter | LangChain OTEL conventions | LangChain-style chain / tool / LLM span hierarchy and langchain.* attribute namespace |
OpenLLMetryFormatEmitter | OpenLLMetry / Traceloop conventions | gen_ai.* semantic conventions plus workflow / task span attribution |
ManualSdkFormatEmitter | Hand-rolled OTEL SDK calls | Bare OpenTelemetry-spec spans without framework conventions — the floor of what ingestion must accept |
Axis B — LLM-vendor span shape
Section titled “Axis B — LLM-vendor span shape”What the underlying LLM API call looks like. Each framework class accepts a vendor parameter
that shapes the LLM-call span attributes:
| Vendor | Span shape exercised |
|---|---|
anthropic | Anthropic Messages-API span attributes; tool-use, stop-reason, gen_ai.system=anthropic |
openai | OpenAI Responses-API span attributes; function-calls, finish-reason, gen_ai.system=openai |
raw_otlp | No LLM-vendor-specific attributes — Manual SDK only (degenerate for framework wrappers) |
Coverage matrix
Section titled “Coverage matrix”anthropic | openai | raw_otlp | |
|---|---|---|---|
| LangChain | ✅ | ✅ | — (LangChain without an LLM call is degenerate) |
| OpenLLMetry | ✅ | ✅ | — (gen_ai.* implies an LLM call) |
| Manual SDK | ✅ | ✅ | ✅ |
Seven cells. Two cells are intentionally absent: a framework wrapper without an LLM call inside it is a degenerate configuration that real customers do not produce.
CI coverage policy
Section titled “CI coverage policy”The full matrix is expensive to run on every push. Coverage runs in tiers:
- Per-push — diagonal slice (3 cells):
LangChain + anthropic,OpenLLMetry + openai,Manual SDK + raw_otlp. Both axes covered by at least one cell each on every merge. - Nightly + pre-release — the full 7-cell matrix.
A push-tier failure isolates to a single cell; the diagonal narrows the ingestion bug to its framework × vendor pair. A nightly-only failure indicates a non-diagonal combination regressed without taking down the diagonal — typically a subtle attribute-shape interaction.
Selection mechanics
Section titled “Selection mechanics”Emitter selection is a CLI flag / env var on the spectral-tax-prep entrypoint; one process
exercises one (framework, vendor) cell at a time. All cells drive the same tax_prep workflow
against the same fixed inputs. The scan pipeline ingests each, calibrates, diagnoses, evaluates,
optimizes, checks safety, and renders a verdict. If a cell fails, the (framework, vendor) diff
localises the ingestion bug.
Reference-implementation discipline
Section titled “Reference-implementation discipline”apps/test-agents/ is a reference implementation, not a test harness. The agent code is
written to demonstrate realistic agent shape and to host failure modes the optimization loop
should hunt — not to be a synthetic fixture stripped of agent characteristics. The package is
excluded from default pytest runs and from the import graphs of spectral.core, spectral.worlds,
and spectral.platform; production code never imports test-agents.
Test composition that automates the agent’s exercise (recorded traces, fixture replay, system runs) is acceptable but secondary — the primary value is a working agent that operators and engineers can read, run, and reason about.
Running the agent
Section titled “Running the agent”# Install dependencies from the repo rootuv sync --all-packages
# Unit tests (single subject, collaborators stubbed)uv run --package spectral-test-agents pytest tests/ -m unit
# Integration tests (recorded traces, deterministic replay)uv run --package spectral-test-agents pytest tests/ -m integration
# System tests (live agent runs through the full optimization loop)uv run --package spectral-test-agents pytest tests/ -m systemThe package adds a local system marker for live full-loop runs alongside the unit and
integration markers; the canonical layer ladder (unit, contract, integration, e2e,
live_drift) lives in Testing.
Workspace and API key mapping
Section titled “Workspace and API key mapping”tax_prep maps to a dedicated local-dev workspace with its own API key for trace ingestion. The
workspace ID and key are seeded by the local development environment (tools/dev/start.sh). Each
emitter configuration uses the same workspace; ingestion routing distinguishes them by span
attributes, not by workspace.
See also
Section titled “See also”- Testing — the test-agent backbone in the broader testing posture
- Optimization Engine — the pipeline
tax_prepexercises - Source:
apps/test-agents/