Test Agents
apps/test-agents/ is the home for reference-implementation customer agents — working agent
code that demonstrates how a customer integrates with Spectral. The subject agent makes
operational decisions, invokes /decide before acting, and operates within the binding work frame
Spectral returns.
The subject agent — tax_prep — is a LangGraph tax-preparation agent that consumes a deployed
Spectral world as a real external customer. It takes a taxpayer through to a filled, submission-ready
Form 1040 PDF, and every consequential judgment is the platform’s: the agent routes each one to
/decide, receives { status, work_frame, decision_metadata }, and honors the four-state outcome.
The agent holds no tax-judgment logic of its own — that separation is the whole point, and it is what
lets a decision-server regression isolate to a phase of
decision execution rather than hide inside agent
variance.
The tax_prep agent
Section titled “The tax_prep agent”tax_prep is one LangGraph orchestrator over an accumulating return state:
discover_actions → interview → ( propose_determination → adjudicate → honor_outcome )* →compute_1040 → render_pdf → verify- discover_actions reads the deployed world’s action set from
/actions. The agent routes only the actions it discovered — it never hardcodes the vocabulary. - interview confines the agent’s LLM to the conversational layer: it turns the taxpayer’s free-text answers into structured facts (and narrates outcomes). It has no decision path — the LLM never makes a tax judgment.
- the per-judgment loop proposes a determination (projecting the collected facts onto the
discovered action’s declared attributes — a deterministic mapping, not a decision), adjudicates
it by routing to
/decide, and honors the four-state outcome:- GREEN — accept the treatment into the return.
- YELLOW — record it as held for review; never silently upgraded to GREEN.
- RED — do not carry the blocked treatment. The reproducible validation driver recovers to a permitted alternative carried as persona data; the interactive shell surfaces the block (the taxpayer revises and re-runs), because recovery alternatives are tax knowledge the agent does not hold.
- compute_1040 / render_pdf run a deterministic TY2025 calculator and render the Form 1040 (plus required schedules) PDF. These are arithmetic and presentation, not judgment.
The platform-facing seam is a single typed decision client with dual auth: a customer JWT gates
discovery and the decision-record read; the sp_live_ decision key gates /decide
(ADR-086 D3
— keys are minted per (org, domain)). The agent never overrides a YELLOW or RED: its job is to
ask before acting; Spectral’s job is to answer.
The agent lives at apps/test-agents/src/spectral_test_agents/tax_prep/ with shared scaffolding
under apps/test-agents/src/spectral_test_agents/shared/. Per
ADR-031, spectral_test_agents is the
app’s snake-case leaf namespace; subject modules import as
from spectral_test_agents.tax_prep import …. The interview LLM is built through the shared
build_chat_model factory (ADR-102 — the sanctioned
langchain-* provider path), which is what binds the cassette transport for credential-free replay.
The shared reference-agent harness
Section titled “The shared reference-agent harness”Every subject agent is built on a small shared harness (spectral_test_agents.shared) so it behaves
like a real customer integration and so new agents are cheap and consistent:
- the platform seam (
decision_client) — the typedGET /actions·POST /decide· decision-read client every agent uses; - config +
--local(config,local_defaults) — an agent reads its platform coordinates and credentials with the precedence explicit flag >--localknown dev values > ambient env, so a local run needs no shell preamble; - connect-or-provision (
connection) — an agent connects to a deployed world by its credentials and falls back to provisioning a fresh world only against a local target (so a mistyped key against a real platform fails loud rather than silently authoring a world); - the CLI base (
cli) — the shared options + session resolution each agent’styperapp composes.
A subject agent adds a console script (tax-prep) and a thin command module; the harness supplies the
rest.
Two surfaces over the same core
Section titled “Two surfaces over the same core”The agent’s graph + preparation core is exercised two ways, neither holding tax-judgment logic.
The validation gate — the reproducible success bar
Section titled “The validation gate — the reproducible success bar”drivers/validation.py runs the agent against the deployed six-action TY2025 world for a set of
full-taxpayer personas (collectively spanning GREEN, YELLOW-held, and RED-recovered) and fail-loud
asserts the whole end-to-end claim: the agent discovers exactly the deployed action set, routes one
/decide per attempt, honors each four-state outcome, every routed decision record is provable
(it carries the deployed world-model version and a cited matched rule — including a recovered-from RED
block), and the computed Form 1040 lines + schedule set + rendered PDF match the hand-verified return.
The interview LLM is pinned to recorded cassettes, so the run reproduces from a clean database with no
LLM credential. This is the standing system: gate — the program success bar.
The tax-prep CLI — the human-facing surface
Section titled “The tax-prep CLI — the human-facing surface”The console script runs the same core for a real taxpayer: an interactive Textual
shell (tax-prep) and a headless mode (tax-prep run, a description in → decisions + a PDF out), plus
tax-prep seed to ensure (reuse or author) a local world. A person’s input feeds the live interview
LLM, the agent routes every discovered action to /decide, and the surface shows the platform’s
four-state outcomes (GREEN carried · YELLOW held · RED blocked) before rendering the PDF — the live
human + live LLM counterpart to the reproducible gate.
Reference-implementation discipline
Section titled “Reference-implementation discipline”apps/test-agents/ is a reference implementation, not a test harness per project memory
project_test_agents_are_reference_implementations. The agent code is written to demonstrate
realistic customer-agent shape and to host integration patterns engineers can read, run, and
reason about — not to be a synthetic fixture stripped of agent characteristics. The package is
excluded from default pytest runs and from the import graphs of spectral.core,
spectral.worlds, and spectral.platform; production code never imports test-agents.
Test composition that automates the agent’s exercise (recorded responses, fixture replay, system runs) is acceptable but secondary — the primary value is a working customer agent that engineers can read to understand the integration shape.
Running the agent
Section titled “Running the agent”The agent ships a console script (tax-prep) runnable from the repo root. --local applies the known
local-dev values, so no shell preamble is needed:
uv run tax-prep --local # interactive Textual shell (the default)uv run tax-prep --local run --input case.txt # headless: a description in → decisions + a PDFuv run tax-prep --local seed # ensure the dedicated local world is deployed, print its credentialsThe agent connects to a deployed world by its credentials (--org/--domain/--token/--key or the
SPECTRAL_AGENT_* env vars), falling back to provisioning a fresh local world when none work;
--cassette replays the interview LLM credential-free.
The test suites are excluded from default pytest runs; invoke them explicitly:
# Unit + CLI tests (collaborators stubbed; the decision client runs over a mock transport)uv run pytest apps/test-agents/ -c apps/test-agents/pyproject.toml -m "not system"
# The system gate — the reproducible success bar against a deployed world. Reset the database# BEFORE the workers boot, or the outbox consumer never materializes the routing projection.FIX="$PWD/apps/test-agents/src/spectral_test_agents/tax_prep/fixtures"supabase db resetsource tools/dev/resolve_supabase_env.shuv run python tools/dev/cold_start_seed.pySPECTRAL_LLM_CASSETTE_MODE=replay SPECTRAL_LLM_CASSETTE_DIR="$FIX/codegen_cassettes" \ tools/dev/start.sh --fullsource tools/dev/resolve_supabase_env.shSPECTRAL_LLM_CASSETTE_MODE=replay SPECTRAL_LLM_CASSETTE_DIR="$FIX/interview_cassettes" \ uv run --project apps/test-agents pytest apps/test-agents/tests/test_validation_system.py -m systemThe package adds a local system marker for live decision-server runs alongside the unit and
integration markers; the canonical layer ladder (unit, contract, integration, e2e,
manual cassette recording) lives in Testing. apps/test-agents/README.md carries the full
record/replay walkthrough.
Domain and API-key setup
Section titled “Domain and API-key setup”The agent consumes a deployed world through a customer (org_id, domain_id) tuple with its own API
key bound per
ADR-086 D3
(keys are minted per (org, domain)). The validation driver provisions this the same no-shortcut way
a real customer would — author → publish → provision through the onboarding API → deploy → approve —
so the run reproduces the whole customer surface from a clean database rather than seeding tenancy
directly. The decision-server routes by (org, domain, action, world_model_version) and does not
distinguish callers by integration shape.
See also
Section titled “See also”- Testing — the test-agent backbone in the broader testing posture
- Decision Execution — the five-phase pipeline the test-agents exercise
- Agent Architecture — distinct topic: Spectral’s own World Agent rather than customer-shaped test-agents
- Source:
apps/test-agents/