Developer

Test Agents

apps/test-agents/ is the home for reference-implementation customer agents — working agent code that demonstrates how a customer integrates with Spectral. The subject agent makes operational decisions, invokes /decide before acting, and operates within the binding work frame Spectral returns.

The subject agent — tax_prep — is a LangGraph tax-preparation agent that consumes a deployed Spectral world as a real external customer. It takes a taxpayer through to a filled, submission-ready Form 1040 PDF, and every consequential judgment is the platform’s: the agent routes each one to /decide, receives { status, work_frame, decision_metadata }, and honors the four-state outcome. The agent holds no tax-judgment logic of its own — that separation is the whole point, and it is what lets a decision-server regression isolate to a phase of decision execution rather than hide inside agent variance.

The `tax_prep` agent

tax_prep is one LangGraph orchestrator over an accumulating return state:

discover_actions → interview → ( propose_determination → adjudicate → honor_outcome )* →
compute_1040 → render_pdf → verify

discover_actions reads the deployed world’s action set from /actions. The agent routes only the actions it discovered — it never hardcodes the vocabulary.
interview confines the agent’s LLM to the conversational layer: it turns the taxpayer’s free-text answers into structured facts (and narrates outcomes). It has no decision path — the LLM never makes a tax judgment.
the per-judgment loop proposes a determination (projecting the collected facts onto the discovered action’s declared attributes — a deterministic mapping, not a decision), adjudicates it by routing to /decide, and honors the four-state outcome:
- GREEN — accept the treatment into the return.
- YELLOW — record it as held for review; never silently upgraded to GREEN.
- RED — do not carry the blocked treatment. The reproducible validation driver recovers to a permitted alternative carried as persona data; the interactive shell surfaces the block (the taxpayer revises and re-runs), because recovery alternatives are tax knowledge the agent does not hold.
compute_1040 / render_pdf run a deterministic TY2025 calculator and render the Form 1040 (plus required schedules) PDF. These are arithmetic and presentation, not judgment.

The platform-facing seam is a single typed decision client with dual auth: a customer JWT gates discovery and the decision-record read; the sp_live_ decision key gates /decide (ADR-086 D3 — keys are minted per (org, domain)). The agent never overrides a YELLOW or RED: its job is to ask before acting; Spectral’s job is to answer.

The agent lives at apps/test-agents/src/spectral_test_agents/tax_prep/ with shared scaffolding under apps/test-agents/src/spectral_test_agents/shared/. Per ADR-031, spectral_test_agents is the app’s snake-case leaf namespace; subject modules import as from spectral_test_agents.tax_prep import …. The interview LLM is built through the shared build_chat_model factory (ADR-102 — the sanctioned langchain-* provider path), which is what binds the cassette transport for credential-free replay.

The shared reference-agent harness

Every subject agent is built on a small shared harness (spectral_test_agents.shared) so it behaves like a real customer integration and so new agents are cheap and consistent:

the platform seam (decision_client) — the typed GET /actions · POST /decide · decision-read client every agent uses;
config + --local (config, local_defaults) — an agent reads its platform coordinates and credentials with the precedence explicit flag > --local known dev values > ambient env, so a local run needs no shell preamble;
connect-or-provision (connection) — an agent connects to a deployed world by its credentials and falls back to provisioning a fresh world only against a local target (so a mistyped key against a real platform fails loud rather than silently authoring a world);
the CLI base (cli) — the shared options + session resolution each agent’s typer app composes.

A subject agent adds a console script (tax-prep) and a thin command module; the harness supplies the rest.

Two surfaces over the same core

The agent’s graph + preparation core is exercised two ways, neither holding tax-judgment logic.

The validation gate — the reproducible success bar

drivers/validation.py runs the agent against the deployed six-action TY2025 world for a set of full-taxpayer personas (collectively spanning GREEN, YELLOW-held, and RED-recovered) and fail-loud asserts the whole end-to-end claim: the agent discovers exactly the deployed action set, routes one /decide per attempt, honors each four-state outcome, every routed decision record is provable (it carries the deployed world-model version and a cited matched rule — including a recovered-from RED block), and the computed Form 1040 lines + schedule set + rendered PDF match the hand-verified return. The interview LLM is pinned to recorded cassettes, so the run reproduces from a clean database with no LLM credential. This is the standing system: gate — the program success bar.

The `tax-prep` CLI — the human-facing surface

The console script runs the same core for a real taxpayer: an interactive Textual shell (tax-prep) and a headless mode (tax-prep run, a description in → decisions + a PDF out), plus tax-prep seed to ensure (reuse or author) a local world. A person’s input feeds the live interview LLM, the agent routes every discovered action to /decide, and the surface shows the platform’s four-state outcomes (GREEN carried · YELLOW held · RED blocked) before rendering the PDF — the live human + live LLM counterpart to the reproducible gate.

Reference-implementation discipline

apps/test-agents/ is a reference implementation, not a test harness per project memory project_test_agents_are_reference_implementations. The agent code is written to demonstrate realistic customer-agent shape and to host integration patterns engineers can read, run, and reason about — not to be a synthetic fixture stripped of agent characteristics. The package is excluded from default pytest runs and from the import graphs of spectral.core, spectral.worlds, and spectral.platform; production code never imports test-agents.

Test composition that automates the agent’s exercise (recorded responses, fixture replay, system runs) is acceptable but secondary — the primary value is a working customer agent that engineers can read to understand the integration shape.

Running the agent

The agent ships a console script (tax-prep) runnable from the repo root. --local applies the known local-dev values, so no shell preamble is needed:

uv run tax-prep --local                       # interactive Textual shell (the default)
uv run tax-prep --local run --input case.txt  # headless: a description in → decisions + a PDF
uv run tax-prep --local seed                  # ensure the dedicated local world is deployed, print its credentials

The agent connects to a deployed world by its credentials (--org/--domain/--token/--key or the SPECTRAL_AGENT_* env vars), falling back to provisioning a fresh local world when none work; --cassette replays the interview LLM credential-free.

The test suites are excluded from default pytest runs; invoke them explicitly:

# Unit + CLI tests (collaborators stubbed; the decision client runs over a mock transport)
uv run pytest apps/test-agents/ -c apps/test-agents/pyproject.toml -m "not system"

# The system gate — the reproducible success bar against a deployed world. Reset the database
# BEFORE the workers boot, or the outbox consumer never materializes the routing projection.
FIX="$PWD/apps/test-agents/src/spectral_test_agents/tax_prep/fixtures"
supabase db reset
source tools/dev/resolve_supabase_env.sh
uv run python tools/dev/cold_start_seed.py
SPECTRAL_LLM_CASSETTE_MODE=replay SPECTRAL_LLM_CASSETTE_DIR="$FIX/codegen_cassettes" \
  tools/dev/start.sh --full
source tools/dev/resolve_supabase_env.sh
SPECTRAL_LLM_CASSETTE_MODE=replay SPECTRAL_LLM_CASSETTE_DIR="$FIX/interview_cassettes" \
  uv run --project apps/test-agents pytest apps/test-agents/tests/test_validation_system.py -m system

The package adds a local system marker for live decision-server runs alongside the unit and integration markers; the canonical layer ladder (unit, contract, integration, e2e, manual cassette recording) lives in Testing. apps/test-agents/README.md carries the full record/replay walkthrough.

Domain and API-key setup

The agent consumes a deployed world through a customer (org_id, domain_id) tuple with its own API key bound per ADR-086 D3 (keys are minted per (org, domain)). The validation driver provisions this the same no-shortcut way a real customer would — author → publish → provision through the onboarding API → deploy → approve — so the run reproduces the whole customer surface from a clean database rather than seeding tenancy directly. The decision-server routes by (org, domain, action, world_model_version) and does not distinguish callers by integration shape.