Decisions

ADR-060: Agent tool invocation, framework-layer composition, and LLM-mediated error handling

Status: Accepted (2026-04-25)

Context

Spectral runs three LangGraph-driven agents — Spectral Agent (customer-facing scan analysis), World Agent (domain exploration), Operations Agent (operator workflow). Each has its own tool registry: Spectral Agent’s tools live in system-design/agents/agent-architecture.mdx, World Agent’s in world-agent.mdx, Ops Agent’s in operations-agent.mdx (per S9 ground truth in SPEC-266). Per-agent registries already existed; the cross-cutting contract did not. The three runtimes had drifted in small ways — error shapes inconsistent across agents, observability metadata inconsistent, approval payloads ad hoc, and (the load-bearing question) no settled answer to where each agent runtime would actually live or how inter-context tool dependencies would compose.

Three architectural questions were entangled:

Where do the three agent runtimes live? Spectral Agent had been provisioned in workers (per TA-5 D12 AgentTask + TA-14 checkpointer + TA-19 D1). World Agent and Ops Agent were unspecified. A single answer was needed before tool patterns could be designed.
How does a tool in one context reach data or behavior in another context? TA-12 D13 had pencilled in framework-layer composition for ask_world_agent; TA-7 D3 + TA-8 D3 had pencilled in SQL grants between contexts for outcome reads + T3 body fetches. Two different mechanisms, no settled default.
How does an agent recover from tool errors? Default LangGraph and pydantic-ai behavior surfaces tool errors back to the LLM as tool messages; the LLM decides next action. An earlier draft proposed a hand-rolled retry middleware at the agent layer with explicit per-tool budgets. The two paths fight each other.

This ADR resolves the cross-cutting contract for agent tool invocation, including the three architectural questions above.

Decision

The decision is structured as three load-bearing architectural ratifications followed by ten per-mechanism decisions. The ratifications are stated separately because their reach extends beyond TA-15 itself — they bind subsequent dispositions and supersede prior ones.

Architectural ratification — Agent runtime placement: all three agents run in workers

apps/workers hosts the LangGraph orchestrators for all three agents. apps/api becomes thin: authentication, AgentTask dispatch via outbox (per TA-5 D12), and SSE streaming proxy. Workers consumes AgentTask events, loads checkpointer state (per TA-14), runs the orchestrator, executes tool calls, writes memory, and streams output via Supabase Realtime channel keyed by conversation_id; apps/api proxies the Realtime channel as SSE to the client. Approval interrupts use LangGraph’s interrupt() to suspend the run; the checkpointer persists state; an operator response (HTTP into apps/api) resumes via Command(resume=...).

Architectural ratification — Inter-context composition: notifications via events; calls via DI; no SQL grants between contexts

The architectural axis for inter-context mechanism choice is flow shape, not sync vs async function semantics:

Notification flow (one-way push; producer doesn’t await a result) → typed event payloads in <producer>.contracts.events.* published onto the TA-5 substrate (per ADR-065 D2).
Call flow (caller dispatches a request and needs the result) → callee-owned OHS Protocol in <callee>.contracts.protocols.* (per ADR-065 D3); impl in callee context’s application layer; bridge tool lives in apps/* per ADR-065 D5 (composes the Protocol into the caller agent’s tool list via DI).

Both flow shapes are implemented with async def Python functions in workers; transport choice is orthogonal to function-definition semantics. No SQL grants between contexts at any layer. This is the agent-tool-invocation projection of ADR-063, where the canonical statement lives.

Architectural ratification — TA-27 collapses to ratification

Founder-lens challenge: scaffolding exceptions during architectural planning indicates the default isn’t right. TA-7 D3 (worlds_outcomes_reader grant) and TA-8 D3 (worlds_t3_reader grant) are both removed. Inter-context outcome reads + T3 body reads are notification-shaped — the Reader Protocol path was retired by ADR-064 D3 (broadened 2026-04-30) in favor of event-driven local replicas at each consumer context. TA-27 (SPEC-331) lands as ratification of the inter-context composition ratification above rather than fresh disposition. See ADR-063 for the canonical statement.

D1 — Tool envelope is a pattern, not a heavy value object

Tools remain plain async callables produced by closed-over-DI factories (the existing Spectral Agent pattern; extended to all three agents). Cross-cutting metadata is captured at call time via a lightweight ToolCallMetadata pydantic VO emitted by an observed_tool decorator. There is no ToolCallEnvelope wrapper around every tool body.

D2 — Error taxonomy: four classes in `spectral.core.tools.errors`

ToolUserError — invalid input from user/operator (bad args, missing context); user-visible
ToolPolicyError — policy/scope/approval denied; user-visible with framing
ToolTransientError — infrastructure transient (DB blip, brief LLM provider rate-limit)
ToolTerminalError — non-recoverable (invariant violation)

The taxonomy’s role is to shape what the LLM sees via the tool message, not to drive a hand-rolled retry dispatcher.

D3 — Error propagation is LLM-mediated; LangGraph recursion-limit is the circuit breaker

Tool errors flow back to the LLM as tool messages with error class plus human-readable description per D2. The LLM decides next action: retry as-is, retry with modified args, surface to operator, abandon. LangGraph orchestrator-level recursion limit (default 25; configurable per agent) caps runaway loops. There is no agent-layer retry budget — the LLM is the retry decision-maker. Tool implementations may include single-retry-on-transient-IO as an implementation detail (e.g., a DB connection blip); that is not contract.

D4 — Approval via LangGraph `interrupt()`; standardized `ToolApprovalRequest` payload

ToolApprovalRequest (pydantic VO in spectral.core.tools.approval) carries:

tool_name: str
agent_name: str
args_summary: str (sanitized; PII-stripped; safe to display to the operator)
effect_description: str (human-readable description of what will change)
correlation_id: UUID

Operator response: approve / deny / request-revision. On approve: Command(resume=ApprovalGranted(...)). On deny: tool aborts with ToolPolicyError(reason=APPROVAL_DENIED). On revision: the agent revises the proposed action and re-emits the approval request. All paths audit-logged per TA-16.

D5 — Per-tool classification ground truth stays in S9 / agent-architecture pages

This ADR specifies the mechanism. SPEC-266 (S9 Ops Agent tool registry), agent-architecture.mdx (Spectral Agent), and world-agent.mdx (World Agent) remain the per-agent tool ground truth. This ADR does not restate the lists.

D6 — `ask_world_agent` composes via in-process DI through the workers entrypoint

WorldAgentRunner Protocol lives in spectral.worlds.contracts.protocols.world_agent per ADR-065 D3 (callee-owned OHS Protocol; the original spectral.core.tools.protocols placement is superseded). Two methods:

ask(question: str, *, world_id: UUID) -> str — stateless mode (no session, no memory; per S10)
chat(message: str, *, session_id: UUID, world_id: UUID) -> str — stateful mode

Impl lives in spectral.worlds.application. Per ADR-065 D5, bridge tools (e.g. an Ops-Agent ask_world_agent callable) live in apps/* framework deliverables, never in caller-context code; the bridge imports WorldAgentRunner (framework-to-context, allowed under validator rule 7) and is composed into the Ops Agent tool list via DI at workers startup. Tool body: await runner.ask(question, world_id=...). OTel trace context flows in-process. No correlation_id, no events, no suspend/resume.

WorldAgentRunner is the reference example; the same shape (callee-owned Protocol in <callee>.contracts.protocols.*; impl in callee context; bridge tool in apps/*) applies to any future inter-context call-shaped tool. Notification-shaped reads (e.g., rule-candidate outcomes, T3 memory body) follow event-driven local replicas instead per ADR-064 D3.

D7 — DLQ inspection tools added to the Ops Agent surface

Resolves the TA-6 D6 deferral:

list_dlq_events(handler_name?, age_range?, limit) — read
get_dlq_event_detail(event_id) — read; returns event payload + sanitized failure_history
replay_dlq_event(event_id, reason: str) — mutate with call-time approval; calls core.outbox_replay() per TA-6 D5; reason field captured in audit log

Backed by OutboxReader / OutboxReplayer protocols injected at the workers entrypoint.

D8 — Cluster triage tools added to the Ops Agent surface; symmetric with D7

Resolves the TA-9 D5 deferral:

list_failure_clusters(severity?, status?) — read
get_cluster_detail(cluster_id) — read; returns snapshot + linked failures
triage_cluster(cluster_id, status: dismissed|escalate|wait, notes) — mutate with call-time approval; updates platform.rule_candidates_pending operator-managed columns per TA-9 D3

Symmetric with D7 under S9’s mutate-with-call-time-approval pattern. Cluster triage is an operational status change (not a governance gate like approve_candidate); both fit the same shape.

D9 — Workshop discipline at the tool → memory boundary is doctrine plus a repository wrapper

Per the workshop framing crystallized in TA-13. Tool outputs containing canonical content (rule body via get_candidate_detail, scan trace via cluster detail, customer PII anywhere) are not round-tripped into memory rows verbatim. The agent uses content in-context for reasoning; the memory-write path stores meta-knowledge (“operator asked about candidate X” / “agent inspected cluster Y”), not content. The repository gateway (TA-13 D11 / TA-12 D11) enforces typology-driven classification; the trigram trigger (TA-12 D8 / TA-13 D4) backstops doctrine drift. There is no separate sanitization decorator on tools — discipline lives in the memory-write path, not the tool surface.

D10 — Helpers landing in `spectral.core.tools`

ToolCallMetadata (metadata.py) — pydantic VO: tool_name, agent_name, latency_ms, ok bool, error_class (nullable), trace_id, started_at, ended_at
ToolError base + four subclasses (errors.py) — D2 taxonomy
ToolApprovalRequest (approval.py) — D4 payload
observed_tool decorator — wraps an async tool callable; emits ToolCallMetadata per call to structlog and OTel; integrates with TA-10 LLM cost tracking when the tool body invokes an LLM. Decorator implementation lands with the first consumer (SPEC-242 Spectral Agent integration) per TA-12 / TA-14 precedent — this ADR fixes the contract surface (metadata + approval + error types).

Alternatives considered

Inter-context tool calls via events (request-reply pattern). Considered for ask_world_agent and similar call-flow tools. Rejected: request-reply over events introduces correlation IDs, timeouts, suspend-resume orchestration, and lost-response handling for what is structurally an in-process function call. DI at the framework-layer seam uses primitives that already exist (closed-over factories), preserves normal stack traces, and incurs no substrate-handler overhead. ADR-063 captures the broader inter-context framing.

HTTP through apps/api for ask_world_agent. Rejected after the agent-runtime-placement ratification: workers IS the framework-layer composition seam for workers-resident tool calls; an HTTP roundtrip would be a needless network hop.

Hand-rolled retry middleware at agent layer with explicit per-tool budget. Rejected. Bypasses LLM judgment, fights LangGraph default behavior, and adds machinery that would not earn its keep. The LLM is already the decision-maker for “what to do about a tool error” — wrapping its judgment in a budget table re-implements what the LLM does naturally.

Asymmetric D7 / D8 (replay-as-tool; triage-as-UI-only). Rejected. Cluster triage is not a governance gate; it is symmetric with DLQ replay under S9’s mutate-with-call-time-approval pattern. Splitting the surface across tool and UI layers would force operators to context-switch between agent chat and an ops dashboard for closely related actions.

Per-tool typed event pair for inter-context tool calls (alternative to a generic ToolInvocationRequested/Answered shape if events-default had survived). Moot under the inter-context composition ratification above (notification flow → events; call flow → DI).

Sanitization decorator on tool functions (D9 alternative). Rejected. Discipline belongs at the memory-write path (where the typology decision is made), not at the tool surface (which legitimately surfaces canonical content for in-context reasoning).

Inter-context SQL grants kept as exception list (TA-7 D3 + TA-8 D3 retained). Rejected after the founder-lens challenge that produced the TA-27 ratification above. ADR-063 captures the canonical reframing.

Consequences

Single inter-context composition mechanism for tool calls (DI at framework-layer seam) — minimal substrate footprint, standard async/await semantics, normal stack traces, easy testing.
TA-27 (SPEC-331) collapses to ratification. Inter-context SQL grants don’t ship at any layer. Captured canonically in ADR-063.
TA-7 D3 + TA-8 D3 grants removed. Reimplementations happen in the consumer epics (SPEC-310 outcome read; SPEC-311 T3 body fetch via DI-injected reader).
TA-5 D5 partially superseded — notification-flow portion holds; inter-context SQL default does not. ADR-044 (TA-5) carries the supersession status line.
TA-19 D2 inheritance from TA-5 D5 superseded. ADR-048 (TA-19) reflects that the workers tier composes inter-context dependencies via DI; no shared DB role assumes an inter-context grant.
spectral.core.tools package landed (commit 5eabc3c): errors.py, metadata.py, approval.py, protocols.py, plus 20 contract tests pinning the surface (test count: 114 → 134).
apps/api becomes thin. Auth + AgentTask dispatch + SSE streaming proxy. Operationally simpler at the cost of a streaming roundtrip via Supabase Realtime (workers → Realtime → API → SSE). Latency penalty is negligible vs LLM token latency.
Workers entrypoint is load-bearing. Inter-context composition for all agent-resident call flows lives there. The composition module is more substrate to maintain than a monolithic process, but the context seal is enforced structurally — agent context code never imports another context.
LLM prompts must handle error tool-messages gracefully — implementation discipline carried into the consumer epics.
observed_tool decorator implementation deferred to first consumer (SPEC-242) per TA-12 / TA-14 precedent. The contract surface is settled now; the decorator wires TA-16 substrate (structlog + OTel) and TA-10 cost tracking when first integrated.
Approval audit trail. Every ToolApprovalRequest and operator response is logged through TA-16; approval timing and reason fields support post-hoc review.

References

ADR-007 — LangGraph agent architecture; closed-over-DI tool factory pattern
ADR-065 — spectral.core admission discipline (the new tool surface ships under core)
ADR-031 — single-library + app-as-framework-layer-leaves; framework-layer composition
ADR-043 — TA-14 LangGraph checkpointer (approval interrupts depend on checkpointer behavior)
ADR-044 — TA-5 event substrate (carries the D5 supersession from this ADR)
ADR-048 — TA-19 deployment topology; workers tier
ADR-058 — TA-12 World Agent memory + agent-memory-primitives
ADR-059 — TA-13 Ops Agent memory
ADR-063 — canonical inter-context access pattern
TA-15 disposition — SPEC-318 comment 66b07620
TA-15 verification — SPEC-318 comment c6868aa3
src/spectral/core/tools/ — landed contract surface
tests/core/test_contract_tools.py — 20 contract tests
Codex system-design/agents/agent-tool-invocation.mdx — declarative pattern documentation
Codex system-design/agents/agent-architecture.mdx — runtime placement + streaming pattern updates

Previous
ADR-059: Operations Agent memory — workshop framing, scope-inheritance retention, identity-not-capability session vars Next
ADR-061: LLM testing strategy