Skip to content
GitHub
Agents

LLM Platform

Spectral’s LLM stack is in-process Python, built on pydantic-ai as the canonical SDK abstraction for direct LLM calls (scan pipeline, rule distillation, any non-LangGraph path). The control plane — routing, profile resolution, rate limits, budgets, graceful degradation — runs in-process; there is no out-of-process HTTP LLM gateway. Decision lineage in ADR-008 (pydantic-ai adoption) and ADR-035 (current stack composition).

langchain-anthropic, langchain-openai, and langchain-google are permitted only as LangChain chat-model adapters passed to LangGraph’s init_chat_model for agent orchestration (per ADR-007). LiteLLM is not in the dependency graph — pydantic-ai covers the provider surface Spectral uses, and the March 2026 supply-chain compromise plus the pain points already named in ADR-008 settled the decision against it.


spectral.core.llm.purposes.PurposeKey is the contract every LLM call carries across worlds and platform:

  • scoring — evaluation scoring (high volume, cost-sensitive)
  • detection — anti-deception, parse validation
  • reasoning — diagnosis, optimization, calibration rewrites, rule distillation
  • agent_turn — conversational turn (Spectral / Ops Agent)
  • agent_tool — tool-invocation call within an agent turn
  • world_agent — World Agent exploration/hypothesis
  • customer_replay — re-executing customer agents during observe
  • embedding — embedding generation (full policy in embeddings)

Events, cost rollups, and observability all aggregate on this key.

The purpose taxonomy is also the cost-control lever. A scan exercises every purpose: scoring runs on the high-volume cost-optimized tier (cheap models, lots of calls); reasoning and world_agent run on the highest-capability tier (fewer calls, more expensive); detection and agent_* sit between. The intentional asymmetry — capability where it pays off, throughput where it doesn’t — is what keeps per-scan unit economics workable as workspace scale grows. Concrete unit-cost ranges sharpen as Spectral observes real workspace usage; the structure that makes those ranges defensible (purpose-level routing + workspace-override governance) is in place from today.


Resolution precedence: customer-fixed > workspace-override > global-default, per-purpose granularity.

Override classification per purpose:

  • locked — platform-only (default for detection, customer_replay)
  • operator_allowed — workspace can override within platform-approved IDs (default for scoring, reasoning, agent_*)
  • open — any supported model (enterprise contract tier)

ResolvedProfile carries the chosen model ID plus enforcement envelope (rate-limit, budget, fallback chain, defaults). Resolver is a pure function in each context’s application layer: resolve_profile(workspace_id, purpose) → ResolvedProfile.

ModelProfile.credential_source ∈ {"spectral_managed", "workspace_byo"} plus credential_ref: str | None. Today, all workspace credentials default to spectral_managed (platform-provisioned LLM credentials); workspace_byo reserves the path for customer-supplied subscriptions or API keys, with the AEAD-encrypted storage and binding mechanics populated a future when customer subscription federation ships.


Token-bucket per (workspace_id, purpose); per-workspace per-purpose daily spend cap. A TenantScopedLLMProvider wrapper around pydantic-ai applies the rate-limit and budget envelope; repositories in worlds or platform cannot bypass.

Cost source: genai-prices, with a fallback registry in spectral.core.llm.pricing for models the package does not yet know.

Per call emits:

  • An OTel span with GenAI-semconv plus Spectral attributes (spectral.purpose, spectral.workspace_id, spectral.account_id, spectral.context, spectral.scan_id?, spectral.agent_turn_id?).
  • A row to core.llm_usage for in-app budget enforcement.

Budget enforcement reads rolling spend from core.llm_usage; on-call check fails closed with Result[BudgetExceeded] when the daily cap is exceeded.

Admin queries against core.llm_usage are app-layer-gated to workspace admins; RLS scopes the rows to the calling workspace, and admin-only access is enforced via scope checks on the route. Ops staff query platform-wide via the Supabase service_role connection (which is exempt from RLS policies). A core.llm_usage_daily rollup serves cheap “spend this month” queries.

Rate-limit budgets are independent per context — a spectral.worlds purpose burning its budget must not starve a spectral.platform purpose, and vice versa. TenantScopedLLMProvider keys its budget ledger on (workspace_id, context, purpose), not just (workspace_id, purpose). An isolation test exercises this directly: the test exhausts one context’s budget for a purpose, then verifies the same purpose still resolves successfully under the other context’s budget. The test fails if either context’s quota leaks into the other’s accounting.


  • Provider-level (transport). HTTP 5xx/429 → exponential backoff retry (max 3) on the same provider; then next provider in the purpose’s fallback chain. Per-tenant retry budget prevents one bad workspace from exhausting provider rate limits for the population.
  • Purpose-level (quality). Consecutive validation/refusal failures on a purpose within a window trigger upgrade rather than lateral fallback: DegradationPolicy(trigger=N_failures_in_window, action=upgrade_to_purpose). Per-eval-dimension override carries forward as scan-scoped policy.

Circuit transitions emit domain events (llm.circuit.opened, llm.circuit.closed) consumed by the observability stack and the Spectral Agent.


core.llm_profiles persists (id, version, active_at, deactivated_at, created_by, audit_log). Activation is append-only; rollback re-activates a prior version. The schema reserves a variant: str | None column for A/B routing without binding the resolver branch in today.

Governance: locked-override modifications require a platform-role audit entry; operator_allowed changes are workspace-admin-auditable.


Same-provider-family to preserve prompt-caching affinity:

  • reasoning, agent_turn, agent_tool, world_agent: Claude Opus 4.7 → Claude Sonnet 4.6 → Claude Haiku 4.5
  • scoring, detection: Gemini 2.5 Flash → Gemini 3.1 Flash-Lite
  • customer_replay: matches the customer agent’s model
  • embedding: per embeddings D11 ladder

Cross-provider fallback is configurable per profile; not a default.


tools/quality/check_llm_sdk_allowlist.py (pre-push tier) asserts litellm absent from uv tree and forbids direct imports of raw provider SDKs (anthropic, openai, google.generativeai) from Spectral code. All calls flow through pydantic-ai or LangGraph/init_chat_model with langchain-anthropic / langchain-openai / langchain-google adapters.


FakeLLMProvider (in spectral.core.llm.testing) implements the LLMProvider protocol and returns canned responses keyed by purpose plus content-class. Integration tests use VCR-style cassettes; nightly drift detection compares live-provider output against cassettes. See testing for the full posture.