Decisions

ADR-036: Observability stack — OTel substrate, three-stream LLM trace architecture, content-class routing

Status: Accepted (2026-04-20)

Context

SPEC-285 fixed Spectral’s observability principles: structured JSON logs via structlog with seven canonical fields (workspace_id, account_id, scan_id, world_model_version, bc, phase, trace_id); LLM-telemetry minimum shape (10 fields plus a closed purpose enum); W3C Trace Context propagation across HTTP / domain events / agent tool calls / worker-job dispatch; the customer-data versus operational-data category boundary. Tool choices were explicitly deferred to TA-16.

Alpha posture is the hard constraint: solo-builder growing to 2–3 engineers, business hours only, no paging, no out-of-hours on-call. Founder time is the scarce resource. TA-10 (ADR-035) flagged Pydantic Logfire and Langfuse as strong LLM-observability candidates; the spike’s research expanded the field.

Several material findings during the spike:

Pydantic Logfire — 10M spans/month free, same vendor as pydantic-ai, OTel-native (literally is OTel data model). Not named in the original decision surface.
Langfuse acquired by ClickHouse (Jan 2026). MIT core preserved; EE directory proprietary. Self-host path unchanged; Cloud EU-residency posture uncertain.
Helicone — Mintlify acquisition → maintenance mode. Dead.
LangSmith — ~$840/month at engineer × trace volume plus deepens LangChain exposure. Rule out.
Langfuse self-host — heavy (Postgres + ClickHouse + Redis + blob + 2 containers; ~16 GiB RAM recommended for prod). Real five-datastore ops tax during alpha.
Sentry self-host — 20+ containers, Kafka, ClickHouse, Snuba. Hard no at alpha.
Class-based content classification — most Spectral LLM traffic is platform-owned (World Agent, Ops Agent) or test-agent-generated synthetic content. Only customer scans and Spectral Agent conversations carry customer PII. Blanket payload stripping was over-protection.

The ContentClass taxonomy was retroactively rebalanced by TA-3 D11 / SPEC-306 naming coherence. This ADR uses the post-rebalance taxonomy (PLATFORM / OPERATIONS / SYNTHETIC).

Decision

D1 — OpenTelemetry is the telemetry substrate

All spans, metrics, and logs-with-context emit through the OTel SDK. The OTel Collector is the fan-out point. Every downstream vendor is swappable via OTEL_EXPORTER_OTLP_ENDPOINT — no proprietary agent lock-in in the critical path.

D2 — Ops observability: OTel Collector → platform-native storage + Grafana UI

Storage backend binding deferred to ADR-046 / ADR-048 / ADR-049 (deployment topology). Preferred path: use whatever the hosting platform’s native logging / tracing / metrics provides. Grafana Cloud free-tier LGTM is the deployment-agnostic fallback. Self-hosted LGTM on Hetzner is the post-alpha sovereignty / cost-ceiling escape.

D3 — LLM observability is three streams with content-class-driven routing

Stream A — payload-free operational spans. OTel SDK emits span structure plus gen_ai.* metadata (model, latency, token counts, cost, purpose, bc, content_class, workspace_id as label). For PLATFORM content_class, gen_ai.prompt.* / gen_ai.completion.* / gen_ai.tool.arguments are stripped before export. Exported via OTel Collector to Pydantic Logfire (SaaS free tier) and the D2 ops storage backend.
Stream B — payload-bearing records. Full prompts, completions, tool-call arguments, intermediate reasoning content. Persisted synchronously to business-object-contextual tables inside Spectral’s sovereignty boundary. Schemas owned by downstream spikes (ADR-043 for agent conversations; ADR-060 for agent tool invocation; per-context for scan / world-model contexts). Not landed by this ADR.
Stream C — cost/usage summary. core.llm_usage table — 10 fields (SPEC-285 shape) plus content_class plus PurposeKey. Platform-scoped, no payload. Authoritative for cost attribution and rate-limit accounting (pairs with ADR-035 D6’s daily cap). Landed by this ADR.

All three streams share trace_id via W3C propagation. Debugging flow: Logfire shows the span tree plus metadata → join via trace_id to Stream C for cost, to Stream B for payload inspection (Supabase SQL editor for alpha; Spectral admin UI post-alpha).

D4 — Error tracking: Sentry SaaS

sentry-sdk[fastapi] plus structlog-sentry. Consumes OTel span context so trace_id correlates across Sentry / Grafana / Logfire. Escape hatch: GlitchTip self-host (Sentry-SDK-compatible; DSN-only swap).

D5 — Structured logging via structlog with SPEC-285’s seven canonical fields

Context pre-bound at composition root (bc=worlds | bc=platform | bc=core | bc=app:api | bc=app:workers | bc=app:operations). JSON output. Logs ship via the OTel Collector log receiver to the D2 storage backend. Error-level events fork to Sentry via structlog-sentry.

D6 — Content-class-driven stripping with two-layer runtime enforcement

Taxonomy (ContentClass enum after the SPEC-306 D11 rebalance):

PLATFORM — customer content processed or generated in the platform context (conformance-track customer traces, Spectral Agent conversations, customer replay)
OPERATIONS — Spectral-operated reasoning (World Agent, Ops Agent, internal distillation)
SYNTHETIC — test-agent-generated synthetic content (no customer PII)

Purpose-to-class resolver (composition-root contract, never per-call developer discretion):

world_agent → always OPERATIONS
scoring / detection / reasoning — SYNTHETIC if scan track = A, PLATFORM if track = B
agent_turn (Spectral Agent) → always PLATFORM
customer_replay → always PLATFORM
agent_tool → inherit from parent span’s content_class
embedding → caller-determined
Ops Agent / platform-internal reasoning → OPERATIONS

Enforcement layer 1: content_class is a required field on LLMUsage (no default; mypy and pydantic enforce; every call site must declare).

Enforcement layer 2: spectral.core.telemetry.emit_llm_call() reads content_class; for PLATFORM it applies a span-attribute redaction hook before the OTel exporter sees the span. The OTel Collector transform processor is the backstop — drops gen_ai.prompt.*|gen_ai.completion.*|gen_ai.tool.arguments for any span where spectral.content_class == "platform".

Audit query: SELECT count(*), purpose, content_class FROM core.llm_usage GROUP BY purpose, content_class — a queryable record of “what has crossed a third-party boundary.”

D7 — Alerting: Sentry alert rules consuming warn/error structured-log events

No separate alerting substrate. Alpha rule set: new-release 5xx alerts; daily LLM cost-cap breach (fires error when ADR-035 D6 cap trips); worker queue depth SLO warnings; storage free-tier ingestion >80%. Sentry’s native Slack/email integration is the routing layer.

D8 — Retention, alpha defaults

Logfire 30 days (free-tier default); Sentry Team default; Postgres core.llm_usage 90 days; the D2 storage backend retention inherits from whichever backend ADR-046 selects. Formal per-class policy lives in ADR-042 / TA-4.

D9 — `purpose` taxonomy reconciliation

SPEC-285’s observability-principles Codex page listed a 9-value purpose set; ADR-035 D3 landed the canonical 8-value PurposeKey enum. ADR-035’s enum wins — LLMUsage imports spectral.core.llm.purposes.PurposeKey. Codex page update is close-pass work.

D10 — `ContentClass` taxonomy lives in `spectral.core.llm.content_class`

A closed taxonomy: PLATFORM, OPERATIONS, SYNTHETIC. Resolver contracts live at each context’s composition root — scan composition root resolves scan-purpose calls using scan-track; agent composition roots resolve per their scope. Revisit trigger: if world_distill or any other operations-class purpose starts ingesting customer-supplied content, re-resolve its default classification.

Alternatives considered

Datadog. Rule out. $300–1000+/month at alpha scale; primary-source regret evidence overwhelming (Coinbase $65M; DHH $83k cancellation; AI startup Deductive’s 48-hour emergency migration in January 2026).

Honeycomb / Axiom. OTel-native but proprietary query languages (BubbleUp / APL); migration tax. Grafana’s OTel + PromQL/LogQL/TraceQL stack is more portable at the same or lower cost.

Langfuse Cloud (Core tier). Post-ClickHouse-acquisition US HQ creates a GDPR-residency question; ~$210/month at our LLM volumes; no material advantage over self-host if sovereignty becomes load-bearing.

Langfuse self-host from day one. Architecturally consistent (own-the-substrate matches Supabase, in-process control plane, single-DB-first). Real five-datastore ops tax during alpha not justified when the authoritative record (core.llm_usage plus Stream B tables) already lives in Postgres. Queued as a post-alpha escape.

LangSmith. ~$840/month at three engineers plus deepens LangChain lock-in. Hard rule-out.

Helicone. Maintenance mode post-Mintlify acquisition. Dead.

Arize Phoenix. ELv2 licensing; dominated by Langfuse at our scale.

Self-hosted LGTM on Hetzner from day one. Correct post-alpha choice; premature at alpha given founder-hour opportunity cost. Migration playbook documented in this ADR.

Sentry self-host. 20+ containers. Hard no at alpha. GlitchTip is the self-host fallback when/if we leave Sentry SaaS.

Fold error tracking into ops obs. Saves ~$26/month, loses Sentry’s stacktrace grouping, release-health, and deploy-diff ergonomics. False economy.

Blanket payload stripping across all LLM calls. Over-protection. Operations-owned (World/Ops Agent) and SYNTHETIC traffic carry no customer PII; stripping them would kneecap Logfire’s AI-specific debugging UI for the bulk of alpha traffic. Content-class-driven routing resolves this cleanly.

Dedicated Slack alert channel + Grafana alert rules. Adds an alerting substrate when Sentry already owns issue grouping plus Slack/email routing. Collapsed into D7.

Consequences

spectral.core.llm.content_class ships as the closed taxonomy used by Streams A/B/C.
LLMUsageRecord gains content_class plus account_id. Per the SPEC-319 verification noted delta: account_id was added because TA-1 D6 designates (account_id, workspace_id) as non-negotiable tenancy columns and SPEC-285 lists account_id as a canonical log field. Nullable for OPERATIONS-only calls (Ops / World Agent) consistent with workspace_id.
core.llm_usage first resident of the core schema per ADR-032 D2; migration 20260420231500_core_llm_usage.sql.
Three SaaS vendors in the alpha operational path — Grafana Cloud (if not platform-native), Pydantic Logfire, Sentry. All OTLP-consuming; all with credible self-host escape hatches (LGTM, Langfuse, GlitchTip). Bounded vendor risk.
OTel Collector is a standard sidecar — hosting choices must support OTLP egress and (if platform-native storage is chosen per D2) native logging/tracing/metrics surfaces.
Architecture-validator extension queued for close-pass — external-package allowlist per Clean Architecture layer (vendor SDK imports in composition roots only).
SaaS-to-self-host migration playbook inlined. Triggers documented in the close-pass Codex page (system-design/observability-stack.mdx): monthly SaaS observability bill > $300/month; first enterprise DPA demanding a SOC2-audited subprocessor list that self-hosted satisfies; Grafana storage free-tier ingestion sustained >80% for 2 months; Langfuse Cloud EU-residency clarification; any operations-class purpose ingesting customer-supplied content (re-resolve content class).

References

ADR-065 — spectral.core admission discipline
ADR-031 — single-library structure
ADR-035 — PurposeKey; LLMProvider; cost-tracking contract
ADR-042 — TA-4 retention policy framework (per-class formal policy)
ADR-043 — Stream B for agent conversations
ADR-046 — D2 storage backend binding
ADR-048 — workers tier composition
ADR-060 — Stream B for agent tool invocation
TA-16 disposition — SPEC-319 comment 593257c8
TA-16 verification — SPEC-319 comment 69689e96
TA-16 amendment (ContentClass rebalance) — SPEC-319 comment 1fe81d1d
src/spectral/core/llm/content_class.py — ContentClass enum
src/spectral/core/llm/usage.py — LLMUsageRecord
supabase/migrations/20260420231500_core_llm_usage.sql — core.llm_usage table
Codex system-design/foundations/observability-principles.mdx — close-pass updates
Codex system-design/observability-stack.mdx — close-pass new page

Previous
ADR-035: LLM stack — pydantic-ai SDK abstraction; in-process control plane; canonical purpose taxonomy Next
ADR-037: Secrets management — provisioning-script architecture and target-swap discipline