ADR-036: Observability stack — OTel substrate, three-stream LLM trace architecture, content-class routing
Status: Accepted (2026-04-20)
Context
SPEC-285 fixed Spectral’s observability principles: structured JSON logs via structlog with seven canonical fields (workspace_id, account_id, scan_id, world_model_version, bc, phase, trace_id); LLM-telemetry minimum shape (10 fields plus a closed purpose enum); W3C Trace Context propagation across HTTP / domain events / agent tool calls / worker-job dispatch; the customer-data versus operational-data category boundary. Tool choices were explicitly deferred to TA-16.
Alpha posture is the hard constraint: solo-builder growing to 2–3 engineers, business hours only, no paging, no out-of-hours on-call. Founder time is the scarce resource. TA-10 (ADR-035) flagged Pydantic Logfire and Langfuse as strong LLM-observability candidates; the spike’s research expanded the field.
Several material findings during the spike:
- Pydantic Logfire — 10M spans/month free, same vendor as pydantic-ai, OTel-native (literally is OTel data model). Not named in the original decision surface.
- Langfuse acquired by ClickHouse (Jan 2026). MIT core preserved; EE directory proprietary. Self-host path unchanged; Cloud EU-residency posture uncertain.
- Helicone — Mintlify acquisition → maintenance mode. Dead.
- LangSmith — ~$840/month at engineer × trace volume plus deepens LangChain exposure. Rule out.
- Langfuse self-host — heavy (Postgres + ClickHouse + Redis + blob + 2 containers; ~16 GiB RAM recommended for prod). Real five-datastore ops tax during alpha.
- Sentry self-host — 20+ containers, Kafka, ClickHouse, Snuba. Hard no at alpha.
- Class-based content classification — most Spectral LLM traffic is platform-owned (World Agent, Ops Agent) or test-agent-generated synthetic content. Only customer scans and Spectral Agent conversations carry customer PII. Blanket payload stripping was over-protection.
The ContentClass taxonomy was retroactively rebalanced by TA-3 D11 / SPEC-306 naming coherence. This ADR uses the post-rebalance taxonomy (PLATFORM / OPERATIONS / SYNTHETIC).
Decision
D1 — OpenTelemetry is the telemetry substrate
All spans, metrics, and logs-with-context emit through the OTel SDK. The OTel Collector is the fan-out point. Every downstream vendor is swappable via OTEL_EXPORTER_OTLP_ENDPOINT — no proprietary agent lock-in in the critical path.
D2 — Ops observability: OTel Collector → platform-native storage + Grafana UI
Storage backend binding deferred to ADR-046 / ADR-048 / ADR-049 (deployment topology). Preferred path: use whatever the hosting platform’s native logging / tracing / metrics provides. Grafana Cloud free-tier LGTM is the deployment-agnostic fallback. Self-hosted LGTM on Hetzner is the post-alpha sovereignty / cost-ceiling escape.
D3 — LLM observability is three streams with content-class-driven routing
- Stream A — payload-free operational spans. OTel SDK emits span structure plus
gen_ai.*metadata (model, latency, token counts, cost,purpose,bc,content_class,workspace_idas label). ForPLATFORMcontent_class,gen_ai.prompt.*/gen_ai.completion.*/gen_ai.tool.argumentsare stripped before export. Exported via OTel Collector to Pydantic Logfire (SaaS free tier) and the D2 ops storage backend. - Stream B — payload-bearing records. Full prompts, completions, tool-call arguments, intermediate reasoning content. Persisted synchronously to business-object-contextual tables inside Spectral’s sovereignty boundary. Schemas owned by downstream spikes (ADR-043 for agent conversations; ADR-060 for agent tool invocation; per-context for scan / world-model contexts). Not landed by this ADR.
- Stream C — cost/usage summary.
core.llm_usagetable — 10 fields (SPEC-285 shape) pluscontent_classplusPurposeKey. Platform-scoped, no payload. Authoritative for cost attribution and rate-limit accounting (pairs with ADR-035 D6’s daily cap). Landed by this ADR.
All three streams share trace_id via W3C propagation. Debugging flow: Logfire shows the span tree plus metadata → join via trace_id to Stream C for cost, to Stream B for payload inspection (Supabase SQL editor for alpha; Spectral admin UI post-alpha).
D4 — Error tracking: Sentry SaaS
sentry-sdk[fastapi] plus structlog-sentry. Consumes OTel span context so trace_id correlates across Sentry / Grafana / Logfire. Escape hatch: GlitchTip self-host (Sentry-SDK-compatible; DSN-only swap).
D5 — Structured logging via structlog with SPEC-285’s seven canonical fields
Context pre-bound at composition root (bc=worlds | bc=platform | bc=core | bc=app:api | bc=app:workers | bc=app:operations). JSON output. Logs ship via the OTel Collector log receiver to the D2 storage backend. Error-level events fork to Sentry via structlog-sentry.
D6 — Content-class-driven stripping with two-layer runtime enforcement
Taxonomy (ContentClass enum after the SPEC-306 D11 rebalance):
PLATFORM— customer content processed or generated in the platform context (conformance-track customer traces, Spectral Agent conversations, customer replay)OPERATIONS— Spectral-operated reasoning (World Agent, Ops Agent, internal distillation)SYNTHETIC— test-agent-generated synthetic content (no customer PII)
Purpose-to-class resolver (composition-root contract, never per-call developer discretion):
world_agent→ alwaysOPERATIONSscoring/detection/reasoning—SYNTHETICif scan track = A,PLATFORMif track = Bagent_turn(Spectral Agent) → alwaysPLATFORMcustomer_replay→ alwaysPLATFORMagent_tool→ inherit from parent span’scontent_classembedding→ caller-determined- Ops Agent / platform-internal reasoning →
OPERATIONS
Enforcement layer 1: content_class is a required field on LLMUsage (no default; mypy and pydantic enforce; every call site must declare).
Enforcement layer 2: spectral.core.telemetry.emit_llm_call() reads content_class; for PLATFORM it applies a span-attribute redaction hook before the OTel exporter sees the span. The OTel Collector transform processor is the backstop — drops gen_ai.prompt.*|gen_ai.completion.*|gen_ai.tool.arguments for any span where spectral.content_class == "platform".
Audit query: SELECT count(*), purpose, content_class FROM core.llm_usage GROUP BY purpose, content_class — a queryable record of “what has crossed a third-party boundary.”
D7 — Alerting: Sentry alert rules consuming warn/error structured-log events
No separate alerting substrate. Alpha rule set: new-release 5xx alerts; daily LLM cost-cap breach (fires error when ADR-035 D6 cap trips); worker queue depth SLO warnings; storage free-tier ingestion >80%. Sentry’s native Slack/email integration is the routing layer.
D8 — Retention, alpha defaults
Logfire 30 days (free-tier default); Sentry Team default; Postgres core.llm_usage 90 days; the D2 storage backend retention inherits from whichever backend ADR-046 selects. Formal per-class policy lives in ADR-042 / TA-4.
D9 — purpose taxonomy reconciliation
SPEC-285’s observability-principles Codex page listed a 9-value purpose set; ADR-035 D3 landed the canonical 8-value PurposeKey enum. ADR-035’s enum wins — LLMUsage imports spectral.core.llm.purposes.PurposeKey. Codex page update is close-pass work.
D10 — ContentClass taxonomy lives in spectral.core.llm.content_class
A closed taxonomy: PLATFORM, OPERATIONS, SYNTHETIC. Resolver contracts live at each context’s composition root — scan composition root resolves scan-purpose calls using scan-track; agent composition roots resolve per their scope. Revisit trigger: if world_distill or any other operations-class purpose starts ingesting customer-supplied content, re-resolve its default classification.
Alternatives considered
Datadog. Rule out. $300–1000+/month at alpha scale; primary-source regret evidence overwhelming (Coinbase $65M; DHH $83k cancellation; AI startup Deductive’s 48-hour emergency migration in January 2026).
Honeycomb / Axiom. OTel-native but proprietary query languages (BubbleUp / APL); migration tax. Grafana’s OTel + PromQL/LogQL/TraceQL stack is more portable at the same or lower cost.
Langfuse Cloud (Core tier). Post-ClickHouse-acquisition US HQ creates a GDPR-residency question; ~$210/month at our LLM volumes; no material advantage over self-host if sovereignty becomes load-bearing.
Langfuse self-host from day one. Architecturally consistent (own-the-substrate matches Supabase, in-process control plane, single-DB-first). Real five-datastore ops tax during alpha not justified when the authoritative record (core.llm_usage plus Stream B tables) already lives in Postgres. Queued as a post-alpha escape.
LangSmith. ~$840/month at three engineers plus deepens LangChain lock-in. Hard rule-out.
Helicone. Maintenance mode post-Mintlify acquisition. Dead.
Arize Phoenix. ELv2 licensing; dominated by Langfuse at our scale.
Self-hosted LGTM on Hetzner from day one. Correct post-alpha choice; premature at alpha given founder-hour opportunity cost. Migration playbook documented in this ADR.
Sentry self-host. 20+ containers. Hard no at alpha. GlitchTip is the self-host fallback when/if we leave Sentry SaaS.
Fold error tracking into ops obs. Saves ~$26/month, loses Sentry’s stacktrace grouping, release-health, and deploy-diff ergonomics. False economy.
Blanket payload stripping across all LLM calls. Over-protection. Operations-owned (World/Ops Agent) and SYNTHETIC traffic carry no customer PII; stripping them would kneecap Logfire’s AI-specific debugging UI for the bulk of alpha traffic. Content-class-driven routing resolves this cleanly.
Dedicated Slack alert channel + Grafana alert rules. Adds an alerting substrate when Sentry already owns issue grouping plus Slack/email routing. Collapsed into D7.
Consequences
spectral.core.llm.content_classships as the closed taxonomy used by Streams A/B/C.LLMUsageRecordgainscontent_classplusaccount_id. Per the SPEC-319 verification noted delta:account_idwas added because TA-1 D6 designates(account_id, workspace_id)as non-negotiable tenancy columns and SPEC-285 listsaccount_idas a canonical log field. Nullable for OPERATIONS-only calls (Ops / World Agent) consistent withworkspace_id.core.llm_usagefirst resident of thecoreschema per ADR-032 D2; migration20260420231500_core_llm_usage.sql.- Three SaaS vendors in the alpha operational path — Grafana Cloud (if not platform-native), Pydantic Logfire, Sentry. All OTLP-consuming; all with credible self-host escape hatches (LGTM, Langfuse, GlitchTip). Bounded vendor risk.
- OTel Collector is a standard sidecar — hosting choices must support OTLP egress and (if platform-native storage is chosen per D2) native logging/tracing/metrics surfaces.
- Architecture-validator extension queued for close-pass — external-package allowlist per Clean Architecture layer (vendor SDK imports in composition roots only).
- SaaS-to-self-host migration playbook inlined. Triggers documented in the close-pass Codex page (
system-design/observability-stack.mdx): monthly SaaS observability bill > $300/month; first enterprise DPA demanding a SOC2-audited subprocessor list that self-hosted satisfies; Grafana storage free-tier ingestion sustained >80% for 2 months; Langfuse Cloud EU-residency clarification; any operations-class purpose ingesting customer-supplied content (re-resolve content class).
References
- ADR-065 —
spectral.coreadmission discipline - ADR-031 — single-library structure
- ADR-035 —
PurposeKey;LLMProvider; cost-tracking contract - ADR-042 — TA-4 retention policy framework (per-class formal policy)
- ADR-043 — Stream B for agent conversations
- ADR-046 — D2 storage backend binding
- ADR-048 — workers tier composition
- ADR-060 — Stream B for agent tool invocation
- TA-16 disposition — SPEC-319 comment
593257c8 - TA-16 verification — SPEC-319 comment
69689e96 - TA-16 amendment (ContentClass rebalance) — SPEC-319 comment
1fe81d1d src/spectral/core/llm/content_class.py—ContentClassenumsrc/spectral/core/llm/usage.py—LLMUsageRecordsupabase/migrations/20260420231500_core_llm_usage.sql—core.llm_usagetable- Codex
system-design/foundations/observability-principles.mdx— close-pass updates - Codex
system-design/observability-stack.mdx— close-pass new page