Skip to content
GitHub
Reference

Observability Stack

OpenTelemetry is the telemetry substrate. The OTel Collector is the fan-out point; every downstream vendor is swappable via OTEL_EXPORTER_OTLP_ENDPOINT. Decision lineage in ADR-036; principles in observability-principles.


Three SaaS surfaces in the alpha operational path; each has a self-host escape hatch.

ConcernAlphaSelf-host fallback
Ops observability (logs, metrics, traces)Grafana Cloud free-tier LGTM (or platform-native if hosting offers it)Self-hosted LGTM on Hetzner
LLM observability (Stream A)Pydantic Logfire free tierSelf-hosted Langfuse
Error trackingSentry SaaS Team tierGlitchTip self-host

The migration playbook (with named triggers) lives in ADR-036 — moves apply when monthly bill > $300, an enterprise DPA needs a SOC2 subprocessor list self-host can satisfy, or storage free-tier ingestion sustains > 80% for 2 months.


Every LLM call emits to all three streams; they share trace_id via W3C Trace Context.

Stream A — payload-free operational spans

Section titled “Stream A — payload-free operational spans”

OTel SDK emits span structure plus gen_ai.* metadata (model, latency, token counts, cost, purpose, bc, content_class, workspace_id as label).

For PLATFORM content_class, gen_ai.prompt.* / gen_ai.completion.* / gen_ai.tool.arguments are stripped before export. Exported via the OTel Collector to Pydantic Logfire plus the ops storage backend.

Full prompts, completions, tool-call arguments, intermediate reasoning content. Persisted synchronously to business-object-contextual tables inside Spectral’s sovereignty boundary. Schemas live with their consumers (Spectral Agent conversation persistence, agent tool invocation, per-context for scan / world-model contexts).

core.llm_usage table; the canonical 10-field shape plus content_class plus PurposeKey. Platform-scoped, no payload. Authoritative for cost attribution and rate-limit accounting (pairs with the LLM platform daily cap).


spectral.core.llm.content_class.ContentClass:

  • PLATFORM — customer content processed or generated in platform (conformance-track customer traces, Spectral Agent conversations, customer replay)
  • OPERATIONS — Spectral-operated reasoning (World Agent, Ops Agent, internal distillation)
  • SYNTHETIC — test-agent-generated synthetic content (no customer PII)

Composition-root contract — never per-call developer discretion:

  • world_agent → always OPERATIONS
  • scoring / detection / reasoningSYNTHETIC on the synthetic track; PLATFORM on the conformance track
  • agent_turn (Spectral Agent) → always PLATFORM
  • customer_replay → always PLATFORM
  • agent_tool → inherits from parent span’s content_class
  • embedding → caller-determined
  • Ops Agent / platform-internal reasoning → OPERATIONS
  1. Layer 1. content_class is a required field on LLMUsage (no default; mypy and pydantic enforce).
  2. Layer 2. The LLM-emission seam (the emit_llm_call shape that lands with the observability epic per ADR-036 D6) reads content_class from LLMUsage; for PLATFORM it applies a span-attribute redaction hook before the OTel exporter sees the span. The OTel Collector transform processor is the backstop — drops gen_ai.prompt.*|gen_ai.completion.*|gen_ai.tool.arguments for any span where spectral.content_class == "platform". The collector config lands under infra/otel-collector/ with the observability epic; until then the redaction hook in Layer 2 is the load-bearing control.

Audit query: SELECT count(*), purpose, content_class FROM core.llm_usage GROUP BY purpose, content_class — a queryable record of “what has crossed a third-party boundary.”


structlog with the seven canonical fields from principles: workspace_id, account_id, scan_id, world_model_version, bc, phase, trace_id. JSON output. The bc field is pre-bound at composition root (bc=worlds | bc=platform | bc=core | bc=app:api | bc=app:workers | bc=app:operations). Logs ship via the OTel Collector log receiver to the ops storage backend.

Error-level events fork to Sentry via structlog-sentry.


sentry-sdk[fastapi] plus structlog-sentry. Consumes OTel span context so trace_id correlates across Sentry / Grafana / Logfire.

Alpha alert rules:

  • New-release 5xx alerts
  • Daily LLM cost-cap breach (fires error when the LLM platform daily cap trips)
  • Worker queue depth SLO warnings
  • Storage free-tier ingestion > 80%

Sentry’s native Slack/email integration is the routing layer — no separate alerting substrate.


  • Logfire 30 days
  • Sentry Team default
  • Postgres core.llm_usage 90 days
  • Ops storage retention inherits from the chosen backend
  • Formal per-class policy in data-retention