Reference

Observability Stack

OpenTelemetry is the telemetry substrate. The OTel Collector is the fan-out point; every downstream vendor is swappable via OTEL_EXPORTER_OTLP_ENDPOINT. Decision lineage in ADR-036; principles in observability-principles.

Vendor inventory

Three SaaS surfaces in the alpha operational path; each has a self-host escape hatch.

Concern	Alpha	Self-host fallback
Ops observability (logs, metrics, traces)	Grafana Cloud free-tier LGTM (or platform-native if hosting offers it)	Self-hosted LGTM on Hetzner
LLM observability (Stream A)	Pydantic Logfire free tier	Self-hosted Langfuse
Error tracking	Sentry SaaS Team tier	GlitchTip self-host

The migration playbook (with named triggers) lives in ADR-036 — moves apply when monthly bill > $300, an enterprise DPA needs a SOC2 subprocessor list self-host can satisfy, or storage free-tier ingestion sustains > 80% for 2 months.

Three streams for LLM observability

Every LLM call emits to all three streams; they share trace_id via W3C Trace Context.

Stream A — payload-free operational spans

OTel SDK emits span structure plus gen_ai.* metadata (model, latency, token counts, cost, purpose, bc, content_class, workspace_id as label).

For PLATFORM content_class, gen_ai.prompt.* / gen_ai.completion.* / gen_ai.tool.arguments are stripped before export. Exported via the OTel Collector to Pydantic Logfire plus the ops storage backend.

Stream B — payload-bearing records

Full prompts, completions, tool-call arguments, intermediate reasoning content. Persisted synchronously to business-object-contextual tables inside Spectral’s sovereignty boundary. Schemas live with their consumers (Spectral Agent conversation persistence, agent tool invocation, per-context for scan / world-model contexts).

Stream C — cost / usage summary

core.llm_usage table; the canonical 10-field shape plus content_class plus PurposeKey. Platform-scoped, no payload. Authoritative for cost attribution and rate-limit accounting (pairs with the LLM platform daily cap).

Content-class taxonomy

spectral.core.llm.content_class.ContentClass:

PLATFORM — customer content processed or generated in platform (conformance-track customer traces, Spectral Agent conversations, customer replay)
OPERATIONS — Spectral-operated reasoning (World Agent, Ops Agent, internal distillation)
SYNTHETIC — test-agent-generated synthetic content (no customer PII)

Purpose-to-class resolver

Composition-root contract — never per-call developer discretion:

world_agent → always OPERATIONS
scoring / detection / reasoning → SYNTHETIC on the synthetic track; PLATFORM on the conformance track
agent_turn (Spectral Agent) → always PLATFORM
customer_replay → always PLATFORM
agent_tool → inherits from parent span’s content_class
embedding → caller-determined
Ops Agent / platform-internal reasoning → OPERATIONS

Two-layer enforcement

Layer 1. content_class is a required field on LLMUsage (no default; mypy and pydantic enforce).
Layer 2. The LLM-emission seam (the emit_llm_call shape that lands with the observability epic per ADR-036 D6) reads content_class from LLMUsage; for PLATFORM it applies a span-attribute redaction hook before the OTel exporter sees the span. The OTel Collector transform processor is the backstop — drops gen_ai.prompt.*|gen_ai.completion.*|gen_ai.tool.arguments for any span where spectral.content_class == "platform". The collector config lands under infra/otel-collector/ with the observability epic; until then the redaction hook in Layer 2 is the load-bearing control.

Audit query: SELECT count(*), purpose, content_class FROM core.llm_usage GROUP BY purpose, content_class — a queryable record of “what has crossed a third-party boundary.”

Structured logging

structlog with the seven canonical fields from principles: workspace_id, account_id, scan_id, world_model_version, bc, phase, trace_id. JSON output. The bc field is pre-bound at composition root (bc=worlds | bc=platform | bc=core | bc=app:api | bc=app:workers | bc=app:operations). Logs ship via the OTel Collector log receiver to the ops storage backend.

Error-level events fork to Sentry via structlog-sentry.

Error tracking

sentry-sdk[fastapi] plus structlog-sentry. Consumes OTel span context so trace_id correlates across Sentry / Grafana / Logfire.

Alpha alert rules:

New-release 5xx alerts
Daily LLM cost-cap breach (fires error when the LLM platform daily cap trips)
Worker queue depth SLO warnings
Storage free-tier ingestion > 80%

Sentry’s native Slack/email integration is the routing layer — no separate alerting substrate.

Retention (alpha defaults)

Logfire 30 days
Sentry Team default
Postgres core.llm_usage 90 days
Ops storage retention inherits from the chosen backend
Formal per-class policy in data-retention