Observability Stack
OpenTelemetry is the telemetry substrate. The OTel Collector is the fan-out point; every downstream vendor is swappable via OTEL_EXPORTER_OTLP_ENDPOINT. Decision lineage in ADR-036; principles in observability-principles.
Vendor inventory
Section titled “Vendor inventory”Three SaaS surfaces in the alpha operational path; each has a self-host escape hatch.
| Concern | Alpha | Self-host fallback |
|---|---|---|
| Ops observability (logs, metrics, traces) | Grafana Cloud free-tier LGTM (or platform-native if hosting offers it) | Self-hosted LGTM on Hetzner |
| LLM observability (Stream A) | Pydantic Logfire free tier | Self-hosted Langfuse |
| Error tracking | Sentry SaaS Team tier | GlitchTip self-host |
The migration playbook (with named triggers) lives in ADR-036 — moves apply when monthly bill > $300, an enterprise DPA needs a SOC2 subprocessor list self-host can satisfy, or storage free-tier ingestion sustains > 80% for 2 months.
Three streams for LLM observability
Section titled “Three streams for LLM observability”Every LLM call emits to all three streams; they share trace_id via W3C Trace Context.
Stream A — payload-free operational spans
Section titled “Stream A — payload-free operational spans”OTel SDK emits span structure plus gen_ai.* metadata (model, latency, token counts, cost, purpose, bc, content_class, workspace_id as label).
For PLATFORM content_class, gen_ai.prompt.* / gen_ai.completion.* / gen_ai.tool.arguments are stripped before export. Exported via the OTel Collector to Pydantic Logfire plus the ops storage backend.
Stream B — payload-bearing records
Section titled “Stream B — payload-bearing records”Full prompts, completions, tool-call arguments, intermediate reasoning content. Persisted synchronously to business-object-contextual tables inside Spectral’s sovereignty boundary. Schemas live with their consumers (Spectral Agent conversation persistence, agent tool invocation, per-context for scan / world-model contexts).
Stream C — cost / usage summary
Section titled “Stream C — cost / usage summary”core.llm_usage table; the canonical 10-field shape plus content_class plus PurposeKey. Platform-scoped, no payload. Authoritative for cost attribution and rate-limit accounting (pairs with the LLM platform daily cap).
Content-class taxonomy
Section titled “Content-class taxonomy”spectral.core.llm.content_class.ContentClass:
PLATFORM— customer content processed or generated inplatform(conformance-track customer traces, Spectral Agent conversations, customer replay)OPERATIONS— Spectral-operated reasoning (World Agent, Ops Agent, internal distillation)SYNTHETIC— test-agent-generated synthetic content (no customer PII)
Purpose-to-class resolver
Section titled “Purpose-to-class resolver”Composition-root contract — never per-call developer discretion:
world_agent→ alwaysOPERATIONSscoring/detection/reasoning→SYNTHETICon the synthetic track;PLATFORMon the conformance trackagent_turn(Spectral Agent) → alwaysPLATFORMcustomer_replay→ alwaysPLATFORMagent_tool→ inherits from parent span’scontent_classembedding→ caller-determined- Ops Agent / platform-internal reasoning →
OPERATIONS
Two-layer enforcement
Section titled “Two-layer enforcement”- Layer 1.
content_classis a required field onLLMUsage(no default; mypy and pydantic enforce). - Layer 2. The LLM-emission seam (the
emit_llm_callshape that lands with the observability epic per ADR-036 D6) readscontent_classfromLLMUsage; forPLATFORMit applies a span-attribute redaction hook before the OTel exporter sees the span. The OTel Collectortransformprocessor is the backstop — dropsgen_ai.prompt.*|gen_ai.completion.*|gen_ai.tool.argumentsfor any span wherespectral.content_class == "platform". The collector config lands underinfra/otel-collector/with the observability epic; until then the redaction hook in Layer 2 is the load-bearing control.
Audit query: SELECT count(*), purpose, content_class FROM core.llm_usage GROUP BY purpose, content_class — a queryable record of “what has crossed a third-party boundary.”
Structured logging
Section titled “Structured logging”structlog with the seven canonical fields from principles: workspace_id, account_id, scan_id, world_model_version, bc, phase, trace_id. JSON output. The bc field is pre-bound at composition root (bc=worlds | bc=platform | bc=core | bc=app:api | bc=app:workers | bc=app:operations). Logs ship via the OTel Collector log receiver to the ops storage backend.
Error-level events fork to Sentry via structlog-sentry.
Error tracking
Section titled “Error tracking”sentry-sdk[fastapi] plus structlog-sentry. Consumes OTel span context so trace_id correlates across Sentry / Grafana / Logfire.
Alpha alert rules:
- New-release 5xx alerts
- Daily LLM cost-cap breach (fires
errorwhen the LLM platform daily cap trips) - Worker queue depth SLO warnings
- Storage free-tier ingestion > 80%
Sentry’s native Slack/email integration is the routing layer — no separate alerting substrate.
Retention (alpha defaults)
Section titled “Retention (alpha defaults)”- Logfire 30 days
- Sentry Team default
- Postgres
core.llm_usage90 days - Ops storage retention inherits from the chosen backend
- Formal per-class policy in data-retention
See also
Section titled “See also”- ADR-036 — decision lineage
- Observability principles — observability doctrine
- LLM platform — cost contract
- Agent tool invocation — Stream B consumer
- Data retention — formal retention model