Observability Principles
This page establishes the observability principles Spectral upholds regardless of which specific stack we choose. These are load-bearing — they hold the line on data-classification, on the minimum shape of telemetry every LLM call and every decision-execution phase emit, and on traceability across contexts.
The first-class boundary — customer data vs operational data
Section titled “The first-class boundary — customer data vs operational data”Spectral handles two categorically different classes of data. Commingling them is an architectural violation, not a policy preference.
Customer data
Section titled “Customer data”What it is. Domain-scoped decision records and audit-chain entries (per /decide
invocation), override-pattern signals from customer-flagged decisions, System Card snapshots,
and conversation content between the customer and the post-release World Agent customer-mode
chat affordance per ADR-081 D5.
Where it lives. Customer-owned schemas under Supabase RLS. Every row carries org_id and
domain_id per ADR-086 D6 (see
Architecture — Multi-tenancy isolation).
A customer querying with their JWT never sees another org/domain’s data because Postgres
itself enforces the boundary.
Who can see it. The owning domain’s authenticated users, plus Spectral operators holding
the operations org role (Spectral staff — used for customer support, incident response,
and the operational surfaces in the Operations app).
Retention. Tied to the customer’s agreement. Deletion propagates on domain retirement.
Operational data
Section titled “Operational data”What it is. Spectral’s own telemetry: structured logs, internal OTEL spans (API request traces, worker-job traces, decision-execution spans), World Agent reasoning traces (internal LangGraph state), LLM call metadata.
Where it lives. Platform-owned stores at Grafana Cloud LGTM (logs/metrics/traces) + Pydantic
Logfire (LLM-call telemetry) + Sentry (error capture) per
ADR-036. What is fixed is that it is not in
customer-scoped tables. Operational storage carries no domain_id as a primary organizing key —
it may carry the field as a label for correlation, but the storage does not enforce
domain isolation.
Who can see it. Spectral engineering and operations. Customers do not see operational data.
Retention. Platform-owned policy per ADR-042. Shorter than customer-data retention in most cases — operational telemetry does not need to outlive incident-investigation windows.
Mixing the two is a category violation
Section titled “Mixing the two is a category violation”The decision API surface (POST /decide — API-key authenticated, decide:domain scope
per ADR-086 D4 + the Customer Dashboard’s
read routes) receives and produces customer data only. Operational telemetry has its own
backend, distinct from customer-data storage — vendor-swappable via
OTEL_EXPORTER_OTLP_ENDPOINT per ADR-036 (see
Observability stack for the vendor inventory). Specifically:
- Application log statements never include customer PII (decision-context content, audit-chain payloads, private conversation text). Org/domain IDs and decision IDs are fine; their contents are not.
- Operational spans that cross customer-data surfaces carry references (decision IDs, audit-chain entry IDs, override-pattern signal IDs) rather than copying the customer data into the operational record.
- If debugging requires joining operational telemetry with customer data, the join happens at query time against the customer-data store — never by copying customer data into the operational store.
This boundary is enforced by convention + code review for today. A structured-linter rule
(forbidden payload shapes in log.info(...) calls) is a a future hardening item.
Structured logging — canonical fields
Section titled “Structured logging — canonical fields”All application logs flow through structlog configured to emit JSON. Every log record carries the canonical fields below when the context applies; fields are omitted rather than set to null.
| Field | Type | When present | Purpose |
|---|---|---|---|
domain_id | UUID | Any log in a request / job that has resolved a domain | Multi-tenant correlation. Indexed in the operational store. |
org_id | UUID | Same as domain_id | Org-level rollup correlation. |
decision_id | UUID | Any log emitted inside /decide execution | Decision-scoped correlation. |
world_model_version | string | Any log that cites a specific world-model version (decision response, audit-chain entry) | Authority correlation — lets you slice telemetry by world-model version. |
bc | enum(worlds, platform, core, app:api, app:workers, app:operations) | Every log | Which package emitted the log. Critical for context-aware filtering. |
phase | enum(auth, module_load, context_establish, predicate_eval, aggregate, distill, evolve, publish, agent, …) | Logs inside a named pipeline phase | Phase-level slice of decision-execution and world-model activity. |
trace_id | string | Every log emitted within an OTEL span context | Correlates logs to spans (see Trace context propagation). |
Free-form message fields are allowed, but they are secondary. Dashboards, alerts, and post-incident queries key on the canonical fields above. A log that only carries a free-form message is debuggable by a human but not by the observability pipeline.
Emitter discipline. Each context constructs a pre-bound logger at composition time
(bc=worlds or bc=platform), so downstream call sites inherit the field automatically.
Adding a new context-aware field is a composition-root change, not a per-call-site change.
LLM call telemetry
Section titled “LLM call telemetry”Every LLM call emits a structured telemetry record. This is the minimum shape — the specific destination is Pydantic Logfire per ADR-036, but the shape itself is invariant across substrate choices.
| Field | Type | Notes |
|---|---|---|
model | string | Provider + model identifier (anthropic/claude-opus-4-7, openai/gpt-5.2, google/gemini-2.5-flash). Exact strings; never aliased. |
org_id | UUID | null | Customer-tenancy field per ADR-033 + ADR-086 D1. Nullable for OPERATIONS-only calls. |
domain_id | UUID | null | Present when the call is attributable to a specific customer domain. Null for OPERATIONS-only calls (e.g., World Agent internal reasoning). |
input_tokens | int | Prompt-token count. |
output_tokens | int | Completion-token count. |
latency_ms | int | Wall-clock latency of the call. |
cost_usd | decimal | Dollar cost of the call, computed from provider pricing at call time (per ADR-035 D6 via genai-prices with fallback registry). |
purpose | enum | The 8-value PurposeKey taxonomy below. Required. |
content_class | enum(PLATFORM, OPERATIONS, SYNTHETIC) | The content classification driving Stream A redaction. Required. |
bc | enum(worlds, platform, core, app:api, app:workers, app:operations) | Which context initiated the call. |
decision_id | UUID | null | Present when the call is inside a /decide execution. |
trace_id | string | OTel trace context for correlation with the enclosing operation. |
The PurposeKey taxonomy
Section titled “The PurposeKey taxonomy”A closed enum in spectral.core.llm.purposes.PurposeKey (per
ADR-035 D3). If a new purpose is needed, it gets added here, not
invented at the call site.
| Value | Meaning |
|---|---|
code_generation | World Agent generates predicate code from natural-language rules (highest-capability tier). |
applies_when_generation | World Agent generates the optional context-only filter alongside a predicate. |
distillation | Operator-driven distillation runs against source materials. |
reasoning | Diagnosis, coverage reflection, restatement drafting. |
agent_turn | World Agent chat surface conversational turn. |
agent_tool | Agent tool-invocation call (may differ from turn). |
world_agent | World Agent exploration / hypothesis. |
embedding | Embedding generation (full policy in embeddings). |
The ContentClass taxonomy
Section titled “The ContentClass taxonomy”spectral.core.llm.content_class.ContentClass is a closed 3-value enum
(per ADR-036 D6):
| Value | Meaning |
|---|---|
PLATFORM | Customer content processed or generated in platform (decision-context inputs, audit-chain entries, post-release World Agent customer-mode chat content). |
OPERATIONS | Spectral-operated reasoning (World Agent, internal distillation). |
SYNTHETIC | Test-agent-generated synthetic content (no customer PII). |
Resolver mapping at the composition root (never per-call developer discretion):
world_agent(operator mode) → alwaysOPERATIONSworld_agent(customer mode, post-release per ADR-081 D5) → alwaysPLATFORMcode_generation/applies_when_generation/distillation/reasoning→ alwaysOPERATIONS(authoring-time)agent_turn(World Agent operator mode) → alwaysOPERATIONSagent_turn(World Agent customer-mode chat) → alwaysPLATFORMagent_tool→ inherits from parent span’scontent_classembedding→ caller-determined
For PLATFORM-class calls, prompt / completion / tool-arg fields are stripped before export to third-party observability (Logfire, Sentry). See observability stack for the three-stream architecture and the two-layer enforcement.
Why this matters. Cost attribution, rate-limit investigation, model-choice decisions, and
provider-drift detection all key on purpose. A generic “LLM call count” dashboard is not
actionable; a dashboard sliced by purpose × model × bc × content_class is.
Trace context propagation across contexts
Section titled “Trace context propagation across contexts”An override-pattern signal aggregation in spectral.platform emits an
override_pattern_signal.aggregated event — a domain event consumed by spectral.worlds. The
WorldAgent then reasons over the event and potentially proposes rule candidates. The entire
path must be traceable as one logical operation.
The rule
Section titled “The rule”W3C Trace Context (traceparent / tracestate) — or an equivalent propagation envelope — is
carried across every event between contexts. The consumer side of an event opens a new span as a
child of the producer’s span. One logical operation can be walked start-to-finish even when it
spans worlds and platform, multiple workers, and multiple LLM calls.
What carries the context
Section titled “What carries the context”- HTTP requests — standard OTEL HTTP propagation. Incoming
traceparentis honoured; if absent, a new root span is started. - Domain events — events typed in
spectral.corecarry an envelope field that includes the producer’strace_idandspan_id(the propagation shape itself is aspectral.corecontract; changes to it followspectral.coregovernance ADR-065). - Agent tool calls — tool invocations inherit the agent’s span context; LLM calls made by tools inherit further still.
- Worker-job dispatch — the
AgentTaskrow carries the trace context; the worker picks it up and opens child spans.
The guarantee
Section titled “The guarantee”If an override-pattern signal aggregation causes a world-model rule-candidate proposal, the operator looking at the proposal can walk back through:
RuleCandidate (worlds, distill phase) ← override_pattern_signal.aggregated event (carries trace context) ← override-pattern signal record (platform, decision-flagging path) ← /decide invocation span (platform) ← HTTP /decide request (API)— with one trace_id threading the entire path. No guessing, no manual correlation.
What breaks if we lose propagation
Section titled “What breaks if we lose propagation”- Incident forensics degrades to database archaeology (looking up rows by timestamps and hoping they correlate).
- Cost-attribution across
worldsandplatformbecomes impossible (an LLM call during authoring-time work cannot be attributed to the decision activity that motivated it). - The “why did this candidate appear?” question has no mechanical answer — operators end up guessing from timing.
The rule is non-optional. An event published without a trace context is a bug.
Tooling realization
Section titled “Tooling realization”The principles above are in force from commit one. The concrete tooling that realizes them — vendor inventory, export destinations, per-stream redaction, retention defaults, alert rules — lives at Observability Stack per ADR-036. This page is the doctrine; that page is the inventory. Principles do not change between the two; only the runtime destination of each signal does.
What this page does NOT cover
Section titled “What this page does NOT cover”- Tool choices. Specific vendors (Grafana Cloud / Pydantic Logfire / Sentry) and the posture matrix live in Observability stack per ADR-036.
- Retention policies. Per-class retention rules live in Data retention per ADR-042.
- Alerting discipline. Who gets paged, for what, and through which channel — captured in the
operational runbooks (
docs/runbooks/) alongside on-call rotation. - Customer-visible observability. Dashboards the customer sees are part of the product surface, not the operational observability plane. See Customer Dashboard and System Card for the customer-facing decision and operational-record surfaces.
Related reading
Section titled “Related reading”- Architecture — the three-context topology these principles apply across
- Event System — the event shape that carries trace context between contexts
- Access Control — role-and-scope model (who can see what)
- Testing — per-layer strategy, including integration tests that cross between contexts
- Observability stack — vendors, posture, and content-class taxonomy per ADR-036
- Data retention — per-class retention policies per ADR-042