Platform

Optimization Engine

How Spectral measures, optimizes, and recommends changes to customer agent systems. This page covers the scan pipeline (seven phases, two tracks, two evaluation authorities), the CompositeScore schema that anchors consistency across phases, the verdict engine, and the governance surface — autonomy modes and integration tier — that controls how scan output reaches the customer.

The page serves three readers. Engineers building or extending the pipeline get the seven-phase sequence, composite-score schema, verdict gate set, and autonomy-mode handling. Strategic readers need the defense the pipeline encodes — two evaluation authorities (world-model + customer-rubric) blended into a composite score so optimization can’t trivially game one signal at the expense of the other (the failure mode Goodhart’s Law names). Reviewers auditing methodology need the holdout strategy, the statistical-uniqueness anchor, and the verdict-gate set.

The strategic claim is the two-authority defense. Optimization rewards what the rubric measures; if the customer authors the rubric, optimization rewards what the customer thought to test for. Spectral runs two scoring authorities in parallel — a world-model authority anchored to a domain standard the customer didn’t author, plus a customer-steerable rubric authority — and blends them into a CompositeScore. Neither authority can crowd the other out, so an agent that “improves” by chasing one signal trips the other. That balance is the page’s strategic center; the seven phases below are the mechanism that delivers it. See Two-authority evaluation for the full treatment.

Two-track architecture

Every scan runs both tracks when data is available, or the synthetic track alone as a valid fallback. Once preflight admits a scan, there is no blocked state — both readiness modes (Full and synthetic_only) produce a verdict; the only non-start path is a Worlds-unavailable error (see Scan preflight below).

Track	Source	Role
Synthetic EvalSet	`spectral.worlds` generates a statistically unique EvalSet per scan	Optimization signal. Agent runs against EvalSet stimuli; this is what drives candidate selection.
Real-world conformance	Curated OtelTrace samples with human-validated ground truth, supplied by CurationService	Convergence anchor. Agent’s real-world performance is measured against validated ground truth. Runs when sufficient validated samples exist.

Scan Readiness is reported as a preflight observation on the scan record: Full (both tracks) or synthetic_only. It is not a blocking gate — a scan can always run the synthetic track alone and still produce a verdict.

Scan preflight

Preflight runs in the scan orchestrator — the application-layer component that owns the scan lifecycle — immediately before the Observe phase begins. It writes a ScanReadinessObservation record to the Scan row, then unconditionally proceeds to Observe:

@dataclass(frozen=True)
class ScanReadinessObservation:
    mode: Literal["Full", "synthetic_only"]
    evalset_available: bool
    curation_samples_count: int
    missing_reasons: list[str]     # empty when mode == "Full"
    observed_at: datetime

mode = Full when Worlds can produce an EvalSet and at least the curation-minimum sample count is available; mode = synthetic_only when an EvalSet is producible but conformance samples are below the minimum. When Worlds cannot produce an EvalSet at all, preflight raises an orchestrator-level error and the scan surfaces a scheduled-retry — that is a scan-start failure, not a preflight observation.

Preflight observes and emits; it never blocks the scan from running when any valid mode is possible. The curation service emits its own readiness signals; preflight consumes the latest or queries synchronously to decide curation_samples_count. Curation readiness is the source-of-truth for conformance-sample availability.

Vocabulary

Scan. An evaluation run against synthetic traces generated by the world model conformant with world model rules. A scan is evaluation against the EvalSet the world model produces, not against world model rules directly.
EvalSet. Produced by spectral.worlds. Statistically unique per request (per ADR-028). Each sample carries stimulus text, ground truth co-generated from the originating rule, and a stimulus_weight derived from rule confidence.
OtelTrace. Permanent customer production record. Never modified after ingestion.
ScanTrace. Ephemeral scan-execution record — agent’s response to an EvalSet stimulus. Gains a provenance field recording stimulus source.

Phase sequence

The pipeline runs seven phases in order after preflight completes. Each phase completes fully before the next begins and emits context that is serialized after each phase for fault tolerance and resume.

preflight (orchestrator pre-check) → Observe → Calibrate → Diagnose → Evaluate → Optimize → Safety → Verdict

Observe

Consumes the ScanReadinessObservation written by the preflight step.
Requests a statistically unique EvalSet from spectral.worlds synchronously at scan start, submitting the workspace’s EvalSetParameterization as the request body. If Worlds cannot produce an EvalSet, the scan errors and retries on the next schedule tick (this path is already surfaced by preflight’s error mode).
Receives curated conformance samples from CurationService. The readiness state (Full or synthetic_only) was already written by preflight; Observe does not recompute it.
Runs the customer agent against synthetic EvalSet stimuli and (where available) conformance samples. Produces ScanTrace records with provenance fields identifying stimulus source.
Partition logic (working vs holdout) is rebuilt against the EvalSet structure.

Calibrate

Adjusts scoring thresholds based on the observed score distribution. No spectral.worlds interaction.

Diagnose

Clusters failures into FailureCluster records (spectral.platform.domain.clustering). Quarantines infrastructure failures and parse failures before clustering so only quality EvalResults feed the LLM clusterer.

Two-authority opacity. The clustering prompt receives only rubric scorer explanations and scores — world-model authority outputs do not cross the clustering prompt boundary. Opacity is enforced at the input shape, not by post-hoc filtering: the clusterer’s EvalResult projection includes scoring_authority = rubric rows and excludes the world-model authority’s view (per ADR-014).

Cluster lifecycle. Each cluster carries an actioning_status enum(identified, addressed, persistent, resolved) with validated transitions enforced by the repository: identified -> addressed -> resolved covers the standard remediation path; identified -> persistent -> resolved covers re-emergence of a previously-addressed cluster.

Detection event. When a cluster crosses the detection threshold, Diagnose emits platform.failure_cluster.detected with the producer-typed payload at spectral.platform.contracts.events.failure_cluster_detected. The event carries cluster_id, severity, failure_count, first_observed_at / last_observed_at, evidence_bundle, a sanitized summary, and a suggested_rule_stub. It does not carry raw customer output text — sanitization is verified by a content-contract test.

Consumer paths off the event:

Operations Agent (intra-platform) — upserts platform.rule_candidates_pending on every detection so operators see the cluster surface immediately.
World Agent (in spectral.worlds) — applies a consumer-side promotion-threshold filter (frequency_pct >= 10, effect_size >= 15, actionable = true, computed over the event stream) and seeds rule-candidate exploration only when the higher bar is met. The threshold logic is consumer-resident so the wire shape stays single-event.

Evaluate

Runs two scorers in parallel on all traces from both tracks. See Two-authority evaluation below.

Attribution fields. Each EvalResult carries scoring_authority, track, and stimulus_source. stimulus_weight on each EvalSet sample (set by worlds, derived from generating rule confidence) is applied at the EvalResult level when computing the world-model authority’s contribution to the composite. Spectral treats stimulus_weight as a scalar attribution input — rule internals never cross the context boundary.

EvalSet sourcing. The scorer consumes EvalSets via the callee-owned EvalSetProvider Protocol at spectral.worlds.contracts.protocols.eval_set_provider (per ADR-070 Tier 2 — multi-consumer eval criteria). No rule structure is reachable from the platform side at scoring time.

Optimize

Generates candidate mutations via the strategy registry and runs a tournament to select the winning candidate.

Candidate types. Tournament evaluates four candidate types, each generated by a distinct mutation strategy:

Type	Mutation profile
`surgical`	Targeted edits to specific failing clusters (smallest blast radius)
`conservative-rewrite`	Bounded prompt rewrite preserving structural intent
`general`	Broader structural mutations (largest blast radius among non-history-informed)
`history-informed`	Mutations seeded by prior workspace `RegressionRecord` patterns to avoid known regressions

Two-pass evaluation. Tournament runs two passes for cost discipline:

Pre-screen pass — runs candidates against ≤ 5 samples using the rubric scorer only. Cheaply culls obviously-failing candidates before invoking the more expensive world-model scorer. The cap is intentional: pre-screen is a culling gate, not a measurement.
Full-evaluation pass — survivors run against the full working set with both authorities (world-model + rubric); stimulus_weight is applied at the EvalResult level when computing each candidate’s CompositeScore (see Two-authority evaluation).

Tournament scoring runs concurrently with bounded concurrency (semaphore-limited).

Regression-avoidance signal. Tournament consumes recent RegressionRecord entries from the workspace’s regression record store (see Regression records below) and penalizes replay of regressed mutation patterns within the same workspace via adaptive composite weighting per ADR-020.

Safety

Content safety checks on the winning candidate’s outputs. No spectral.worlds interaction. The safety gate runs late — after Optimize selects a candidate — because running safety before Optimize would prejudge candidates that a safer mutation would render acceptable.

Verdict

Multi-gate GO / NO-GO engine. The eight core verdict gates are pure functions in spectral.platform.domain.verdict with no infrastructure imports — enforced by the architecture validator. A convergence gate runs alongside. The delta threshold gate operates on blended_delta; every other gate operates on rubric scorer data.

#	Gate	Operates on	Outcome contribution
1	Delta threshold	`blended_delta` (CompositeScore)	NO-GO if improvement below workspace threshold
2	Agent regression (severe + mild)	Rubric scorer per-agent score deltas	NO-GO on severe regression; CAUTION on mild
3	Dimension regression	Rubric scorer per-dimension deltas	NO-GO if any rubric dimension floor is violated
4	Holdout generalization gap	Rubric scorer holdout vs working-set delta	NO-GO if holdout significantly underperforms working-set (synthetic holdout partition only — see Holdout protocol below)
5	Bootstrap 95 % CI	Rubric scorer score distribution	NO-GO if confidence interval crosses zero
6	Output similarity	Rubric scorer output embeddings	CAUTION on unexplained semantic drift
7	Pareto cost / latency penalty	Rubric scorer + cost / latency telemetry	NO-GO on Pareto-dominated outcome (degraded cost OR latency without compensating quality gain)
8	Sanity downgrade	Rubric scorer distribution shape	CAUTION on suspicious distribution (e.g., all-perfect or all-zero scores). Rubric scorer only — world-model scorer’s discriminative quality is already expressed via `stimulus_weight` per ADR-014, so applying sanity downgrade there would double-count.
+	Convergence gate	`convergence_delta` (CompositeScore)	CAUTION on conformance-track convergence drift; workspace-configurable hard NO-GO escalation

Verdict also emits scan.convergence.delta per scan with explicit absence-marker semantics:

Conformance data available: event carries the convergence delta (real-world vs synthetic EvalSet performance).
Conformance data not available: event carries an explicit absence marker with reason.

Absence is a signal, not silence. WorldAgent aggregates absence at scale as a world-model-adoption signal.

VerdictResult and CompositeScore are defined in spectral.platform.domain.tournament per ADR-020 — platform-internal types, not spectral.core shared kernel.

Holdout protocol

The holdout generalization gap gate consumes the synthetic EvalSet holdout partition exclusively. Conformance samples are scarce and reserved as convergence anchors — they are NOT consumed for the holdout generalization gap gate. The EvalSet carries an explicit two-layer holdout structure (working set + holdout); the verdict engine reads only the synthetic track.

`blend_ratio` and `blended_delta`

The delta threshold gate’s input field blended_delta is the stimulus-weight-derived composite delta. The blend ratio that combines the world-model authority and rubric authority contributions into blended_delta is computed at scan time from aggregate stimulus_weight, not configured per workspace. Workspace configuration does not accept blend_ratio as a tunable.

Verdict outcomes

The go_nogo field on VerdictResult is one of four values. The fourth value reuses the observe_only label from the autonomy-mode taxonomy because both refer to the same workspace state — a verdict outcome of observe_only is the natural shape produced when the workspace itself is in autonomy mode observe_only.

`go_nogo`	Meaning
`go`	Promotion recommended — all gates pass
`caution`	Mixed signals — requires human review. Autonomy modes never auto-accept `caution` regardless of configuration.
`nogo`	Do not promote — gate(s) fired
`observe_only`	The workspace is in `observe_only` autonomy mode — no changeset created (see Autonomy governance)

Two-authority evaluation

Evaluation runs two scoring authorities in parallel on every ScanTrace, then combines their outputs into a single composite. Each authority has a different epistemic basis:

Authority	Scoring basis	Owned by
World-model scorer	Ground truth co-generated with the EvalSet stimulus at generation time (see ADR-014). Answers: “Did the agent produce the response the rule says it should?”	`spectral.worlds` produces ground truth; `spectral.platform` consumes it.
Rubric scorer	LLM-as-judge against the workspace’s Evaluation Framework rubric. Answers: “How does this output score on the rubric’s dimensions?” Produces natural-language explanations that the diagnose phase’s clusterer reasons over.	`spectral.platform`.

Opacity discipline between contexts

The world-model scorer’s inputs — ground truth, world-model-rule-derived scoring dimensions, and stimulus_weight — are packaged into each EvalSet sample by spectral.worlds and consumed via the callee-owned EvalSetProvider Tier 2 Protocol at spectral.worlds.contracts.protocols.eval_set_provider per ADR-065 D3 + ADR-070 Tier 2 (LLM-tool wrapping inside apps/test-agents is the qualifying multi-framework-consumer condition; Observe / tournament / Evaluate are intra-spectral.platform phases that consume the same producer-typed payload through the platform-side caller). Rule internals never cross the context boundary; the scorer reasons over the producer-typed payload’s published shape, not over rule structure. The architecture validator at STRICT=True enforces no spectral.worlds imports into spectral.platform as the structural backstop; the EvalSetProvider Protocol surface is the data-flow assertion that complements it.

Why two authorities

A single authority lets an agent Goodhart the metric. Two authorities that draw from different signal bases keep the evaluation surface grounded:

The world-model scorer anchors to the authority_version under which the EvalSet was generated. Its verdict is binary-ish: did the response match ground truth or not?
The rubric scorer is customer-steerable. It encodes what the customer cares about — dimension weights, scoring guidance, hard-constraint floors.
Neither authority can crowd the other out, because stimulus_weight is bounded and the composite blends both.

The two-authority defense rests on the world-model authority being credible — if the world model is wrong, two-authority evaluation just blends a flawed authority with a customer-steerable rubric. That credibility is built upstream, not inside the evaluation step: the four-tier provenance system grounds rules in authoritative sources, the conformity gate provides mechanical validation independent of any operator’s judgment, and methodology disclosure on the System Card makes every rule’s provenance auditable post-hoc. Two-authority evaluation is the Goodhart defense at evaluation time; the methodology stack the system card discloses is the credibility defense at authority time.

CompositeScore schema

Defined under spectral.platform.domain.tournament.* per ADR-020 + ADR-065 D1 (domain types do not live in the kernel — kernel admission discipline rules them out). Every phase that produces a score (tournament pre-screen, verdict validation, system card reporting) emits and consumes the same CompositeScore shape:

Field	Description
`world_model_score`	Aggregated world-model authority score for the scan
`rubric_score`	Aggregated rubric authority score for the scan
`blended_delta`	Champion → patch improvement on the stimulus-weight-adjusted composite
`convergence_delta`	Conformance (real-world) vs synthetic performance delta. Null when conformance samples absent.
`per_track_breakdown`	`{synthetic: {world_model, rubric, n_samples}, conformance: {...}}`
`attribution`	World model version, authority_version, rule references (reference into `worlds` per ADR-030 + ADR-065 D2 producer-typed payload)

Consistency across phases is enforced by type, not by shared compute. Every phase reads and produces the same CompositeScore; the verdict engine uses the same blending logic as the tournament pre-screen.

Rubric provisioning (`rubric_gen`)

The rubric scorer consumes the workspace’s Evaluation Framework rubric. rubric_gen is the zero-setup cold-start path that produces a viable rubric for new workspaces from either observed agent traces or a stated objective — workspaces never start without a scorable rubric.

rubric_gen is workspace-onboarding capability, not per-scan capability — it runs once at workspace setup (or when an operator triggers a rubric refresh) and writes its output to the workspace’s EvaluationFramework. The rubric scorer reads the framework on each scan; ongoing rubric refinement happens through the Rubric audit feedback loop, not through re-running rubric_gen.

LLM tier: reasoning (per LLM routing below) — discovery and synthesis cost is acceptable at workspace setup, unlike per-scan rubric scoring which uses the scoring tier.

Rubric divergence records

Per-scan rubric scorer outputs are compared against world-model scorer outputs to compute a rubric divergence delta — a measurement of how much the customer rubric diverges from the world-model authority on each scan. The delta is persisted as a RubricDivergenceRecord domain record (workspace-scoped, RLS per ADR-033, retention per ADR-042) — not agent memory per ADR-058 D14 non-mirror list, the same category as RegressionRecord and InterventionLog. Schema documented in domain-model — RubricDivergenceRecord.

Each scan also emits a rubric.divergence typed event to spectral.worlds (regardless of conformance-sample availability; payload module planned at spectral.platform.contracts.events.rubric_divergence per ADR-065 D2). The World Agent aggregates divergence across workspaces as a world-model-evolution signal; single-workspace divergence remains a scan observation and does not initiate rule revision — only cross-workspace aggregation (handled in spectral.worlds) is a rule-evolution signal.

Event emissions

Events with producer-owned typed payload modules in <context>.contracts.events.* per ADR-065 D2:

Event	Emitted by	Carries
`platform.failure_cluster.detected` (`spectral.platform.contracts.events.failure_cluster_detected`)	Diagnose (every cluster crossing detection threshold; World Agent applies promotion-threshold filter consumer-side)	Cluster ID, severity, failure count, first/last observed, evidence bundle, sanitized summary, suggested rule stub
`rubric.divergence`	Evaluate (always, per scan, regardless of conformance-sample availability)	Workspace ID, scan ID, evaluation framework ID, divergence delta, observed_at
`verdict.issued`	Verdict (always, per scan)	Workspace ID, scan ID, verdict, composite score, evaluation_authority_ref, issued_at
`scan.convergence.delta`	Verdict (always, per scan)	Convergence delta with presence-or-absence marker
`scan.completed`	Verdict (always, per scan)	Summary + outcome
`approval.required`	`on_scan_completed` handler when verdict triggers it (always when autonomy mode is `manual`; kill-switch and bounded-auto fall-through cases land in the second alpha autonomy wave)	Changeset ID + reason

Autonomy governance

Autonomy modes govern how verdict output reaches changesets. The 0.3.0 alpha ships these in two waves:

First wave lands the alpha-bound subset: observe_only + manual (default), enforced in the on_scan_completed handler. No gate evaluation, no kill switch, no fall-through arbitration.
Second wave extends the handler to cover recommend, bounded_auto, the four-gate framework, and the kill switch — completing the alpha autonomy surface (second alpha wave).
auto_test and guarded_auto are post-launch, deferred outside the 0.3.0 alpha milestone.

Autonomy modes

observe_only and manual (default) ship in the first alpha wave; recommend and bounded_auto ship in the second.

Mode	Changeset created	Application path	Notes
`observe_only`	No	—	Enforced in the `on_scan_completed` handler before changeset creation; no changeset record exists. See Observe-only data treatment below — measurement is unaffected.
`manual`	Yes	Always `approval.required`	Default mode at workspace bootstrap; the handler always creates a changeset and emits `approval.required`. Operator-driven explicit control.
`recommend`	Yes	Always `approval.required`	Mechanically identical to `manual`; semantic intent is “Spectral recommends, human curates.”
`bounded_auto`	Yes	Auto-accept within gates; `approval.required` otherwise	Auto-accepts when composite score clears workspace-configured thresholds.

caution verdicts are never auto-accepted regardless of mode or gate configuration. This is a hard rule, not a threshold.

Bounded-auto gates

Workspace-configurable thresholds evaluated against the CompositeScore snapshot attached to the changeset:

min_blended_delta — minimum score improvement required
min_world_model_score — floor on the world-model authority score
max_rules_affected — cap on blast radius per changeset
require_validated_changeset — only changesets that have passed the validated terminal state are eligible

All gates must pass for auto-acceptance. Any single failure routes to approval.required.

Observe-only data treatment

observe_only mode suppresses actuation, not measurement. Everything the scan pipeline observes, computes, and emits still happens; the only path that is skipped is ChangeSet creation.

Data / signal	Behavior in `observe_only`
`VerdictResult`	Stored — same `Scan` row, same verdict table, same schema as any other mode.
`CompositeScore`	Stored — attached to the scan record.
Dashboard surfacing	Visible — verdicts and scores render on the customer dashboard identically. The UI labels them “observe-only” so there is no confusion about whether a ChangeSet exists.
Spectral Agent proactive conversation	Opened — `ScanCompletedEvent` still fires; the agent summarizes verdicts normally. The agent will not propose applying a ChangeSet because no ChangeSet exists; it can still discuss the findings.
Supervisor mode classification (`ACTIVE` / `PLATEAU` / `FRONTIER` / `NO_DATA`)	Fed — supervisor state is the right place to reason about “is the system improving?” and the answer must not depend on whether actuation is enabled.
`rubric.divergence` event	Emitted — the event carries measurement, not actuation. WorldAgent consumes it regardless of mode.
`scan.convergence.delta` event	Emitted — both presence and absence cases carry meaning per the Verdict phase spec.
`platform.failure_cluster.detected` event	Emitted — clustering is a measurement phase output.
T1 (interaction-tier) observation persistence	Persisted — the `on_scan_completed` handler invokes the `spectral_agent_memory` gateway directly per scan. T1 writes are independent of changeset creation.

Why actuation vs measurement is the right cut

The purpose of observe_only is to let a customer watch how the system would behave before granting any write authority. Muting measurement would turn the mode into a no-op — the customer would not learn anything from it. The whole point is that, after N weeks in observe_only, the customer has seen verdicts, trends, and agent reasoning that inform their decision to move to recommend or further.

The one thing that does not happen in observe_only is ChangeSet creation. That is a record-of-proposal and requires workspace-level intent to actuate. The on_scan_completed handler checks the mode before creating the ChangeSet, and returns without creating one. Downstream event emission and T1 (interaction-tier) memory persistence happen regardless of that check. The handler itself runs in apps/workers per ADR-060.

Kill switch

A workspace-level kill switch forces approval.required on every changeset regardless of the configured autonomy mode. It does not suppress changeset creation — scans run normally, changesets accumulate, and every one requires human approval. Effective behavior is identical to recommend mode while active.

The kill switch:

Is persisted and survives service restarts
Is audit-logged on activation and deactivation
Aligns with the existing approval.required event path; no mode bypass is introduced

Post-launch modes

auto_test — auto-accepts non-breaking changesets, defers breaking to approval. Requires a trust-baseline mechanism.
guarded_auto — terminal rung of the autonomy ladder. Auto-accepts within policy guardrails with anomaly-driven rollback. Hard-depends on auto_test.

Behavioral specifications for both modes carry through to this page when the modes return.

Autonomy mode vs integration tier

These are two different axes. Both use tiered framing, which trips readers up. Keep them separate:

	Integration tier	Autonomy mode
What it controls	Customer-facing trust progression (“how deeply does Spectral sidecar into the workflow”)	Workspace execution policy (“what happens to accepted changesets”)
Where it lives	Product vocabulary, customer onboarding, commercial positioning	Workspace configuration, enforced in `on_scan_completed`
Values	Stage 1 (observe + recommend), Stage 2 (observe + manage), Stage 3 (observe + manage + automate)	`observe_only`, `manual` (alpha first wave) + `recommend`, `bounded_auto`, kill switch (alpha second wave)
Who changes it	Commercial relationship / expansion decision	Workspace admin setting

Neither axis subsumes the other. A Stage 2 customer can run manual (tight operator control) or bounded_auto (automate with gates) without touching the tier.

Typical mapping

Not a hard rule — just what typically happens. The first wave covers observe_only + manual; recommend and bounded_auto ship in the second.

Integration tier	Typical autonomy mode
Stage 1 (observe + recommend)	`observe_only` or `recommend` — customer is still building trust in the optimization signal
Stage 2 (observe + manage)	`recommend` or `manual` — customer curates actively but Spectral owns scanning
Stage 3 (observe + manage + automate)	`bounded_auto` — customer has enough signal history to let gates fire

For the customer-facing integration tiers see How Spectral Works.

Meta-improvement engine

The scan pipeline doesn’t just optimize customer agent systems — it feeds its own improvement. The meta-improvement engine tracks what mutation strategies work, identifies rubric quality issues, and guides the pipeline’s approach over time. See Memory System for the universal interaction / session / persistent lifecycle (parameterized as cycle / run / workspace for the Spectral Agent) that compounds strategy performance, and World Model System / Evolution Loop for how observed cluster patterns feed rule evolution.

Strategy registry

Tracks effectiveness of optimization strategies across runs:

ELO ratings — strategies compete head-to-head in tournaments; ratings update on win/loss.
Usage counts and win rates — how often each strategy is used and improves composites.
Average improvement — expected blended_delta when a strategy is applied.

Optimize consults the registry when selecting mutation approaches. Higher-rated strategies are preferred for similar failure patterns.

Intervention log

Every optimization intervention records outcome: per-cluster pre/post scores, predicted vs actual improvement (calibrates the engine’s confidence), and reusability tags that feed observations into the Spectral Agent’s persistent-tier (workspace-scope) memory. InterventionLog is a canonical record of optimization activity, not agent memory itself — see domain-model — InterventionLog.

Regression records

RegressionRecord is a spectral.platform domain record that captures interventions which caused measurable regression on one or more failure clusters. Not agent memory — it is a canonical record of optimization activity per the records-vs-memory framing (ADR-058 D14 captures the workshop principle that distinguishes records from memory). Stored in workspace-scoped platform.regression_records with RLS enforcement per ADR-033 and retention governed by ADR-042 D4. Schema documented in domain-model — RegressionRecord.

A RegressionRecord is a dedicated entity, not a flag on InterventionLog. The two have different responsibilities and access patterns:

InterventionLog — record-of-action (every intervention, regardless of outcome). The optimizer queries it chronologically by workspace.
RegressionRecord — record-of-regression (interventions that caused measurable harm). The verdict + tournament engine queries it by mutation-pattern and by cluster; the World Agent queries it (after sanitisation) for rule-coverage signal.

A RegressionRecord references the originating InterventionLog entry; interventions that did not regress carry no RegressionRecord.

Write path. When a verdict gate fires NO-GO with cluster-level regression detail (gates 2 / 3 — agent regression and dimension regression), the verdict engine writes a RegressionRecord capturing the mutation pattern, regressed clusters, improved clusters (regressions are rarely pure), and severity.

Read path. Tournament reads recent RegressionRecord entries during candidate selection and penalizes replay of matching mutation patterns via adaptive composite weighting (see Optimize / Regression-avoidance signal above).

Sanitised promotion to World Signal. A RegressionRecord whose mutation pattern is workspace-agnostic (no PII, no workspace-specific configuration detail, just the domain-relevant pattern + cluster class) clears the sanitisation gate and routes via the memory-to-Worlds signal events (memory.observation.promoted / t3_memory.written) to spectral.worlds per ADR-018. The World Agent may surface the signal as a candidate for rule revision — the underlying rule may be underspecified in a way that lets regressing mutations pass local evaluation.

What stays out of the record: no customer output text (mutation pattern is opaque, cluster references do not unwrap to raw content); no cross-workspace pattern matching (that happens in the World Agent after sanitisation, not inside the workspace-scoped record); no automatic retirement of strategies (a pattern with many regressions in one workspace is workspace context; strategy retirement is an operator / World Agent concern at the aggregate level).

Rubric audit

Calibrate and Evaluate emit signals about rubric quality — high-variance dimensions, ambiguous guidance, score distribution shifts. The rubric audit surface summarizes these for the Spectral Agent, which proposes rewrites the operator can accept or decline.

LLM routing

Tier-based model routing balances capability and cost:

Task tier	Used for	Selection criteria
Scoring	Evaluation (both authorities), tournament scoring, holdout validation	Cost-optimized, high throughput
Detection	Parse checks, anti-deception, safety screening	Lowest cost, fast
Reasoning	Diagnosis, optimization, prompt rewrites, mutation generation	Highest capability
Customer	Agent execution during Observe	Customer’s own model (passed through)

Fallback hardening ensures model failures don’t halt the pipeline. Consecutive failures trigger automatic tier disabling for the remainder of the scan.

Supervision & scheduling

For customers on continuous or periodic cadences:

Scheduled scans run at configured intervals.
Supervisor integration consults a planning function before each scan to determine priority and budget allocation.
Frontier detection recognizes when optimization has plateaued and switches to monitoring-only.
Economic reasoning uses cost-per-failure and revenue-per-success to prioritize optimization where it has the highest business impact.

Bayesian category selection — `intervention_memory_adjustment`

The supervisor’s category-selection step reads from agent memory, not from raw scan history. The supervisor consults Tier-3 (workspace-scope) observations for an intervention_memory_adjustment that nudges category priors based on past intervention outcomes — categories whose past interventions correlated with regression get downweighted; categories whose past interventions moved the needle get upweighted. The adjustment is a posterior nudge on a Bayesian prior, not a hard override; recent observations dominate older ones via the standard Tier-3 decay schedule. The integration is read-only at the supervisor seam — the supervisor never writes to memory; the adjustment value flows through as part of the planning function’s input.

Supervisor recommendation delivery

The supervisor produces SupervisorRecommendation records — guidance about what to scan next, what budget to allocate, what to prioritize. These are consumed by the Spectral Agent (and, at a lesser priority, by operational dashboards).

Delivery is event-driven via a supervisor.recommendation.issued event with producer-typed payload planned at spectral.platform.contracts.events.supervisor_recommendation_issued per ADR-065 D2 (lands with the supervisor epic).

Event shape:

Field	Type	Notes
`event_id`	UUID
`workspace_id`	UUID
`recommendation_id`	UUID	`SupervisorRecommendation` primary key
`mode_classification`	`ACTIVE` \| `PLATEAU` \| `FRONTIER` \| `NO_DATA`
`priority`	enum	Ordered set of priority tags the supervisor has reasoned about
`budget_hint`	optional decimal	Spend cap guidance for the next scan, when supervisor has an opinion
`narrative`	string	Short natural-language rationale
`issued_at`	timestamp

Why event-driven:

Consistent with the event-substrate doctrine. Heavy/async work dispatches via events; direct query coupling across the supervisor-to-agent boundary would break the pattern.
Multiple consumers supported. The Spectral Agent is the primary consumer, but operational dashboards and the Operations Agent’s observability tooling may also subscribe without the supervisor needing to know about them.
Decoupled timing. The supervisor emits when it has reasoned; the agent consumes when it is triggered, which may be a different moment (e.g., when a customer sends a chat message).
Survives restart. Unlike a direct-call-state-query model, the event record is durable — an agent restart does not lose the recommendation.

Why not attached to scan.completed:

The supervisor can issue recommendations that are not tied to a specific scan (periodic plateau detection, budget reallocation after a billing event, etc.). Forcing them onto scan.completed would lose those cases.

Why not polling:

Polling is the wrong direction. The supervisor is the producer; pushing its recommendations onto an event bus keeps authority where the reasoning is.

Consumer hookup: OnSupervisorRecommendationHandler in spectral.platform’s agent application layer creates a proactive conversation (or updates an existing one) with the recommendation narrative. See Agent Architecture — Event-driven proactive conversations.

Next steps

Domain Model — entities, state machines, relationships
World Model System — how EvalSets and ground truth are produced
Memory System — universal interaction / session / persistent lifecycle compounding and the world-signal path
Agent Architecture — Spectral Agent, Operations Agent, WorldAgent

Previous
Overview Next
Explainability