Skip to content
GitHub
Platform

Optimization Engine

How Spectral measures, optimizes, and recommends changes to customer agent systems. This page covers the scan pipeline (seven phases, two tracks, two evaluation authorities), the CompositeScore schema that anchors consistency across phases, the verdict engine, and the governance surface — autonomy modes and integration tier — that controls how scan output reaches the customer.

The page serves three readers. Engineers building or extending the pipeline get the seven-phase sequence, composite-score schema, verdict gate set, and autonomy-mode handling. Strategic readers need the defense the pipeline encodes — two evaluation authorities (world-model + customer-rubric) blended into a composite score so optimization can’t trivially game one signal at the expense of the other (the failure mode Goodhart’s Law names). Reviewers auditing methodology need the holdout strategy, the statistical-uniqueness anchor, and the verdict-gate set.

The strategic claim is the two-authority defense. Optimization rewards what the rubric measures; if the customer authors the rubric, optimization rewards what the customer thought to test for. Spectral runs two scoring authorities in parallel — a world-model authority anchored to a domain standard the customer didn’t author, plus a customer-steerable rubric authority — and blends them into a CompositeScore. Neither authority can crowd the other out, so an agent that “improves” by chasing one signal trips the other. That balance is the page’s strategic center; the seven phases below are the mechanism that delivers it. See Two-authority evaluation for the full treatment.


Every scan runs both tracks when data is available, or the synthetic track alone as a valid fallback. Once preflight admits a scan, there is no blocked state — both readiness modes (Full and synthetic_only) produce a verdict; the only non-start path is a Worlds-unavailable error (see Scan preflight below).

TrackSourceRole
Synthetic EvalSetspectral.worlds generates a statistically unique EvalSet per scanOptimization signal. Agent runs against EvalSet stimuli; this is what drives candidate selection.
Real-world conformanceCurated OtelTrace samples with human-validated ground truth, supplied by CurationServiceConvergence anchor. Agent’s real-world performance is measured against validated ground truth. Runs when sufficient validated samples exist.

Scan Readiness is reported as a preflight observation on the scan record: Full (both tracks) or synthetic_only. It is not a blocking gate — a scan can always run the synthetic track alone and still produce a verdict.


Preflight runs in the scan orchestrator — the application-layer component that owns the scan lifecycle — immediately before the Observe phase begins. It writes a ScanReadinessObservation record to the Scan row, then unconditionally proceeds to Observe:

@dataclass(frozen=True)
class ScanReadinessObservation:
mode: Literal["Full", "synthetic_only"]
evalset_available: bool
curation_samples_count: int
missing_reasons: list[str] # empty when mode == "Full"
observed_at: datetime

mode = Full when Worlds can produce an EvalSet and at least the curation-minimum sample count is available; mode = synthetic_only when an EvalSet is producible but conformance samples are below the minimum. When Worlds cannot produce an EvalSet at all, preflight raises an orchestrator-level error and the scan surfaces a scheduled-retry — that is a scan-start failure, not a preflight observation.

Preflight observes and emits; it never blocks the scan from running when any valid mode is possible. The curation service emits its own readiness signals; preflight consumes the latest or queries synchronously to decide curation_samples_count. Curation readiness is the source-of-truth for conformance-sample availability.

  • Scan. An evaluation run against synthetic traces generated by the world model conformant with world model rules. A scan is evaluation against the EvalSet the world model produces, not against world model rules directly.
  • EvalSet. Produced by spectral.worlds. Statistically unique per request (per ADR-028). Each sample carries stimulus text, ground truth co-generated from the originating rule, and a stimulus_weight derived from rule confidence.
  • OtelTrace. Permanent customer production record. Never modified after ingestion.
  • ScanTrace. Ephemeral scan-execution record — agent’s response to an EvalSet stimulus. Gains a provenance field recording stimulus source.

The pipeline runs seven phases in order after preflight completes. Each phase completes fully before the next begins and emits context that is serialized after each phase for fault tolerance and resume.

preflight (orchestrator pre-check) → Observe → Calibrate → Diagnose → Evaluate → Optimize → Safety → Verdict
  • Consumes the ScanReadinessObservation written by the preflight step.
  • Requests a statistically unique EvalSet from spectral.worlds synchronously at scan start, submitting the workspace’s EvalSetParameterization as the request body. If Worlds cannot produce an EvalSet, the scan errors and retries on the next schedule tick (this path is already surfaced by preflight’s error mode).
  • Receives curated conformance samples from CurationService. The readiness state (Full or synthetic_only) was already written by preflight; Observe does not recompute it.
  • Runs the customer agent against synthetic EvalSet stimuli and (where available) conformance samples. Produces ScanTrace records with provenance fields identifying stimulus source.
  • Partition logic (working vs holdout) is rebuilt against the EvalSet structure.

Adjusts scoring thresholds based on the observed score distribution. No spectral.worlds interaction.

Clusters failures into FailureCluster records (spectral.platform.domain.clustering). Quarantines infrastructure failures and parse failures before clustering so only quality EvalResults feed the LLM clusterer.

Two-authority opacity. The clustering prompt receives only rubric scorer explanations and scores — world-model authority outputs do not cross the clustering prompt boundary. Opacity is enforced at the input shape, not by post-hoc filtering: the clusterer’s EvalResult projection includes scoring_authority = rubric rows and excludes the world-model authority’s view (per ADR-014).

Cluster lifecycle. Each cluster carries an actioning_status enum(identified, addressed, persistent, resolved) with validated transitions enforced by the repository: identified -> addressed -> resolved covers the standard remediation path; identified -> persistent -> resolved covers re-emergence of a previously-addressed cluster.

Detection event. When a cluster crosses the detection threshold, Diagnose emits platform.failure_cluster.detected with the producer-typed payload at spectral.platform.contracts.events.failure_cluster_detected. The event carries cluster_id, severity, failure_count, first_observed_at / last_observed_at, evidence_bundle, a sanitized summary, and a suggested_rule_stub. It does not carry raw customer output text — sanitization is verified by a content-contract test.

Consumer paths off the event:

  • Operations Agent (intra-platform) — upserts platform.rule_candidates_pending on every detection so operators see the cluster surface immediately.
  • World Agent (in spectral.worlds) — applies a consumer-side promotion-threshold filter (frequency_pct >= 10, effect_size >= 15, actionable = true, computed over the event stream) and seeds rule-candidate exploration only when the higher bar is met. The threshold logic is consumer-resident so the wire shape stays single-event.

Runs two scorers in parallel on all traces from both tracks. See Two-authority evaluation below.

Attribution fields. Each EvalResult carries scoring_authority, track, and stimulus_source. stimulus_weight on each EvalSet sample (set by worlds, derived from generating rule confidence) is applied at the EvalResult level when computing the world-model authority’s contribution to the composite. Spectral treats stimulus_weight as a scalar attribution input — rule internals never cross the context boundary.

EvalSet sourcing. The scorer consumes EvalSets via the callee-owned EvalSetProvider Protocol at spectral.worlds.contracts.protocols.eval_set_provider (per ADR-070 Tier 2 — multi-consumer eval criteria). No rule structure is reachable from the platform side at scoring time.

Generates candidate mutations via the strategy registry and runs a tournament to select the winning candidate.

Candidate types. Tournament evaluates four candidate types, each generated by a distinct mutation strategy:

TypeMutation profile
surgicalTargeted edits to specific failing clusters (smallest blast radius)
conservative-rewriteBounded prompt rewrite preserving structural intent
generalBroader structural mutations (largest blast radius among non-history-informed)
history-informedMutations seeded by prior workspace RegressionRecord patterns to avoid known regressions

Two-pass evaluation. Tournament runs two passes for cost discipline:

  1. Pre-screen pass — runs candidates against ≤ 5 samples using the rubric scorer only. Cheaply culls obviously-failing candidates before invoking the more expensive world-model scorer. The cap is intentional: pre-screen is a culling gate, not a measurement.
  2. Full-evaluation pass — survivors run against the full working set with both authorities (world-model + rubric); stimulus_weight is applied at the EvalResult level when computing each candidate’s CompositeScore (see Two-authority evaluation).

Tournament scoring runs concurrently with bounded concurrency (semaphore-limited).

Regression-avoidance signal. Tournament consumes recent RegressionRecord entries from the workspace’s regression record store (see Regression records below) and penalizes replay of regressed mutation patterns within the same workspace via adaptive composite weighting per ADR-020.

Content safety checks on the winning candidate’s outputs. No spectral.worlds interaction. The safety gate runs late — after Optimize selects a candidate — because running safety before Optimize would prejudge candidates that a safer mutation would render acceptable.

Multi-gate GO / NO-GO engine. The eight core verdict gates are pure functions in spectral.platform.domain.verdict with no infrastructure imports — enforced by the architecture validator. A convergence gate runs alongside. The delta threshold gate operates on blended_delta; every other gate operates on rubric scorer data.

#GateOperates onOutcome contribution
1Delta thresholdblended_delta (CompositeScore)NO-GO if improvement below workspace threshold
2Agent regression (severe + mild)Rubric scorer per-agent score deltasNO-GO on severe regression; CAUTION on mild
3Dimension regressionRubric scorer per-dimension deltasNO-GO if any rubric dimension floor is violated
4Holdout generalization gapRubric scorer holdout vs working-set deltaNO-GO if holdout significantly underperforms working-set (synthetic holdout partition only — see Holdout protocol below)
5Bootstrap 95 % CIRubric scorer score distributionNO-GO if confidence interval crosses zero
6Output similarityRubric scorer output embeddingsCAUTION on unexplained semantic drift
7Pareto cost / latency penaltyRubric scorer + cost / latency telemetryNO-GO on Pareto-dominated outcome (degraded cost OR latency without compensating quality gain)
8Sanity downgradeRubric scorer distribution shapeCAUTION on suspicious distribution (e.g., all-perfect or all-zero scores). Rubric scorer only — world-model scorer’s discriminative quality is already expressed via stimulus_weight per ADR-014, so applying sanity downgrade there would double-count.
+Convergence gateconvergence_delta (CompositeScore)CAUTION on conformance-track convergence drift; workspace-configurable hard NO-GO escalation

Verdict also emits scan.convergence.delta per scan with explicit absence-marker semantics:

  • Conformance data available: event carries the convergence delta (real-world vs synthetic EvalSet performance).
  • Conformance data not available: event carries an explicit absence marker with reason.

Absence is a signal, not silence. WorldAgent aggregates absence at scale as a world-model-adoption signal.

VerdictResult and CompositeScore are defined in spectral.platform.domain.tournament per ADR-020 — platform-internal types, not spectral.core shared kernel.

The holdout generalization gap gate consumes the synthetic EvalSet holdout partition exclusively. Conformance samples are scarce and reserved as convergence anchors — they are NOT consumed for the holdout generalization gap gate. The EvalSet carries an explicit two-layer holdout structure (working set + holdout); the verdict engine reads only the synthetic track.

The delta threshold gate’s input field blended_delta is the stimulus-weight-derived composite delta. The blend ratio that combines the world-model authority and rubric authority contributions into blended_delta is computed at scan time from aggregate stimulus_weight, not configured per workspace. Workspace configuration does not accept blend_ratio as a tunable.

The go_nogo field on VerdictResult is one of four values. The fourth value reuses the observe_only label from the autonomy-mode taxonomy because both refer to the same workspace state — a verdict outcome of observe_only is the natural shape produced when the workspace itself is in autonomy mode observe_only.

go_nogoMeaning
goPromotion recommended — all gates pass
cautionMixed signals — requires human review. Autonomy modes never auto-accept caution regardless of configuration.
nogoDo not promote — gate(s) fired
observe_onlyThe workspace is in observe_only autonomy mode — no changeset created (see Autonomy governance)

Evaluation runs two scoring authorities in parallel on every ScanTrace, then combines their outputs into a single composite. Each authority has a different epistemic basis:

AuthorityScoring basisOwned by
World-model scorerGround truth co-generated with the EvalSet stimulus at generation time (see ADR-014). Answers: “Did the agent produce the response the rule says it should?”spectral.worlds produces ground truth; spectral.platform consumes it.
Rubric scorerLLM-as-judge against the workspace’s Evaluation Framework rubric. Answers: “How does this output score on the rubric’s dimensions?” Produces natural-language explanations that the diagnose phase’s clusterer reasons over.spectral.platform.

The world-model scorer’s inputs — ground truth, world-model-rule-derived scoring dimensions, and stimulus_weight — are packaged into each EvalSet sample by spectral.worlds and consumed via the callee-owned EvalSetProvider Tier 2 Protocol at spectral.worlds.contracts.protocols.eval_set_provider per ADR-065 D3 + ADR-070 Tier 2 (LLM-tool wrapping inside apps/test-agents is the qualifying multi-framework-consumer condition; Observe / tournament / Evaluate are intra-spectral.platform phases that consume the same producer-typed payload through the platform-side caller). Rule internals never cross the context boundary; the scorer reasons over the producer-typed payload’s published shape, not over rule structure. The architecture validator at STRICT=True enforces no spectral.worlds imports into spectral.platform as the structural backstop; the EvalSetProvider Protocol surface is the data-flow assertion that complements it.

A single authority lets an agent Goodhart the metric. Two authorities that draw from different signal bases keep the evaluation surface grounded:

  • The world-model scorer anchors to the authority_version under which the EvalSet was generated. Its verdict is binary-ish: did the response match ground truth or not?
  • The rubric scorer is customer-steerable. It encodes what the customer cares about — dimension weights, scoring guidance, hard-constraint floors.
  • Neither authority can crowd the other out, because stimulus_weight is bounded and the composite blends both.

The two-authority defense rests on the world-model authority being credible — if the world model is wrong, two-authority evaluation just blends a flawed authority with a customer-steerable rubric. That credibility is built upstream, not inside the evaluation step: the four-tier provenance system grounds rules in authoritative sources, the conformity gate provides mechanical validation independent of any operator’s judgment, and methodology disclosure on the System Card makes every rule’s provenance auditable post-hoc. Two-authority evaluation is the Goodhart defense at evaluation time; the methodology stack the system card discloses is the credibility defense at authority time.

Defined under spectral.platform.domain.tournament.* per ADR-020 + ADR-065 D1 (domain types do not live in the kernel — kernel admission discipline rules them out). Every phase that produces a score (tournament pre-screen, verdict validation, system card reporting) emits and consumes the same CompositeScore shape:

FieldDescription
world_model_scoreAggregated world-model authority score for the scan
rubric_scoreAggregated rubric authority score for the scan
blended_deltaChampion → patch improvement on the stimulus-weight-adjusted composite
convergence_deltaConformance (real-world) vs synthetic performance delta. Null when conformance samples absent.
per_track_breakdown{synthetic: {world_model, rubric, n_samples}, conformance: {...}}
attributionWorld model version, authority_version, rule references (reference into worlds per ADR-030 + ADR-065 D2 producer-typed payload)

Consistency across phases is enforced by type, not by shared compute. Every phase reads and produces the same CompositeScore; the verdict engine uses the same blending logic as the tournament pre-screen.

The rubric scorer consumes the workspace’s Evaluation Framework rubric. rubric_gen is the zero-setup cold-start path that produces a viable rubric for new workspaces from either observed agent traces or a stated objective — workspaces never start without a scorable rubric.

rubric_gen is workspace-onboarding capability, not per-scan capability — it runs once at workspace setup (or when an operator triggers a rubric refresh) and writes its output to the workspace’s EvaluationFramework. The rubric scorer reads the framework on each scan; ongoing rubric refinement happens through the Rubric audit feedback loop, not through re-running rubric_gen.

LLM tier: reasoning (per LLM routing below) — discovery and synthesis cost is acceptable at workspace setup, unlike per-scan rubric scoring which uses the scoring tier.

Per-scan rubric scorer outputs are compared against world-model scorer outputs to compute a rubric divergence delta — a measurement of how much the customer rubric diverges from the world-model authority on each scan. The delta is persisted as a RubricDivergenceRecord domain record (workspace-scoped, RLS per ADR-033, retention per ADR-042) — not agent memory per ADR-058 D14 non-mirror list, the same category as RegressionRecord and InterventionLog. Schema documented in domain-model — RubricDivergenceRecord.

Each scan also emits a rubric.divergence typed event to spectral.worlds (regardless of conformance-sample availability; payload module planned at spectral.platform.contracts.events.rubric_divergence per ADR-065 D2). The World Agent aggregates divergence across workspaces as a world-model-evolution signal; single-workspace divergence remains a scan observation and does not initiate rule revision — only cross-workspace aggregation (handled in spectral.worlds) is a rule-evolution signal.


Events with producer-owned typed payload modules in <context>.contracts.events.* per ADR-065 D2:

EventEmitted byCarries
platform.failure_cluster.detected (spectral.platform.contracts.events.failure_cluster_detected)Diagnose (every cluster crossing detection threshold; World Agent applies promotion-threshold filter consumer-side)Cluster ID, severity, failure count, first/last observed, evidence bundle, sanitized summary, suggested rule stub
rubric.divergenceEvaluate (always, per scan, regardless of conformance-sample availability)Workspace ID, scan ID, evaluation framework ID, divergence delta, observed_at
verdict.issuedVerdict (always, per scan)Workspace ID, scan ID, verdict, composite score, evaluation_authority_ref, issued_at
scan.convergence.deltaVerdict (always, per scan)Convergence delta with presence-or-absence marker
scan.completedVerdict (always, per scan)Summary + outcome
approval.requiredon_scan_completed handler when verdict triggers it (always when autonomy mode is manual; kill-switch and bounded-auto fall-through cases land in the second alpha autonomy wave)Changeset ID + reason

Autonomy modes govern how verdict output reaches changesets. The 0.3.0 alpha ships these in two waves:

  • First wave lands the alpha-bound subset: observe_only + manual (default), enforced in the on_scan_completed handler. No gate evaluation, no kill switch, no fall-through arbitration.
  • Second wave extends the handler to cover recommend, bounded_auto, the four-gate framework, and the kill switch — completing the alpha autonomy surface (second alpha wave).
  • auto_test and guarded_auto are post-launch, deferred outside the 0.3.0 alpha milestone.

observe_only and manual (default) ship in the first alpha wave; recommend and bounded_auto ship in the second.

ModeChangeset createdApplication pathNotes
observe_onlyNoEnforced in the on_scan_completed handler before changeset creation; no changeset record exists. See Observe-only data treatment below — measurement is unaffected.
manualYesAlways approval.requiredDefault mode at workspace bootstrap; the handler always creates a changeset and emits approval.required. Operator-driven explicit control.
recommendYesAlways approval.requiredMechanically identical to manual; semantic intent is “Spectral recommends, human curates.”
bounded_autoYesAuto-accept within gates; approval.required otherwiseAuto-accepts when composite score clears workspace-configured thresholds.

caution verdicts are never auto-accepted regardless of mode or gate configuration. This is a hard rule, not a threshold.

Workspace-configurable thresholds evaluated against the CompositeScore snapshot attached to the changeset:

  • min_blended_delta — minimum score improvement required
  • min_world_model_score — floor on the world-model authority score
  • max_rules_affected — cap on blast radius per changeset
  • require_validated_changeset — only changesets that have passed the validated terminal state are eligible

All gates must pass for auto-acceptance. Any single failure routes to approval.required.

observe_only mode suppresses actuation, not measurement. Everything the scan pipeline observes, computes, and emits still happens; the only path that is skipped is ChangeSet creation.

Data / signalBehavior in observe_only
VerdictResultStored — same Scan row, same verdict table, same schema as any other mode.
CompositeScoreStored — attached to the scan record.
Dashboard surfacingVisible — verdicts and scores render on the customer dashboard identically. The UI labels them “observe-only” so there is no confusion about whether a ChangeSet exists.
Spectral Agent proactive conversationOpenedScanCompletedEvent still fires; the agent summarizes verdicts normally. The agent will not propose applying a ChangeSet because no ChangeSet exists; it can still discuss the findings.
Supervisor mode classification (ACTIVE / PLATEAU / FRONTIER / NO_DATA)Fed — supervisor state is the right place to reason about “is the system improving?” and the answer must not depend on whether actuation is enabled.
rubric.divergence eventEmitted — the event carries measurement, not actuation. WorldAgent consumes it regardless of mode.
scan.convergence.delta eventEmitted — both presence and absence cases carry meaning per the Verdict phase spec.
platform.failure_cluster.detected eventEmitted — clustering is a measurement phase output.
T1 (interaction-tier) observation persistencePersisted — the on_scan_completed handler invokes the spectral_agent_memory gateway directly per scan. T1 writes are independent of changeset creation.

Why actuation vs measurement is the right cut

Section titled “Why actuation vs measurement is the right cut”

The purpose of observe_only is to let a customer watch how the system would behave before granting any write authority. Muting measurement would turn the mode into a no-op — the customer would not learn anything from it. The whole point is that, after N weeks in observe_only, the customer has seen verdicts, trends, and agent reasoning that inform their decision to move to recommend or further.

The one thing that does not happen in observe_only is ChangeSet creation. That is a record-of-proposal and requires workspace-level intent to actuate. The on_scan_completed handler checks the mode before creating the ChangeSet, and returns without creating one. Downstream event emission and T1 (interaction-tier) memory persistence happen regardless of that check. The handler itself runs in apps/workers per ADR-060.

A workspace-level kill switch forces approval.required on every changeset regardless of the configured autonomy mode. It does not suppress changeset creation — scans run normally, changesets accumulate, and every one requires human approval. Effective behavior is identical to recommend mode while active.

The kill switch:

  • Is persisted and survives service restarts
  • Is audit-logged on activation and deactivation
  • Aligns with the existing approval.required event path; no mode bypass is introduced
  • auto_test — auto-accepts non-breaking changesets, defers breaking to approval. Requires a trust-baseline mechanism.
  • guarded_auto — terminal rung of the autonomy ladder. Auto-accepts within policy guardrails with anomaly-driven rollback. Hard-depends on auto_test.

Behavioral specifications for both modes carry through to this page when the modes return.


These are two different axes. Both use tiered framing, which trips readers up. Keep them separate:

Integration tierAutonomy mode
What it controlsCustomer-facing trust progression (“how deeply does Spectral sidecar into the workflow”)Workspace execution policy (“what happens to accepted changesets”)
Where it livesProduct vocabulary, customer onboarding, commercial positioningWorkspace configuration, enforced in on_scan_completed
ValuesStage 1 (observe + recommend), Stage 2 (observe + manage), Stage 3 (observe + manage + automate)observe_only, manual (alpha first wave) + recommend, bounded_auto, kill switch (alpha second wave)
Who changes itCommercial relationship / expansion decisionWorkspace admin setting

Neither axis subsumes the other. A Stage 2 customer can run manual (tight operator control) or bounded_auto (automate with gates) without touching the tier.

Not a hard rule — just what typically happens. The first wave covers observe_only + manual; recommend and bounded_auto ship in the second.

Integration tierTypical autonomy mode
Stage 1 (observe + recommend)observe_only or recommend — customer is still building trust in the optimization signal
Stage 2 (observe + manage)recommend or manual — customer curates actively but Spectral owns scanning
Stage 3 (observe + manage + automate)bounded_auto — customer has enough signal history to let gates fire

For the customer-facing integration tiers see How Spectral Works.


The scan pipeline doesn’t just optimize customer agent systems — it feeds its own improvement. The meta-improvement engine tracks what mutation strategies work, identifies rubric quality issues, and guides the pipeline’s approach over time. See Memory System for the universal interaction / session / persistent lifecycle (parameterized as cycle / run / workspace for the Spectral Agent) that compounds strategy performance, and World Model System / Evolution Loop for how observed cluster patterns feed rule evolution.

Tracks effectiveness of optimization strategies across runs:

  • ELO ratings — strategies compete head-to-head in tournaments; ratings update on win/loss.
  • Usage counts and win rates — how often each strategy is used and improves composites.
  • Average improvement — expected blended_delta when a strategy is applied.

Optimize consults the registry when selecting mutation approaches. Higher-rated strategies are preferred for similar failure patterns.

Every optimization intervention records outcome: per-cluster pre/post scores, predicted vs actual improvement (calibrates the engine’s confidence), and reusability tags that feed observations into the Spectral Agent’s persistent-tier (workspace-scope) memory. InterventionLog is a canonical record of optimization activity, not agent memory itself — see domain-model — InterventionLog.

RegressionRecord is a spectral.platform domain record that captures interventions which caused measurable regression on one or more failure clusters. Not agent memory — it is a canonical record of optimization activity per the records-vs-memory framing (ADR-058 D14 captures the workshop principle that distinguishes records from memory). Stored in workspace-scoped platform.regression_records with RLS enforcement per ADR-033 and retention governed by ADR-042 D4. Schema documented in domain-model — RegressionRecord.

A RegressionRecord is a dedicated entity, not a flag on InterventionLog. The two have different responsibilities and access patterns:

  • InterventionLog — record-of-action (every intervention, regardless of outcome). The optimizer queries it chronologically by workspace.
  • RegressionRecord — record-of-regression (interventions that caused measurable harm). The verdict + tournament engine queries it by mutation-pattern and by cluster; the World Agent queries it (after sanitisation) for rule-coverage signal.

A RegressionRecord references the originating InterventionLog entry; interventions that did not regress carry no RegressionRecord.

Write path. When a verdict gate fires NO-GO with cluster-level regression detail (gates 2 / 3 — agent regression and dimension regression), the verdict engine writes a RegressionRecord capturing the mutation pattern, regressed clusters, improved clusters (regressions are rarely pure), and severity.

Read path. Tournament reads recent RegressionRecord entries during candidate selection and penalizes replay of matching mutation patterns via adaptive composite weighting (see Optimize / Regression-avoidance signal above).

Sanitised promotion to World Signal. A RegressionRecord whose mutation pattern is workspace-agnostic (no PII, no workspace-specific configuration detail, just the domain-relevant pattern + cluster class) clears the sanitisation gate and routes via the memory-to-Worlds signal events (memory.observation.promoted / t3_memory.written) to spectral.worlds per ADR-018. The World Agent may surface the signal as a candidate for rule revision — the underlying rule may be underspecified in a way that lets regressing mutations pass local evaluation.

What stays out of the record: no customer output text (mutation pattern is opaque, cluster references do not unwrap to raw content); no cross-workspace pattern matching (that happens in the World Agent after sanitisation, not inside the workspace-scoped record); no automatic retirement of strategies (a pattern with many regressions in one workspace is workspace context; strategy retirement is an operator / World Agent concern at the aggregate level).

Calibrate and Evaluate emit signals about rubric quality — high-variance dimensions, ambiguous guidance, score distribution shifts. The rubric audit surface summarizes these for the Spectral Agent, which proposes rewrites the operator can accept or decline.


Tier-based model routing balances capability and cost:

Task tierUsed forSelection criteria
ScoringEvaluation (both authorities), tournament scoring, holdout validationCost-optimized, high throughput
DetectionParse checks, anti-deception, safety screeningLowest cost, fast
ReasoningDiagnosis, optimization, prompt rewrites, mutation generationHighest capability
CustomerAgent execution during ObserveCustomer’s own model (passed through)

Fallback hardening ensures model failures don’t halt the pipeline. Consecutive failures trigger automatic tier disabling for the remainder of the scan.


For customers on continuous or periodic cadences:

  • Scheduled scans run at configured intervals.
  • Supervisor integration consults a planning function before each scan to determine priority and budget allocation.
  • Frontier detection recognizes when optimization has plateaued and switches to monitoring-only.
  • Economic reasoning uses cost-per-failure and revenue-per-success to prioritize optimization where it has the highest business impact.

Bayesian category selection — intervention_memory_adjustment

Section titled “Bayesian category selection — intervention_memory_adjustment”

The supervisor’s category-selection step reads from agent memory, not from raw scan history. The supervisor consults Tier-3 (workspace-scope) observations for an intervention_memory_adjustment that nudges category priors based on past intervention outcomes — categories whose past interventions correlated with regression get downweighted; categories whose past interventions moved the needle get upweighted. The adjustment is a posterior nudge on a Bayesian prior, not a hard override; recent observations dominate older ones via the standard Tier-3 decay schedule. The integration is read-only at the supervisor seam — the supervisor never writes to memory; the adjustment value flows through as part of the planning function’s input.

The supervisor produces SupervisorRecommendation records — guidance about what to scan next, what budget to allocate, what to prioritize. These are consumed by the Spectral Agent (and, at a lesser priority, by operational dashboards).

Delivery is event-driven via a supervisor.recommendation.issued event with producer-typed payload planned at spectral.platform.contracts.events.supervisor_recommendation_issued per ADR-065 D2 (lands with the supervisor epic).

Event shape:

FieldTypeNotes
event_idUUID
workspace_idUUID
recommendation_idUUIDSupervisorRecommendation primary key
mode_classificationACTIVE | PLATEAU | FRONTIER | NO_DATA
priorityenumOrdered set of priority tags the supervisor has reasoned about
budget_hintoptional decimalSpend cap guidance for the next scan, when supervisor has an opinion
narrativestringShort natural-language rationale
issued_attimestamp

Why event-driven:

  • Consistent with the event-substrate doctrine. Heavy/async work dispatches via events; direct query coupling across the supervisor-to-agent boundary would break the pattern.
  • Multiple consumers supported. The Spectral Agent is the primary consumer, but operational dashboards and the Operations Agent’s observability tooling may also subscribe without the supervisor needing to know about them.
  • Decoupled timing. The supervisor emits when it has reasoned; the agent consumes when it is triggered, which may be a different moment (e.g., when a customer sends a chat message).
  • Survives restart. Unlike a direct-call-state-query model, the event record is durable — an agent restart does not lose the recommendation.

Why not attached to scan.completed:

The supervisor can issue recommendations that are not tied to a specific scan (periodic plateau detection, budget reallocation after a billing event, etc.). Forcing them onto scan.completed would lose those cases.

Why not polling:

Polling is the wrong direction. The supervisor is the producer; pushing its recommendations onto an event bus keeps authority where the reasoning is.

Consumer hookup: OnSupervisorRecommendationHandler in spectral.platform’s agent application layer creates a proactive conversation (or updates an existing one) with the recommendation narrative. See Agent Architecture — Event-driven proactive conversations.