Optimization Engine
How Spectral measures, optimizes, and recommends changes to customer agent systems. This page covers the scan pipeline (seven phases, two tracks, two evaluation authorities), the CompositeScore schema that anchors consistency across phases, the verdict engine, and the governance surface — autonomy modes and integration tier — that controls how scan output reaches the customer.
The page serves three readers. Engineers building or extending the pipeline get the seven-phase sequence, composite-score schema, verdict gate set, and autonomy-mode handling. Strategic readers need the defense the pipeline encodes — two evaluation authorities (world-model + customer-rubric) blended into a composite score so optimization can’t trivially game one signal at the expense of the other (the failure mode Goodhart’s Law names). Reviewers auditing methodology need the holdout strategy, the statistical-uniqueness anchor, and the verdict-gate set.
The strategic claim is the two-authority defense. Optimization rewards what the rubric measures; if the customer authors the rubric, optimization rewards what the customer thought to test for. Spectral runs two scoring authorities in parallel — a world-model authority anchored to a domain standard the customer didn’t author, plus a customer-steerable rubric authority — and blends them into a CompositeScore. Neither authority can crowd the other out, so an agent that “improves” by chasing one signal trips the other. That balance is the page’s strategic center; the seven phases below are the mechanism that delivers it. See Two-authority evaluation for the full treatment.
Two-track architecture
Section titled “Two-track architecture”Every scan runs both tracks when data is available, or the synthetic track alone as a valid
fallback. Once preflight admits a scan, there is no blocked state — both readiness modes (Full and
synthetic_only) produce a verdict; the only non-start path is a Worlds-unavailable error
(see Scan preflight below).
| Track | Source | Role |
|---|---|---|
| Synthetic EvalSet | spectral.worlds generates a statistically unique EvalSet per scan | Optimization signal. Agent runs against EvalSet stimuli; this is what drives candidate selection. |
| Real-world conformance | Curated OtelTrace samples with human-validated ground truth, supplied by CurationService | Convergence anchor. Agent’s real-world performance is measured against validated ground truth. Runs when sufficient validated samples exist. |
Scan Readiness is reported as a preflight observation on the scan record: Full (both tracks)
or synthetic_only. It is not a blocking gate — a scan can always run the synthetic track alone
and still produce a verdict.
Scan preflight
Section titled “Scan preflight”Preflight runs in the scan orchestrator — the application-layer component that owns the
scan lifecycle — immediately before the Observe phase begins. It writes a
ScanReadinessObservation record to the Scan row, then unconditionally proceeds to Observe:
@dataclass(frozen=True)class ScanReadinessObservation: mode: Literal["Full", "synthetic_only"] evalset_available: bool curation_samples_count: int missing_reasons: list[str] # empty when mode == "Full" observed_at: datetimemode = Full when Worlds can produce an EvalSet and at least the curation-minimum sample
count is available; mode = synthetic_only when an EvalSet is producible but conformance
samples are below the minimum. When Worlds cannot produce an EvalSet at all, preflight raises
an orchestrator-level error and the scan surfaces a scheduled-retry — that is a scan-start
failure, not a preflight observation.
Preflight observes and emits; it never blocks the scan from running when any valid mode is
possible. The curation service emits its own readiness signals; preflight consumes the latest
or queries synchronously to decide curation_samples_count. Curation readiness is the
source-of-truth for conformance-sample availability.
Vocabulary
Section titled “Vocabulary”- Scan. An evaluation run against synthetic traces generated by the world model conformant with world model rules. A scan is evaluation against the EvalSet the world model produces, not against world model rules directly.
- EvalSet. Produced by
spectral.worlds. Statistically unique per request (per ADR-028). Each sample carries stimulus text, ground truth co-generated from the originating rule, and astimulus_weightderived from rule confidence. - OtelTrace. Permanent customer production record. Never modified after ingestion.
- ScanTrace. Ephemeral scan-execution record — agent’s response to an EvalSet stimulus.
Gains a
provenancefield recording stimulus source.
Phase sequence
Section titled “Phase sequence”The pipeline runs seven phases in order after preflight completes. Each phase completes fully before the next begins and emits context that is serialized after each phase for fault tolerance and resume.
preflight (orchestrator pre-check) → Observe → Calibrate → Diagnose → Evaluate → Optimize → Safety → VerdictObserve
Section titled “Observe”- Consumes the
ScanReadinessObservationwritten by the preflight step. - Requests a statistically unique EvalSet from
spectral.worldssynchronously at scan start, submitting the workspace’sEvalSetParameterizationas the request body. If Worlds cannot produce an EvalSet, the scan errors and retries on the next schedule tick (this path is already surfaced by preflight’s error mode). - Receives curated conformance samples from CurationService. The readiness state (Full or
synthetic_only) was already written by preflight; Observe does not recompute it. - Runs the customer agent against synthetic EvalSet stimuli and (where available) conformance
samples. Produces
ScanTracerecords withprovenancefields identifying stimulus source. - Partition logic (working vs holdout) is rebuilt against the EvalSet structure.
Calibrate
Section titled “Calibrate”Adjusts scoring thresholds based on the observed score distribution. No spectral.worlds
interaction.
Diagnose
Section titled “Diagnose”Clusters failures into FailureCluster records (spectral.platform.domain.clustering).
Quarantines infrastructure failures and parse failures before clustering so only quality
EvalResults feed the LLM clusterer.
Two-authority opacity. The clustering prompt receives only rubric scorer explanations and
scores — world-model authority outputs do not cross the clustering prompt boundary. Opacity is
enforced at the input shape, not by post-hoc filtering: the clusterer’s EvalResult projection
includes scoring_authority = rubric rows and excludes the world-model authority’s view
(per ADR-014).
Cluster lifecycle. Each cluster carries an actioning_status enum(identified, addressed, persistent, resolved) with validated transitions enforced by the repository:
identified -> addressed -> resolved covers the standard remediation path;
identified -> persistent -> resolved covers re-emergence of a previously-addressed cluster.
Detection event. When a cluster crosses the detection threshold, Diagnose emits
platform.failure_cluster.detected with the producer-typed payload at
spectral.platform.contracts.events.failure_cluster_detected. The event carries cluster_id,
severity, failure_count, first_observed_at / last_observed_at, evidence_bundle, a
sanitized summary, and a suggested_rule_stub. It does not carry raw customer output
text — sanitization is verified by a content-contract test.
Consumer paths off the event:
- Operations Agent (intra-platform) — upserts
platform.rule_candidates_pendingon every detection so operators see the cluster surface immediately. - World Agent (in
spectral.worlds) — applies a consumer-side promotion-threshold filter (frequency_pct >= 10,effect_size >= 15,actionable = true, computed over the event stream) and seeds rule-candidate exploration only when the higher bar is met. The threshold logic is consumer-resident so the wire shape stays single-event.
Evaluate
Section titled “Evaluate”Runs two scorers in parallel on all traces from both tracks. See Two-authority evaluation below.
Attribution fields. Each EvalResult carries scoring_authority, track, and
stimulus_source. stimulus_weight on each EvalSet sample (set by worlds, derived from
generating rule confidence) is applied at the EvalResult level when computing the world-model
authority’s contribution to the composite. Spectral treats stimulus_weight as a scalar
attribution input — rule internals never cross the context boundary.
EvalSet sourcing. The scorer consumes EvalSets via the callee-owned EvalSetProvider
Protocol at spectral.worlds.contracts.protocols.eval_set_provider
(per ADR-070 Tier 2 —
multi-consumer eval criteria). No rule structure is reachable from the platform side at scoring
time.
Optimize
Section titled “Optimize”Generates candidate mutations via the strategy registry and runs a tournament to select the winning candidate.
Candidate types. Tournament evaluates four candidate types, each generated by a distinct mutation strategy:
| Type | Mutation profile |
|---|---|
surgical | Targeted edits to specific failing clusters (smallest blast radius) |
conservative-rewrite | Bounded prompt rewrite preserving structural intent |
general | Broader structural mutations (largest blast radius among non-history-informed) |
history-informed | Mutations seeded by prior workspace RegressionRecord patterns to avoid known regressions |
Two-pass evaluation. Tournament runs two passes for cost discipline:
- Pre-screen pass — runs candidates against ≤ 5 samples using the rubric scorer only. Cheaply culls obviously-failing candidates before invoking the more expensive world-model scorer. The cap is intentional: pre-screen is a culling gate, not a measurement.
- Full-evaluation pass — survivors run against the full working set with both authorities
(world-model + rubric);
stimulus_weightis applied at theEvalResultlevel when computing each candidate’sCompositeScore(see Two-authority evaluation).
Tournament scoring runs concurrently with bounded concurrency (semaphore-limited).
Regression-avoidance signal. Tournament consumes recent RegressionRecord entries from the
workspace’s regression record store (see Regression records below) and
penalizes replay of regressed mutation patterns within the same workspace via adaptive composite
weighting per ADR-020.
Safety
Section titled “Safety”Content safety checks on the winning candidate’s outputs. No spectral.worlds interaction. The
safety gate runs late — after Optimize selects a candidate — because running safety before Optimize
would prejudge candidates that a safer mutation would render acceptable.
Verdict
Section titled “Verdict”Multi-gate GO / NO-GO engine. The eight core verdict gates are pure functions in
spectral.platform.domain.verdict with no infrastructure imports — enforced by the architecture
validator. A convergence gate runs alongside. The delta threshold gate operates on
blended_delta; every other gate operates on rubric scorer data.
| # | Gate | Operates on | Outcome contribution |
|---|---|---|---|
| 1 | Delta threshold | blended_delta (CompositeScore) | NO-GO if improvement below workspace threshold |
| 2 | Agent regression (severe + mild) | Rubric scorer per-agent score deltas | NO-GO on severe regression; CAUTION on mild |
| 3 | Dimension regression | Rubric scorer per-dimension deltas | NO-GO if any rubric dimension floor is violated |
| 4 | Holdout generalization gap | Rubric scorer holdout vs working-set delta | NO-GO if holdout significantly underperforms working-set (synthetic holdout partition only — see Holdout protocol below) |
| 5 | Bootstrap 95 % CI | Rubric scorer score distribution | NO-GO if confidence interval crosses zero |
| 6 | Output similarity | Rubric scorer output embeddings | CAUTION on unexplained semantic drift |
| 7 | Pareto cost / latency penalty | Rubric scorer + cost / latency telemetry | NO-GO on Pareto-dominated outcome (degraded cost OR latency without compensating quality gain) |
| 8 | Sanity downgrade | Rubric scorer distribution shape | CAUTION on suspicious distribution (e.g., all-perfect or all-zero scores). Rubric scorer only — world-model scorer’s discriminative quality is already expressed via stimulus_weight per ADR-014, so applying sanity downgrade there would double-count. |
| + | Convergence gate | convergence_delta (CompositeScore) | CAUTION on conformance-track convergence drift; workspace-configurable hard NO-GO escalation |
Verdict also emits scan.convergence.delta per scan with explicit absence-marker semantics:
- Conformance data available: event carries the convergence delta (real-world vs synthetic EvalSet performance).
- Conformance data not available: event carries an explicit absence marker with reason.
Absence is a signal, not silence. WorldAgent aggregates absence at scale as a world-model-adoption signal.
VerdictResult and CompositeScore are defined in spectral.platform.domain.tournament per
ADR-020 — platform-internal types,
not spectral.core shared kernel.
Holdout protocol
Section titled “Holdout protocol”The holdout generalization gap gate consumes the synthetic EvalSet holdout partition exclusively. Conformance samples are scarce and reserved as convergence anchors — they are NOT consumed for the holdout generalization gap gate. The EvalSet carries an explicit two-layer holdout structure (working set + holdout); the verdict engine reads only the synthetic track.
blend_ratio and blended_delta
Section titled “blend_ratio and blended_delta”The delta threshold gate’s input field blended_delta is the stimulus-weight-derived composite
delta. The blend ratio that combines the world-model authority and rubric authority contributions
into blended_delta is computed at scan time from aggregate stimulus_weight, not configured
per workspace. Workspace configuration does not accept blend_ratio as a tunable.
Verdict outcomes
Section titled “Verdict outcomes”The go_nogo field on VerdictResult is one of four values. The fourth value reuses the
observe_only label from the autonomy-mode taxonomy because both refer to the same workspace
state — a verdict outcome of observe_only is the natural shape produced when the workspace
itself is in autonomy mode observe_only.
go_nogo | Meaning |
|---|---|
go | Promotion recommended — all gates pass |
caution | Mixed signals — requires human review. Autonomy modes never auto-accept caution regardless of configuration. |
nogo | Do not promote — gate(s) fired |
observe_only | The workspace is in observe_only autonomy mode — no changeset created (see Autonomy governance) |
Two-authority evaluation
Section titled “Two-authority evaluation”Evaluation runs two scoring authorities in parallel on every ScanTrace, then combines their outputs into a single composite. Each authority has a different epistemic basis:
| Authority | Scoring basis | Owned by |
|---|---|---|
| World-model scorer | Ground truth co-generated with the EvalSet stimulus at generation time (see ADR-014). Answers: “Did the agent produce the response the rule says it should?” | spectral.worlds produces ground truth; spectral.platform consumes it. |
| Rubric scorer | LLM-as-judge against the workspace’s Evaluation Framework rubric. Answers: “How does this output score on the rubric’s dimensions?” Produces natural-language explanations that the diagnose phase’s clusterer reasons over. | spectral.platform. |
Opacity discipline between contexts
Section titled “Opacity discipline between contexts”The world-model scorer’s inputs — ground truth, world-model-rule-derived scoring dimensions, and
stimulus_weight — are packaged into each EvalSet sample by spectral.worlds and consumed via
the callee-owned EvalSetProvider Tier 2 Protocol at
spectral.worlds.contracts.protocols.eval_set_provider per
ADR-065 D3 +
ADR-070 Tier 2 (LLM-tool wrapping
inside apps/test-agents is the qualifying multi-framework-consumer condition; Observe / tournament /
Evaluate are intra-spectral.platform phases that consume the same producer-typed payload through
the platform-side caller). Rule internals never cross the context boundary; the scorer reasons over
the producer-typed payload’s published shape, not over rule structure. The architecture validator at
STRICT=True enforces no spectral.worlds imports into spectral.platform as the structural
backstop; the EvalSetProvider Protocol surface is the data-flow assertion that complements it.
Why two authorities
Section titled “Why two authorities”A single authority lets an agent Goodhart the metric. Two authorities that draw from different signal bases keep the evaluation surface grounded:
- The world-model scorer anchors to the authority_version under which the EvalSet was generated. Its verdict is binary-ish: did the response match ground truth or not?
- The rubric scorer is customer-steerable. It encodes what the customer cares about — dimension weights, scoring guidance, hard-constraint floors.
- Neither authority can crowd the other out, because
stimulus_weightis bounded and the composite blends both.
The two-authority defense rests on the world-model authority being credible — if the world model is wrong, two-authority evaluation just blends a flawed authority with a customer-steerable rubric. That credibility is built upstream, not inside the evaluation step: the four-tier provenance system grounds rules in authoritative sources, the conformity gate provides mechanical validation independent of any operator’s judgment, and methodology disclosure on the System Card makes every rule’s provenance auditable post-hoc. Two-authority evaluation is the Goodhart defense at evaluation time; the methodology stack the system card discloses is the credibility defense at authority time.
CompositeScore schema
Section titled “CompositeScore schema”Defined under spectral.platform.domain.tournament.* per
ADR-020 +
ADR-065 D1
(domain types do not live in the kernel — kernel admission discipline rules them out).
Every phase that produces a score (tournament pre-screen, verdict validation, system card
reporting) emits and consumes the same CompositeScore shape:
| Field | Description |
|---|---|
world_model_score | Aggregated world-model authority score for the scan |
rubric_score | Aggregated rubric authority score for the scan |
blended_delta | Champion → patch improvement on the stimulus-weight-adjusted composite |
convergence_delta | Conformance (real-world) vs synthetic performance delta. Null when conformance samples absent. |
per_track_breakdown | {synthetic: {world_model, rubric, n_samples}, conformance: {...}} |
attribution | World model version, authority_version, rule references (reference into worlds per ADR-030 + ADR-065 D2 producer-typed payload) |
Consistency across phases is enforced by type, not by shared compute. Every phase reads and
produces the same CompositeScore; the verdict engine uses the same blending logic as the
tournament pre-screen.
Rubric provisioning (rubric_gen)
Section titled “Rubric provisioning (rubric_gen)”The rubric scorer consumes the workspace’s Evaluation Framework rubric. rubric_gen is the
zero-setup cold-start path that produces a viable rubric for new workspaces from either observed
agent traces or a stated objective — workspaces never start without a scorable rubric.
rubric_gen is workspace-onboarding capability, not per-scan capability — it runs once at
workspace setup (or when an operator triggers a rubric refresh) and writes its output to the
workspace’s EvaluationFramework. The rubric scorer reads the framework on each scan; ongoing
rubric refinement happens through the Rubric audit feedback loop, not through
re-running rubric_gen.
LLM tier: reasoning (per LLM routing below) — discovery and synthesis cost is acceptable at workspace setup, unlike per-scan rubric scoring which uses the scoring tier.
Rubric divergence records
Section titled “Rubric divergence records”Per-scan rubric scorer outputs are compared against world-model scorer outputs to compute a
rubric divergence delta — a measurement of how much the customer rubric diverges from the
world-model authority on each scan. The delta is persisted as a RubricDivergenceRecord domain
record (workspace-scoped, RLS per
ADR-033, retention per
ADR-042) — not agent memory per ADR-058 D14 non-mirror list,
the same category as RegressionRecord and InterventionLog. Schema documented in
domain-model — RubricDivergenceRecord.
Each scan also emits a rubric.divergence typed event to spectral.worlds (regardless of
conformance-sample availability; payload module planned at
spectral.platform.contracts.events.rubric_divergence per ADR-065 D2).
The World Agent aggregates divergence across workspaces as a world-model-evolution signal;
single-workspace divergence remains a scan observation and does not initiate rule revision —
only cross-workspace aggregation (handled in spectral.worlds) is a rule-evolution signal.
Event emissions
Section titled “Event emissions”Events with producer-owned typed payload modules in <context>.contracts.events.* per
ADR-065 D2:
| Event | Emitted by | Carries |
|---|---|---|
platform.failure_cluster.detected (spectral.platform.contracts.events.failure_cluster_detected) | Diagnose (every cluster crossing detection threshold; World Agent applies promotion-threshold filter consumer-side) | Cluster ID, severity, failure count, first/last observed, evidence bundle, sanitized summary, suggested rule stub |
rubric.divergence | Evaluate (always, per scan, regardless of conformance-sample availability) | Workspace ID, scan ID, evaluation framework ID, divergence delta, observed_at |
verdict.issued | Verdict (always, per scan) | Workspace ID, scan ID, verdict, composite score, evaluation_authority_ref, issued_at |
scan.convergence.delta | Verdict (always, per scan) | Convergence delta with presence-or-absence marker |
scan.completed | Verdict (always, per scan) | Summary + outcome |
approval.required | on_scan_completed handler when verdict triggers it (always when autonomy mode is manual; kill-switch and bounded-auto fall-through cases land in the second alpha autonomy wave) | Changeset ID + reason |
Autonomy governance
Section titled “Autonomy governance”Autonomy modes govern how verdict output reaches changesets. The 0.3.0 alpha ships these in two waves:
- First wave lands the alpha-bound subset:
observe_only+manual(default), enforced in theon_scan_completedhandler. No gate evaluation, no kill switch, no fall-through arbitration. - Second wave extends the handler to cover
recommend,bounded_auto, the four-gate framework, and the kill switch — completing the alpha autonomy surface (second alpha wave). auto_testandguarded_autoare post-launch, deferred outside the 0.3.0 alpha milestone.
Autonomy modes
Section titled “Autonomy modes”observe_only and manual (default) ship in the first alpha wave; recommend and
bounded_auto ship in the second.
| Mode | Changeset created | Application path | Notes |
|---|---|---|---|
observe_only | No | — | Enforced in the on_scan_completed handler before changeset creation; no changeset record exists. See Observe-only data treatment below — measurement is unaffected. |
manual | Yes | Always approval.required | Default mode at workspace bootstrap; the handler always creates a changeset and emits approval.required. Operator-driven explicit control. |
recommend | Yes | Always approval.required | Mechanically identical to manual; semantic intent is “Spectral recommends, human curates.” |
bounded_auto | Yes | Auto-accept within gates; approval.required otherwise | Auto-accepts when composite score clears workspace-configured thresholds. |
caution verdicts are never auto-accepted regardless of mode or gate configuration. This is a
hard rule, not a threshold.
Bounded-auto gates
Section titled “Bounded-auto gates”Workspace-configurable thresholds evaluated against the CompositeScore snapshot attached to the
changeset:
min_blended_delta— minimum score improvement requiredmin_world_model_score— floor on the world-model authority scoremax_rules_affected— cap on blast radius per changesetrequire_validated_changeset— only changesets that have passed the validated terminal state are eligible
All gates must pass for auto-acceptance. Any single failure routes to approval.required.
Observe-only data treatment
Section titled “Observe-only data treatment”observe_only mode suppresses actuation, not measurement. Everything the scan pipeline
observes, computes, and emits still happens; the only path that is skipped is ChangeSet creation.
| Data / signal | Behavior in observe_only |
|---|---|
VerdictResult | Stored — same Scan row, same verdict table, same schema as any other mode. |
CompositeScore | Stored — attached to the scan record. |
| Dashboard surfacing | Visible — verdicts and scores render on the customer dashboard identically. The UI labels them “observe-only” so there is no confusion about whether a ChangeSet exists. |
| Spectral Agent proactive conversation | Opened — ScanCompletedEvent still fires; the agent summarizes verdicts normally. The agent will not propose applying a ChangeSet because no ChangeSet exists; it can still discuss the findings. |
Supervisor mode classification (ACTIVE / PLATEAU / FRONTIER / NO_DATA) | Fed — supervisor state is the right place to reason about “is the system improving?” and the answer must not depend on whether actuation is enabled. |
rubric.divergence event | Emitted — the event carries measurement, not actuation. WorldAgent consumes it regardless of mode. |
scan.convergence.delta event | Emitted — both presence and absence cases carry meaning per the Verdict phase spec. |
platform.failure_cluster.detected event | Emitted — clustering is a measurement phase output. |
| T1 (interaction-tier) observation persistence | Persisted — the on_scan_completed handler invokes the spectral_agent_memory gateway directly per scan. T1 writes are independent of changeset creation. |
Why actuation vs measurement is the right cut
Section titled “Why actuation vs measurement is the right cut”The purpose of observe_only is to let a customer watch how the system would behave before
granting any write authority. Muting measurement would turn the mode into a no-op — the customer
would not learn anything from it. The whole point is that, after N weeks in observe_only, the
customer has seen verdicts, trends, and agent reasoning that inform their decision to move to
recommend or further.
The one thing that does not happen in observe_only is ChangeSet creation. That is a
record-of-proposal and requires workspace-level intent to actuate. The on_scan_completed
handler checks the mode before creating the ChangeSet, and returns without creating one.
Downstream event emission and T1 (interaction-tier) memory persistence happen regardless of that
check. The handler itself runs in apps/workers per ADR-060.
Kill switch
Section titled “Kill switch”A workspace-level kill switch forces approval.required on every changeset regardless of the
configured autonomy mode. It does not suppress changeset creation — scans run normally,
changesets accumulate, and every one requires human approval. Effective behavior is identical to
recommend mode while active.
The kill switch:
- Is persisted and survives service restarts
- Is audit-logged on activation and deactivation
- Aligns with the existing
approval.requiredevent path; no mode bypass is introduced
Post-launch modes
Section titled “Post-launch modes”auto_test— auto-accepts non-breaking changesets, defers breaking to approval. Requires a trust-baseline mechanism.guarded_auto— terminal rung of the autonomy ladder. Auto-accepts within policy guardrails with anomaly-driven rollback. Hard-depends onauto_test.
Behavioral specifications for both modes carry through to this page when the modes return.
Autonomy mode vs integration tier
Section titled “Autonomy mode vs integration tier”These are two different axes. Both use tiered framing, which trips readers up. Keep them separate:
| Integration tier | Autonomy mode | |
|---|---|---|
| What it controls | Customer-facing trust progression (“how deeply does Spectral sidecar into the workflow”) | Workspace execution policy (“what happens to accepted changesets”) |
| Where it lives | Product vocabulary, customer onboarding, commercial positioning | Workspace configuration, enforced in on_scan_completed |
| Values | Stage 1 (observe + recommend), Stage 2 (observe + manage), Stage 3 (observe + manage + automate) | observe_only, manual (alpha first wave) + recommend, bounded_auto, kill switch (alpha second wave) |
| Who changes it | Commercial relationship / expansion decision | Workspace admin setting |
Neither axis subsumes the other. A Stage 2 customer can run manual (tight operator control) or
bounded_auto (automate with gates) without touching the tier.
Typical mapping
Section titled “Typical mapping”Not a hard rule — just what typically happens. The first wave covers observe_only + manual;
recommend and bounded_auto ship in the second.
| Integration tier | Typical autonomy mode |
|---|---|
| Stage 1 (observe + recommend) | observe_only or recommend — customer is still building trust in the optimization signal |
| Stage 2 (observe + manage) | recommend or manual — customer curates actively but Spectral owns scanning |
| Stage 3 (observe + manage + automate) | bounded_auto — customer has enough signal history to let gates fire |
For the customer-facing integration tiers see How Spectral Works.
Meta-improvement engine
Section titled “Meta-improvement engine”The scan pipeline doesn’t just optimize customer agent systems — it feeds its own improvement. The meta-improvement engine tracks what mutation strategies work, identifies rubric quality issues, and guides the pipeline’s approach over time. See Memory System for the universal interaction / session / persistent lifecycle (parameterized as cycle / run / workspace for the Spectral Agent) that compounds strategy performance, and World Model System / Evolution Loop for how observed cluster patterns feed rule evolution.
Strategy registry
Section titled “Strategy registry”Tracks effectiveness of optimization strategies across runs:
- ELO ratings — strategies compete head-to-head in tournaments; ratings update on win/loss.
- Usage counts and win rates — how often each strategy is used and improves composites.
- Average improvement — expected
blended_deltawhen a strategy is applied.
Optimize consults the registry when selecting mutation approaches. Higher-rated strategies are preferred for similar failure patterns.
Intervention log
Section titled “Intervention log”Every optimization intervention records outcome: per-cluster pre/post scores, predicted vs actual
improvement (calibrates the engine’s confidence), and reusability tags that feed observations into
the Spectral Agent’s persistent-tier (workspace-scope) memory. InterventionLog is a canonical
record of optimization activity, not agent memory itself — see
domain-model — InterventionLog.
Regression records
Section titled “Regression records”RegressionRecord is a spectral.platform domain record that captures interventions which caused
measurable regression on one or more failure clusters. Not agent memory — it is a canonical
record of optimization activity per the records-vs-memory framing
(ADR-058 D14 captures the workshop principle that distinguishes
records from memory). Stored in workspace-scoped platform.regression_records with RLS enforcement
per ADR-033 and retention governed by
ADR-042 D4. Schema documented in
domain-model — RegressionRecord.
A RegressionRecord is a dedicated entity, not a flag on InterventionLog. The two have
different responsibilities and access patterns:
InterventionLog— record-of-action (every intervention, regardless of outcome). The optimizer queries it chronologically by workspace.RegressionRecord— record-of-regression (interventions that caused measurable harm). The verdict + tournament engine queries it by mutation-pattern and by cluster; the World Agent queries it (after sanitisation) for rule-coverage signal.
A RegressionRecord references the originating InterventionLog entry; interventions that did
not regress carry no RegressionRecord.
Write path. When a verdict gate fires NO-GO with cluster-level regression detail (gates 2 / 3
— agent regression and dimension regression), the verdict engine writes a RegressionRecord
capturing the mutation pattern, regressed clusters, improved clusters (regressions are rarely
pure), and severity.
Read path. Tournament reads recent RegressionRecord entries during candidate selection and
penalizes replay of matching mutation patterns via adaptive composite weighting (see
Optimize / Regression-avoidance signal above).
Sanitised promotion to World Signal. A RegressionRecord whose mutation pattern is
workspace-agnostic (no PII, no workspace-specific configuration detail, just the domain-relevant
pattern + cluster class) clears the sanitisation gate and routes via the memory-to-Worlds
signal events (memory.observation.promoted / t3_memory.written) to spectral.worlds per
ADR-018. The
World Agent may surface the signal as a candidate for rule revision — the underlying rule may be
underspecified in a way that lets regressing mutations pass local evaluation.
What stays out of the record: no customer output text (mutation pattern is opaque, cluster references do not unwrap to raw content); no cross-workspace pattern matching (that happens in the World Agent after sanitisation, not inside the workspace-scoped record); no automatic retirement of strategies (a pattern with many regressions in one workspace is workspace context; strategy retirement is an operator / World Agent concern at the aggregate level).
Rubric audit
Section titled “Rubric audit”Calibrate and Evaluate emit signals about rubric quality — high-variance dimensions, ambiguous guidance, score distribution shifts. The rubric audit surface summarizes these for the Spectral Agent, which proposes rewrites the operator can accept or decline.
LLM routing
Section titled “LLM routing”Tier-based model routing balances capability and cost:
| Task tier | Used for | Selection criteria |
|---|---|---|
| Scoring | Evaluation (both authorities), tournament scoring, holdout validation | Cost-optimized, high throughput |
| Detection | Parse checks, anti-deception, safety screening | Lowest cost, fast |
| Reasoning | Diagnosis, optimization, prompt rewrites, mutation generation | Highest capability |
| Customer | Agent execution during Observe | Customer’s own model (passed through) |
Fallback hardening ensures model failures don’t halt the pipeline. Consecutive failures trigger automatic tier disabling for the remainder of the scan.
Supervision & scheduling
Section titled “Supervision & scheduling”For customers on continuous or periodic cadences:
- Scheduled scans run at configured intervals.
- Supervisor integration consults a planning function before each scan to determine priority and budget allocation.
- Frontier detection recognizes when optimization has plateaued and switches to monitoring-only.
- Economic reasoning uses cost-per-failure and revenue-per-success to prioritize optimization where it has the highest business impact.
Bayesian category selection — intervention_memory_adjustment
Section titled “Bayesian category selection — intervention_memory_adjustment”The supervisor’s category-selection step reads from agent memory, not from raw scan history.
The supervisor consults Tier-3 (workspace-scope) observations for an intervention_memory_adjustment
that nudges category priors based on past intervention outcomes — categories whose past
interventions correlated with regression get downweighted; categories whose past interventions
moved the needle get upweighted. The adjustment is a posterior nudge on a Bayesian prior, not
a hard override; recent observations dominate older ones via the standard Tier-3 decay schedule.
The integration is read-only at the supervisor seam — the supervisor never writes to memory; the
adjustment value flows through as part of the planning function’s input.
Supervisor recommendation delivery
Section titled “Supervisor recommendation delivery”The supervisor produces SupervisorRecommendation records — guidance about what to scan next,
what budget to allocate, what to prioritize. These are consumed by the Spectral Agent (and, at
a lesser priority, by operational dashboards).
Delivery is event-driven via a supervisor.recommendation.issued event with producer-typed
payload planned at spectral.platform.contracts.events.supervisor_recommendation_issued per
ADR-065 D2 (lands with
the supervisor epic).
Event shape:
| Field | Type | Notes |
|---|---|---|
event_id | UUID | |
workspace_id | UUID | |
recommendation_id | UUID | SupervisorRecommendation primary key |
mode_classification | ACTIVE | PLATEAU | FRONTIER | NO_DATA | |
priority | enum | Ordered set of priority tags the supervisor has reasoned about |
budget_hint | optional decimal | Spend cap guidance for the next scan, when supervisor has an opinion |
narrative | string | Short natural-language rationale |
issued_at | timestamp |
Why event-driven:
- Consistent with the event-substrate doctrine. Heavy/async work dispatches via events; direct query coupling across the supervisor-to-agent boundary would break the pattern.
- Multiple consumers supported. The Spectral Agent is the primary consumer, but operational dashboards and the Operations Agent’s observability tooling may also subscribe without the supervisor needing to know about them.
- Decoupled timing. The supervisor emits when it has reasoned; the agent consumes when it is triggered, which may be a different moment (e.g., when a customer sends a chat message).
- Survives restart. Unlike a direct-call-state-query model, the event record is durable — an agent restart does not lose the recommendation.
Why not attached to scan.completed:
The supervisor can issue recommendations that are not tied to a specific scan (periodic plateau
detection, budget reallocation after a billing event, etc.). Forcing them onto scan.completed
would lose those cases.
Why not polling:
Polling is the wrong direction. The supervisor is the producer; pushing its recommendations onto an event bus keeps authority where the reasoning is.
Consumer hookup: OnSupervisorRecommendationHandler in spectral.platform’s agent
application layer creates a proactive conversation (or updates an existing one) with the
recommendation narrative. See Agent Architecture — Event-driven proactive conversations.
Next steps
Section titled “Next steps”- Domain Model — entities, state machines, relationships
- World Model System — how EvalSets and ground truth are produced
- Memory System — universal interaction / session / persistent lifecycle compounding and the world-signal path
- Agent Architecture — Spectral Agent, Operations Agent, WorldAgent