Skip to content
GitHub
World Model System

Eval Generation

EvaluationFrameworks are derived from World Models, but the operational unit of evaluation is the EvalSet — a generated artifact created from a World Model on request, shaped by the customer’s directional parameters. EvalSets are ephemeral and purposive: they are created for a specific evaluation context, not stored as world model representations.

This separation matters. The world model is the standard; the EvalSet is a concrete probe of a customer’s system against that standard at a point in time.

EvalSets are also statistically unique per request (per ADR-028) — two scans against the same world model never receive the same set of generated instances. This pins eval results against the version while preventing memorization-style overfitting at the instance level: an agent that improves on this scan’s EvalSet against unseen instances is genuinely conforming better, not pattern-matching the test set. The corollary of statistical uniqueness is that rule-level coverage stays stable across scans (the same rules are exercised; the instances that exercise them differ); see Holdout Strategy below for the second overfitting defense.

Customers do not author evaluation frameworks. Customers parameterize world-model eval generation — selecting vectors, coverage areas, and focus dimensions that direct the generation process. The output is always a world-model-generated EvalSet with customer-directed parameters.

The customer’s problem space is a subset of the domain the World Model governs. The EvaluationFramework scopes to the customer’s problem space but grounds its criteria in the domain standard. The customer narrows the aperture; the domain standard defines what correctness means within that aperture.

Customer steering parameters sometimes point at territory outside current world model coverage — the Unknown zone. When this happens, the system applies two responses simultaneously.

  1. A coverage gap notification is returned to the customer, honestly reporting that the requested territory is not yet covered by the world model.
  2. The steering parameters are routed internally as a candidate discovery signal to the Evolution Loop.

This connects the customer-facing coverage gap to the discovery mechanism. The customer gets honest feedback rather than a silently-degraded EvalSet; the system gets a signal about where the world model’s Unknown zone may contain valuable rules.

The eval corpus is an internal worlds asset — owned by spectral.worlds, never exposed outside worlds except through the EvalSet generation surface described below. Per ADR-027, no other context reads or writes the corpus directly; consumers in other contexts receive only the generated EvalSet artifact.

EvalSet generation draws from three source types.

Synthetic — systematically generated scenarios covering rule space. Provides baseline coverage across the rule’s full scope, including regions where organic scenarios are sparse.

Customer-supplied — scenarios from the customer’s operational context. Captures the actual distribution of the target agent’s problem space so evaluation reflects real operating conditions.

Public-sourced — scenarios grounded in real-world instantiations of the domain. Provides external validity and cross-customer grounding. Quality-assured operator-side: the source set is curated against the same Authoritative-or-Curated provenance bar as the rule corpus itself, with each public scenario traceable to its source publication or authority. Public-sourced scenarios that can’t carry that provenance footprint are rejected at intake.

Scenario perturbation (mutation-style transformations of known-good and known-bad scenarios — distinct from code-level mutation testing) and fuzzing apply across all three sources to probe boundary conditions, and edge-case exploration pushes toward the regions where the rule’s intent becomes hard to satisfy trivially. This addresses the shallow surface-probing failure mode of scenario generation — generated evals that look diverse but concentrate on easy cases.

The holdout prevents overfitting to the generation distribution. It applies in two layers, with the two-layer structure (instance-level universal + rule-level selective) decided in ADR-023.

Instance-level holdout — applied universally. A fraction of generated instances per rule is reserved from the optimization loop. Instance-level holdout detects whether performance on seen instances tracks genuine rule conformance or pattern matching on the generation distribution.

Rule-level holdout — applied selectively. A rule is eligible for rule-level holdout only when the visible rule set covers its domain territory with sufficient peer density. An agent genuinely conforming to the domain should handle the held-out rule without having seen it specifically, because its peers constrain the same behavioral region. Peer coverage is assessed by reviewer judgment over topic tags at launch, with embedding-based clustering displacing topic tags as the embedding space matures (per ADR-023).

Generated instances are hashed and compared against the holdout registry before release. Embedding-based semantic similarity with a configurable threshold suppresses near-matches and triggers regeneration. The same embedding infrastructure used for peer coverage assessment serves the hash comparison — no separate embedding pipeline is required.

The holdout configuration is managed, not static. It is reviewed and updated as world model coverage evolves — rules that become eligible for rule-level holdout as their peer density grows, thresholds that tighten as the embedding space matures.

The conformity gate is the quality enforcement boundary at promotion. Any observation that accumulates sufficient evidence to be proposed as a rule candidate must demonstrate conformity with the existing rule set before entering the enshrinement pipeline.

The gate checks against the full proposed rule set including pending candidates — not only enshrined rules. This prevents a race condition where two conforming-but-mutually-inconsistent candidates both pass the gate independently and then conflict at enshrinement. Checking against pending candidates closes that gap.

The gate is self-consistent: the rule set validates itself. It catches both deliberately misaligned observations — for example, from customer-directed world-boundary-probing vectors — and well-intentioned but inconsistent observations. The conformity gate performs the mechanical consistency check. Human sign-off at enshrinement exercises judgment about whether the rule belongs in the world model. The two gates serve different functions and are not interchangeable.

The surface lands with the EvalSet provider epic; the shape below is the locked design.

Customers parameterize EvalSet generation through four fixed dimensions. The EvalSet generation flow is a synchronous call: the customer-facing API in apps/api submits parameters and worlds returns the generated set. Per ADR-065 D3 + D5 the mechanism is a callee-owned OHS Protocol in spectral.worlds.contracts.protocols.* (worlds is the callee), with the bridge tool composed in apps/api per ADR-065 D5; the request type EvalSetParameterization is part of the worlds contract surface (not in spectral.core).

The four dimensions below are the full surface — adding a fifth is a worlds-contract change subject to the ADR-065 admission discipline.

DimensionTypeMeaning
coverage_areaslist[ProblemSpaceRef]Subset of problem spaces (from the scoped world model) to generate samples for. Must be a strict subset of the workspace’s scoped problem spaces — submitting an area outside the scope is a validation error, not an implicit Unknown-routing.
focus_vectorsdict[ProblemSpaceRef, float]Per-area weighting in [0, 1], normalised server-side. Reshapes the sample distribution toward the focus without altering the coverage boundary. Absent areas default to uniform weight.
difficulty_profileenum(basic, edge, adversarial, mixed)Controls the mutation-testing mix applied during generation. basic = known-good scenarios only; edge = boundary conditions; adversarial = deliberately-probing mutations; mixed = the default blend.
exploratory_probeslist[ExploratoryProbe]Explicitly-flagged world-boundary-probing vectors. Each probe carries a short natural-language statement of what the customer is testing. These are permitted into Unknown territory per the Unknown Territory Behavior rules above; sample output is tagged accordingly.
# spectral.worlds.contracts.protocols.eval_set_provider (illustrative; lands per the EvalSet refinement)
@dataclass(frozen=True)
class EvalSetParameterization:
world_model_version: str # pinned authority version
coverage_areas: list[ProblemSpaceRef]
focus_vectors: dict[ProblemSpaceRef, float]
difficulty_profile: DifficultyProfile
exploratory_probes: list[ExploratoryProbe]
@dataclass(frozen=True)
class EvalSetGenerationResponse:
evalset_id: UUID
sample_count: int
attribution_summary: AttributionSummary # per-sample provenance + source-type breakdown
coverage_gaps: list[CoverageGap] # non-empty when a coverage_area or probe fell into Unknown; each `CoverageGap` records the requested area + the reason no rule covers it
authority_ref: EvaluationAuthorityRef # opaque world-model-version reference

EvaluationAuthorityRef is the opaque, version-pinned reference platform receives in the response. Worlds mints it at world-model-version publication; platform stamps it onto the ChangeSet at scan time. Platform cannot parse it for worlds-internal state — see Version Attribution for the full principle.

  • world_model_version must resolve to a published WorldModelVersion. Unpublished or draft versions are rejected.
  • coverage_areas — every element must be in the workspace’s scoped problem-space set. Submitting an unknown area is a 400, not a silent fallback.
  • focus_vectors — keys must be a subset of coverage_areas. Weights outside [0, 1] are rejected.
  • difficulty_profile — closed enum; the default is mixed if omitted.
  • exploratory_probes — no upper limit on count, but each probe must carry a non-empty statement. Probes that land in Unknown territory are not validation errors — they flow through the coverage-gap + Evolution-Loop discovery signal path already defined on this page.

A customer who wanted to inflate scores could submit probes only into well-covered territory; a customer who wanted to influence world-model evolution could probe their pet edge cases. Three structural constraints bound both behaviors:

  • Probes don’t shape composite scores. Sample output from probes carries an explicit source_type = exploratory_probe tag. The composite score in Two-authority evaluation is computed over the world-model-derived sample set; probe-tagged samples are reported separately on the AgentPerformanceCard, not blended into the conformance score. So inflating the score by probe selection is structurally prevented.
  • Probes that land in Unknown territory route as discovery signals, not as rule changes. The Evolution Loop’s conformity gate + human sign-off still gates promotion. A customer submitting many self-interested probes adds evidence the operator can review or set aside; it doesn’t shortcut the gate.
  • Attribution is per-customer. exploratory_probe samples carry workspace attribution into the evidence stream. Operators reviewing accumulated probe evidence can see whether a single customer is dominating the discovery surface and weight their judgment accordingly.

The probe surface is a coverage-gap + discovery signal mechanism, not a metric-shaping lever.

Every generated sample carries an attribution envelope indicating how it was derived:

FieldValues
stimulus_sourceOne of: world_model_grounded, customer_directed_probe (exploratory), mutation (mutation-testing perturbation of a grounded sample)
originating_areaProblemSpaceRef the sample was generated for
generating_rule_refOpaque rule reference — no rule text
probe_refPresent only when stimulus_source = customer_directed_probe; references the submitted ExploratoryProbe

The attribution envelope is what lets the Evaluate phase route samples through the correct authority (world-model grounding for world_model_grounded, dual-authority for probes) and what lets the AgentPerformanceCard report probe-outcome separately from grounded-outcome.

  • Not a rubric-authoring surface. Customers never write evaluation rubrics; rubrics are derived from the world model. ADR-014 defines the contract.
  • Not a rule-selection surface. The customer does not pick rules; they pick areas. Inside an area, the world model decides which rules contribute to the EvalSet.
  • Not a difficulty-knob surface. difficulty_profile is coarse (four values) by design — fine-grained difficulty tuning would let customers produce shallow EvalSets and claim conformance against the standard.

Cold-start posture — no default world model

Section titled “Cold-start posture — no default world model”

Spectral does not ship a default or general-purpose world model. A customer cannot run a scan until a domain-specific world model exists in the system for that customer’s domain. Workspace creation accepts only published world-model versions; if none is published in the customer’s domain, scan onboarding waits. Rationale + alternatives considered (default model, generic models) live in ADR-025 — the system-card authority model requires Authoritative source grounding, which a generic model cannot provide.

  • Workspace creation requires selecting a published world model version. The selection UI lists only versions that exist in the system. Today, that list has one entry (us-federal-individual-tax v0.1.0 per the operator walkthrough).
  • A customer whose domain is not yet covered is not silently serviced. They are told explicitly: “This domain is not yet available. [Become a design partner to bootstrap it.]” This routing lives in the onboarding flow, not in the EvalSet-generation path.
  • The minimum rubric configuration the customer provides is zero. The EvaluationFramework is derived from the selected world model’s rule corpus; the customer parameterizes it via EvalSetParameterization (above), but does not author rubrics.
  • The Spectral Agent’s onboarding-guide specialist helps the customer select scope and focus — it does not help the customer author a world model. Authoring is an operator-only responsibility in the Operations app.

The first-customer walkthrough (first-customer-walkthrough — Step 1) describes what this looks like at the customer touchpoint.

The world signal events path (failure clusters and promoted observations routed back to worlds as input to rule evolution) compounds across the customer base in a domain. By construction, a single customer’s divergence remains a scan observation and does not initiate rule revision; cross-workspace aggregation is what drives evolution-loop proposals. The first customer in a domain — by definition — has no peers, so their scan observations sit as evidence rather than triggering rule changes until additional customers come online and the same patterns surface across workspaces. This is intentional: the world model is a shared standard, not a per-customer artifact, and rule evolution requires breadth before action. The asymmetry attenuates at ~5–10 customers in a domain, after which cross-workspace patterns surface readily.

  • Evolution Loop — how candidate observations move through conformity gate + human sign-off into enshrined rules
  • System Card — what an external auditor sees about an EvalSet’s authority + provenance composition
  • Optimization Engine — how Spectral consumes the generated EvalSet through the scan pipeline