Decisions

ADR-022: Eval generation architecture

Status: Accepted (2026-04-20)

Source: migrated from planning/swms-decisions.md ADR-032 as part of SPEC-270.

Context

Eval generation is the mechanism through which the world model produces the instances that spectral.platform executes against. The generation pipeline must satisfy several requirements that interact: broad coverage of the rule space, faithfulness to how target agents are actually used, grounding in real-world instantiations of the rules, resistance to shallow surface probing, structural protection against self-consistent corruption when observations graduate into rule candidates, and an explicit provenance flag that distinguishes vectors grounded in the world model from vectors arising from customer-directed parameterization. The design interview resolved these requirements into a single coherent generation architecture.

Decision

Eval generation draws from a three-source corpus, applies mutation and fuzzing patterns across all sources, enforces a conformity gate at the promotion boundary, and carries a provenance flag on every EvalSet that distinguishes world-model-grounded vectors from customer-directed exploratory vectors.

Three-source corpus

Synthetic source: systematic coverage of the rule space generated from the world model structure itself. Ensures the generation surface is not biased by what customers happen to submit or by what public material happens to exist.
Customer-supplied source: captures the actual operational distribution of the target agent. Ensures the evaluation surface reflects how the agent is actually used rather than an abstract ideal.
Public-sourced source: grounds scenarios in real-world instantiations of the rules drawn from public material. Connects synthetic coverage to real-world form.

Mutation testing and fuzzing

Patterns are applied across all three sources. These patterns probe boundary conditions and structural variations rather than restating rule text in slightly different phrasings. The purpose is to prevent the eval set from becoming a shallow surface-probing of rule intent that an agent can pattern-match on.

Conformity gate at the promotion boundary

Any observation that accumulates sufficient evidence to be proposed as a rule candidate must demonstrate conformity with the existing rule set before entering the enshrinement pipeline.
The gate runs against the full proposed rule set including pending candidates, not only enshrined rules. This prevents a race condition in which two conforming-but-mutually-inconsistent candidates pass the gate independently.
The gate is self-consistent: the rule set validates itself. It catches both misaligned intentional evals and well-intentioned but inconsistent observations.

`EvalSet` provenance flag

Every generated EvalSet carries a provenance flag distinguishing world-model-grounded vectors from customer-directed exploratory vectors.
Observation path and signal routing handle the two categories differently. Customer-directed world-boundary-probing vectors are valid exploration tools; their observations are routed with appropriate provenance context.

Consequences

The generation pipeline has three distinct source ingestion paths. Each path feeds the same downstream mutation and fuzzing stages, preserving corpus diversity without diluting the common output contract.
The conformity gate is a first-class architectural component at the promotion boundary, not a convention applied by reviewers. Its input set includes pending candidates to close the race-condition window.
EvalSet provenance becomes a routing signal downstream. The world signal event path (ADR-017) uses the flag to preserve the distinction between grounded and customer-directed observations as they flow to the enshrinement pipeline.
Mutation and fuzzing are structural generation stages, not optional enhancements. They are part of the reason the eval surface resists shallow overfitting.
Customer-directed parameterization (ADR-014) feeds into the generation pipeline as a first-class input and is reflected in the resulting EvalSet provenance flag.

Previous
ADR-020: Tournament redesign — consistent scoring metric Next
ADR-023: Holdout strategy