Decisions

ADR-075: Retire customer-facing eval generation

Context

Under the in-band decision-support shift, Spectral’s customer surface is POST /api/decide (and the equivalent MCP tool) returning a binding { status, work_frame }. Customers consume decisions from deployed decision modules; they do not request or consume evaluation artifacts. The EvaluationFramework / EvalSet primitives that anchored the prior product framing — customer-directed parameterization of world-model eval generation, two-authority composite scoring of agent traces against generated eval sets, three-layer instantiation (global template → workspace instance → changeset snapshot), tournament + verdict outputs — have no role on the customer surface under the shift.

Under the shift, the EvaluationFramework and EvalSet concepts are reimagined as internal-only validation primitives used by the world agent’s eval framework to validate generated rule code across predicate correctness, test fidelity, determinism, runtime safety, trace integrity, and readability. The customer-facing surface for eval authoring and consumption retires.

The four predecessor ADRs — ADR-014, ADR-022, ADR-027, ADR-028 — are a coherent retirement unit: each anchored a piece of the customer-facing eval-generation surface. This ADR consolidates them so the four threads close together rather than as four independent retirements.

Companion ADR-074 (retire scan pipeline) retires the execution side of the scan-and-evaluate flow; this ADR retires the generation side. Together they remove the observe-and-recommend product framing from the decision record.

Decision

The customer-facing eval-generation surface retires. The four predecessor ADRs — ADR-014, ADR-022, ADR-027, ADR-028 — retire as a single unit. ADR-022’s conformity gate and three-source-corpus mechanics are exceptions that carry forward into the world agent’s internal eval framework; everything else in the four ADRs retires. Principles those primitives embodied are tracked separately to their migration targets so conceptual resemblance to retired primitives does not smuggle stale design assumptions into the internal eval framework’s design space.

What retires

The customer-facing EvaluationFramework typed contract between spectral.worlds and spectral.platform; the EvalSet primitive as a customer-consumable artifact; customer-directed parameterization (customers selecting metrics, measurement vectors, coverage areas to steer generation); the EvalSample first-class primitive on the customer-facing surface; RubricDefinition as a customer-facing typed contract; the EvalSet provenance flag distinguishing world-model-grounded from customer-directed vectors; statistically-unique-per-request EvalSet generation; the eval corpus and holdout registry as world-model-internal assets sanitized for customer consumption; the attribution envelope as the customer-facing pinning mechanism.

Per-ADR disposition

ADR-014 — EvaluationFramework as shared contractual type; customer-directed parameterization. The shared contract dissolves because there is no customer-facing eval surface. Customer-directed parameterization retires wholesale; under the shift, customers invoke decisions, not eval generation. Principle migration: “world model presence in assessment is structural and mandatory” migrates and intensifies — the world model is now the executable artifact governing every decision, not merely the adjudicating standard behind eval generation. Goodhart-resistance migrates into the world agent’s internal eval framework as multi-axis scoring (covered in ADR-074’s principle-migration tracking for ADR-020).

ADR-022 — Eval generation architecture. Partial retirement. The customer-facing portions retire: the EvalSet provenance flag distinguishing world-model-grounded from customer-directed vectors retires; the customer-directed provenance flag itself retires; the customer-facing three-layer instantiation model retires. The conformity gate carries forward unchanged as one of the two gates at rule enshrinement, alongside the new implementation-readiness gate. The three-source corpus (synthetic + customer-supplied + public) and the mutation/fuzzing patterns carry into the world agent’s internal eval framework as constructs — same shape, new context (the framework now evaluates generated rule code rather than customer agent traces); these are reimagined-as-internal, not direct ports.

ADR-027 — Eval corpus as internal world asset. Scope was the internality of the eval corpus and holdout registry, the sanitization boundary between spectral.worlds and spectral.platform, and the attribution-envelope-only inter-context reference. The internality concern is moot under the shift — there is no inter-context customer-facing eval surface to sanitize across. Principle migration: the eval-corpus-as-versioned-world-asset principle survives as substrate for the internal eval framework (the framework’s corpus of test cases for validating generated rule code is itself a world-model-version asset). The sanitization-boundary principle is no longer needed because no boundary remains.

ADR-028 — Statistically unique EvalSets per request. Scope was the per-request statistical uniqueness of customer-issued EvalSets, the no-distribution-ledger constraint, and corpus-distillation defense via generation variance. No customer EvalSet requests under the shift → entire mechanism retires. Principle migration: the generation-variance-as-distillation-defense principle has no clear analog in the internal eval framework today; if the framework later needs distillation defense (e.g., for shared eval corpora across customers using the same world model version), the principle is in reserve, to be authored fresh against the framework’s actual needs rather than ported from this ADR.

What carries forward into the internal eval framework

The world agent generates executable code (predicates and applies_when filters) for each rule under the new authoring-time path (per ADR-081). An internal eval framework validates that generated code across multiple axes — predicate correctness, test fidelity, determinism, runtime safety, trace integrity, readability. This framework consumes the conformity gate’s structural and authoritative invariants (carried forward from ADR-022 unchanged) and runs alongside the new implementation-readiness gate (a five-check pass/fail covering code-gen success, test pass, multi-axis eval pass, deployment readiness, plus a reserved fifth check).

Three-source corpus mechanics (synthetic + customer-supplied + public) and mutation/fuzzing patterns carry into this framework as constructs. They are not direct ports — the framework’s subject is generated code rather than customer agent traces, so input shape, output shape, and downstream integration all change. The shared name is conceptual orientation only; structural inheritance is not implied.

The internal eval framework itself is not specified by this ADR. Its architecture is open design space, to be settled when its build-out begins. This ADR closes the door on the customer-facing surface so that internal-framework design starts from the in-band shift’s actual needs, not from carried-over customer-facing constraints.

Alternatives considered

Piecemeal supersession (four independent ADRs). Rejected. The four predecessors share one rationale (sidecar customer-facing eval product) and one fate (retire under the in-band shift); splitting that into four supersession threads multiplies bookkeeping without adding clarity. A single consolidating ADR is the legible unit.

Wait for the internal eval framework’s design to settle before retiring the customer-facing surface. Rejected. The customer-facing surface retires regardless of what the internal framework looks like — its retirement is settled by the shift, not by the internal framework’s specifics. Coupling the two would block this ADR on work that is not yet scoped.

Preserve EvaluationFramework and EvalSet as internal-only types by amending ADR-014 / ADR-022 rather than retiring them. Rejected. The retired customer-facing surface is most of ADR-014’s content. Amending would leave a hollow ADR pointing at a non-existent contract; the resulting record would mislead a future reader more than a clean supersession plus the internal eval framework’s own future ADR. Under the carry-forward-vs-resemblance rule, new internal constructs with conceptual resemblance to retired customer-facing ones are new constructs, not amendments.

Include ADR-030 (authority version as metadata-only across context boundary) in this consolidation. Rejected. ADR-030’s principle (version is metadata; version pinning lives at the request level, not in the carrier) survives the shift — version pinning becomes the world_model_version field on the /api/decide request body (per ADR-077). The specific carrier (EvalSet attribution envelope) retires here, but the principle holds. ADR-030 gets a separate, surgical touch-up rather than full supersession.

Consequences

ADR-014, ADR-022, ADR-027, and ADR-028 are retired by this decision; those files were deleted and git history holds them. ADR-022 retires only partially: its conformity gate and three-source-corpus mechanics carry forward (per the Decision above); the customer-facing eval-generation portions retire.
Linear scope changes in Phase 4 follow naturally: ~15 eval-set-cluster issues close with Fork-A-style closure-with-note (SPEC-239, SPEC-355–356, SPEC-358–362, SPEC-394–398, plus the rubric-scoring path issues).
Codex page retirements follow: system-design/world-model-system/eval-generation.mdx (whole page). Rewrites follow for reference/primitives.mdx (EvalSet / EvaluationFramework entries) and reference/domain-model.mdx (ER overview).
The rubric_gen.py LLM-assisted rubric generation, the EvalSet generation pipeline, the customer-directed parameterization API, the EvalSample primitive on the customer surface, and the customer-facing eval delivery endpoints become candidates for removal in Phase 4 build planning.
ADR-030 (authority version as metadata-only across context boundary) needs a surgical touch-up to reflect that version pinning now rides the world_model_version request field rather than the EvalSet attribution envelope. Tracked as a Tier-3 ADR touch-up.
The world agent’s internal eval framework is now the canonical home for principles migrating from the retired customer-facing eval surface (conformity gate, three-source corpus, mutation/fuzzing, multi-axis scoring). Its build-out is a future scope item; this ADR establishes its terms of existence without specifying its architecture.

Previous
ADR-074: Retire scan pipeline Next
ADR-076: Platform pillar as decision-module host