Skip to content
GitHub
Decisions

ADR-014: EvaluationFramework as shared contractual type — customer-directed parameterization

Status: Accepted (2026-04-20)

Source: migrated from planning/swms-decisions.md ADR-023 as part of SPEC-270.

Context

The EvaluationFramework is the primary integration surface between spectral.worlds and spectral.platform. worlds generates evaluation frameworks from world model rules; platform executes optimization scans against them. In the prior Spectral architecture, EvaluationFramework was a platform-owned type with rubrics typed as list[dict[str, Any]] — schemaless, unvalidated, and carrying no provenance metadata. This is structurally incompatible with SWMS-generated frameworks, which carry rule attribution, provenance tier, and world model version.

An earlier iteration of this ADR left room for customer-authored frameworks existing alongside world-model-generated frameworks. The design interview clarified that customer framework authorship is not a supported authoring model: customers parameterize world-model eval generation rather than authoring independent frameworks that would require grounding against the world model as a separate step.

Decision

Placement of EvaluationFramework / RubricDefinition / EvalSample in spectral.core as rich domain types — superseded by ADR-065. Per ADR-065 D1, no domain types live in the kernel; contracts between contexts that carry typed payloads relocate to <producer>.contracts.events.* per D2. The mandatory authority / authority_version field convention (per ADR-015) remains the pinning mechanism between contexts. The context distribution — worlds generates as output of the eval generation pipeline; platform executes — remains authoritative.

RubricDefinition replaces the former list[dict[str, Any]] with a typed Pydantic model carrying dimensions, weights, scoring guidance, hard constraint thresholds, and an opaque attribution envelope. EvalSample is a first-class primitive — the unit that worlds generates and platform executes.

World model presence in assessment is structural and mandatory; there is no assessment path that bypasses world model grounding. Customers do not author independent evaluation frameworks. Customers parameterize world-model eval generation by selecting metrics, measurement vectors, and coverage areas that direct the generation process. The output is always a world-model-generated eval set with customer-directed parameters.

Unknown-territory behavior is handled at generation time: when customer steering parameters point at territory outside current world model coverage, the system returns a coverage gap notification to the customer and routes the candidate observation internally as a discovery signal. There is no ingestion-time grounding step because there is no independently authored framework to ground.

LLM-assisted rubric generation is removed from packages/spectral; this capability is replaced by worlds eval generation.

Consequences

  • The untyped rubrics: list[dict[str, Any]] field is eliminated. All rubric structures are validated at the type level.
  • The three-layer instantiation model (global template → workspace instance → changeset snapshot) remains valid; the top layer’s data source changes from LLM generation to world model generation.
  • rubric_gen.py LLM-assisted generation is retired from spectral.platform. There is no customer-authored-framework authoring path in the new design.
  • EvalSample as a first-class type enables clean holdout split operations, world-model-suggested holdout configuration, and deviation record attribution.
  • spectral’s local EvaluationFramework representation references the core type rather than replacing it; evaluation_framework_id on ChangeSet continues to reference the local representation.
  • Customer-directed parameterization is a first-class input to the generation pipeline. The World Agent interprets customer steering parameters against world model structure at generation time.
  • Goodhart-resistance is reinforced: even when a customer directs eval scenarios through parameterization, the world model remains the adjudicating standard.
  • The generation-time coverage gap notification and internal discovery routing are specified in ADR-022 (eval generation architecture) and integrate with the world signal path in ADR-017.