Skip to content
GitHub
Decisions

ADR-028: Statistically unique EvalSets per request

Status: Accepted (2026-04-20)

Source: migrated from planning/swms-decisions.md ADR-038 as part of SPEC-270.

Context

EvalSet reuse across workspaces creates a distribution ledger problem: if the world model tracks which workspaces have received which instances, it accumulates a client issuance record that creates operational complexity, privacy surface, and a potential correlation attack vector. Strict per-workspace uniqueness guarantees are operationally expensive and may reduce coverage consistency across workspaces in the same domain.

Decision

EvalSets are statistically unique per request rather than tracking-unique. Each EvalSet is generated fresh for each request, parameterized by a combination of the customer’s steering inputs, a random seed, and the current corpus state. No workspace-level issuance tracking is maintained. The world model does not record which workspaces have received which instances.

Statistical uniqueness is defined as: two EvalSets are statistically unique if no more than T% of their instances are within the semantic similarity threshold used for holdout sample hashing, where T is a configurable generation parameter. This reuses the existing similarity threshold infrastructure and provides a consistent uniqueness standard across the system.

The generation process is parameterized with sufficient variance that systematic overlap between requests is not achievable, even with identical steering inputs across multiple requests from the same or different workspaces. Two requests with identical steering inputs will produce different EvalSets because the seed varies independently of the inputs.

Consequences

  • EvalSets are ephemeral and purposive: generated for a specific request, consumed by a single tournament run, not persisted after use.
  • No workspace-level issuance tracking. The world model maintains no distribution ledger.
  • Corpus distillation via repeated requests is addressed by generation variance, not by access control on a tracked distribution.
  • The holdout boundary is enforced at generation time: the generation process excludes holdout instances from active generation without needing to communicate holdout identities to the requester.
  • Statistical uniqueness threshold T is an operational parameter set at world model construction time, reviewed as corpus size and domain breadth evolve.
  • K-folding and other advanced holdback mechanisms are deferred to a future eval sophistication iteration.