Skip to content
GitHub
Decisions

ADR-020: Tournament redesign — consistent scoring metric

Status: Accepted (2026-04-20)

Source: migrated from planning/swms-decisions.md ADR-029 as part of SPEC-270.

Context

The existing tournament implementation uses a simple mean score across traces for candidate pre-screening, while the Verdict phase uses dimension-weighted per-agent composite scores for validation. These are different metrics. This inconsistency is a correctness issue that becomes more consequential when eval validity is the product claim. The tournament is not ported to the rebuild; it is redesigned.

An earlier framing of this ADR placed the composite metric computation itself in spectral.core. Design review clarified that spectral.core should own the contractual types that anchor consistency, not the computation, so that spectral.core remains a contract layer rather than an implementation layer.

Decision

The tournament is redesigned with a single, consistent composite metric definition used across all pipeline phases — tournament pre-screening, verdict validation, and system card reporting.

Placement of VerdictResult and CompositeScore in spectral.core superseded by ADR-065. Per ADR-065 D1, no domain types live in the kernel; VerdictResult and CompositeScore are platform-internal types and belong under spectral.platform.* (domain or application layer per Clean Architecture). The consistency-via-shared-types principle is preserved — same types referenced across pre-screen, verdict, and system-card-projection phases — but the types are platform-owned, not kernel-owned.

The tournament’s role as a cheap pre-screen is preserved; the statistical rigor of the verdict phase is preserved.

The 5-sample hardcoded ceiling is replaced with a configurable parameter; silent 0.0 failure returns are replaced with explicit failure signaling; the async scoring core is covered by unit tests from day one.

Consequences

  • The tournament is explicitly not a direct port of the existing implementation. Any resemblance to existing tournament code requires deliberate review and acceptance.
  • VerdictResult and CompositeScore are platform-internal types per ADR-065; the verdict engine and tournament execution remain implementation details of spectral.platform. Event payloads (where applicable) live in spectral.platform.contracts.events.* per ADR-065 D2.
  • Consistency across pre-screen, verdict, and system card phases is enforced by all phases reading and producing the same platform-internal types, not by calling a shared compute function.
  • Configurable sample ceiling replaces the hardcoded 5-sample limit.
  • Explicit failure signaling replaces silent 0.0 returns. A scoring failure is an error condition, not a score value.
  • End-to-end integration tests covering the tournament → verdict → system card path are a non-negotiable acceptance criterion for the tournament implementation ticket.