Decisions

ADR-108: Behavioral-completeness as an enforced publish gate

Context

A rule is authored in natural language; codegen produces the predicate that decides it. At v0 the generated tests were a readiness count — a “12/12 tests pass” signal feeding an implementation-readiness gate (ADR-081 D4), best-effort and not a deploy gate. The tests were generated from the predicate, so they could only confirm the predicate does what it does — they could not catch a predicate that does the wrong thing relative to the rule’s intent.

The dogfood program proved this is not enough. A validate_deduction_choice rule whose NL says the outcome depends on the chosen deduction shipped a predicate that ignored chosen_deduction entirely — and because its tests were co-generated from that same predicate, they never varied the input, so the rule published “complete” and the defect only surfaced in the live eval. Co-blind generation (predicate and its tests sharing one lens) cannot detect an incomplete rule.

The fix is a test-driven inversion with an independent lens: derive what the rule should do from its NL statement alone, materialize that into discriminating tests, and generate the predicate to pass them — then make passing a hard publish gate. This is the SPEC-705 thesis, layered on ADR-107’s input-declaration contract.

Decision

D1 — The behavioral spec: an independent lens on the rule

Before the predicate, a behavioral spec is extracted from the rule’s NL text independently of any predicate: {intended_inputs, outcome_partition, case_pairs} — the inputs the outcome should depend on, the partition of outcomes over those inputs, and discriminating case-pairs (incl. negative cases where the rule must not fire). It is derived from rule text + the sibling-derived context schema only, never the predicate, so the lens is structurally independent. Persisted on worlds.rules.behavioral_spec (JSONB, migration 20260620020000). A spec the lens cannot make internally consistent within its retry budget (ambiguous/underspecified NL) raises BehavioralSpecError — a semantic failure, not a silent proceed.

D2 — Materialized discriminating tests, anchored to the rule’s outcome

The spec’s discriminating pairs are deterministically materialized into predicate tests (no LLM, no cassette burden): each case asserts applies(context) is (case_outcome == rule_outcome), where rule_outcome is the rule’s static emitted outcome (ADR-106 D1) — the four-state↔boolean mapping that turns “what outcome should this context get” into “should the predicate fire.” Because the spec is anchored to the rule’s authored outcome, a spec that mislabels the firing case (e.g. labels a YELLOW rule’s firing case RED) is rejected structurally rather than producing an unsatisfiable suite. The spec is the test surface — exact discriminating coverage, deterministically rendered.

D3 — TDD codegen inversion: the predicate is generated to pass the spec

The materialized suite is a gate inside the predicate generate→check→repair loop, after AST-safety (ADR-083 D2) and declared == read (ADR-107 D2). A candidate that fails the spec’s tests routes back into the bounded repair loop with spec-correlated feedback naming the offending input (red→green); on exhaustion the loop raises (PredicateCodegenError, terminal — nothing enshrined). This is the literal TDD inversion: the test is written from intent first, and the predicate is generated to satisfy it.

D4 — The dead-input mutation gate: every consumed input must matter

A complementary deterministic mutation gate runs last in the loop. Where D2/D3 catch a missing input (the predicate ignores something the spec says matters), the dead-input gate catches a dead input: a key the predicate reads but whose value never moves applies() across the spec’s case space. It re-runs the predicate over substituted values (case values ∪ enum allowed-values ∪ boolean flip) in the same restricted-exec seam (ADR-083); an input with ≥2 testable values that never changes the result is dead → repair, exhaustion raises. It is outcome-independent and conservative (an input with <2 testable values is indeterminate, not rejected). Server-injected keys are excluded.

D5 — Behavioral completeness is a hard publish gate, with two-tier remediation

Publication hard-fails unless every enshrined rule is behaviorally complete: assert_behavioral_completeness deterministically re-checks each rule’s persisted spec for internal consistency, pass == total > 0, and no dead input, and blocks the publish (mapped to 422) listing the failing rules. Remediation is two-tier, and an operator never edits code:

Mechanical — an AST-safety, declared == read, or dead-input failure. The World Agent repairs it within the codegen loop (red→green) automatically.
Semantic — a spec-test failure (the NL rule and its intended behavior disagree) or a BehavioralSpecError (ambiguous NL). Surfaced to the operator for chat-steering: the operator revises the rule’s NL and re-verifies; the re-entry regenerates cleanly.

The terminal codegen error carries a failure_kind mapped to the remediation tier so the surface knows which path applies.

D6 — The enforced suite ships in the deploy bundle

The materialized test source is persisted (worlds.rules.test_source, migration 20260620030000) and shipped in the content-addressed deploy bundle as tests/<rule_id>_test.py, beside a deploy-time behavioral backstop (assert_behavioral_backstop, sibling to ADR-107 D3’s self-verifying input-contract member + OntologyDriftError drift check) that re-runs the shipped suite against the deployed predicate. The behavioral suite is therefore a carried, self-verifying artifact in the immutable bundle — the enforcement survives to deploy, not only to publish.

D7 — The boundary: internal completeness, not real-world truth

The platform proves an internal property: every intended input measurably affects the outcome (D3/D4) and the predicate matches the spec extracted from the rule’s own text (D2/D3). It does not prove the rule is correct about the world — that the tax threshold is the real IRS figure, that the NL captured the operator’s true intent. The empty-spec boundary follows from this: a rule body that reads no caller inputs is complete by construction (nothing can be incompletely consumed) and is grandfathered; an input-reading body with an empty spec is a violation (it is exactly the chosen_deduction defect class). Real-world truth is the operator’s responsibility, surfaced through authoring and review — not something a deterministic gate can or should assert.

Consequences

An incomplete rule cannot publish. The validate_deduction_choice defect class — a predicate ignoring an input the rule’s intent depends on — is caught at the publish gate by construction, not in a downstream eval.
ADR-081 D4 is amended in lockstep: the per-rule behavioral suite is an enforced publish-gate artifact, shipped in the deploy bundle — not a throwaway readiness count, and not best-effort. No residual “readiness gate / not a deploy gate” framing remains.
Remediation never asks an operator to write or edit code: mechanical failures self-heal in the codegen loop; semantic failures route to NL chat-steering (D5). This keeps the operator in the authoring vocabulary, consistent with ADR-081 D4 + ADR-090’s conversational cockpit.
The gates are deterministic (D2/D4) — no LLM in the enforcement path, so the publish/deploy decision is reproducible. The publish flow runs one LLM-bearing step ahead of these gates: the cross-rule input-vocabulary reconciliation phase (ADR-107 D5), which precedes the deterministic completeness gate and the mint. That naming call is cassette-pinned and memoized (zero LLM on a steady-state republish), and it produces the realigned artifacts the deterministic gate then re-checks — so the enforcement decision stays LLM-free and reproducible; only the reconciliation phase carries a (bounded) cassette burden.
The boundary (D7) is stated, not implied: the platform’s guarantee is internal consistency, and the corpus does not over-claim correctness about the world.
SPEC-658’s strict /decide payload enforcement stays advisory, not strict — the dynamic-key completeness signal that would unlock a strict posture is not built; this gate proves authoring-time completeness, a different axis.

Alternatives considered

Keep generating tests from the predicate and raise the bar (more tests, higher count). Rejected. Co-blind generation cannot detect an incomplete rule no matter how many tests it emits — the tests inherit the predicate’s blind spot. An independent lens (D1) is the only thing that catches “the predicate does the wrong thing relative to intent.”

Make completeness a warning, not a gate (best-effort + human review). Rejected — this was the v0 posture, and the dogfood proved an incomplete rule published clean under it. Pre-launch, an enforced gate has no bypass for input-reading rules (the empty-spec grandfather is scoped to genuinely input-free bodies, D7).

Let operators fix a failing predicate by editing the generated code. Rejected. Operators author in NL; code is the World Agent’s product. Mechanical failures are auto-repaired; semantic failures are fixed by steering the NL (D5). An operator editing generated code would break the authoring model and the provenance chain.

References

ADR-107 — the declared == read input contract this gate enforces completeness on top of
ADR-081 D4 — the generation responsibilities + the readiness-gate framing this strengthens to an enforced gate (amended in lockstep)
ADR-106 D1 — the static rule outcome the materialized tests map applies against
ADR-083 — the AST-safety + restricted-exec seam the spec-test and dead-input gates reuse
ADR-080 D1 — the content-addressed bundle the shipped suite + behavioral backstop ride

Previous
ADR-107: The action's canonical input ontology, reconciled at publish Next
ADR-109: Cloudflare hosting topology — one app container, predicate in-process, Supabase locked