Decisions

ADR-057: Failure cluster → Rule candidate signal — `failure_cluster_detected` event with snapshot semantics, operator triage queue

Status: Accepted (2026-04-25) — placement segments updated 2026-04-30: service relocated from worlds to platform; service rename RuleHealthService → FailureClusterService.

Context

FailureClusterService (formerly RuleHealthService) lives in spectral.platform; it observes rule-evaluation outcomes from customer scans and aggregates failures into clusters. SPEC-266 (Ops tool surface) landed. This ADR specifies the signal protocol — how a failure cluster surfaces from FailureClusterService into the operator triage queue. The signal is the input to operator review where they decide whether the cluster warrants enshrining a new rule or modifying an existing one.

Failure clustering is purely a platform concern — it aggregates customer scan-failure observations into actionable triage signals. Rule-domain reasoning lives in worlds; this event surfaces signals about rule failures derived from platform-side scan outcomes, not rule state itself. Producer + consumer are both in spectral.platform (intra-platform notification flow per ADR-070 Tier 3); the event substrate carries the signal between them.

Decision

D1 — Cluster identity: `FailureClusterService`-minted UUID, persisted, stable across lifetime

cluster_id is a UUID minted by FailureClusterService at first cluster detection and persisted in its own state. Identity is stable across the cluster’s lifetime — re-detection of an existing cluster after a process restart MUST resolve to the same cluster_id (FailureClusterService’s persistence responsibility, not TA-9 protocol).

The downstream consumer keys its display state on cluster_id; updates from new snapshots merge into the same row. Operator dismissal, resolution tracking, “cluster A is back” detection all key on cluster_id.

D2 — Snapshot signals; idempotency_key = `cluster_id:snapshot_hash`

Signals carry the full current snapshot of cluster state — not differentials. The core.event_handled idempotency key for the consumer is composed as f"{cluster_id}:{snapshot_hash}" where snapshot_hash is a content-derived hash of the snapshot fields.

Same content = same hash = consumer dedups (e.g., FailureClusterService re-emits an identical snapshot after restart)
Different content = different hash = consumer processes (cluster grew, severity bumped, etc.)
No sequence-counter management at FailureClusterService

The EventEnvelope.idempotency_key field carries this composite value at publish time.

D3 — Delivery: event with full snapshot; consumer materializes inbox

event_type: platform.failure_cluster.detected (the unprefixed form failure_cluster_detected was the M5 carry-over name; the wire type carries the producer prefix per ADR-065 D2 typed-payload module convention)
source: platform
target: platform (intra-platform notification flow; producer = FailureClusterService, consumer = Operations Agent’s cluster-triage handler)
Lives in spectral.platform.contracts.events.failure_cluster_detected per ADR-065 D2 (the original spectral.worlds.contracts.events.failure_cluster_detected placement was a v0.2.0 carry-over; failure clustering is purely a platform concern)

Payload (full snapshot):

class FailureClusterDetectedPayload(BaseModel):
    cluster_id: UUID
    snapshot_hash: str
    rule_id: UUID
    workspace_id: UUID
    severity: Literal["low", "medium", "high"]
    failure_count: int
    first_observed_at: datetime
    last_observed_at: datetime
    evidence_bundle: list[FailureRef]  # bounded; e.g., top 10 representative failures
    suggested_rule_stub: str | None

Where FailureRef is a small typed record (failure_id, observed_at, summary).

Consumer handler (in platform’s Operations Agent) upserts into platform.rule_candidates_pending:

create table platform.rule_candidates_pending (
    cluster_id          uuid primary key,
    snapshot_hash       text not null,
    rule_id             uuid not null,
    workspace_id        uuid not null,
    severity            text not null check (severity in ('low', 'medium', 'high')),
    failure_count       int not null,
    first_observed_at   timestamptz not null,
    last_observed_at    timestamptz not null,
    evidence_bundle     jsonb not null,
    suggested_rule_stub text null,
    -- Operator-managed columns (preserved across snapshot upserts)
    status              text not null default 'pending'
                        check (status in ('pending', 'in_review', 'dismissed', 'resolved')),
    assigned_to         uuid null,
    notes               text null,
    -- Lifecycle
    created_at          timestamptz not null default now(),
    last_signaled_at    timestamptz not null default now(),
    deleted_at          timestamptz null
);

Consumer handler upsert behavior:

cluster_id matches existing row → update snapshot fields + last_signaled_at; preserve operator-managed columns (status, assigned_to, notes).
cluster_id is new → INSERT with status='pending'.

This separation is critical: signal updates refresh cluster state; operator triage state survives signal updates.

D4 — Severity model: 3-tier `low | medium | high` at alpha

FailureClusterService derives severity from cluster size + failure-rate + rule-criticality (specifics are FailureClusterService implementation, not TA-9 protocol). The protocol-level commitment is the 3-tier enum.

Forward triggers to numeric or 5-tier:

Operator finds 3 buckets too coarse
Automated triage logic needs continuous score for sorting/ranking
Schema bump per ADR-044 D11 additive-only versioning when triggered

D5 — Ops Agent tools delivered by ADR-060 D8

The operator UX layer — list_failure_clusters(), get_cluster_detail(), triage_cluster(cluster_id, status, notes) — is delivered by ADR-060 D8 (cluster triage tools added to the Ops Agent surface). Pattern matches ADR-054 D6: TA-9’s contract surface is the substrate (event + materialization table); the operator UX is TA-15’s domain.

At alpha (before consumer epic delivers TA-15 tool surface): operator queries platform.rule_candidates_pending directly via Supabase Studio. Updates status column manually. Runbook documents the workflow.

D6 — Cross-link to TA-8 observation pool

Failure clusters (TA-9) and T3 observations (TA-8) are distinct evidence streams. They DO NOT both feed the same Worlds-side observation pool — that was an over-broad framing during conversational refinement.

Clarified separation:

Worlds observation pool (ADR-056 D4) — Worlds-internal evidence accumulator for T3 memories that don’t immediately promote to rule candidates.
platform.rule_candidates_pending (D3 here) — operator triage queue for failure clusters detected by FailureClusterService.

Both signal streams MAY converge in operator triage when an operator decides to merge a “T3-observed pattern” with a “rule-failure cluster,” but that’s a downstream operator workflow, not a substrate-level convergence.

Alternatives considered

Hash of (rule_id, failure_signature) cluster identity. Rejected; cluster grows = same hash = updates dedup’d into oblivion; defeats update flow.

Hash of constituent failure IDs. Rejected; cluster grows → new hash → looks like new cluster; defeats identity stability.

Differential signals. Rejected; consumer would have to reconstruct snapshot; producer-side seq-counter management; fewer self-contained signals.

Producer-side table + poll between contexts. Rejected; inbox state belongs at consumer; polling adds no value over event substrate.

Event-only without consumer materialization. Rejected; operator inbox needs persistent triage state; in-memory cache loses dismissals on restart.

5-tier severity at alpha. Rejected; over-engineering for a single operator (cofounder); 3-tier covers triage discrimination at alpha.

Eager TA-15 tool surface in TA-9. Rejected; pre-anchors TA-15; SQL-via-Studio is sufficient operator path at alpha.

Consequences

Cluster identity stable across lifetime; operator dismissal/resolution sticks.
Snapshot signals self-contained; no differential reconstruction logic.
Operator triage state preserved across snapshot updates.
Substrate-aligned (ADR-044 native); discoverable for new consumers.
Severity 3-tier discriminates triage adequately at alpha.
Operator surface is SQL-driven at alpha until ADR-060 D8 lands tools in consumer epic.
Severity is a coarse 3-tier; numeric score deferred.
evidence_bundle is bounded (~10 entries); high-cardinality clusters need a follow-up read across contexts for full evidence.
Cluster identity is a producer-side persistence requirement that TA-9 protocol must trust FailureClusterService to implement.
Open risk: if FailureClusterService’s persistence of cluster identity fails (e.g., its state gets reset), every cluster looks “new” again, dedup breaks, operator sees duplicate inbox rows. Mitigation: alarm on cluster_id churn (Sentry alert if count(distinct cluster_id) over 24h exceeds historical baseline by N×).

References

ADR-065 — spectral.core admission discipline
ADR-036 — Sentry alert on cluster_id churn
ADR-044 — event substrate; D11 versioning
ADR-054 — D6 alpha-substrate operator surface pattern
ADR-056 — D4 Worlds observation pool (distinct stream)
ADR-060 — D8 cluster triage tools
ADR-070 — simplest-fit ladder; D3 is intra-platform Tier 3 (notification flow) under the ladder
TA-9 disposition — SPEC-312 comment 17c67abe
TA-9 verification — SPEC-312 comment f13949ee
src/spectral/platform/contracts/events/failure_cluster_detected.py
Codex system-design/foundations/contract-surfaces/event-substrate.mdx — close-pass updates
Codex system-design/agents/agent-architecture.mdx — operator triage flow

Previous
ADR-056: T3 Memory → Worlds routing — `t3_memory_written` event with worlds-local projection Next
ADR-058: World Agent memory storage and cross-agent memory primitives