ADR-057: Failure cluster → Rule candidate signal — `failure_cluster_detected` event with snapshot semantics, operator triage queue
Status: Accepted (2026-04-25) — placement segments updated 2026-04-30: service relocated from worlds to platform; service rename RuleHealthService → FailureClusterService.
Context
FailureClusterService (formerly RuleHealthService) lives in spectral.platform; it observes rule-evaluation outcomes from customer scans and aggregates failures into clusters. SPEC-266 (Ops tool surface) landed. This ADR specifies the signal protocol — how a failure cluster surfaces from FailureClusterService into the operator triage queue. The signal is the input to operator review where they decide whether the cluster warrants enshrining a new rule or modifying an existing one.
Failure clustering is purely a platform concern — it aggregates customer scan-failure observations into actionable triage signals. Rule-domain reasoning lives in worlds; this event surfaces signals about rule failures derived from platform-side scan outcomes, not rule state itself. Producer + consumer are both in spectral.platform (intra-platform notification flow per ADR-070 Tier 3); the event substrate carries the signal between them.
Decision
D1 — Cluster identity: FailureClusterService-minted UUID, persisted, stable across lifetime
cluster_id is a UUID minted by FailureClusterService at first cluster detection and persisted in its own state. Identity is stable across the cluster’s lifetime — re-detection of an existing cluster after a process restart MUST resolve to the same cluster_id (FailureClusterService’s persistence responsibility, not TA-9 protocol).
The downstream consumer keys its display state on cluster_id; updates from new snapshots merge into the same row. Operator dismissal, resolution tracking, “cluster A is back” detection all key on cluster_id.
D2 — Snapshot signals; idempotency_key = cluster_id:snapshot_hash
Signals carry the full current snapshot of cluster state — not differentials. The core.event_handled idempotency key for the consumer is composed as f"{cluster_id}:{snapshot_hash}" where snapshot_hash is a content-derived hash of the snapshot fields.
- Same content = same hash = consumer dedups (e.g.,
FailureClusterServicere-emits an identical snapshot after restart) - Different content = different hash = consumer processes (cluster grew, severity bumped, etc.)
- No sequence-counter management at
FailureClusterService
The EventEnvelope.idempotency_key field carries this composite value at publish time.
D3 — Delivery: event with full snapshot; consumer materializes inbox
event_type:platform.failure_cluster.detected(the unprefixed formfailure_cluster_detectedwas the M5 carry-over name; the wire type carries the producer prefix per ADR-065 D2 typed-payload module convention)source:platformtarget:platform(intra-platform notification flow; producer =FailureClusterService, consumer = Operations Agent’s cluster-triage handler)- Lives in
spectral.platform.contracts.events.failure_cluster_detectedper ADR-065 D2 (the originalspectral.worlds.contracts.events.failure_cluster_detectedplacement was a v0.2.0 carry-over; failure clustering is purely a platform concern)
Payload (full snapshot):
class FailureClusterDetectedPayload(BaseModel): cluster_id: UUID snapshot_hash: str rule_id: UUID workspace_id: UUID severity: Literal["low", "medium", "high"] failure_count: int first_observed_at: datetime last_observed_at: datetime evidence_bundle: list[FailureRef] # bounded; e.g., top 10 representative failures suggested_rule_stub: str | NoneWhere FailureRef is a small typed record (failure_id, observed_at, summary).
Consumer handler (in platform’s Operations Agent) upserts into platform.rule_candidates_pending:
create table platform.rule_candidates_pending ( cluster_id uuid primary key, snapshot_hash text not null, rule_id uuid not null, workspace_id uuid not null, severity text not null check (severity in ('low', 'medium', 'high')), failure_count int not null, first_observed_at timestamptz not null, last_observed_at timestamptz not null, evidence_bundle jsonb not null, suggested_rule_stub text null, -- Operator-managed columns (preserved across snapshot upserts) status text not null default 'pending' check (status in ('pending', 'in_review', 'dismissed', 'resolved')), assigned_to uuid null, notes text null, -- Lifecycle created_at timestamptz not null default now(), last_signaled_at timestamptz not null default now(), deleted_at timestamptz null);Consumer handler upsert behavior:
cluster_idmatches existing row → update snapshot fields +last_signaled_at; preserve operator-managed columns (status,assigned_to,notes).cluster_idis new → INSERT withstatus='pending'.
This separation is critical: signal updates refresh cluster state; operator triage state survives signal updates.
D4 — Severity model: 3-tier low | medium | high at alpha
FailureClusterService derives severity from cluster size + failure-rate + rule-criticality (specifics are FailureClusterService implementation, not TA-9 protocol). The protocol-level commitment is the 3-tier enum.
Forward triggers to numeric or 5-tier:
- Operator finds 3 buckets too coarse
- Automated triage logic needs continuous score for sorting/ranking
- Schema bump per ADR-044 D11 additive-only versioning when triggered
D5 — Ops Agent tools delivered by ADR-060 D8
The operator UX layer — list_failure_clusters(), get_cluster_detail(), triage_cluster(cluster_id, status, notes) — is delivered by ADR-060 D8 (cluster triage tools added to the Ops Agent surface). Pattern matches ADR-054 D6: TA-9’s contract surface is the substrate (event + materialization table); the operator UX is TA-15’s domain.
At alpha (before consumer epic delivers TA-15 tool surface): operator queries platform.rule_candidates_pending directly via Supabase Studio. Updates status column manually. Runbook documents the workflow.
D6 — Cross-link to TA-8 observation pool
Failure clusters (TA-9) and T3 observations (TA-8) are distinct evidence streams. They DO NOT both feed the same Worlds-side observation pool — that was an over-broad framing during conversational refinement.
Clarified separation:
- Worlds observation pool (ADR-056 D4) — Worlds-internal evidence accumulator for T3 memories that don’t immediately promote to rule candidates.
platform.rule_candidates_pending(D3 here) — operator triage queue for failure clusters detected byFailureClusterService.
Both signal streams MAY converge in operator triage when an operator decides to merge a “T3-observed pattern” with a “rule-failure cluster,” but that’s a downstream operator workflow, not a substrate-level convergence.
Alternatives considered
Hash of (rule_id, failure_signature) cluster identity. Rejected; cluster grows = same hash = updates dedup’d into oblivion; defeats update flow.
Hash of constituent failure IDs. Rejected; cluster grows → new hash → looks like new cluster; defeats identity stability.
Differential signals. Rejected; consumer would have to reconstruct snapshot; producer-side seq-counter management; fewer self-contained signals.
Producer-side table + poll between contexts. Rejected; inbox state belongs at consumer; polling adds no value over event substrate.
Event-only without consumer materialization. Rejected; operator inbox needs persistent triage state; in-memory cache loses dismissals on restart.
5-tier severity at alpha. Rejected; over-engineering for a single operator (cofounder); 3-tier covers triage discrimination at alpha.
Eager TA-15 tool surface in TA-9. Rejected; pre-anchors TA-15; SQL-via-Studio is sufficient operator path at alpha.
Consequences
- Cluster identity stable across lifetime; operator dismissal/resolution sticks.
- Snapshot signals self-contained; no differential reconstruction logic.
- Operator triage state preserved across snapshot updates.
- Substrate-aligned (ADR-044 native); discoverable for new consumers.
- Severity 3-tier discriminates triage adequately at alpha.
- Operator surface is SQL-driven at alpha until ADR-060 D8 lands tools in consumer epic.
- Severity is a coarse 3-tier; numeric score deferred.
evidence_bundleis bounded (~10 entries); high-cardinality clusters need a follow-up read across contexts for full evidence.- Cluster identity is a producer-side persistence requirement that TA-9 protocol must trust
FailureClusterServiceto implement. - Open risk: if
FailureClusterService’s persistence of cluster identity fails (e.g., its state gets reset), every cluster looks “new” again, dedup breaks, operator sees duplicate inbox rows. Mitigation: alarm on cluster_id churn (Sentry alert ifcount(distinct cluster_id) over 24hexceeds historical baseline by N×).
References
- ADR-065 —
spectral.coreadmission discipline - ADR-036 — Sentry alert on cluster_id churn
- ADR-044 — event substrate; D11 versioning
- ADR-054 — D6 alpha-substrate operator surface pattern
- ADR-056 — D4 Worlds observation pool (distinct stream)
- ADR-060 — D8 cluster triage tools
- ADR-070 — simplest-fit ladder; D3 is intra-platform Tier 3 (notification flow) under the ladder
- TA-9 disposition — SPEC-312 comment
17c67abe - TA-9 verification — SPEC-312 comment
f13949ee src/spectral/platform/contracts/events/failure_cluster_detected.py- Codex
system-design/foundations/contract-surfaces/event-substrate.mdx— close-pass updates - Codex
system-design/agents/agent-architecture.mdx— operator triage flow