Skip to content
GitHub
Decisions

ADR-057: Failure cluster → Rule candidate signal — `failure_cluster_detected` event with snapshot semantics, operator triage queue

Status: Accepted (2026-04-25) — placement segments updated 2026-04-30: service relocated from worlds to platform; service rename RuleHealthServiceFailureClusterService.

Context

FailureClusterService (formerly RuleHealthService) lives in spectral.platform; it observes rule-evaluation outcomes from customer scans and aggregates failures into clusters. SPEC-266 (Ops tool surface) landed. This ADR specifies the signal protocol — how a failure cluster surfaces from FailureClusterService into the operator triage queue. The signal is the input to operator review where they decide whether the cluster warrants enshrining a new rule or modifying an existing one.

Failure clustering is purely a platform concern — it aggregates customer scan-failure observations into actionable triage signals. Rule-domain reasoning lives in worlds; this event surfaces signals about rule failures derived from platform-side scan outcomes, not rule state itself. Producer + consumer are both in spectral.platform (intra-platform notification flow per ADR-070 Tier 3); the event substrate carries the signal between them.

Decision

D1 — Cluster identity: FailureClusterService-minted UUID, persisted, stable across lifetime

cluster_id is a UUID minted by FailureClusterService at first cluster detection and persisted in its own state. Identity is stable across the cluster’s lifetime — re-detection of an existing cluster after a process restart MUST resolve to the same cluster_id (FailureClusterService’s persistence responsibility, not TA-9 protocol).

The downstream consumer keys its display state on cluster_id; updates from new snapshots merge into the same row. Operator dismissal, resolution tracking, “cluster A is back” detection all key on cluster_id.

D2 — Snapshot signals; idempotency_key = cluster_id:snapshot_hash

Signals carry the full current snapshot of cluster state — not differentials. The core.event_handled idempotency key for the consumer is composed as f"{cluster_id}:{snapshot_hash}" where snapshot_hash is a content-derived hash of the snapshot fields.

  • Same content = same hash = consumer dedups (e.g., FailureClusterService re-emits an identical snapshot after restart)
  • Different content = different hash = consumer processes (cluster grew, severity bumped, etc.)
  • No sequence-counter management at FailureClusterService

The EventEnvelope.idempotency_key field carries this composite value at publish time.

D3 — Delivery: event with full snapshot; consumer materializes inbox

  • event_type: platform.failure_cluster.detected (the unprefixed form failure_cluster_detected was the M5 carry-over name; the wire type carries the producer prefix per ADR-065 D2 typed-payload module convention)
  • source: platform
  • target: platform (intra-platform notification flow; producer = FailureClusterService, consumer = Operations Agent’s cluster-triage handler)
  • Lives in spectral.platform.contracts.events.failure_cluster_detected per ADR-065 D2 (the original spectral.worlds.contracts.events.failure_cluster_detected placement was a v0.2.0 carry-over; failure clustering is purely a platform concern)

Payload (full snapshot):

class FailureClusterDetectedPayload(BaseModel):
cluster_id: UUID
snapshot_hash: str
rule_id: UUID
workspace_id: UUID
severity: Literal["low", "medium", "high"]
failure_count: int
first_observed_at: datetime
last_observed_at: datetime
evidence_bundle: list[FailureRef] # bounded; e.g., top 10 representative failures
suggested_rule_stub: str | None

Where FailureRef is a small typed record (failure_id, observed_at, summary).

Consumer handler (in platform’s Operations Agent) upserts into platform.rule_candidates_pending:

create table platform.rule_candidates_pending (
cluster_id uuid primary key,
snapshot_hash text not null,
rule_id uuid not null,
workspace_id uuid not null,
severity text not null check (severity in ('low', 'medium', 'high')),
failure_count int not null,
first_observed_at timestamptz not null,
last_observed_at timestamptz not null,
evidence_bundle jsonb not null,
suggested_rule_stub text null,
-- Operator-managed columns (preserved across snapshot upserts)
status text not null default 'pending'
check (status in ('pending', 'in_review', 'dismissed', 'resolved')),
assigned_to uuid null,
notes text null,
-- Lifecycle
created_at timestamptz not null default now(),
last_signaled_at timestamptz not null default now(),
deleted_at timestamptz null
);

Consumer handler upsert behavior:

  • cluster_id matches existing row → update snapshot fields + last_signaled_at; preserve operator-managed columns (status, assigned_to, notes).
  • cluster_id is new → INSERT with status='pending'.

This separation is critical: signal updates refresh cluster state; operator triage state survives signal updates.

D4 — Severity model: 3-tier low | medium | high at alpha

FailureClusterService derives severity from cluster size + failure-rate + rule-criticality (specifics are FailureClusterService implementation, not TA-9 protocol). The protocol-level commitment is the 3-tier enum.

Forward triggers to numeric or 5-tier:

  • Operator finds 3 buckets too coarse
  • Automated triage logic needs continuous score for sorting/ranking
  • Schema bump per ADR-044 D11 additive-only versioning when triggered

D5 — Ops Agent tools delivered by ADR-060 D8

The operator UX layer — list_failure_clusters(), get_cluster_detail(), triage_cluster(cluster_id, status, notes) — is delivered by ADR-060 D8 (cluster triage tools added to the Ops Agent surface). Pattern matches ADR-054 D6: TA-9’s contract surface is the substrate (event + materialization table); the operator UX is TA-15’s domain.

At alpha (before consumer epic delivers TA-15 tool surface): operator queries platform.rule_candidates_pending directly via Supabase Studio. Updates status column manually. Runbook documents the workflow.

Failure clusters (TA-9) and T3 observations (TA-8) are distinct evidence streams. They DO NOT both feed the same Worlds-side observation pool — that was an over-broad framing during conversational refinement.

Clarified separation:

  • Worlds observation pool (ADR-056 D4) — Worlds-internal evidence accumulator for T3 memories that don’t immediately promote to rule candidates.
  • platform.rule_candidates_pending (D3 here) — operator triage queue for failure clusters detected by FailureClusterService.

Both signal streams MAY converge in operator triage when an operator decides to merge a “T3-observed pattern” with a “rule-failure cluster,” but that’s a downstream operator workflow, not a substrate-level convergence.

Alternatives considered

Hash of (rule_id, failure_signature) cluster identity. Rejected; cluster grows = same hash = updates dedup’d into oblivion; defeats update flow.

Hash of constituent failure IDs. Rejected; cluster grows → new hash → looks like new cluster; defeats identity stability.

Differential signals. Rejected; consumer would have to reconstruct snapshot; producer-side seq-counter management; fewer self-contained signals.

Producer-side table + poll between contexts. Rejected; inbox state belongs at consumer; polling adds no value over event substrate.

Event-only without consumer materialization. Rejected; operator inbox needs persistent triage state; in-memory cache loses dismissals on restart.

5-tier severity at alpha. Rejected; over-engineering for a single operator (cofounder); 3-tier covers triage discrimination at alpha.

Eager TA-15 tool surface in TA-9. Rejected; pre-anchors TA-15; SQL-via-Studio is sufficient operator path at alpha.

Consequences

  • Cluster identity stable across lifetime; operator dismissal/resolution sticks.
  • Snapshot signals self-contained; no differential reconstruction logic.
  • Operator triage state preserved across snapshot updates.
  • Substrate-aligned (ADR-044 native); discoverable for new consumers.
  • Severity 3-tier discriminates triage adequately at alpha.
  • Operator surface is SQL-driven at alpha until ADR-060 D8 lands tools in consumer epic.
  • Severity is a coarse 3-tier; numeric score deferred.
  • evidence_bundle is bounded (~10 entries); high-cardinality clusters need a follow-up read across contexts for full evidence.
  • Cluster identity is a producer-side persistence requirement that TA-9 protocol must trust FailureClusterService to implement.
  • Open risk: if FailureClusterService’s persistence of cluster identity fails (e.g., its state gets reset), every cluster looks “new” again, dedup breaks, operator sees duplicate inbox rows. Mitigation: alarm on cluster_id churn (Sentry alert if count(distinct cluster_id) over 24h exceeds historical baseline by N×).

References

  • ADR-065spectral.core admission discipline
  • ADR-036 — Sentry alert on cluster_id churn
  • ADR-044 — event substrate; D11 versioning
  • ADR-054 — D6 alpha-substrate operator surface pattern
  • ADR-056 — D4 Worlds observation pool (distinct stream)
  • ADR-060 — D8 cluster triage tools
  • ADR-070 — simplest-fit ladder; D3 is intra-platform Tier 3 (notification flow) under the ladder
  • TA-9 disposition — SPEC-312 comment 17c67abe
  • TA-9 verification — SPEC-312 comment f13949ee
  • src/spectral/platform/contracts/events/failure_cluster_detected.py
  • Codex system-design/foundations/contract-surfaces/event-substrate.mdx — close-pass updates
  • Codex system-design/agents/agent-architecture.mdx — operator triage flow