Skip to content
GitHub
Developer

Checkpointer encryption runbook

Operational procedures for activating envelope encryption on the LangGraph checkpointer when a forward trigger fires.

System reference: Codex system-design/agent-architecture.mdx · ADR-043 D10.


Trigger conditions

Activation triggers are owned by ADR-043 D10 — see the ADR for the authoritative list. The checkpointer relies on disk-level encryption + role-scoped DB access + audit logs + retention cascade until a trigger fires.


Implementation shape

The activation builds an EncryptedSerializer wrapping AsyncPostgresSaver’s SerializerProtocol. Per-workspace DEK generated and wrapped via KMS. DEK caching with TTL. Provider-swap seam via KeyManagementProvider protocol per ADR-037 D11.

Estimated effort: ~2 engineer-weeks plus KMS IAM setup plus rotation runbook authoring.

Components

  1. KeyManagementProvider protocol in spectral.core.crypto.protocols (or domain-appropriate location).
  2. GcpKmsProvider impl in spectral_workers infrastructure (per ADR-037 D12 KMS reservation; even if compute lives on Render, KMS is the master-key root of trust).
  3. EncryptedSerializer wrapping the checkpointer’s serializer (the LangGraph default; provider-swap seam preserved).
  4. Per-workspace DEK lifecycle:
    • Generate DEK on workspace creation; wrap with KMS key; store wrapped DEK in platform.workspace_keys (or analogous).
    • Cache unwrapped DEK in process memory with TTL (default 1 h).
    • Rotate KMS key per quarterly cadence; re-wrap DEKs without re-encrypting payloads.

Migration

When activated:

  1. Land platform.workspace_keys migration.
  2. Provision KMS keys per environment (spectral-staging-kms, spectral-production-kms).
  3. Deploy a backfill job that generates per-workspace DEKs for existing workspaces.
  4. Deploy the workers update with EncryptedSerializer enabled via feature flag.
  5. Re-encrypt existing checkpointer rows (one-time backfill; runs in workers).
  6. Remove the feature flag once backfill completes.

Verification

After activation:

-- Confirm checkpointer rows are encrypted (payloads should be base64 ciphertext, not the cleartext serializer output)
SELECT pg_typeof(state), octet_length(state)
FROM langgraph.checkpoints
LIMIT 10;

A roundtrip test confirms the workers can decrypt and resume an arbitrary thread.


Rotation

Quarterly cadence (mirrors the ADR-062 D5 secrets rotation).

  1. Rotate the KMS key version.
  2. Re-wrap all workspace DEKs against the new key version (no payload re-encryption needed).
  3. Verify a sample of threads decrypts successfully.

Old KMS key versions retained per the KMS retention policy for audit + emergency decrypt.


Disaster scenarios

  • DEK unwrap fails (KMS outage): workers fail closed; /health returns 503 (auth check fails on Spectral Agent paths). Wait for KMS recovery; verify with sample roundtrip.
  • Workspace DEK lost: the workspace’s checkpointer history becomes unrecoverable. Mitigation: KMS replication; multi-region key-version retention.
  • KMS key destroyed: all workspace DEKs unwrappable; full checkpointer history unrecoverable. Mitigation: DR runbook escalation; restore from pg_dump (which contains the wrapped DEKs but not the destroyed KMS material).

See also

  • ADR-043 — Spectral Agent conversation persistence (D10 forward trigger)
  • ADR-037 — D12 GCP KMS reservation
  • docs/runbooks/secrets-management.md — quarterly rotation cadence
  • docs/runbooks/disaster-recovery.md — DR scenarios