Deployment topology runbook
Operational procedures for deployment-generation stamping, worker drain, race-free deploy protocols, and core.deployments / core.workers inspection.
System reference: Codex system-design/deployment-topology.mdx · ADR-048 · ADR-049 · ADR-053.
Inspect current generation
-- Latest deployment generation per environmentSELECT generation, reference, deployed_at, deployed_by, tagFROM core.deploymentsORDER BY generation DESCLIMIT 5;
-- What generation are workers reportingSELECT generation, count(*), max(last_heartbeat_at) AS most_recent_beatFROM core.workersWHERE state = 'running'GROUP BY generationORDER BY generation DESC;Healthy state: workers’ max generation matches core.deployments.generation head. Lag = a deploy is in flight or workers haven’t picked up the new generation yet.
Inspect outbox by generation
-- Pending rows by generation + statusSELECT generation, status, count(*)FROM core.outboxWHERE deleted_at IS NULLGROUP BY generation, statusORDER BY generation DESC, status;Healthy state: PENDING rows at the latest generation; ZERO rows at older generations (legacy-drained).
Race-free generation-stamp protocol
The 12-step cutover sequence is in docs/runbooks/deployment.md. Key correctness invariants:
SPECTRAL_GENERATIONis per-service (set at deploy time via Render API), not in env groups. Env-group changes never bump generation. This makes the “gen-N events processed by gen-N code” guarantee structural.- Generation allocated atomically via
INSERT INTO core.deployments RETURNING generation. No race on sequence allocation. - Workflow concurrency mutex (
concurrency: deploy-prod) prevents concurrent prod deploys. - Schema-version gate — green’s
/version/detailmust report the expected schema_version before the CNAME flip step. - 30 s pre-flip sanity check — verify no blue service has started a redeploy in the last 30 s (env-group race detection).
Race C (Render pod crash-restart env-snapshot semantics during rolling deploy) is materially mitigated for SPECTRAL_GENERATION by the per-service placement above. Forward considerations if Race C fires for generation specifically — see ADR-049 D5 + ADR-053 D7.
Worker drain parameters
| Parameter | Value | Where |
|---|---|---|
HANDLER_MAX | 60 s (asyncio.wait_for bound on each handler) | per-service env var |
maxShutdownDelaySeconds | 90 s (HANDLER_MAX + 30 s buffer; under Render’s 300 s ceiling) | render.yaml worker service |
| Reaper interval | 30 s | per-service env var |
| Claim TTL | 300 s (5× HANDLER_MAX) | per-service env var |
SPECTRAL_DRAIN_COOLING_SECONDS | 60 s default | per-service env var (legacy-drain workers only) |
Legacy-drain workflow
When stranded gen-N rows exist (e.g., after rollback path 3), invoke drain-legacy-generation.yml:
- Run the workflow with
target_generation=<N>. - Workflow reads
core.deploymentsfor the code reference attarget_generation. - Checks out repo at that reference; deploys a temporary Render worker
workers-drain-gen-<N>withSPECTRAL_GENERATION=<N>andSPECTRAL_DRAIN_AND_EXIT=true. - Worker drains gen-N events from
core.outbox; auto-exits afterSPECTRAL_DRAIN_COOLING_SECONDSof zero pending+in_flight rows. - Workflow deletes the temporary service.
Recovery if mid-drain failure
The temporary service stays up. Manual cleanup:
# Verify drain statuspsql -c "SELECT status, count(*) FROM core.outbox WHERE generation = <N> AND deleted_at IS NULL GROUP BY status;"
# Delete the temporary service via Render API (or dashboard)curl -X DELETE -H "Authorization: Bearer $RENDER_API_KEY" \ https://api.render.com/v1/services/<workers-drain-gen-N-service-id>/version and /version/detail
/version (public; minimal):
curl https://api.runspectral.com/version# → {"service":"api","environment":"production","generation":42,"tag":"prod-42","color":"green","commit_sha":"a9ba851","deployed_at":"2026-04-25T21:00:00Z"}/version/detail (auth-gated via Authorization: Bearer sk_deploy_...):
curl -H "Authorization: Bearer $DEPLOY_KEY" https://api.runspectral.com/version/detail# → full version.json + runtime/framework/os + check statuses with latencyThe deploy-key registry lives in env-group secrets (sk_deploy_* keys); rotation = deploy side-effect.
Worker heartbeat
core.workers rows are written by each worker process on a heartbeat interval. Inspect:
SELECT worker_id, generation, channel, state, last_heartbeat_at, started_atFROM core.workersORDER BY last_heartbeat_at DESC;Stale heartbeats (no update in > 60 s) indicate a worker crash or disconnect. The reaper handles outbox-row recovery; the worker row itself is left for ops visibility.
Container image version
# Inside any container (api / workers / dashboard / operations)cat /app/version.json# → {"sha":"...", "short_sha":"...", "describe":"...", "built_at":"...", "uv_lock_sha":"...", "pnpm_lock_sha":"..."}The build-time script infra/docker/build-version.sh produces this file; each Dockerfile COPYs it into /app/version.json.
Cron service inventory
| Cron | Image | Pattern |
|---|---|---|
retention-run | workers (shared) | Pattern A — posts retention.run_scheduled event into the workers substrate |
backup-nightly | backup-nightly (dedicated) | Pattern B — runs tools/ops/backup/backup-nightly.sh directly |
Pattern B keeps pg_dump + age + GCS creds off the workers attack surface.
See also
- ADR-048 — Deployment topology
- ADR-049 — Container strategy
- ADR-053 — CD pipeline orchestration
- Codex deployment topology
- Codex container strategy
- Codex CD pipeline overview
docs/runbooks/deployment.md— 12-step cutoverdocs/runbooks/legacy-drain.md— drain protocoldocs/runbooks/rollback.md— rollback decision treedocs/runbooks/event-substrate.md— outbox + LISTEN/NOTIFY