Skip to content
GitHub
Operator

Deployment topology runbook

Operational procedures for deployment-generation stamping, worker drain, race-free deploy protocols, and core.deployments / core.workers inspection.

System reference: Codex system-design/deployment-topology.mdx · ADR-048 · ADR-049 · ADR-053.


Inspect current generation

-- Latest deployment generation per environment
SELECT generation, reference, deployed_at, deployed_by, tag
FROM core.deployments
ORDER BY generation DESC
LIMIT 5;
-- What generation are workers reporting
SELECT generation, count(*), max(last_heartbeat_at) AS most_recent_beat
FROM core.workers
WHERE state = 'running'
GROUP BY generation
ORDER BY generation DESC;

Healthy state: workers’ max generation matches core.deployments.generation head. Lag = a deploy is in flight or workers haven’t picked up the new generation yet.


Inspect outbox by generation

-- Pending rows by generation + status
SELECT generation, status, count(*)
FROM core.outbox
WHERE deleted_at IS NULL
GROUP BY generation, status
ORDER BY generation DESC, status;

Healthy state: PENDING rows at the latest generation; ZERO rows at older generations (legacy-drained).


Race-free generation-stamp protocol

The 12-step cutover sequence is in docs/runbooks/deployment.md. Key correctness invariants:

  1. SPECTRAL_GENERATION is per-service (set at deploy time via Render API), not in env groups. Env-group changes never bump generation. This makes the “gen-N events processed by gen-N code” guarantee structural.
  2. Generation allocated atomically via INSERT INTO core.deployments RETURNING generation. No race on sequence allocation.
  3. Workflow concurrency mutex (concurrency: deploy-prod) prevents concurrent prod deploys.
  4. Schema-version gate — green’s /version/detail must report the expected schema_version before the CNAME flip step.
  5. 30 s pre-flip sanity check — verify no blue service has started a redeploy in the last 30 s (env-group race detection).

Race C (Render pod crash-restart env-snapshot semantics during rolling deploy) is materially mitigated for SPECTRAL_GENERATION by the per-service placement above. Forward considerations if Race C fires for generation specifically — see ADR-049 D5 + ADR-053 D7.


Worker drain parameters

ParameterValueWhere
HANDLER_MAX60 s (asyncio.wait_for bound on each handler)per-service env var
maxShutdownDelaySeconds90 s (HANDLER_MAX + 30 s buffer; under Render’s 300 s ceiling)render.yaml worker service
Reaper interval30 sper-service env var
Claim TTL300 s (5× HANDLER_MAX)per-service env var
SPECTRAL_DRAIN_COOLING_SECONDS60 s defaultper-service env var (legacy-drain workers only)

Legacy-drain workflow

When stranded gen-N rows exist (e.g., after rollback path 3), invoke drain-legacy-generation.yml:

  1. Run the workflow with target_generation=<N>.
  2. Workflow reads core.deployments for the code reference at target_generation.
  3. Checks out repo at that reference; deploys a temporary Render worker workers-drain-gen-<N> with SPECTRAL_GENERATION=<N> and SPECTRAL_DRAIN_AND_EXIT=true.
  4. Worker drains gen-N events from core.outbox; auto-exits after SPECTRAL_DRAIN_COOLING_SECONDS of zero pending+in_flight rows.
  5. Workflow deletes the temporary service.

Recovery if mid-drain failure

The temporary service stays up. Manual cleanup:

Terminal window
# Verify drain status
psql -c "SELECT status, count(*) FROM core.outbox WHERE generation = <N> AND deleted_at IS NULL GROUP BY status;"
# Delete the temporary service via Render API (or dashboard)
curl -X DELETE -H "Authorization: Bearer $RENDER_API_KEY" \
https://api.render.com/v1/services/<workers-drain-gen-N-service-id>

/version and /version/detail

/version (public; minimal):

Terminal window
curl https://api.runspectral.com/version
# → {"service":"api","environment":"production","generation":42,"tag":"prod-42","color":"green","commit_sha":"a9ba851","deployed_at":"2026-04-25T21:00:00Z"}

/version/detail (auth-gated via Authorization: Bearer sk_deploy_...):

Terminal window
curl -H "Authorization: Bearer $DEPLOY_KEY" https://api.runspectral.com/version/detail
# → full version.json + runtime/framework/os + check statuses with latency

The deploy-key registry lives in env-group secrets (sk_deploy_* keys); rotation = deploy side-effect.


Worker heartbeat

core.workers rows are written by each worker process on a heartbeat interval. Inspect:

SELECT worker_id, generation, channel, state, last_heartbeat_at, started_at
FROM core.workers
ORDER BY last_heartbeat_at DESC;

Stale heartbeats (no update in > 60 s) indicate a worker crash or disconnect. The reaper handles outbox-row recovery; the worker row itself is left for ops visibility.


Container image version

Terminal window
# Inside any container (api / workers / dashboard / operations)
cat /app/version.json
# → {"sha":"...", "short_sha":"...", "describe":"...", "built_at":"...", "uv_lock_sha":"...", "pnpm_lock_sha":"..."}

The build-time script infra/docker/build-version.sh produces this file; each Dockerfile COPYs it into /app/version.json.


Cron service inventory

CronImagePattern
retention-runworkers (shared)Pattern A — posts retention.run_scheduled event into the workers substrate
backup-nightlybackup-nightly (dedicated)Pattern B — runs tools/ops/backup/backup-nightly.sh directly

Pattern B keeps pg_dump + age + GCS creds off the workers attack surface.


See also

  • ADR-048 — Deployment topology
  • ADR-049 — Container strategy
  • ADR-053 — CD pipeline orchestration
  • Codex deployment topology
  • Codex container strategy
  • Codex CD pipeline overview
  • docs/runbooks/deployment.md — 12-step cutover
  • docs/runbooks/legacy-drain.md — drain protocol
  • docs/runbooks/rollback.md — rollback decision tree
  • docs/runbooks/event-substrate.md — outbox + LISTEN/NOTIFY