Operator

Deployment topology runbook

Operational procedures for deployment-generation stamping, worker drain, race-free deploy protocols, and core.deployments / core.workers inspection.

System reference: Codex system-design/deployment-topology.mdx · ADR-048 · ADR-049 · ADR-053.

Inspect current generation

-- Latest deployment generation per environment
SELECT generation, reference, deployed_at, deployed_by, tag
FROM core.deployments
ORDER BY generation DESC
LIMIT 5;

-- What generation are workers reporting
SELECT generation, count(*), max(last_heartbeat_at) AS most_recent_beat
FROM core.workers
WHERE state = 'running'
GROUP BY generation
ORDER BY generation DESC;

Healthy state: workers’ max generation matches core.deployments.generation head. Lag = a deploy is in flight or workers haven’t picked up the new generation yet.

Inspect outbox by generation

-- Pending rows by generation + status
SELECT generation, status, count(*)
FROM core.outbox
WHERE deleted_at IS NULL
GROUP BY generation, status
ORDER BY generation DESC, status;

Healthy state: PENDING rows at the latest generation; ZERO rows at older generations (legacy-drained).

Race-free generation-stamp protocol

The 12-step cutover sequence is in docs/runbooks/deployment.md. Key correctness invariants:

SPECTRAL_GENERATION is per-service (set at deploy time via Render API), not in env groups. Env-group changes never bump generation. This makes the “gen-N events processed by gen-N code” guarantee structural.
Generation allocated atomically via INSERT INTO core.deployments RETURNING generation. No race on sequence allocation.
Workflow concurrency mutex (concurrency: deploy-prod) prevents concurrent prod deploys.
Schema-version gate — green’s /version/detail must report the expected schema_version before the CNAME flip step.
30 s pre-flip sanity check — verify no blue service has started a redeploy in the last 30 s (env-group race detection).

Race C (Render pod crash-restart env-snapshot semantics during rolling deploy) is materially mitigated for SPECTRAL_GENERATION by the per-service placement above. Forward considerations if Race C fires for generation specifically — see ADR-049 D5 + ADR-053 D7.

Worker drain parameters

Parameter	Value	Where
`HANDLER_MAX`	60 s (`asyncio.wait_for` bound on each handler)	per-service env var
`maxShutdownDelaySeconds`	90 s (HANDLER_MAX + 30 s buffer; under Render’s 300 s ceiling)	`render.yaml` worker service
Reaper interval	30 s	per-service env var
Claim TTL	300 s (5× HANDLER_MAX)	per-service env var
`SPECTRAL_DRAIN_COOLING_SECONDS`	60 s default	per-service env var (legacy-drain workers only)

Legacy-drain workflow

When stranded gen-N rows exist (e.g., after rollback path 3), invoke drain-legacy-generation.yml:

Run the workflow with target_generation=<N>.
Workflow reads core.deployments for the code reference at target_generation.
Checks out repo at that reference; deploys a temporary Render worker workers-drain-gen-<N> with SPECTRAL_GENERATION=<N> and SPECTRAL_DRAIN_AND_EXIT=true.
Worker drains gen-N events from core.outbox; auto-exits after SPECTRAL_DRAIN_COOLING_SECONDS of zero pending+in_flight rows.
Workflow deletes the temporary service.

Recovery if mid-drain failure

The temporary service stays up. Manual cleanup:

# Verify drain status
psql -c "SELECT status, count(*) FROM core.outbox WHERE generation = <N> AND deleted_at IS NULL GROUP BY status;"

# Delete the temporary service via Render API (or dashboard)
curl -X DELETE -H "Authorization: Bearer $RENDER_API_KEY" \
  https://api.render.com/v1/services/<workers-drain-gen-N-service-id>

`/version` and `/version/detail`

/version (public; minimal):

curl https://api.runspectral.com/version
# → {"service":"api","environment":"production","generation":42,"tag":"prod-42","color":"green","commit_sha":"a9ba851","deployed_at":"2026-04-25T21:00:00Z"}

/version/detail (auth-gated via Authorization: Bearer sk_deploy_...):

curl -H "Authorization: Bearer $DEPLOY_KEY" https://api.runspectral.com/version/detail
# → full version.json + runtime/framework/os + check statuses with latency

The deploy-key registry lives in env-group secrets (sk_deploy_* keys); rotation = deploy side-effect.

Worker heartbeat

core.workers rows are written by each worker process on a heartbeat interval. Inspect:

SELECT worker_id, generation, channel, state, last_heartbeat_at, started_at
FROM core.workers
ORDER BY last_heartbeat_at DESC;

Stale heartbeats (no update in > 60 s) indicate a worker crash or disconnect. The reaper handles outbox-row recovery; the worker row itself is left for ops visibility.

Container image version

# Inside any container (api / workers / dashboard / operations)
cat /app/version.json
# → {"sha":"...", "short_sha":"...", "describe":"...", "built_at":"...", "uv_lock_sha":"...", "pnpm_lock_sha":"..."}

The build-time script infra/docker/build-version.sh produces this file; each Dockerfile COPYs it into /app/version.json.

Cron service inventory

Cron	Image	Pattern
`retention-run`	`workers` (shared)	Pattern A — posts `retention.run_scheduled` event into the workers substrate
`backup-nightly`	`backup-nightly` (dedicated)	Pattern B — runs `tools/ops/backup/backup-nightly.sh` directly

Pattern B keeps pg_dump + age + GCS creds off the workers attack surface.