Legacy-drain runbook
Operational procedure for draining outbox rows stamped at a prior deployment generation. Used after a rollback to clear gen-(N) events that no longer have a worker pool processing them. See ADR-053 D13 for the workflow contract and ADR-046 D8 for the generation-stamping guarantee that makes this protocol coherent.
When to run
Use legacy-drain in any of the following situations:
- After a Class 2 rollback (
rollback.md) — green workers stamped events at gen-(N+1) during the brief green-traffic window, and gen-(N+1) workers are no longer running. Draintarget_generation=N+1. - After a Class 3 rollback — prior code shipped at gen-(N+2)
serving traffic; gen-(N+1) is stranded. Drain
target_generation=N+1. - Routine cleanup — under normal operation, the rolling-deploy window leaves outbox rows at the prior generation that drain naturally as the new-gen workers process them. If a worker pool was killed mid-drain (rare; usually a Render incident), legacy-drain finishes the work.
Do not use legacy-drain for:
- Active rollbacks where the prior generation is still running (the active workers will handle their own outbox).
- Outbox rows from the current generation (they are not legacy).
Workflow contract
The drain runs as a manually-dispatched GitHub Actions workflow.
Workflow: .github/workflows/drain-legacy-generation.yml (lands per ADR-053 D13 with the deploy substrate; .github/workflows/ ships only ci.yml, release.yml, generate-sbom.yml today).
Concurrency: shares the deploy-prod concurrency mutex; queues behind in-flight prod deploys and blocks them while running.
Dispatch input: target_generation: integer — the generation
number whose outbox rows must be drained.
Steps
- Lookup
referencefortarget_generationincore.deployments. Abort with a clear error if no row matches; the workflow input is wrong orcore.deploymentsis corrupted. - Checkout repo at the resolved reference. This is the commit SHA the prior generation was deployed from. The build context matches what was running at gen-(target).
- Build and deploy a temporary Render worker service named
workers-drain-gen-<target_generation>. Configuration:SPECTRAL_GENERATION=<target_generation>SPECTRAL_DRAIN_AND_EXIT=trueSPECTRAL_DRAIN_COOLING_SECONDS— optional override; defaults to 60 seconds- All other env vars match the production worker service for that generation
- Monitor for drain completion. Two signals, in priority order:
- Authoritative: the log line
drain complete, exiting. The worker emits this after processing the last legacy event + waitingSPECTRAL_DRAIN_COOLING_SECONDSfor in-flight handlers and re-pending the cooler. - Fallback: poll
core.outboxfor zero rows instate IN ('pending', 'in_flight')at the target generation. Used if the log-line signal is missed (log shipping lag, log parsing failure).
- Authoritative: the log line
- Delete the temporary service. Call Render API
DELETE /v1/services/{service_id}. Verify deletion via API. - Workflow completion. Workflow succeeds when the temporary service is confirmed deleted and outbox is verified clean at the target generation.
Failure modes
| Failure | Action |
|---|---|
core.deployments row not found | Abort; verify target_generation; do not synthesize a row |
| Render deploy of temporary service fails | Abort; investigate via Render dashboard; the legacy outbox rows remain unprocessed and need a follow-up drain |
| Drain-complete signal never appears (timeout 30 min) | Investigate via Render logs; check for handler errors; manually inspect outbox state; if drain genuinely complete but signal missed, manually delete temporary service and re-run with same target |
| Render delete API fails | Retry with exponential backoff; if retries exhaust, delete via Render dashboard manually |
Manual cleanup if workflow fails mid-drain
If the workflow fails after deploying the temporary service but before deletion, the temporary service stays running. To clean up:
- Verify drain completion via direct query:
Expect zero rows inSELECT state, count(*) FROM core.outboxWHERE generation = <target_generation>GROUP BY state;
pendingorin_flight. - If clean, delete the temporary service via Render dashboard or
DELETE /v1/services/{service_id}against the Render API. - Note the manual intervention in the ops log.
Generation-stamping invariant
Legacy-drain relies on the ADR-046 D8 invariant: every outbox
row carries the generation it was stamped at, and workers filter
their claims by their own SPECTRAL_GENERATION env var. A worker
running at gen-N never claims a gen-(N-1) row, and vice versa. This
makes drain coherent — the temporary worker at gen-(target) is the
only worker that can claim and process those rows.
If this invariant is violated (e.g., a worker is misconfigured to claim across generations), legacy-drain produces incorrect behavior. The architecture validator and worker initialization both assert the invariant; mis-configuration should not reach production.
Cron-managed retention vs operator-triggered drain
core.deployments retention is indefinite (per ADR-049 D7) —
generations remain queryable for legacy-drain runs that occur weeks
or months after the deploy. There is no automatic cleanup of
core.deployments rows. If a row is needed for drain reference, it
will be there.
Outbox rows themselves are subject to retention per ADR-044 D13
(unchanged retention values). Routine retention sweep runs the daily
retention-run cron and removes rows past their TOMBSTONED window.
Legacy-drain produces no rows past retention because it processes
existing rows; if a drain target is older than the outbox retention
window, those rows have already been swept and there is nothing to
drain.
Related
- ADR-053 — D13 legacy-drain workflow contract.
- ADR-046 — D8 generation-stamping + drain
parameters (
HANDLER_MAX,SPECTRAL_DRAIN_COOLING_SECONDS, reaper interval, claim TTL). - ADR-044 — outbox + retention values.
- ADR-049 — D7
core.deploymentsretention indefinite. deployment.md— production cutover sequence.rollback.md— rollback decision tree (Classes 2 & 3 invoke legacy-drain).