Recovery

Legacy-drain runbook

Operational procedure for draining outbox rows stamped at a prior deployment generation. Used after a rollback to clear gen-(N) events that no longer have a worker pool processing them. See ADR-053 D13 for the workflow contract and ADR-046 D8 for the generation-stamping guarantee that makes this protocol coherent.

When to run

Use legacy-drain in any of the following situations:

After a Class 2 rollback (rollback.md) — green workers stamped events at gen-(N+1) during the brief green-traffic window, and gen-(N+1) workers are no longer running. Drain target_generation=N+1.
After a Class 3 rollback — prior code shipped at gen-(N+2) serving traffic; gen-(N+1) is stranded. Drain target_generation=N+1.
Routine cleanup — under normal operation, the rolling-deploy window leaves outbox rows at the prior generation that drain naturally as the new-gen workers process them. If a worker pool was killed mid-drain (rare; usually a Render incident), legacy-drain finishes the work.

Do not use legacy-drain for:

Active rollbacks where the prior generation is still running (the active workers will handle their own outbox).
Outbox rows from the current generation (they are not legacy).

Workflow contract

The drain runs as a manually-dispatched GitHub Actions workflow.

Workflow: .github/workflows/drain-legacy-generation.yml (lands per ADR-053 D13 with the deploy substrate; .github/workflows/ ships only ci.yml, release.yml, generate-sbom.yml today). Concurrency: shares the deploy-prod concurrency mutex; queues behind in-flight prod deploys and blocks them while running. Dispatch input: target_generation: integer — the generation number whose outbox rows must be drained.

Steps

Lookup reference for target_generation in core.deployments. Abort with a clear error if no row matches; the workflow input is wrong or core.deployments is corrupted.
Checkout repo at the resolved reference. This is the commit SHA the prior generation was deployed from. The build context matches what was running at gen-(target).
Build and deploy a temporary Render worker service named workers-drain-gen-<target_generation>. Configuration:
- SPECTRAL_GENERATION=<target_generation>
- SPECTRAL_DRAIN_AND_EXIT=true
- SPECTRAL_DRAIN_COOLING_SECONDS — optional override; defaults to 60 seconds
- All other env vars match the production worker service for that generation
Monitor for drain completion. Two signals, in priority order:
- Authoritative: the log line drain complete, exiting. The worker emits this after processing the last legacy event + waiting SPECTRAL_DRAIN_COOLING_SECONDS for in-flight handlers and re-pending the cooler.
- Fallback: poll core.outbox for zero rows in state IN ('pending', 'in_flight') at the target generation. Used if the log-line signal is missed (log shipping lag, log parsing failure).
Delete the temporary service. Call Render API DELETE /v1/services/{service_id}. Verify deletion via API.
Workflow completion. Workflow succeeds when the temporary service is confirmed deleted and outbox is verified clean at the target generation.

Failure modes

Failure	Action
`core.deployments` row not found	Abort; verify `target_generation`; do not synthesize a row
Render deploy of temporary service fails	Abort; investigate via Render dashboard; the legacy outbox rows remain unprocessed and need a follow-up drain
Drain-complete signal never appears (timeout 30 min)	Investigate via Render logs; check for handler errors; manually inspect outbox state; if drain genuinely complete but signal missed, manually delete temporary service and re-run with same target
Render delete API fails	Retry with exponential backoff; if retries exhaust, delete via Render dashboard manually

Manual cleanup if workflow fails mid-drain

If the workflow fails after deploying the temporary service but before deletion, the temporary service stays running. To clean up:

Verify drain completion via direct query:

SELECT state, count(*) FROM core.outbox
WHERE generation = <target_generation>
GROUP BY state;

Expect zero rows in pending or in_flight.

If clean, delete the temporary service via Render dashboard or DELETE /v1/services/{service_id} against the Render API.
Note the manual intervention in the ops log.

Generation-stamping invariant

Legacy-drain relies on the ADR-046 D8 invariant: every outbox row carries the generation it was stamped at, and workers filter their claims by their own SPECTRAL_GENERATION env var. A worker running at gen-N never claims a gen-(N-1) row, and vice versa. This makes drain coherent — the temporary worker at gen-(target) is the only worker that can claim and process those rows.

If this invariant is violated (e.g., a worker is misconfigured to claim across generations), legacy-drain produces incorrect behavior. The architecture validator and worker initialization both assert the invariant; mis-configuration should not reach production.

Cron-managed retention vs operator-triggered drain

core.deployments retention is indefinite (per ADR-049 D7) — generations remain queryable for legacy-drain runs that occur weeks or months after the deploy. There is no automatic cleanup of core.deployments rows. If a row is needed for drain reference, it will be there.

Outbox rows themselves are subject to retention per ADR-044 D13 (unchanged retention values). Routine retention sweep runs the daily retention-run cron and removes rows past their TOMBSTONED window. Legacy-drain produces no rows past retention because it processes existing rows; if a drain target is older than the outbox retention window, those rows have already been swept and there is nothing to drain.

ADR-053 — D13 legacy-drain workflow contract.
ADR-046 — D8 generation-stamping + drain parameters (HANDLER_MAX, SPECTRAL_DRAIN_COOLING_SECONDS, reaper interval, claim TTL).
ADR-044 — outbox + retention values.
ADR-049 — D7 core.deployments retention indefinite.
deployment.md — production cutover sequence.
rollback.md — rollback decision tree (Classes 2 & 3 invoke legacy-drain).

Previous
Disaster Recovery Next
Rollback