ADR-038: Embedding model — single canonical local model, hybrid retrieval, blue-green re-embedding
Status: Accepted (2026-04-21)
Context
Embeddings feed four non-user-facing retrieval paths in alpha: T3 agent memory retrieval; rule-candidate similarity (worlds); world-model artifact search (worlds); future customer-trace similarity (platform). All four are within-purpose queries today — no cross-purpose vector comparison required for the alpha feature set.
Alpha volume estimate: <100K embeddings/month (not the 1M-100M forward-projection scale points).
Already locked going in: single Supabase project + pgvector as vector store (ADR-032); PurposeKey.EMBEDDING reserved (ADR-035 D3); LLMUsageRecord already carries every field embedding calls need; “own-the-substrate” architectural bias (ADR-035 in-process control plane; ADR-037 native secrets posture).
The initial draft recommended Gemini’s cloud embedding API. Reversed during disposition: cloud-API embeddings set a track toward large spend at scale ($6–15K/month at 100M embeddings/month) — incompatible with bootstrap-plausible funding trajectory and inconsistent with the own-the-substrate pattern. In-process local model reuses worker compute and costs $0 additional at alpha scale; the upgrade ladder (D11) keeps the door open if quality, scale, or contractual demands change.
Decision
D1 — Canonical model: BAAI/bge-small-en-v1.5 at 384-dim, in-process via FastEmbed
- 33M params, ~120 MB footprint, Apache 2.0 licensed
- CPU inference 5–15 ms per embedding via FastEmbed (ONNX-backed, ~3× faster than raw transformers)
- Loaded in workers (and in
apps/apifor query-time embedding) - MTEB retrieval ~63 — adequate for all four alpha use cases
- Migration to cloud or larger local model is a re-embedding job (<$200 at 1–10M vectors), not a re-architecture
D2 — Single canonical model across every context and every purpose
Enforced by the EmbeddingProfileResolver Protocol in spectral.core.embeddings.protocol — all contexts consume the resolution surface; direct model-ID literals in per-context code are a discouraged pattern. Rationale: vector-space unity costs nothing today (no cross-purpose vector comparisons in alpha) and preserves future optionality (rule↔memory retrieval becomes possible without re-embedding).
D3 — Embedding profile config in core.embedding_profile
One active row per account (partial unique index on account_id WHERE deactivated_at IS NULL). Columns: id, account_id, provider, model, model_version, dimension, created_at, activated_at, deactivated_at. Append-only — rotation sets deactivated_at on the previous row and inserts a new active one. Audit trail for “which vectors belong to which profile.”
D4 — Re-embedding lifecycle is blue-green and event-driven
When the canonical profile rotates: a migration adds embedding_v2 vector(<new_dim>) (pgvector allows multiple vector columns per table); an EmbeddingProfileRotated domain event triggers a backfill worker; the worker re-embeds source content in batches into the new column; a feature-flagged read path flips when backfill hits 100%; a follow-up migration drops the old column and index.
D5 — EmbeddingProvider protocol in spectral.core.embeddings.protocol
@runtime_checkableclass EmbeddingProvider(Protocol): async def embed(self, texts: list[str], *, profile: EmbeddingProfile) -> list[Embedding]: ...Concrete implementations live in per-context infrastructure (alpha: InProcessFastEmbedProvider; later: TEIProvider, GeminiProvider, OpenAIProvider). A TenantScopedEmbeddingProvider wrapper applies the ADR-035 D5/D6 rate-limit + budget envelope.
D6 — Rate limit and budget accounting piggyback on ADR-035
Each batch embedding call writes one core.llm_usage row: model=bge-small-en-v1.5, input_tokens=<total>, output_tokens=0, purpose=EMBEDDING, content_class=<from caller>. No new schema. In-process calls still emit the row for consistent cost attribution and audit.
D7 — Content-class routing follows ADR-036 D6
The same PLATFORM / OPERATIONS / SYNTHETIC taxonomy applies to embedding calls. In-process embedding means PLATFORM content never leaves the worker — sovereignty posture is strictly stronger than cloud-API. Content-class is still tagged on the core.llm_usage row for audit.
D8 — Hybrid retrieval via RRF is the standard pattern
Every retrievable table carries both a vector(<dim>) column (semantic, HNSW-indexed) and a tsvector column (lexical, GIN-indexed). Retrieval helpers in spectral.core.embeddings.retrieval fuse via Reciprocal Rank Fusion with k=60 (the standard constant). Pure vector similarity misses exact matches on domain vocabulary (rule IDs, form codes, error strings); RRF consistently outperforms single-method retrieval. Built on vanilla Postgres FTS plus pgvector, zero additional extensions.
D9 — Retrievable-table convention
Schema rule shared across contexts, enforced by code review and a post-alpha migration-naming lint extension:
embedding <vector|halfvec>(<dim>)+ HNSW index onvector_cosine_opsembedding_model TEXT NOT NULLembedding_model_version TEXT NOT NULLembedding_dim INT NOT NULLsource_content_hash TEXT NULL(re-embed skip-if-unchanged)search_tsv tsvectorgenerated from relevant text columns + GIN indexsearch_lang TEXT DEFAULT 'english'(multilingual-future)
D10 — HNSW defaults: m=16, ef_construction=64 at build; tune ef_search per query
Supabase-standard. Alpha (≤1M rows): 4–8 GB maintenance_work_mem at build. Revisit at 10M+ rows; consider tenant-partitioned indexes if RLS-scoped query planner regressions surface.
D11 — Fallback upgrade ladder
Each step is a re-embedding job, not a re-architecture:
- Quality insufficient → upgrade to
BAAI/bge-large-en-v1.5(335M params, 1024-dim, ~1.3 GB; still in-process if worker RAM allows). - Worker RAM pressure OR embedding volume outpaces in-process → deploy
huggingface/text-embeddings-inferencesidecar on Cloud Run CPU (~$50–80/month flat; frees worker memory). - Quality demands frontier OR enterprise DPA demands a named provider → swap to a cloud API (Gemini
gemini-embedding-001preferred by GCP lean; OpenAItext-embedding-3-largeif non-GCP).
All three steps: swap provider under the EmbeddingProvider protocol → re-embed via EmbeddingProfileRotated event → cut over.
Alternatives considered
Gemini gemini-embedding-001 cloud (the initial recommendation). Reversed: sets a cost-scaling track incompatible with the funding trajectory. $150/month at 1M emb/month is trivial; $15K/month at 100M emb/month is real. In-process reuses existing worker compute at zero additional cost.
OpenAI text-embedding-3-large cloud. Same rejection reasoning; slightly worse GCP alignment.
Voyage-3. MongoDB acquisition trajectory; the API is becoming an Atlas Vector Search feature.
Cohere embed-v4. No GCP availability; cross-cloud friction.
Local TEI sidecar on Cloud Run CPU from day one. Correct upgrade target (D11 step 2) but unnecessary infrastructure at alpha when in-process works. Premature.
Larger in-process models (BGE-large, nomic-embed-text-v1.5, Qwen3-Embedding). Upgrade targets (D11 step 1). Footprint/quality trade rejected for alpha; BGE-small is the right size for current worker sizing.
Per-purpose different models. Fragmentation risk; forecloses cross-purpose retrieval; against the single-canonical discipline elsewhere.
Embeddings-only, no FTS. Loses exact-match recall on domain vocabulary. RRF hybrid is strictly stronger at negligible cost.
External FTS service (Elasticsearch, Meilisearch, Typesense). Overkill; Postgres native FTS handles our scale fine, zero new infra.
Consequences
- Unblocks ADR-056 (TA-8 T3 Memory routing) — embedding-based retrieval has a canonical model.
- Unblocks agent memory ADRs (ADR-058 / ADR-059 / ADR-043) —
EmbeddingProviderprotocol and RRF retrieval helper available. core.embedding_profileis the secondcoreschema table (aftercore.llm_usage).- pgvector storage at alpha: 1M rows ×
vector(384)≈ 1.5 GB raw + 2.25 GB HNSW ≈ 3.75 GB total. Comfortable in alpha Supabase instance. - Worker footprint: in-process BGE-small adds ~120 MB per worker. Negligible. If a future canonical model upgrade pushes in-process footprint past ~500 MB (e.g., BGE-large at ~1.3 GB), ADR-048 / ADR-049 should re-evaluate whether to subdivide workers by workload profile.
- Sovereignty posture is strictly stronger than the initial cloud-API recommendation. Customer content never leaves the worker process for embedding. No subprocessor added for this capability.
- Cost trajectory is flat, not stepped. No 100M-embedding cliff scenario. Upgrade path (D11) adds cost incrementally when triggered.
- D9 schema convention applies to every future retrievable table — Worlds rule candidates, world-model artifacts, T3 memory items, scan traces. Enforce in code review until a post-alpha migration-naming lint extension can mechanically check.
References
- ADR-065 —
spectral.coreadmission discipline - ADR-031 — single-library structure
- ADR-032 — pgvector store;
coreschema - ADR-035 —
PurposeKey.EMBEDDING;LLMUsageRecord; rate-limit + budget pattern - ADR-036 — content-class taxonomy;
core.llm_usageshape - ADR-043 — TA-14 memory consumer
- ADR-056 — TA-8 T3 Memory routing consumer
- ADR-058 — TA-12 retrieval consumer
- ADR-059 — TA-13 retrieval consumer
- TA-11 disposition — SPEC-314 comment
568fe106 - TA-11 verification — SPEC-314 comment
993aae10 src/spectral/core/embeddings/(commitdf78715) — landed contract surfacesupabase/migrations/20260421012800_core_embedding_profile.sql—core.embedding_profile- Codex
system-design/agents/embeddings.mdx— close-pass new page docs/runbooks/embeddings.md— upgrade-ladder + rotation playbook