Skip to content
GitHub
Decisions

ADR-038: Embedding model — single canonical local model, hybrid retrieval, blue-green re-embedding

Status: Accepted (2026-04-21)

Context

Embeddings feed four non-user-facing retrieval paths in alpha: T3 agent memory retrieval; rule-candidate similarity (worlds); world-model artifact search (worlds); future customer-trace similarity (platform). All four are within-purpose queries today — no cross-purpose vector comparison required for the alpha feature set.

Alpha volume estimate: <100K embeddings/month (not the 1M-100M forward-projection scale points).

Already locked going in: single Supabase project + pgvector as vector store (ADR-032); PurposeKey.EMBEDDING reserved (ADR-035 D3); LLMUsageRecord already carries every field embedding calls need; “own-the-substrate” architectural bias (ADR-035 in-process control plane; ADR-037 native secrets posture).

The initial draft recommended Gemini’s cloud embedding API. Reversed during disposition: cloud-API embeddings set a track toward large spend at scale ($6–15K/month at 100M embeddings/month) — incompatible with bootstrap-plausible funding trajectory and inconsistent with the own-the-substrate pattern. In-process local model reuses worker compute and costs $0 additional at alpha scale; the upgrade ladder (D11) keeps the door open if quality, scale, or contractual demands change.

Decision

D1 — Canonical model: BAAI/bge-small-en-v1.5 at 384-dim, in-process via FastEmbed

  • 33M params, ~120 MB footprint, Apache 2.0 licensed
  • CPU inference 5–15 ms per embedding via FastEmbed (ONNX-backed, ~3× faster than raw transformers)
  • Loaded in workers (and in apps/api for query-time embedding)
  • MTEB retrieval ~63 — adequate for all four alpha use cases
  • Migration to cloud or larger local model is a re-embedding job (<$200 at 1–10M vectors), not a re-architecture

D2 — Single canonical model across every context and every purpose

Enforced by the EmbeddingProfileResolver Protocol in spectral.core.embeddings.protocol — all contexts consume the resolution surface; direct model-ID literals in per-context code are a discouraged pattern. Rationale: vector-space unity costs nothing today (no cross-purpose vector comparisons in alpha) and preserves future optionality (rule↔memory retrieval becomes possible without re-embedding).

D3 — Embedding profile config in core.embedding_profile

One active row per account (partial unique index on account_id WHERE deactivated_at IS NULL). Columns: id, account_id, provider, model, model_version, dimension, created_at, activated_at, deactivated_at. Append-only — rotation sets deactivated_at on the previous row and inserts a new active one. Audit trail for “which vectors belong to which profile.”

D4 — Re-embedding lifecycle is blue-green and event-driven

When the canonical profile rotates: a migration adds embedding_v2 vector(<new_dim>) (pgvector allows multiple vector columns per table); an EmbeddingProfileRotated domain event triggers a backfill worker; the worker re-embeds source content in batches into the new column; a feature-flagged read path flips when backfill hits 100%; a follow-up migration drops the old column and index.

D5 — EmbeddingProvider protocol in spectral.core.embeddings.protocol

@runtime_checkable
class EmbeddingProvider(Protocol):
async def embed(self, texts: list[str], *, profile: EmbeddingProfile) -> list[Embedding]: ...

Concrete implementations live in per-context infrastructure (alpha: InProcessFastEmbedProvider; later: TEIProvider, GeminiProvider, OpenAIProvider). A TenantScopedEmbeddingProvider wrapper applies the ADR-035 D5/D6 rate-limit + budget envelope.

D6 — Rate limit and budget accounting piggyback on ADR-035

Each batch embedding call writes one core.llm_usage row: model=bge-small-en-v1.5, input_tokens=<total>, output_tokens=0, purpose=EMBEDDING, content_class=<from caller>. No new schema. In-process calls still emit the row for consistent cost attribution and audit.

D7 — Content-class routing follows ADR-036 D6

The same PLATFORM / OPERATIONS / SYNTHETIC taxonomy applies to embedding calls. In-process embedding means PLATFORM content never leaves the worker — sovereignty posture is strictly stronger than cloud-API. Content-class is still tagged on the core.llm_usage row for audit.

D8 — Hybrid retrieval via RRF is the standard pattern

Every retrievable table carries both a vector(<dim>) column (semantic, HNSW-indexed) and a tsvector column (lexical, GIN-indexed). Retrieval helpers in spectral.core.embeddings.retrieval fuse via Reciprocal Rank Fusion with k=60 (the standard constant). Pure vector similarity misses exact matches on domain vocabulary (rule IDs, form codes, error strings); RRF consistently outperforms single-method retrieval. Built on vanilla Postgres FTS plus pgvector, zero additional extensions.

D9 — Retrievable-table convention

Schema rule shared across contexts, enforced by code review and a post-alpha migration-naming lint extension:

  • embedding <vector|halfvec>(<dim>) + HNSW index on vector_cosine_ops
  • embedding_model TEXT NOT NULL
  • embedding_model_version TEXT NOT NULL
  • embedding_dim INT NOT NULL
  • source_content_hash TEXT NULL (re-embed skip-if-unchanged)
  • search_tsv tsvector generated from relevant text columns + GIN index
  • search_lang TEXT DEFAULT 'english' (multilingual-future)

D10 — HNSW defaults: m=16, ef_construction=64 at build; tune ef_search per query

Supabase-standard. Alpha (≤1M rows): 4–8 GB maintenance_work_mem at build. Revisit at 10M+ rows; consider tenant-partitioned indexes if RLS-scoped query planner regressions surface.

D11 — Fallback upgrade ladder

Each step is a re-embedding job, not a re-architecture:

  1. Quality insufficient → upgrade to BAAI/bge-large-en-v1.5 (335M params, 1024-dim, ~1.3 GB; still in-process if worker RAM allows).
  2. Worker RAM pressure OR embedding volume outpaces in-process → deploy huggingface/text-embeddings-inference sidecar on Cloud Run CPU (~$50–80/month flat; frees worker memory).
  3. Quality demands frontier OR enterprise DPA demands a named provider → swap to a cloud API (Gemini gemini-embedding-001 preferred by GCP lean; OpenAI text-embedding-3-large if non-GCP).

All three steps: swap provider under the EmbeddingProvider protocol → re-embed via EmbeddingProfileRotated event → cut over.

Alternatives considered

Gemini gemini-embedding-001 cloud (the initial recommendation). Reversed: sets a cost-scaling track incompatible with the funding trajectory. $150/month at 1M emb/month is trivial; $15K/month at 100M emb/month is real. In-process reuses existing worker compute at zero additional cost.

OpenAI text-embedding-3-large cloud. Same rejection reasoning; slightly worse GCP alignment.

Voyage-3. MongoDB acquisition trajectory; the API is becoming an Atlas Vector Search feature.

Cohere embed-v4. No GCP availability; cross-cloud friction.

Local TEI sidecar on Cloud Run CPU from day one. Correct upgrade target (D11 step 2) but unnecessary infrastructure at alpha when in-process works. Premature.

Larger in-process models (BGE-large, nomic-embed-text-v1.5, Qwen3-Embedding). Upgrade targets (D11 step 1). Footprint/quality trade rejected for alpha; BGE-small is the right size for current worker sizing.

Per-purpose different models. Fragmentation risk; forecloses cross-purpose retrieval; against the single-canonical discipline elsewhere.

Embeddings-only, no FTS. Loses exact-match recall on domain vocabulary. RRF hybrid is strictly stronger at negligible cost.

External FTS service (Elasticsearch, Meilisearch, Typesense). Overkill; Postgres native FTS handles our scale fine, zero new infra.

Consequences

  • Unblocks ADR-056 (TA-8 T3 Memory routing) — embedding-based retrieval has a canonical model.
  • Unblocks agent memory ADRs (ADR-058 / ADR-059 / ADR-043)EmbeddingProvider protocol and RRF retrieval helper available.
  • core.embedding_profile is the second core schema table (after core.llm_usage).
  • pgvector storage at alpha: 1M rows × vector(384) ≈ 1.5 GB raw + 2.25 GB HNSW ≈ 3.75 GB total. Comfortable in alpha Supabase instance.
  • Worker footprint: in-process BGE-small adds ~120 MB per worker. Negligible. If a future canonical model upgrade pushes in-process footprint past ~500 MB (e.g., BGE-large at ~1.3 GB), ADR-048 / ADR-049 should re-evaluate whether to subdivide workers by workload profile.
  • Sovereignty posture is strictly stronger than the initial cloud-API recommendation. Customer content never leaves the worker process for embedding. No subprocessor added for this capability.
  • Cost trajectory is flat, not stepped. No 100M-embedding cliff scenario. Upgrade path (D11) adds cost incrementally when triggered.
  • D9 schema convention applies to every future retrievable table — Worlds rule candidates, world-model artifacts, T3 memory items, scan traces. Enforce in code review until a post-alpha migration-naming lint extension can mechanically check.

References

  • ADR-065spectral.core admission discipline
  • ADR-031 — single-library structure
  • ADR-032 — pgvector store; core schema
  • ADR-035PurposeKey.EMBEDDING; LLMUsageRecord; rate-limit + budget pattern
  • ADR-036 — content-class taxonomy; core.llm_usage shape
  • ADR-043 — TA-14 memory consumer
  • ADR-056 — TA-8 T3 Memory routing consumer
  • ADR-058 — TA-12 retrieval consumer
  • ADR-059 — TA-13 retrieval consumer
  • TA-11 disposition — SPEC-314 comment 568fe106
  • TA-11 verification — SPEC-314 comment 993aae10
  • src/spectral/core/embeddings/ (commit df78715) — landed contract surface
  • supabase/migrations/20260421012800_core_embedding_profile.sqlcore.embedding_profile
  • Codex system-design/agents/embeddings.mdx — close-pass new page
  • docs/runbooks/embeddings.md — upgrade-ladder + rotation playbook