Skip to content
GitHub
Decisions

ADR-038: Embedding model — single canonical local model, hybrid retrieval, blue-green re-embedding

Context

Embeddings feed four non-user-facing retrieval paths in alpha: T3 agent memory retrieval; rule-candidate similarity (worlds); world-model artifact search (worlds); future customer-trace similarity (platform). All four are within-purpose queries today — no cross-purpose vector comparison required for the alpha feature set.

Alpha volume estimate: <100K embeddings/month (not the 1M-100M forward-projection scale points).

Already locked going in: single Supabase project + pgvector as vector store (ADR-032); PurposeKey.EMBEDDING reserved (ADR-035 D3); LLMUsageRecord already carries every field embedding calls need; “own-the-substrate” architectural bias (ADR-035 in-process control plane; ADR-037 native secrets posture).

Cloud-API embeddings set a track toward large spend at scale ($6–15K/month at 100M embeddings/month) — incompatible with bootstrap-plausible funding trajectory and inconsistent with the own-the-substrate pattern. An in-process local model reuses worker compute and costs $0 additional at alpha scale; the upgrade ladder (D11) keeps the door open if quality, scale, or contractual demands change.

Decision

D1 — Canonical model: BAAI/bge-small-en-v1.5 at 384-dim, in-process via FastEmbed

  • 33M params, ~120 MB footprint, Apache 2.0 licensed
  • CPU inference 5–15 ms per embedding via FastEmbed (ONNX-backed, ~3× faster than raw transformers)
  • Loaded in workers (and in apps/api for query-time embedding)
  • MTEB retrieval ~63 — adequate for all four alpha use cases
  • Migration to cloud or larger local model is a re-embedding job (<$200 at 1–10M vectors), not a re-architecture

D2 — Single canonical model across every context and every purpose

Enforced by the EmbeddingProfileResolver Protocol in spectral.core.embeddings.protocol — all contexts consume the resolution surface; direct model-ID literals in per-context code are a discouraged pattern. Rationale: vector-space unity costs nothing today (no cross-purpose vector comparisons in alpha) and preserves future optionality (rule↔memory retrieval becomes possible without re-embedding).

D3 — Embedding profile config in core.embedding_profile

One active row per org (partial unique index on org_id WHERE deactivated_at IS NULL). Columns: id, org_id, provider, model, model_version, dimension, created_at, activated_at, deactivated_at. Append-only — rotation sets deactivated_at on the previous row and inserts a new active one. Audit trail for “which vectors belong to which profile.”

D4 — Re-embedding lifecycle is blue-green and event-driven

When the canonical profile rotates: a migration adds embedding_v2 vector(<new_dim>) (pgvector allows multiple vector columns per table); an EmbeddingProfileRotated domain event triggers a backfill worker; the worker re-embeds source content in batches into the new column; a feature-flagged read path flips when backfill hits 100%; a follow-up migration drops the old column and index.

D5 — EmbeddingProvider protocol in spectral.core.embeddings.protocol

@runtime_checkable
class EmbeddingProvider(Protocol):
async def embed(self, texts: list[str], *, profile: EmbeddingProfile) -> list[Embedding]: ...

The context-agnostic concrete EmbeddingProvider implementations — which embed text and know nothing of any context’s tables — live in core.embeddings.infrastructure per ADR-099 (alpha: the FastEmbed-backed provider at core.embeddings.infrastructure.fastembed, the canonical first inhabitant of the core infrastructure zone; later: TEIProvider, GeminiProvider, OpenAIProvider). The DB-coupled embeddings machinery — the profile-resolver (D2/D3) and the hybrid-retriever (D8) — reads worlds tables and encodes worlds retrieval policy, so it stays in worlds.infrastructure: it fails the ADR-099 killer test (a context-agnostic concretion used by more than one context). A TenantScopedEmbeddingProvider wrapper applies the ADR-035 D5/D6 rate-limit + budget envelope.

D6 — Rate limit and budget accounting piggyback on ADR-035

Each batch embedding call writes one core.llm_usage row: model=bge-small-en-v1.5, input_tokens=<total>, output_tokens=0, purpose=EMBEDDING, content_class=<from caller>. No new schema. In-process calls still emit the row for consistent cost attribution and audit.

D7 — Content-class routing follows ADR-036 D6

The same PLATFORM / WORLDS / OPERATIONS / SYNTHETIC taxonomy applies to embedding calls. In-process embedding means PLATFORM content never leaves the worker — sovereignty posture is strictly stronger than cloud-API. Content-class is still tagged on the core.llm_usage row for audit.

D8 — Hybrid retrieval via RRF is the standard pattern

Every retrievable table carries both a vector(<dim>) column (semantic, HNSW-indexed) and a tsvector column (lexical, GIN-indexed). Retrieval helpers in spectral.core.embeddings.retrieval fuse via Reciprocal Rank Fusion with k=60 (the standard constant). Pure vector similarity misses exact matches on domain vocabulary (rule IDs, form codes, error strings); RRF consistently outperforms single-method retrieval. Built on vanilla Postgres FTS plus pgvector, zero additional extensions.

D9 — Retrievable-table convention

Schema rule shared across contexts, enforced by code review and a post-alpha migration-naming lint extension:

  • embedding <vector|halfvec>(<dim>) + HNSW index on vector_cosine_ops
  • embedding_model TEXT NOT NULL
  • embedding_model_version TEXT NOT NULL
  • embedding_dim INT NOT NULL
  • source_content_hash TEXT NULL (re-embed skip-if-unchanged)
  • search_tsv tsvector generated from relevant text columns + GIN index
  • search_lang TEXT DEFAULT 'english' (multilingual-future)

D10 — HNSW defaults: m=16, ef_construction=64 at build; tune ef_search per query

Supabase-standard. Alpha (≤1M rows): 4–8 GB maintenance_work_mem at build. Revisit at 10M+ rows; consider tenant-partitioned indexes if RLS-scoped query planner regressions surface.

D11 — Fallback upgrade ladder

Each step is a re-embedding job, not a re-architecture:

  1. Quality insufficient → upgrade to BAAI/bge-large-en-v1.5 (335M params, 1024-dim, ~1.3 GB; still in-process if worker RAM allows).
  2. Worker RAM pressure OR embedding volume outpaces in-process → run huggingface/text-embeddings-inference as a dedicated Cloudflare Container, off the in-process worker (one compute vendor per ADR-109; frees worker memory).
  3. Quality demands frontier OR enterprise DPA demands a named provider → swap to a cloud API (Gemini gemini-embedding-001 or OpenAI text-embedding-3-large).

All three steps: swap provider under the EmbeddingProvider protocol → re-embed via EmbeddingProfileRotated event → cut over. The upgrade-ladder provider implementations are themselves context-agnostic concretions and live alongside the FastEmbed provider in core.embeddings.infrastructure per ADR-099 (the TEI sidecar client, GeminiProvider, OpenAIProvider); the worlds-coupled resolver/retriever that uses whichever provider is active does not move.

Alternatives considered

Gemini gemini-embedding-001 cloud. Rejected: sets a cost-scaling track incompatible with the funding trajectory. $150/month at 1M emb/month is trivial; $15K/month at 100M emb/month is real. In-process reuses existing worker compute at zero additional cost.

OpenAI text-embedding-3-large cloud. Same rejection reasoning.

Voyage-3. MongoDB acquisition trajectory; the API is becoming an Atlas Vector Search feature.

Cohere embed-v4. No GCP availability; cross-cloud friction.

Local TEI sidecar (dedicated Cloudflare Container) from day one. Correct upgrade target (D11 step 2) but unnecessary infrastructure at alpha when in-process works. Premature.

Larger in-process models (BGE-large, nomic-embed-text-v1.5, Qwen3-Embedding). Upgrade targets (D11 step 1). Footprint/quality trade rejected for alpha; BGE-small is the right size for current worker sizing.

Per-purpose different models. Fragmentation risk; forecloses cross-purpose retrieval; against the single-canonical discipline elsewhere.

Embeddings-only, no FTS. Loses exact-match recall on domain vocabulary. RRF hybrid is strictly stronger at negligible cost.

External FTS service (Elasticsearch, Meilisearch, Typesense). Overkill; Postgres native FTS handles our scale fine, zero new infra.

Consequences

  • Embedding-based retrieval has a canonical model — the downstream retrieval consumers (memory, worlds) have one model to target.
  • Unblocks agent memory ADRs (ADR-058 / ADR-043)EmbeddingProvider protocol and RRF retrieval helper available.
  • core.embedding_profile is the second core schema table (after core.llm_usage).
  • pgvector storage at alpha: 1M rows × vector(384) ≈ 1.5 GB raw + 2.25 GB HNSW ≈ 3.75 GB total. Comfortable in alpha Supabase instance.
  • Worker footprint: in-process BGE-small adds ~120 MB per worker. Negligible. If a future canonical model upgrade pushes in-process footprint past ~500 MB (e.g., BGE-large at ~1.3 GB), ADR-109 / ADR-049 should re-evaluate whether to subdivide workers by workload profile.
  • Sovereignty posture is strictly stronger than a cloud-API approach. Customer content never leaves the worker process for embedding. No subprocessor added for this capability.
  • Cost trajectory is flat, not stepped. No 100M-embedding cliff scenario. Upgrade path (D11) adds cost incrementally when triggered.
  • D9 schema convention applies to every future retrievable table — Worlds rule candidates, world-model artifacts, T3 memory items, decision audit records. Enforce in code review until a post-alpha migration-naming lint extension can mechanically check.

References

  • ADR-065spectral.core admission discipline
  • ADR-099 — core infrastructure zone; D5/D11 provider placement (FastEmbed + ladder concretions in core.embeddings.infrastructure; worlds resolver/retriever in worlds.infrastructure)
  • ADR-031 — single-library structure
  • ADR-032 — pgvector store; core schema
  • ADR-035PurposeKey.EMBEDDING; LLMUsageRecord; rate-limit + budget pattern
  • ADR-036 — content-class taxonomy; core.llm_usage shape
  • ADR-043 — TA-14 memory consumer
  • ADR-058 — TA-12 retrieval consumer
  • TA-11 disposition — SPEC-314 comment 568fe106
  • TA-11 verification — SPEC-314 comment 993aae10
  • src/spectral/core/embeddings/ — the embeddings contract surface
  • supabase/migrations/20260421012800_core_embedding_profile.sqlcore.embedding_profile
  • Codex system-design/agents/embeddings.mdx — close-pass new page
  • docs/runbooks/embeddings.md — upgrade-ladder + rotation playbook