Skip to main content

Hybrid Retrieval

Reciprocal Rank Fusion (RRF) combines two scorers so the agent handles both exact-name queries and paraphrased queries well.

User message
Token-recall
Tokenize query (stopword-filtered). Score each record by|query record| / |query|. Keep 0.33.
Proper nounsMedication namesShort queries
Embedding cosine
Compute query embedding. Cosine similarity against stored embeddings. Keep 0.5.
StemmingSynonymsParaphrase
RRF fusionscore = Σ 1/(k + rank), k=60
Top-k WorkingMemoryEntry dicts

Why hybrid?

Neither scorer alone is robust enough for therapy content:

ScorerWins onLoses on
Token-recallProper nouns ("Sarah"), medication names ("fluoxetine"), short queries, multilingual text (CJK, Cyrillic, accented Latin)Stemming ("anxiety" vs "anxious"), synonyms ("sibling" vs "sister"), paraphrase
EmbeddingStemming, synonyms, semantic paraphrase ("I feel stuck" vs "things feel hopeless")Proper nouns (name signal diluted), short queries (too little context)
Hybrid RRFBoth — fuses by rank position, not raw scoreNothing significant
Unicode-aware tokenization

The tokenizer uses \b\w+\b (Python 3 Unicode-aware) with a CJK character-splitting post-processor. Chinese, Japanese, Korean, Cyrillic, and accented Latin text all produce meaningful token sets for both dedup and retrieval. CJK characters are emitted individually (standard for CJK IR without a word segmenter), covering BMP and astral-plane Extensions B through H.

RRF's constant k=60 requires no per-dataset tuning (Cormack et al. 2009).


Fallback paths

ScenarioWhat happens
No embedding providerPure token-recall (the pre-embedding behavior)
Embedding API failureLogged, falls back to token-recall for this turn
Record has no embeddingParticipates in token-recall only
Model mismatchRecord skipped in embedding scan

The retrieval_path diagnostic reports which path ran: "hybrid_rrf", "token_recall", or "token_recall_after_embed_error".


Episodic date filter

Query-based episodic retrieval applies a max_age_days=30 filter — session arcs older than 30 days are excluded from the search results. This keeps the agent focused on recent context.

The first-turn catch-up path (alatest) is not date-filtered — the most recent session summary always appears regardless of age. This ensures every new session opens with continuity even if the user hasn't visited in months.


Embedding storage

ColumnTypePurpose
embeddingBLOBfloat32 array via struct.pack
embedding_dimINTEGERDimensionality validation
embedding_modelTEXTModel migration detection

Default provider: Gemini gemini-embedding-001 (3072 dims). Falls back to NullEmbeddingProvider when no API key is set.


Eval harness

# Token-recall baseline (no API key needed)
uv run python eval/runners/retrieval_eval.py --mode token-only

# Compare all three scorers (requires API key)
uv run python eval/runners/retrieval_eval.py --mode hybrid --verbose

17 hand-curated cases across 6 categories. Output is a scorer comparison matrix showing recall@1 / recall@5 per category.