Part 4: Beyond Semantic Search: Implementing Hybrid Retrieval with BM25

Part 4: Beyond Semantic Search: Implementing Hybrid Retrieval with BM25
Imagine you are building a mission-critical RAG (Retrieval-Augmented Generation) system for a global logistics company. A user asks: "What is the status of shipment INC-2023-Q4-011?"
Your vector database, powered by the latest OpenAI or Claude embeddings, starts crunching the numbers. It traverses the high-dimensional manifold, looking for concepts related to "shipment status" and "incident reports." It returns three beautifully relevant chunks about general shipping delays and maritime law. But there is one problem: none of them contain the string INC-2023-Q4-011.
This is the "Semantic Search Trap." While embeddings are magical at understanding that "liquid" and "fluid" are related, they are notoriously terrible at finding specific, rare identifiers—like product SKUs, incident IDs, or medical codes. To build a production-grade system, you need more than just vectors. You need Hybrid Retrieval.
The Geometry of Failure: Why Vectors Miss the Mark
To understand why your multi-million dollar vector store failed to find a simple ID, we have to look at how embeddings work.
An embedding model compresses the "meaning" of a text into a dense vector (e.g., 1,536 dimensions). In this flattened geometry, the distance between words is determined by their semantic context in the pre-training data. "Shipment" and "Delivery" are neighbors. However, a unique alphanumeric ID like INC-2023-Q4-011 is essentially a "token outlier."
Because that specific ID likely never appeared in the model's training set with a consistent semantic meaning, it gets mapped to a relatively generic area of the vector space. The model "smooths over" the exact characters in favor of the overall vibe of the sentence. In high-performance information retrieval, "vibe" is not enough.

Lexical Search: The Return of the King
If semantic search is the "intuition" of your system, lexical search is the "memory." Lexical search (or keyword search) doesn't care about meaning; it cares about characters.
For decades, the industry standard for lexical search has been BM25 (Best Match 25). It is a probabilistic refinement of the classic TF-IDF (Term Frequency-Inverse Document Frequency) algorithm. While it might feel "old school" compared to neural embeddings, BM25 is remarkably effective at finding the exact keys that semantic search forgets.
How BM25 Works: A Step-by-Step Breakdown
BM25 works by assigning a score to a document based on how many terms from the query appear in that document, weighted by the rarity of those terms across the entire collection.
- Tokenization: The query "What happened with INC-2023-Q4-011?" is broken into tokens:
["What", "happened", "with", "INC-2023-Q4-011"]. - Term Frequency (TF): The algorithm counts how many times
INC-2023-Q4-011appears in Document A. If it appears five times, Document A gets a higher score. - Inverse Document Frequency (IDF): This is the secret sauce. The word "What" appears in 99% of your documents, so its weight is nearly zero. The term
INC-2023-Q4-011appears in only 1 document, so its weight is massive. - Length Normalization: BM25 penalizes extremely long documents. If a 500-page book mentions your ID once, it's less relevant than a 1-page report that mentions it once.

Mathematical Foundations: The BM25 Formula
To truly master retrieval, you must look at the math beneath the hood. The score of a document $D$ for a query $Q$ is calculated as:
$$\text{score}(D, Q) = \sum_{q_i \in Q} \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot \left(1 - b + b \cdot \frac{|D|}{\text{avgdl}}\right)}$$
Where:
- $f(q_i, D)$ is the frequency of the term in the document.
- $|D|$ is the length of the document.
- $\text{avgdl}$ is the average length of all documents in the collection.
- $k_1$ and $b$ are tuning parameters (usually $k_1 \in [1.2, 2.0]$ and $b = 0.75$).
As $k_1$ increases, the importance of term frequency increases. As $b$ increases, the penalty for long documents becomes more severe.
Implementation: Building a Hybrid Searcher
In a production environment, you don't choose between BM25 and Vector Search—you run them in parallel and merge the results. Below is a production-ready implementation using the rank_bm25 library alongside a conceptual vector store.
import numpy as np
from rank_bm25 import BM25Okapi
from typing import List, Dict
class HybridSearchEngine:
def __init__(self, documents: List[Dict]):
self.documents = documents
# 1. Prepare Lexical Index (BM25)
tokenized_corpus = [doc['text'].lower().split() for doc in documents]
self.bm25 = BM25Okapi(tokenized_corpus)
# 2. Assume Vector Store is initialized elsewhere
# self.vector_store = ...
def get_lexical_scores(self, query: str) -> np.ndarray:
tokenized_query = query.lower().split()
return self.bm25.get_scores(tokenized_query)
def search(self, query: str, top_k: int = 5):
# Step A: Get BM25 Scores
bm25_scores = self.get_lexical_scores(query)
# Step B: Get Semantic Scores (Conceptual)
# semantic_scores = self.vector_store.get_cosine_similarities(query)
# For demo, let's assume we have them
semantic_scores = np.random.rand(len(self.documents))
# Step C: Normalize and Merge
# We use Reciprocal Rank Fusion (RRF) for robust merging
return self.reciprocal_rank_fusion(bm25_scores, semantic_scores, k=top_k)
def reciprocal_rank_fusion(self, scores_a, scores_b, k=60):
"""
Combines two sets of rankings using RRF.
RRF is better than simple averaging because it doesn't
require the scores to be on the same scale.
"""
ranks_a = np.argsort(np.argsort(-scores_a))
ranks_b = np.argsort(np.argsort(-scores_b))
rrf_scores = (1.0 / (k + ranks_a)) + (1.0 / (k + ranks_b))
# Get final top indices
top_indices = np.argsort(-rrf_scores)
return [self.documents[i] for i in top_indices[:k]]
The "Merging" Problem: Reciprocal Rank Fusion (RRF)
How do you combine a BM25 score (which ranges from 0 to 50+) with a Cosine Similarity score (which ranges from 0 to 1)? If you just add them together, the BM25 score will always win, effectively deleting the "meaning" you worked so hard to capture with your embeddings.
The industry solution is Reciprocal Rank Fusion (RRF).
RRF doesn't care about the value of the score; it only cares about the rank. If a document is #1 in BM25 and #100 in Vector, it gets a high combined score. If a document is #3 in both, it might actually beat the #1/#100 document because it is consistently relevant across both perspectives.
The formula is elegantly simple: $$\text{RRFscore}(d) = \sum_{r \in \text{Ranks}} \frac{1}{k + r(d)}$$ where $k$ is a constant (usually 60) that prevents a single high ranking from overwhelming the result.
Why This Works Better in Production
- Technical Precision: Your system can now find
ERROR_CODE_404andINC-2023-Q4-011with 100% accuracy. - Multilingual Support: BM25 works purely on characters, making it a great "safety net" for languages where your embedding model might be weak.
- Low Latency: BM25 is computationally "free" compared to large neural network inferences. Running it in parallel with your vector search adds negligible overhead.
- Bias Correction: Embedding models can sometimes have hidden biases or "blind spots" in their high-dimensional maps. BM25 acts as an unbiased, literal observer.
Conclusion: The Architecture of Truth
Building a RAG system is a journey from "it works on my machine" to "it works for my users." Moving from semantic-only to hybrid retrieval is the single most impactful transition a team can make to eliminate hallucinations caused by missing context.
By pairing the conceptual "intuition" of vectors with the literal "memory" of BM25, you create a system that is both smart and precise.
🚀 Next Up: Part 5 — The Multi-Index Architecture
Now that we've mastered the retrieval algorithms, it's time to talk about Architectures. How do you coordinate multiple search engines, handle protocols, and re-rank results at scale?



