Bi-Encoders: The Foundation of Scalable Semantic Search¶

In the previous chapter, we established the power of embeddings: we can represent the meaning of any text as a vector in a high-dimensional space. The logical next step seems simple: to find relevant documents for a query, we generate a vector for the query, generate vectors for every document in our corpus, and calculate the cosine similarity between the query and each document. We then sort the results and return the top-scoring documents.

This "brute-force" approach is conceptually sound, but from a systems perspective, it fails catastrophically at scale. Consider your corpus of ~25,000 documents. A single query would require ~25,000 embedding generations and ~25,000 similarity calculations, all while the user is waiting. Now imagine a real-world corpus with millions or billions of documents. Comparing a query vector to every single document vector at search time is computationally infeasible. The latency would be measured in minutes, not milliseconds.

To move from theory to a practical, real-time application, we need a more elegant solution. That solution is the bi-encoder architecture.

2. The Bi-Encoder Architecture: The "Two Towers" Model¶

The bi-encoder is a neural network architecture designed specifically for high-speed semantic retrieval. The "bi" in its name is the key: it uses two functionally separate, independent encoders, often visualized as a "Two Towers" model.

The Query Tower: This is a Transformer encoder whose sole purpose is to accept an incoming user query and rapidly convert it into a query embedding.
The Document Tower: This is another Transformer encoder (often using the exact same model weights as the query tower) whose purpose is to process the documents within your corpus and convert them into document embeddings.

The most critical feature of this design, and the secret to its performance, is that the two towers operate in complete isolation. The query encoder produces its embedding without ever seeing a single document. The document encoder produces its embeddings without ever seeing the query. This strict separation allows us to decouple the most intensive computations from the real-time search process.

3. The Two-Phase Process: Decoupling for Speed¶

The genius of the bi-encoder lies in its ability to split the retrieval workload into two distinct phases: a heavy, one-time offline phase and a lightning-fast online phase.

Phase 1: Offline Indexing (The Upfront Cost)¶

This is a preparatory, pre-computation step that is performed once, long before any user queries arrive. The process is straightforward:

You take your entire corpus—in your case, all 24,977 insurance answers.
You pass every single document through the document encoder tower to generate its corresponding embedding.
These resulting document embeddings are then stored in a specialized, high-performance vector index—a data structure optimized for fast vector similarity search (e.g., FAISS, HNSW).

This is the computationally expensive part of the process, but because it happens offline, it doesn't impact the user's search experience. You are essentially "pre-baking" the most demanding work. Once the index is built, it can be queried millions of times.

Phase 2: Online Querying (The Real-Time Search)¶

This is what happens in real-time when a user submits a search request. The system only needs to perform two, very fast operations:

The (typically short) query text is passed through the query encoder to generate a single query embedding. This is a very fast operation.
This query embedding is then used to search the pre-built vector index. Using highly efficient algorithms like Approximate Nearest Neighbor (ANN), the index can compare the query vector to millions of document vectors and return the top-K most similar results (those with the highest cosine similarity) almost instantly.

The entire online process, from receiving the query to returning a list of relevant document IDs, is typically completed in milliseconds.

4. The Defining Trade-Off: Speed vs. Precision¶

The bi-encoder architecture is a brilliant piece of engineering, but it comes with a fundamental trade-off that you must understand.

The Pro (Massive Speed): As we've seen, the bi-encoder is incredibly fast. By handling the expensive document encoding process offline, it minimizes real-time computation and enables semantic search at a massive scale.
The Con (Limited Precision): The very source of the bi-encoder's speed—the strict separation of the query and document towers—is also the source of its primary weakness. Because the model never evaluates the query and a document together, it cannot model the complex, fine-grained interactions between their specific words and phrases. The model generates a holistic "gist" embedding for the query and a "gist" embedding for the document and then compares them. It cannot see, for instance, that a specific keyword in the query is negated in the document. This lack of direct interaction inherently limits its maximum possible precision when compared to architectures that allow for deeper analysis.

5. Role in Your Capstone: The High-Recall First Stage¶

This trade-off profile—blazing speed in exchange for good-but-not-perfect precision—makes the bi-encoder the ideal architecture for Stage 1: Candidate Retrieval in a modern, multi-stage RAG pipeline.

The job of the bi-encoder in this first stage is not to find the single, perfect answer. Its job is to achieve high recall. This means its primary goal is to cast a wide but intelligent net, drastically reducing the search space from tens of thousands of documents down to a small, manageable set of the most plausible candidates (e.g., the top 50 or 100). The key is to ensure the correct answer is somewhere in this candidate list.

This highly relevant but imperfectly ranked list is then passed to a much slower, more powerful, and more precise model for a final re-ranking. This second-stage model, which we will discuss next, can afford to be computationally expensive because it only has to analyze a few dozen documents, not the entire corpus.

It is crucial to recognize that all the dense retrieval models you will be evaluating in this project—from the zero-shot baselines like bge-small-en-v1.5 and msmarco-MiniLM-L-6-v3 to your own fine-tuned gte-modernbert-base—are being used within this bi-encoder architecture to power this critical first stage of retrieval.