RAG (Retrieval-Augmented Generation)
A language model's knowledge is frozen at its training cutoff. It cannot access your company's internal documentation, yesterday's news, or any data that was not in its training set. Ask it about your product's pricing policy or a recent security patch, and it will either hallucinate a plausible-sounding answer or admit it does not know.
RAG solves this by separating knowledge from the model. Instead of baking facts into model weights, RAG stores knowledge in an external database and retrieves the most relevant pieces at inference time, injecting them into the prompt. The model sees the retrieved context and generates an answer grounded in real documents — not in memory.
This gives you three properties that training alone cannot provide:
- Up-to-date answers: Update the knowledge base without retraining the model.
- Private data: Keep proprietary content out of model weights while still making it queryable.
- Source attribution: Cite the exact document each answer came from — essential for trust and auditability in enterprise applications.
The Two-Phase Architecture#
RAG operates in two distinct phases that run at different times and on different infrastructure.
The RAG System: Ingestion and Retrieval
The ingestion pipeline runs offline — once when documents are added. The retrieval loop runs online — on every user query. The two phases share one artifact: the vector index. Everything else is separate infrastructure with separate scaling concerns.
Phase 1: The Ingestion Pipeline#
Before any user query can be answered, your documents must be processed and indexed. This offline pipeline has three steps that must work well together: chunking, embedding, and indexing.
Step 1: Chunking#
Embedding models convert text to a single fixed-size vector. If you embed an entire 50-page document, that one vector must represent everything — it becomes too coarse to match any specific question. Chunking splits documents into smaller, self-contained passages so each embedding captures a focused piece of knowledge.
The core trade-off is between too small and too large:
| Problem | Effect | Fix |
|---|---|---|
| Chunks too small (< 128 tokens) | Each chunk lacks enough context. The embedding captures a fragment of an idea, not a complete thought. Retrieval returns noisy, incomplete passages | Use at least 256 tokens; prepend the section header to each chunk to provide context |
| Chunks too large (> 1,500 tokens) | Each chunk covers too many topics. The entire section is retrieved when only one sentence was needed. Irrelevant content crowds the prompt and degrades the answer | Split at 512–1,024 tokens for most content; add 10–20% overlap at boundaries |
| Boundary splits (sentence cut in half) | A concept spanning a chunk boundary loses meaning on both sides. Retrieval returns half an explanation | Add overlap: repeat the last 50–100 tokens at the start of the next chunk so no sentence is orphaned |
| Mixed-topic chunks | A chunk covering two unrelated sections produces a blended embedding. Retrieval finds it for both topics but it answers neither well | Use semantic or structure-aware chunking that respects document sections |
Common chunking strategies:
- Fixed-size chunking: Split every N tokens regardless of sentence structure. Simple to implement. Works well with overlap. Default choice for unstructured text.
- Recursive chunking: Tries separators in order (
\n\n→\n→ space → character) until chunks fall under the size limit. Balances structure awareness with size control. - Structure-aware chunking: Respects the document's native structure — Markdown headers, HTML tags, PDF sections, code blocks. Prepends the section title to each chunk so the embedding carries document context.
- Semantic chunking: Uses cosine similarity between adjacent sentence embeddings to detect topic shifts and split there. Produces variable-length chunks that respect meaning boundaries. More expensive to compute but often better quality.
Practical guidance: Start with recursive chunking at 512 tokens with 100 tokens of overlap. Switch to structure-aware or semantic chunking only when you observe retrieval precision problems — for example, when retrieved passages frequently contain irrelevant content, or when the correct chunk keeps failing to appear in results. Measure retrieval recall on a test set before tuning any parameter, not after.
| Content Type | Recommended Chunk Size | Notes |
|---|---|---|
| Chat logs, FAQs | 200–400 tokens | Short, self-contained entries benefit from small chunks |
| General web text, blog posts | 256–512 tokens | Paragraph-level granularity works well |
| Technical documentation, code | 512–1,024 tokens | Larger context preserves code examples and their explanations together |
| Legal documents, academic papers | 800–1,500 tokens | Dense argument structure needs larger windows to stay coherent |
Step 2: Embedding#
An embedding model converts a text chunk into a dense vector — a list of floating-point numbers (e.g., [0.23, -0.87, 0.41, ...]). The vector encodes the semantic meaning of the text: chunks with similar meaning produce vectors that are geometrically close in the vector space, regardless of whether they share any words.
This is what enables semantic search: "How do I reset my password?" and "steps to recover login credentials" produce similar vectors even though they share no keywords.
Critical rule: use the same embedding model for ingestion and querying. Each embedding model defines its own vector space — dimensions mean different things in different models. Searching a Cohere-embedded index with an OpenAI query vector is like asking for directions in French from someone who only speaks Mandarin: the numbers might look similar but they carry entirely different meaning. The result is meaningless, and there is no error message to warn you. This is a silent bug that is catastrophic and hard to detect.
A key consequence: if you upgrade your embedding model, you must re-embed the entire corpus. There is no in-place migration path between embedding models. Plan for this before committing to one.
| Model | Dimensions | Notes |
|---|---|---|
text-embedding-3-small (OpenAI) | 1,536 | Good balance of quality and cost. Supports dimension truncation |
text-embedding-3-large (OpenAI) | 3,072 | Best OpenAI quality. Can be truncated to 256 dimensions and still outperforms the previous generation |
all-MiniLM-L6-v2 (open source) | 384 | Lightweight and fast. Good for prototyping or resource-constrained deployments |
all-mpnet-base-v2 (open source) | 768 | Higher quality than MiniLM. Free and self-hostable |
| Cohere Embed / Voyage AI | 768–1,024 | Strong multilingual support. Competitive on MTEB benchmarks |
Practical note on dimensions: Higher dimensionality generally means better recall, but also more storage and slower index builds. The sweet spot for most production systems is 768–1,536 dimensions — enough quality without prohibitive storage costs.
Step 3: Indexing#
Once embedded, vectors are stored in a vector index — a data structure optimized for one query type: given a query vector, find the K most similar stored vectors quickly. This is called Approximate Nearest Neighbor (ANN) search. The word "approximate" is intentional: finding the mathematically exact nearest neighbor across millions of vectors would require comparing the query against every stored vector, which is O(n) and too slow at scale. ANN algorithms trade a small amount of recall — typically missing only 1–2% of the true best matches — for orders-of-magnitude faster queries. This trade-off is what makes vector databases practical. The underlying algorithms (HNSW and IVFFlat) are explained in the Vector Databases section below.
Each vector is stored alongside its metadata: the original chunk text, a reference to the source document, page number, a timestamp, or any other fields needed for filtering and citation. Metadata filtering lets you scope queries — for example, "search only within documents tagged team: engineering" — without post-processing a full-corpus result.
Phase 2: The Retrieval Loop#
When a user submits a query, the retrieval loop runs online — it must complete fast enough to feel interactive. It has three steps: embed the query using the same model used during ingestion, retrieve candidate chunks using hybrid search, and re-rank the candidates to select the highest-quality ones. The final top chunks are assembled into the prompt alongside the user's question and sent to the LLM.
Hybrid Search: Keyword + Semantic#
Neither keyword search nor semantic search alone is sufficient for production systems:
- Keyword search (BM25) excels at exact term matching — product names, error codes, specific identifiers. BM25 (Best Matching 25) is a classical ranking algorithm that scores documents based on how frequently and distinctively query terms appear in them. It returns high scores when query terms appear literally in the document, but it misses synonyms and paraphrases entirely.
- Semantic search excels at meaning-level similarity — questions, paraphrases, conceptual queries. It misses precise keyword matches when the query and the document happen to use different vocabulary.
Production RAG systems run both and fuse the results.
Hybrid Search with Reciprocal Rank Fusion
Run keyword and semantic search independently, then combine their ranked lists using Reciprocal Rank Fusion (RRF). RRF ignores raw scores entirely — it works only on rank positions — making it score-scale-agnostic. No calibration is needed, and it consistently outperforms either retrieval method alone.
The RRF formula in plain terms: For each document, sum 1 / (rank + 60) across every ranked list it appears in. The constant 60 smooths out the difference between high ranks — without it, rank 1 would be worth twice rank 2, over-rewarding small ranking differences at the top. Documents that appear high in multiple lists accumulate a high RRF score; documents that appear in only one list at a low rank score poorly. No knowledge of score distributions is needed, which makes RRF easy to deploy.
Re-ranking: From Candidate Set to Final Context#
The first retrieval stage (keyword + vector search) uses bi-encoders — models that embed the query and each document independently, with no awareness of each other. This is fast and scalable: you can pre-compute document embeddings once at ingestion time and reuse them for any query. But it sacrifices accuracy, because the query vector is computed without knowing which documents it will be compared against.
Cross-encoders (the re-ranker) score query and document jointly — the model reads both at once and produces a direct relevance score. This is far more accurate, but each query-document pair must be scored individually, making it O(n) in the number of candidates and therefore unsuitable for large sets. In practice, it only runs on a small candidate set (typically 20–50). The two-stage pattern is:
- Fast bi-encoder retrieval → 100 candidates (milliseconds)
- Accurate cross-encoder re-ranking → top 5–10 final chunks (adds 200–500 ms)
- Inject top 5–10 chunks into the LLM prompt
Re-ranking consistently and significantly improves answer relevance compared to using raw embedding similarity scores alone. The added latency is almost always worth the quality gain in production.
Vector Databases: Under the Hood#
A vector database stores high-dimensional vectors and serves one query type efficiently: find the K vectors most similar to this query vector. Doing this by brute force — comparing the query against every stored vector — is O(n) and too slow at scale. Real systems use Approximate Nearest Neighbor (ANN) algorithms that trade a small amount of recall (missing ~1–2% of the best matches) for orders-of-magnitude speedup.
Two ANN algorithms dominate production deployments: HNSW and IVFFlat.
HNSW: The Default for Production#
HNSW (Hierarchical Navigable Small World) is a graph-based index. Think of it as a multi-layer highway system built on top of your vectors. The index organizes vectors into a hierarchy of layers: the top layer is sparse, containing only a few nodes connected by long-range edges — like highways that let you skip across the vector space in large jumps. Each layer below adds more nodes and shorter connections, until the bottom layer contains every vector with fine-grained, local edges — like neighborhood streets. To search, you enter at the top layer and greedily hop toward the region closest to your query vector. At each layer, once you can't get any closer, you drop down to the next denser layer and refine. By the time you reach the bottom, you've narrowed the search space so efficiently that the total number of hops is O(log N) — even across millions of vectors.
HNSW is the default choice for most RAG applications because it delivers the best balance of query speed and recall for corpora up to tens of millions of vectors. It requires no training step — vectors can be inserted or deleted incrementally without rebuilding the index, which is critical for dynamic knowledge bases where documents are frequently added or updated. Query throughput is roughly 15x higher than IVFFlat at equivalent recall, making it ideal for latency-sensitive pipelines like interactive search and real-time Q&A.
The main trade-off is memory: HNSW stores graph edges alongside vectors, using roughly 3x more RAM than IVFFlat at equivalent recall. Index build time is also slower because each insertion requires constructing graph connections. For very large corpora (100M+ vectors) where RAM is a hard constraint, quantized indexes like DiskANN become more practical.
IVFFlat (Inverted File Flat) takes a different approach: it uses k-means clustering to partition all vectors into groups at build time. Each group has a centroid — an average vector representing its cluster. When a query arrives, the algorithm first compares the query against all centroids to identify the nearest clusters, then searches only those nprobe clusters exhaustively, skipping the rest of the dataset entirely. IVFFlat is faster to build and uses less memory than HNSW, but it requires an upfront training step (the k-means clustering), and its recall degrades if the data distribution shifts significantly after the index is built.
| HNSW | IVFFlat | |
|---|---|---|
| Algorithm type | Graph-based | Clustering-based (k-means) |
| Build time | Slow — graph construction per insertion | Fast — one-time k-means clustering |
| Index size (RAM) | Large — ~3× more than IVFFlat | Compact |
| Query throughput | Very high — ~15× faster at equal recall | Lower |
| Training step required | No — supports incremental insertion | Yes — clusters must be built before insertion |
| Handles dynamic data | Yes — add or delete without rebuilding | Degrades — clusters become stale as data shifts |
| Best for | Production, real-time search, dynamic corpora | Large, static corpora with tight memory budgets |
Choosing a Vector Database#
The algorithm is one part of the decision; the deployment model and operational concerns matter just as much.
| Database | Index Support | Deployment | Best For |
|---|---|---|---|
| Pinecone | HNSW (managed) | Cloud-only SaaS | Teams wanting zero infrastructure ops; simplest path to production |
| Milvus / Zilliz Cloud | HNSW, IVFFlat, GPU indexes | Self-host or managed cloud | GPU-accelerated search, massive scale (billions of vectors), cost control |
| pgvector (Postgres extension) | HNSW, IVFFlat | Self-host (add to existing Postgres) | Teams already on Postgres; SQL-native integration without a separate service |
| Weaviate | HNSW + BM25 hybrid built-in | Self-host or managed cloud | Hybrid search without Elasticsearch; GraphQL API; open-source |
| Qdrant | HNSW with payload filtering | Self-host or managed cloud | Complex metadata filtering; low latency; built in Rust; open-source |
| Chroma | HNSW (embedded) | Local or lightweight server | Local development, prototyping, small-scale applications |
Practical guidance for most teams: Start with pgvector if you are already on Postgres and your corpus is under ~10 million vectors — it adds zero new services to operate. Switch to a dedicated vector database (Pinecone, Qdrant, Weaviate) when you need features pgvector cannot provide: managed scaling, built-in hybrid search, or sub-10 ms latency at tens of millions of vectors.
Fine-Tuning vs. RAG#
A common misconception is that fine-tuning and RAG compete for the same use case. They do not — they solve different problems.
- RAG scales knowledge: what facts the model can access.
- Fine-tuning scales behavior: how the model reasons, formats, and communicates.
| RAG | Fine-Tuning | |
|---|---|---|
| Best for | Grounding answers in specific documents; accessing private data; citing sources | Consistent output format; specialized reasoning style; domain-adapted vocabulary |
| Knowledge updates | Add or update documents without retraining — seconds to hours | Full training run required — hours to days, plus GPU cost |
| Source attribution | Built-in — every answer traces to a retrieved chunk | Not available — knowledge is baked into weights, sources are not tracked |
| Hallucination risk | Lower — anchored to retrieved context | Higher for factual claims — the model may confuse training signal with facts |
| Private data handling | Data stays in your vector store, not sent to the model provider | Training data is sent to the training pipeline — raises compliance concerns |
| Typical cost | Embedding + vector storage + retrieval at inference time | Training compute (can be $10K–$100K+) plus the same inference costs as the base model |
| When to use | Knowledge is dynamic, private, or requires citation | Output format must be consistent; reasoning style needs to be specialized |
Production Failure Modes#
RAG systems fail in predictable patterns. Understanding them before deployment lets you design around them.
| Failure Mode | Root Cause | Fix |
|---|---|---|
| Retrieval misses the answer | The chunk containing the answer was not retrieved — vocabulary mismatch, poor chunking, or weak embedding model | Add hybrid search (BM25 catches keyword misses); measure Recall@K on a test set before tuning any other component |
| Model ignores retrieved context | Retrieved chunks buried in the middle of a long prompt — LLMs tend to pay more attention to content at the very beginning and end of their context window, and overlook what is sandwiched in between. This is known as 'lost in the middle' degradation | Put the most relevant chunk first in the prompt; limit context to 5–10 chunks; use re-ranking to surface only the best candidates rather than injecting everything |
| Stale answers | The vector index was not updated when documents changed — model cites outdated content with confidence | Add a document change detection pipeline; trigger re-chunking and re-embedding on every document update; display last-updated timestamps to users |
| Authorization leakage | A user retrieves a chunk they should not have access to — the vector search does not enforce permissions | Filter candidates by access-control metadata before returning results; never rely on the LLM to enforce access control |
| Embedding model mismatch | Query is embedded with a different model version than the index — search returns meaningless results silently | Enforce the same model version at ingestion and query time; fail hard on version mismatch; pin the embedding model in config |
| Aggregation query fails | User asks 'summarize all Q3 reports' — RAG retrieves one passage, not the full corpus | Route aggregation queries to a different pipeline: map-reduce, an agentic loop, or SQL analytics — not RAG point-retrieval |
Measuring RAG quality: Evaluate retrieval and generation separately, because a drop in final answer quality could come from either layer — and you need to know which one to fix.
- Retrieval metrics:
Recall@K(did the correct chunk appear in the top K results?) andMRR(Mean Reciprocal Rank — how high did the correct chunk rank on average across your test queries?). - Generation metrics: Faithfulness (does the answer stay within what the retrieved context says, or does the model add claims that are not supported?), context precision (how much of the retrieved context was actually used and relevant?), and answer relevance (does the response address what the user actually asked?).
Tools like RAGAS and TruLens provide automated evaluation pipelines that compute all these metrics against a labeled test set. Add them to your CI pipeline alongside your functional tests — a retrieval regression is one of the most common sources of silent quality degradation in production RAG systems, and it is nearly impossible to catch without systematic measurement.
Summary#
| Concept | Key Point |
|---|---|
| The RAG problem | LLMs have frozen training knowledge. RAG injects external documents at query time, enabling up-to-date answers, private data access, and source attribution without retraining |
| Two-phase architecture | Ingestion (offline: chunk → embed → index) and retrieval (online: embed query → ANN search → re-rank → generate). They share the vector index and nothing else |
| Chunking | Split documents into 256–1,024 token passages with 10–20% overlap. Chunk size is the most impactful tuning lever — too small loses context, too large adds noise |
| Embedding | Dense vectors encode semantic meaning. Use the same model for ingestion and querying. Model upgrades require re-embedding the entire corpus — plan for this upfront |
| Hybrid search | Run BM25 (keyword) and vector (semantic) search independently; fuse with RRF. Consistently outperforms either method alone and is required for production-quality retrieval |
| Re-ranking | Cross-encoders score query+document jointly on a small candidate set (top 20–50). Adds 200–500 ms latency; significantly improves answer relevance over raw embedding similarity |
| HNSW | Graph-based ANN index. Fast queries (O(log N)), supports dynamic updates, memory-intensive. Default choice for most production RAG systems |
| IVFFlat | Clustering-based ANN index. Fast to build, memory-efficient, degrades on dynamic data. Use for large static corpora with tight memory budgets |
| Vector DB choice | pgvector for teams already on Postgres at < 10M vectors; Pinecone for managed zero-ops; Weaviate or Qdrant for open-source with advanced features |
| RAG vs. fine-tuning | RAG scales knowledge (dynamic, private facts); fine-tuning scales behavior (tone, format, reasoning style). |
| Common failure modes | Retrieval misses, 'lost in the middle' degradation, stale index, authorization leakage, embedding mismatch. Measure Recall@K and faithfulness separately |
The most common mistake in building RAG systems is treating ingestion and retrieval as a single unit and tuning everything at the end. The ingestion pipeline and the retrieval loop fail independently and require separate evaluation. Build a test set that measures retrieval quality before you evaluate generation quality — a well-tuned re-ranker on top of broken chunking is just a polished failure.
Sources:
- Chunking Strategies for RAG — Firecrawl (2026)
- Chunking Best Practices — NVIDIA Developer Blog
- RAG in 2025: 7 Proven Strategies — Morphik
- Hybrid Search Explained — Weaviate
- Advanced RAG: Understanding RRF in Hybrid Search — glaforge.dev
- Rerankers and Two-Stage Retrieval — Pinecone
- HNSW for Vector Search — Milvus Blog
- IVFFlat vs HNSW Deep Dive — AWS Database Blog
- Vector Database Comparison 2025 — Firecrawl
- RAG vs. Fine-Tuning — Oracle
- RAG Failure Modes — Snorkel AI
- Optimizing RAG with Hybrid Search and Reranking — Superlinked VectorHub