RAG (Retrieval-Augmented Generation)

A language model's knowledge is frozen at its training cutoff. It cannot access your company's internal documentation, yesterday's news, or any data that was not in its training set. Ask it about your product's pricing policy or a recent security patch, and it will either hallucinate a plausible-sounding answer or admit it does not know.

RAG solves this by separating knowledge from the model. Instead of baking facts into model weights, RAG stores knowledge in an external database and retrieves the most relevant pieces at inference time, injecting them into the prompt. The model sees the retrieved context and generates an answer grounded in real documents — not in memory.

This gives you three properties that training alone cannot provide:

Up-to-date answers: Update the knowledge base without retraining the model.
Private data: Keep proprietary content out of model weights while still making it queryable.
Source attribution: Cite the exact document each answer came from — essential for trust and auditability in enterprise applications.

The Two-Phase Architecture#

RAG operates in two distinct phases that run at different times and on different infrastructure.

The RAG System: Ingestion and Retrieval

The ingestion pipeline runs offline — once when documents are added. The retrieval loop runs online — on every user query. The two phases share one artifact: the vector index. Everything else is separate infrastructure with separate scaling concerns.

Rendering diagram...

Phase 1: The Ingestion Pipeline#

Before any user query can be answered, your documents must be processed and indexed. This offline pipeline has three steps that must work well together: chunking, embedding, and indexing.

Step 1: Chunking#

Embedding models convert text to a single fixed-size vector. If you embed an entire 50-page document, that one vector must represent everything — it becomes too coarse to match any specific question. Chunking splits documents into smaller, self-contained passages so each embedding captures a focused piece of knowledge.

The core trade-off is between too small and too large:

Problem	Effect	Fix
Chunks too small (< 128 tokens)	Each chunk lacks enough context. The embedding captures a fragment of an idea, not a complete thought. Retrieval returns noisy, incomplete passages	Use at least 256 tokens; prepend the section header to each chunk to provide context
Chunks too large (> 1,500 tokens)	Each chunk covers too many topics. The entire section is retrieved when only one sentence was needed. Irrelevant content crowds the prompt and degrades the answer	Split at 512–1,024 tokens for most content; add 10–20% overlap at boundaries
Boundary splits (sentence cut in half)	A concept spanning a chunk boundary loses meaning on both sides. Retrieval returns half an explanation	Add overlap: repeat the last 50–100 tokens at the start of the next chunk so no sentence is orphaned
Mixed-topic chunks	A chunk covering two unrelated sections produces a blended embedding. Retrieval finds it for both topics but it answers neither well	Use semantic or structure-aware chunking that respects document sections

Common chunking strategies:

Fixed-size chunking: Split every N tokens regardless of sentence structure. Simple to implement. Works well with overlap. Default choice for unstructured text.
Recursive chunking: Tries separators in order (\n\n → \n → space → character) until chunks fall under the size limit. Balances structure awareness with size control.
Structure-aware chunking: Respects the document's native structure — Markdown headers, HTML tags, PDF sections, code blocks. Prepends the section title to each chunk so the embedding carries document context.
Semantic chunking: Uses cosine similarity between adjacent sentence embeddings to detect topic shifts and split there. Produces variable-length chunks that respect meaning boundaries. More expensive to compute but often better quality.

Practical guidance: Start with recursive chunking at 512 tokens with 100 tokens of overlap. Switch to structure-aware or semantic chunking only when you observe retrieval precision problems — for example, when retrieved passages frequently contain irrelevant content, or when the correct chunk keeps failing to appear in results. Measure retrieval recall on a test set before tuning any parameter, not after.

Content Type	Recommended Chunk Size	Notes
Chat logs, FAQs	200–400 tokens	Short, self-contained entries benefit from small chunks
General web text, blog posts	256–512 tokens	Paragraph-level granularity works well
Technical documentation, code	512–1,024 tokens	Larger context preserves code examples and their explanations together
Legal documents, academic papers	800–1,500 tokens	Dense argument structure needs larger windows to stay coherent

Step 2: Embedding#

An embedding model converts a text chunk into a dense vector — a list of floating-point numbers (e.g., [0.23, -0.87, 0.41, ...]). The vector encodes the semantic meaning of the text: chunks with similar meaning produce vectors that are geometrically close in the vector space, regardless of whether they share any words.

This is what enables semantic search: "How do I reset my password?" and "steps to recover login credentials" produce similar vectors even though they share no keywords.

Rendering diagram...

Critical rule: use the same embedding model for ingestion and querying. Each embedding model defines its own vector space — dimensions mean different things in different models. Searching a Cohere-embedded index with an OpenAI query vector is like asking for directions in French from someone who only speaks Mandarin: the numbers might look similar but they carry entirely different meaning. The result is meaningless, and there is no error message to warn you. This is a silent bug that is catastrophic and hard to detect.

A key consequence: if you upgrade your embedding model, you must re-embed the entire corpus. There is no in-place migration path between embedding models. Plan for this before committing to one.

Model	Dimensions	Notes
`text-embedding-3-small` (OpenAI)	1,536	Good balance of quality and cost. Supports dimension truncation
`text-embedding-3-large` (OpenAI)	3,072	Best OpenAI quality. Can be truncated to 256 dimensions and still outperforms the previous generation
`all-MiniLM-L6-v2` (open source)	384	Lightweight and fast. Good for prototyping or resource-constrained deployments
`all-mpnet-base-v2` (open source)	768	Higher quality than MiniLM. Free and self-hostable
Cohere Embed / Voyage AI	768–1,024	Strong multilingual support. Competitive on MTEB benchmarks

Practical note on dimensions: Higher dimensionality generally means better recall, but also more storage and slower index builds. The sweet spot for most production systems is 768–1,536 dimensions — enough quality without prohibitive storage costs.

Step 3: Indexing#

Once embedded, vectors are stored in a vector index — a data structure optimized for one query type: given a query vector, find the K most similar stored vectors quickly. This is called Approximate Nearest Neighbor (ANN) search. The word "approximate" is intentional: finding the mathematically exact nearest neighbor across millions of vectors would require comparing the query against every stored vector, which is O(n) and too slow at scale. ANN algorithms trade a small amount of recall — typically missing only 1–2% of the true best matches — for orders-of-magnitude faster queries. This trade-off is what makes vector databases practical. The underlying algorithms (HNSW and IVFFlat) are explained in the Vector Databases section below.

Each vector is stored alongside its metadata: the original chunk text, a reference to the source document, page number, a timestamp, or any other fields needed for filtering and citation. Metadata filtering lets you scope queries — for example, "search only within documents tagged team: engineering" — without post-processing a full-corpus result.

Phase 2: The Retrieval Loop#

When a user submits a query, the retrieval loop runs online — it must complete fast enough to feel interactive. It has three steps: embed the query using the same model used during ingestion, retrieve candidate chunks using hybrid search, and re-rank the candidates to select the highest-quality ones. The final top chunks are assembled into the prompt alongside the user's question and sent to the LLM.

Hybrid Search: Keyword + Semantic#

Neither keyword search nor semantic search alone is sufficient for production systems:

Keyword search (BM25) excels at exact term matching — product names, error codes, specific identifiers. BM25 (Best Matching 25) is a classical ranking algorithm that scores documents based on how frequently and distinctively query terms appear in them. It returns high scores when query terms appear literally in the document, but it misses synonyms and paraphrases entirely.
Semantic search excels at meaning-level similarity — questions, paraphrases, conceptual queries. It misses precise keyword matches when the query and the document happen to use different vocabulary.

Production RAG systems run both and fuse the results.

Hybrid Search with Reciprocal Rank Fusion

Run keyword and semantic search independently, then combine their ranked lists using Reciprocal Rank Fusion (RRF). RRF ignores raw scores entirely — it works only on rank positions — making it score-scale-agnostic. No calibration is needed, and it consistently outperforms either retrieval method alone.

Rendering diagram...

The RRF formula in plain terms: For each document, sum 1 / (rank + 60) across every ranked list it appears in. The constant 60 smooths out the difference between high ranks — without it, rank 1 would be worth twice rank 2, over-rewarding small ranking differences at the top. Documents that appear high in multiple lists accumulate a high RRF score; documents that appear in only one list at a low rank score poorly. No knowledge of score distributions is needed, which makes RRF easy to deploy.

Re-ranking: From Candidate Set to Final Context#

The first retrieval stage (keyword + vector search) uses bi-encoders — models that embed the query and each document independently, with no awareness of each other. This is fast and scalable: you can pre-compute document embeddings once at ingestion time and reuse them for any query. But it sacrifices accuracy, because the query vector is computed without knowing which documents it will be compared against.

Cross-encoders (the re-ranker) score query and document jointly — the model reads both at once and produces a direct relevance score. This is far more accurate, but each query-document pair must be scored individually, making it O(n) in the number of candidates and therefore unsuitable for large sets. In practice, it only runs on a small candidate set (typically 20–50). The two-stage pattern is:

Fast bi-encoder retrieval → 100 candidates (milliseconds)
Accurate cross-encoder re-ranking → top 5–10 final chunks (adds 200–500 ms)
Inject top 5–10 chunks into the LLM prompt

Re-ranking consistently and significantly improves answer relevance compared to using raw embedding similarity scores alone. The added latency is almost always worth the quality gain in production.

Vector Databases: Under the Hood#

A vector database stores high-dimensional vectors and serves one query type efficiently: find the K vectors most similar to this query vector. Doing this by brute force — comparing the query against every stored vector — is O(n) and too slow at scale. Real systems use Approximate Nearest Neighbor (ANN) algorithms that trade a small amount of recall (missing ~1–2% of the best matches) for orders-of-magnitude speedup.

Two ANN algorithms dominate production deployments: HNSW and IVFFlat.

HNSW: The Default for Production#

HNSW (Hierarchical Navigable Small World) is a graph-based index. Think of it as a multi-layer highway system built on top of your vectors. The index organizes vectors into a hierarchy of layers: the top layer is sparse, containing only a few nodes connected by long-range edges — like highways that let you skip across the vector space in large jumps. Each layer below adds more nodes and shorter connections, until the bottom layer contains every vector with fine-grained, local edges — like neighborhood streets. To search, you enter at the top layer and greedily hop toward the region closest to your query vector. At each layer, once you can't get any closer, you drop down to the next denser layer and refine. By the time you reach the bottom, you've narrowed the search space so efficiently that the total number of hops is O(log N) — even across millions of vectors.

HNSW is the default choice for most RAG applications because it delivers the best balance of query speed and recall for corpora up to tens of millions of vectors. It requires no training step — vectors can be inserted or deleted incrementally without rebuilding the index, which is critical for dynamic knowledge bases where documents are frequently added or updated. Query throughput is roughly 15x higher than IVFFlat at equivalent recall, making it ideal for latency-sensitive pipelines like interactive search and real-time Q&A.

The main trade-off is memory: HNSW stores graph edges alongside vectors, using roughly 3x more RAM than IVFFlat at equivalent recall. Index build time is also slower because each insertion requires constructing graph connections. For very large corpora (100M+ vectors) where RAM is a hard constraint, quantized indexes like DiskANN become more practical.

IVFFlat (Inverted File Flat) takes a different approach: it uses k-means clustering to partition all vectors into groups at build time. Each group has a centroid — an average vector representing its cluster. When a query arrives, the algorithm first compares the query against all centroids to identify the nearest clusters, then searches only those nprobe clusters exhaustively, skipping the rest of the dataset entirely. IVFFlat is faster to build and uses less memory than HNSW, but it requires an upfront training step (the k-means clustering), and its recall degrades if the data distribution shifts significantly after the index is built.

	HNSW	IVFFlat
Algorithm type	Graph-based	Clustering-based (k-means)
Build time	Slow — graph construction per insertion	Fast — one-time k-means clustering
Index size (RAM)	Large — ~3× more than IVFFlat	Compact
Query throughput	Very high — ~15× faster at equal recall	Lower
Training step required	No — supports incremental insertion	Yes — clusters must be built before insertion
Handles dynamic data	Yes — add or delete without rebuilding	Degrades — clusters become stale as data shifts
Best for	Production, real-time search, dynamic corpora	Large, static corpora with tight memory budgets

Choosing a Vector Database#

The algorithm is one part of the decision; the deployment model and operational concerns matter just as much.

Database	Index Support	Deployment	Best For
Pinecone	HNSW (managed)	Cloud-only SaaS	Teams wanting zero infrastructure ops; simplest path to production
Milvus / Zilliz Cloud	HNSW, IVFFlat, GPU indexes	Self-host or managed cloud	GPU-accelerated search, massive scale (billions of vectors), cost control
pgvector (Postgres extension)	HNSW, IVFFlat	Self-host (add to existing Postgres)	Teams already on Postgres; SQL-native integration without a separate service
Weaviate	HNSW + BM25 hybrid built-in	Self-host or managed cloud	Hybrid search without Elasticsearch; GraphQL API; open-source
Qdrant	HNSW with payload filtering	Self-host or managed cloud	Complex metadata filtering; low latency; built in Rust; open-source
Chroma	HNSW (embedded)	Local or lightweight server	Local development, prototyping, small-scale applications

Practical guidance for most teams: Start with pgvector if you are already on Postgres and your corpus is under ~10 million vectors — it adds zero new services to operate. Switch to a dedicated vector database (Pinecone, Qdrant, Weaviate) when you need features pgvector cannot provide: managed scaling, built-in hybrid search, or sub-10 ms latency at tens of millions of vectors.

Fine-Tuning vs. RAG#

A common misconception is that fine-tuning and RAG compete for the same use case. They do not — they solve different problems.

RAG scales knowledge: what facts the model can access.
Fine-tuning scales behavior: how the model reasons, formats, and communicates.

	RAG	Fine-Tuning
Best for	Grounding answers in specific documents; accessing private data; citing sources	Consistent output format; specialized reasoning style; domain-adapted vocabulary
Knowledge updates	Add or update documents without retraining — seconds to hours	Full training run required — hours to days, plus GPU cost
Source attribution	Built-in — every answer traces to a retrieved chunk	Not available — knowledge is baked into weights, sources are not tracked
Hallucination risk	Lower — anchored to retrieved context	Higher for factual claims — the model may confuse training signal with facts
Private data handling	Data stays in your vector store, not sent to the model provider	Training data is sent to the training pipeline — raises compliance concerns
Typical cost	Embedding + vector storage + retrieval at inference time	Training compute (can be $10K–$100K+) plus the same inference costs as the base model
When to use	Knowledge is dynamic, private, or requires citation	Output format must be consistent; reasoning style needs to be specialized

Production Failure Modes#

RAG systems fail in predictable patterns. Understanding them before deployment lets you design around them.

Failure Mode	Root Cause	Fix
Retrieval misses the answer	The chunk containing the answer was not retrieved — vocabulary mismatch, poor chunking, or weak embedding model	Add hybrid search (BM25 catches keyword misses); measure Recall@K on a test set before tuning any other component
Model ignores retrieved context	Retrieved chunks buried in the middle of a long prompt — LLMs tend to pay more attention to content at the very beginning and end of their context window, and overlook what is sandwiched in between. This is known as 'lost in the middle' degradation	Put the most relevant chunk first in the prompt; limit context to 5–10 chunks; use re-ranking to surface only the best candidates rather than injecting everything
Stale answers	The vector index was not updated when documents changed — model cites outdated content with confidence	Add a document change detection pipeline; trigger re-chunking and re-embedding on every document update; display last-updated timestamps to users
Authorization leakage	A user retrieves a chunk they should not have access to — the vector search does not enforce permissions	Filter candidates by access-control metadata before returning results; never rely on the LLM to enforce access control
Embedding model mismatch	Query is embedded with a different model version than the index — search returns meaningless results silently	Enforce the same model version at ingestion and query time; fail hard on version mismatch; pin the embedding model in config
Aggregation query fails	User asks 'summarize all Q3 reports' — RAG retrieves one passage, not the full corpus	Route aggregation queries to a different pipeline: map-reduce, an agentic loop, or SQL analytics — not RAG point-retrieval

Measuring RAG quality: Evaluate retrieval and generation separately, because a drop in final answer quality could come from either layer — and you need to know which one to fix.

Retrieval metrics: Recall@K (did the correct chunk appear in the top K results?) and MRR (Mean Reciprocal Rank — how high did the correct chunk rank on average across your test queries?).
Generation metrics: Faithfulness (does the answer stay within what the retrieved context says, or does the model add claims that are not supported?), context precision (how much of the retrieved context was actually used and relevant?), and answer relevance (does the response address what the user actually asked?).

Tools like RAGAS and TruLens provide automated evaluation pipelines that compute all these metrics against a labeled test set. Add them to your CI pipeline alongside your functional tests — a retrieval regression is one of the most common sources of silent quality degradation in production RAG systems, and it is nearly impossible to catch without systematic measurement.

Summary#

Concept	Key Point
The RAG problem	LLMs have frozen training knowledge. RAG injects external documents at query time, enabling up-to-date answers, private data access, and source attribution without retraining
Two-phase architecture	Ingestion (offline: chunk → embed → index) and retrieval (online: embed query → ANN search → re-rank → generate). They share the vector index and nothing else
Chunking	Split documents into 256–1,024 token passages with 10–20% overlap. Chunk size is the most impactful tuning lever — too small loses context, too large adds noise
Embedding	Dense vectors encode semantic meaning. Use the same model for ingestion and querying. Model upgrades require re-embedding the entire corpus — plan for this upfront
Hybrid search	Run BM25 (keyword) and vector (semantic) search independently; fuse with RRF. Consistently outperforms either method alone and is required for production-quality retrieval
Re-ranking	Cross-encoders score query+document jointly on a small candidate set (top 20–50). Adds 200–500 ms latency; significantly improves answer relevance over raw embedding similarity
HNSW	Graph-based ANN index. Fast queries (O(log N)), supports dynamic updates, memory-intensive. Default choice for most production RAG systems
IVFFlat	Clustering-based ANN index. Fast to build, memory-efficient, degrades on dynamic data. Use for large static corpora with tight memory budgets
Vector DB choice	pgvector for teams already on Postgres at < 10M vectors; Pinecone for managed zero-ops; Weaviate or Qdrant for open-source with advanced features
RAG vs. fine-tuning	RAG scales knowledge (dynamic, private facts); fine-tuning scales behavior (tone, format, reasoning style).
Common failure modes	Retrieval misses, 'lost in the middle' degradation, stale index, authorization leakage, embedding mismatch. Measure Recall@K and faithfulness separately

The most common mistake in building RAG systems is treating ingestion and retrieval as a single unit and tuning everything at the end. The ingestion pipeline and the retrieval loop fail independently and require separate evaluation. Build a test set that measures retrieval quality before you evaluate generation quality — a well-tuned re-ranker on top of broken chunking is just a polished failure.

Sources:

PreviousContext Management

NextAI Agents

RAG (Retrieval-Augmented Generation)

The RAG System: Ingestion and Retrieval

Hybrid Search with Reciprocal Rank Fusion

Arch Advisor