AI-Powered Customer Support with RAG

A customer support chatbot sounds straightforward — a user asks a question, the bot answers. But when that chatbot must accurately answer hundreds of questions per second about a product catalog that changes daily, using documentation spread across dozens of sources, while remaining factually grounded and never fabricating answers, the engineering challenge becomes one of the most instructive problems in applied AI systems.

This case study introduces Retrieval-Augmented Generation (RAG) — the dominant architecture for building LLM-powered systems that need to answer questions grounded in real, up-to-date company knowledge. The core insight is that a large language model alone is not enough: it has a knowledge cutoff, it can hallucinate, and it cannot know your specific product. RAG fixes all three problems by teaching the LLM to look things up before it answers.

The core question this case study answers: how do you build a system where an LLM gives answers that are accurate, grounded in your data, and never fabricated?

This case study in this section follows the same framework:

Clarify constraints (Steps 1–2) — What does the system do, and how much traffic must it handle?
High-level design (Step 3) — What are the major components and how do they connect?
Deep dives (Steps 4–7) — How do the trickiest parts actually work?
Trade-offs (Step 8) — What did we give up, and when would we choose differently?

Step 1: Clarify Requirements#

Functional Requirements#

These describe what the system does.

Feature	Description	Priority
Question answering	Answer natural-language customer questions using the company's documentation, FAQs, and support articles	Core
Source grounding	Every answer must cite the specific documents it was drawn from; no answer should be fabricated	Core
Multi-turn conversation	Support back-and-forth dialogue — a customer can ask a follow-up like 'what about returns?' and the system understands what 'that' refers to	Core
Human escalation	Detect when the system cannot answer confidently and route to a human agent	Core
Knowledge base ingestion	Ingest new documents (PDF, Markdown, HTML) and make them searchable within minutes of upload	Core
Real-time updates	When an article is edited in the content management system (CMS), the knowledge base reflects the change without a manual re-index	Core
Language support	Answer questions in the customer's language; the knowledge base may be in English only	Optional
Feedback collection	Let customers rate responses as helpful or not; use this signal to improve retrieval	Optional

Non-Functional Requirements#

These describe how well the system works.

Property	Requirement	Why It Matters
Latency	First token streamed to the user within 1 second; complete response within 5 seconds	Customer support is synchronous — users expect near-instant responses. Visible delay triggers abandonment and escalation to human agents, defeating the cost-saving purpose of the system
Accuracy	Answers must be factually grounded in retrieved documents; hallucination rate below 2% in production evaluation	A customer who receives a wrong refund policy or incorrect product specification will either escalate or be misinformed — both outcomes are costly and damage trust
Availability	99.9% uptime on the query path — under 9 hours of downtime per year	Customer support operates 24/7; the chat widget is the primary deflection layer before human agents
Knowledge freshness	New or updated articles visible to the retrieval system within 5 minutes of publication	Product pricing, return policies, and feature availability change frequently — stale answers are as harmful as hallucinations
Scalability	Handle 10,000 concurrent conversations; scale to 100,000 during product launches or outages	Support volume is bursty — a site outage or viral product launch can create a 10× spike in support traffic within minutes
Safety	Block harmful requests, off-topic conversations, and prompt injection attacks	Without guardrails, users can jailbreak the chatbot into off-brand responses, reveal internal documents, or generate harmful content

The core tension: retrieval quality and response latency pull against each other. More retrieval steps — more documents fetched, re-ranked, and filtered — produces more accurate answers but at the cost of higher latency. The architecture resolves this through a carefully budgeted pipeline: hybrid search retrieves a broad set of candidates efficiently, and a single cross-encoder reranking pass narrows that set to the best three chunks — delivering the accuracy of exhaustive search at the latency of a fast lookup.

Step 2: Back-of-the-Envelope Estimation#

Metric	Calculation	Result
Daily conversations	Assumed mid-scale SaaS support product	500,000 conversations/day
Messages per conversation	~4 customer messages per conversation on average (initial question, a clarifying follow-up, a confirmation, and a closing message)	~2M customer messages/day
Average throughput	2M ÷ 86,400 seconds/day	~23 queries/second
Peak throughput (5× average)	23 × 5 — product launch or outage spikes	~115 queries/second
Knowledge base size	10,000 support articles × ~800 tokens average	~8M tokens of raw content
Chunks at 512 tokens with 50-token overlap	8M tokens ÷ ~460 unique tokens per chunk (512 − 50 overlap = ~462 non-overlapping tokens per chunk)	~17,000 chunks
Embedding size per chunk	1,536 dimensions × 4 bytes (float32)	~6 KB per chunk
Total vector index size	17,000 chunks × 6 KB	~100 MB — fits in memory on any modern server
Embedding API cost (OpenAI text-embedding-3-small)	$0.02 per 1M tokens; 8M tokens at ingestion	~$0.16 for full re-index; negligible
LLM cost per query (GPT-4o)	~2,000 tokens input (3 retrieved chunks ~1,500 tokens + system prompt ~800 tokens + minimal early-turn history) × ~500 tokens output; $2.50/1M input + $10/1M output (verify current pricing at OpenAI, Claude)	~$0.010 per query; $10,000/day at 1M queries — LLM inference is 95%+ of total system cost; all other components are rounding errors by comparison

The number that shapes the cost model: LLM inference dominates. The vector database, embedding API, and retrieval infrastructure together cost less than 5% of what the LLM generation step costs. Cost optimization must therefore focus on the number of tokens sent to the LLM — shorter retrieved contexts, conversation history compression, and caching repeated queries are the primary levers that materially reduce cost at scale.

Step 3: High-Level Design#

The foundational principle: separate knowledge from reasoning. The LLM handles language understanding and generation — it does not need to memorize your product documentation. The retrieval layer handles lookup — it finds the relevant passages and hands them to the LLM as context. This separation makes the system updatable (change the knowledge base, not the model), auditable (every answer traces back to a source document), and accurate (the LLM is constrained to reason over retrieved facts, not invent them).

Rendering diagram...

What each component does:

API Gateway + Rate Limiter — Terminates TLS, authenticates requests (session cookie or JWT), and enforces rate limits (e.g., 60 queries per user per minute via Upstash Redis sliding window). Rejects prompt injection patterns before they reach the LLM.
Query Service — The orchestrator. It embeds the query, runs hybrid search in parallel, fuses and reranks results, assembles the final prompt, streams the LLM response, and persists the conversation turn to PostgreSQL.
Embedding Model — Converts the customer's question text into a dense vector. The same model that was used to embed the knowledge base chunks must be used here — they must live in the same vector space.
Vector DB — Stores pre-computed chunk embeddings. Handles approximate nearest neighbor (ANN) search: given the query vector, return the k most semantically similar chunks in milliseconds.
Search Index (Elasticsearch / OpenSearch) — Stores the raw text of each chunk and indexes it for BM25 (Best Match 25) keyword matching — a standard ranking algorithm that scores documents by how well their words match the query, with a bonus for rare, distinctive terms. Runs in parallel with vector search to catch exact term matches that semantic search misses.
Reciprocal Rank Fusion — A fusion algorithm that merges the keyword and vector result lists into a single ranked list by rewarding documents that rank highly in both. It requires no machine learning or training data — a simple formula with a standard k=60 default handles the merging automatically and works well across virtually all domains.
Cross-Encoder Reranker — A small transformer model (e.g., ms-marco-MiniLM-L-6-v2) that scores each candidate chunk against the full query together, rather than comparing pre-computed embeddings. Much more accurate than bi-encoder similarity but too slow to run on the full corpus — run it only on the top 20–30 fused candidates.
LLM — Generates the final answer given the system prompt, conversation history, and retrieved chunks. Streamed via Server-Sent Events for fast time-to-first-token.
Ingestion Service — Triggered by webhooks from the content management system (CMS). Fetches the updated document, splits it into chunks, embeds each chunk, upserts into the vector DB, and updates the keyword index. The full pipeline for a single article runs in 3–8 seconds.
PostgreSQL — Stores conversation history per session (for multi-turn context), user feedback (thumbs up/down), and metadata for audit trails.

API Design#

Endpoint	Method	Request / Body	Response
`POST /api/v1/chat`	POST	`{ session_id, message: string }`	`200 OK` — SSE stream of tokens; final event includes `sources[]`
`GET /api/v1/chat/{session_id}/history`	GET	—	`200 OK` — `{ messages[], sources[] }`
`POST /api/v1/feedback`	POST	`{ message_id, rating: 'helpful' \| 'not_helpful', comment? }`	`200 OK`
`POST /api/v1/ingest`	POST (internal)	`{ document_id, url, content_type }`	`202 Accepted` — async; webhook from CMS
`DELETE /api/v1/documents/{id}`	DELETE (internal)	—	`200 OK` — removes all chunks for this document from both indexes

POST /api/v1/chat — The core query endpoint. Embeds the user message, runs hybrid retrieval, assembles the prompt, and streams the LLM response back via Server-Sent Events. The final SSE event includes sources[] so the client can render citations alongside the answer.
GET /api/v1/chat/{session_id}/history — Returns the full message history for a session. Used to restore conversation state when a user reopens a chat widget or navigates back to a support thread.
POST /api/v1/feedback — Records a thumbs-up / thumbs-down rating on a specific assistant message. Feedback is stored in PostgreSQL and used offline to identify retrieval gaps — a cluster of "not helpful" ratings on the same topic signals a missing or poorly chunked article.
POST /api/v1/ingest — Internal endpoint triggered by CMS webhooks on document publish or update. Accepts a document_id and content reference, then kicks off the async ingestion pipeline (fetch → chunk → embed → upsert). Returns 202 Accepted immediately; the actual indexing happens in the background.
DELETE /api/v1/documents/{id} — Removes all chunks for a given document from both the vector DB and the keyword index. Called when an article is unpublished or deleted from the CMS — ensures stale content is never returned in retrieval results.

Database Schema#

-- Conversation sessions — one per browser tab or mobile session
CREATE TABLE sessions (
  id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  user_id     BIGINT REFERENCES users(id),
  created_at  TIMESTAMPTZ DEFAULT now(),
  metadata    JSONB  -- locale, product context, etc.
);

-- Individual conversation turns within a session
CREATE TABLE messages (
  id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  session_id  UUID REFERENCES sessions(id) ON DELETE CASCADE,
  role        VARCHAR(16) NOT NULL,  -- 'user' | 'assistant'
  content     TEXT NOT NULL,
  sources     JSONB,   -- [{doc_id, title, url, chunk_text}] for assistant turns
  tokens_used INT,     -- for cost tracking
  created_at  TIMESTAMPTZ DEFAULT now()
);
CREATE INDEX idx_messages_session ON messages(session_id, created_at DESC);

-- Knowledge base document registry — the source of truth for what is indexed
CREATE TABLE documents (
  id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  external_id VARCHAR(255) UNIQUE,  -- CMS content ID
  title       TEXT,
  url         TEXT,
  checksum    VARCHAR(64),  -- SHA-256 of content; skip re-index if unchanged
  indexed_at  TIMESTAMPTZ,
  chunk_count INT
);

The vector database stores each chunk as a separate record with its embedding vector and metadata:

Vector DB record (Weaviate / Qdrant):
  id              UUID          -- chunk-level unique ID
  document_id     UUID          -- links back to documents table
  content         TEXT          -- the raw chunk text (also sent to LLM as context)
  embedding       float32[1536] -- from text-embedding-3-small
  embedding_model VARCHAR(64)   -- e.g. "text-embedding-3-small" — MUST match the model used at query time;
                                --   if you ever switch models, re-embed the full corpus before querying
  title           TEXT          -- document title (for source citation)
  url             TEXT          -- document URL (for source citation)
  created_at      TIMESTAMP
  chunk_index     INT           -- position of this chunk within the document (for ordering)

Step 4: Deep Dive — Ingestion Pipeline and Chunking#

The ingestion pipeline determines the ceiling on retrieval quality. No amount of clever searching or reranking can recover information that was poorly chunked at ingestion time.

The Ingestion Pipeline: From Raw Document to Searchable Chunk

The ingestion pipeline transforms raw documents into searchable vector chunks in five stages. The chunking stage is the most consequential — chunk boundaries determine whether a retrieved passage has enough context to answer a question, and whether related ideas are kept together or split across chunks.

Rendering diagram...

Real-Time Knowledge Base Updates#

The freshness requirement — new articles visible within 5 minutes of publication — demands an event-driven architecture rather than scheduled polling.

Webhook-Driven Update Flow:
1. Content editor saves article in CMS
2. CMS fires POST /api/v1/ingest?doc_id=<id>&event=published
   within < 1 second of save
3. Ingestion Service fetches fresh content via CMS JSON API
4. Deletes all existing chunks for doc_id from both indexes
   (prevents stale duplicates)
5. Re-chunks, re-embeds, and re-upserts the updated content
6. Total re-index time for one article: 3–8 seconds

Nightly Consistency Job (2 AM):
- Fetch all doc IDs and checksums from PostgreSQL documents table
- Fetch all doc IDs currently indexed in vector DB
- For any document in DB but not in vector index: re-ingest
- For any document with mismatched checksum: re-ingest
- For any document in vector index but not in DB: delete from index
- Goal: catch webhooks that failed silently (network errors, CMS bugs)
- Production result: ~2–5 missed updates per night caught and recovered

Use an async queue between the webhook receiver and the ingestion worker. During bulk documentation updates, a CMS can fire dozens of webhooks simultaneously. Without a queue, the ingestion service faces concurrent embed-and-upsert operations that can overwhelm it — dropped webhooks result in stale index entries that the nightly job must recover. A simple Redis queue or a managed queue (AWS SQS, Google Pub/Sub) between the webhook endpoint and the ingestion worker lets you accept all webhook events immediately (returning 202 Accepted) and process them with controlled concurrency. The 5-minute freshness requirement is easily satisfied; queue depth becomes your monitoring signal for processing lag.

Delete-before-insert is mandatory. A document update that only inserts new chunks without first deleting the old ones results in both versions coexisting in the index. Retrieval will then return chunks from the old version — with outdated pricing, superseded policies, or removed feature descriptions — mixed with chunks from the new version, producing contradictory answers. Always delete all existing chunks for a document before inserting the updated ones.

Step 5: Deep Dive — Hybrid Search and Reranking#

Retrieval quality is the single largest determinant of answer quality. An LLM cannot generate a correct answer from wrong or incomplete retrieved context — retrieval failure causes generation failure.

Hybrid Search: BM25 + Vector + Reranking

Pure semantic (vector) search fails on exact terms — product model numbers, error codes, abbreviations, and proper nouns. Pure keyword (BM25) search fails on paraphrases and conceptual questions. Hybrid search runs both in parallel and fuses the results. A final cross-encoder reranking pass selects the top-3 chunks with high precision. Production benchmarks show full cascading retrieval (BM25 + dense vector search + cross-encoder reranking) achieves 91% accuracy vs. 62% for dense-only and 58% for sparse-only.

Rendering diagram...

Why reranking is necessary: Both BM25 and vector search score query and chunk independently — BM25 computes term overlap, vector search compares pre-computed embeddings. Neither method sees the query and chunk together, so their scores are imprecise proxies for relevance. A cross-encoder reranker processes each (query, chunk) pair jointly through a transformer, giving it full attention over both texts at once. This makes it far more accurate at judging true relevance, but it runs a full forward pass per candidate, so it can only be applied to a small set (20–30 chunks) after the fast bi-encoder retrieval has already narrowed the field. The cascade — broad hybrid retrieval followed by narrow cross-encoder reranking — is how you get cross-encoder precision at production latency.

Step 6: Deep Dive — Multi-Turn Conversation and Context Management#

A single-turn question-answering system is significantly simpler than a conversational one. Multi-turn conversations introduce a hard constraint: the LLM has a finite context window, and each new message — along with its retrieved chunks and the full conversation history — consumes tokens from that fixed budget.

Context Window Budget: Fitting Conversation + Retrieved Chunks

A GPT-5.2 context window is 400,000 tokens. That sounds enormous, but the goal is not to fill it — every token you send to the LLM costs money. Token budgeting — allocating fixed slices of the window to each component — serves two purposes: it keeps per-query cost predictable at scale, and it prevents any single component from crowding out the others as conversations grow longer. A short, well-budgeted prompt (system prompt + recent history + 3 chunks + query) typically uses 2,000–4,000 tokens, which is what the cost model in Step 2 assumes. When conversation history grows beyond its budget, summarize rather than truncate.

Rendering diagram...

Step 7: Deep Dive — Guardrails and Hallucination Prevention#

A production customer-support chatbot must do more than answer correctly — it must refuse gracefully, fail safely, and never fabricate. Hallucinations in customer support are not just an accuracy problem; they are a legal and trust problem. A chatbot that confidently states the wrong refund period or a nonexistent product feature can create binding customer expectations and expose the company to liability.

Guardrail Layers: Input, Retrieval, Generation, Output

Guardrails operate at four independent layers. A defect at any layer propagates to the response, so defense must be deep. Input guardrails prevent malicious or off-topic queries from consuming LLM budget. Retrieval guardrails prevent low-confidence or irrelevant chunks from anchoring the LLM to wrong context. Generation guardrails constrain the LLM's output. Output guardrails catch and block harmful content before streaming to the user.

Rendering diagram...

Step 8: Trade-offs#

Vector Database Options: Managed Cloud vs. Open Source vs. PostgreSQL Extension

The choice of vector database is one of the first production decisions in a RAG system. Four options dominate: Pinecone (managed, simple, expensive at scale), Weaviate (managed or self-hosted, native hybrid search), Qdrant (self-hosted, high performance, open-source), and pgvector (adds vectors to existing PostgreSQL — lowest operational overhead for teams already running Postgres).

Rendering diagram...

Key Architectural Decisions Compared#

Decision	Option A	Option B	Recommendation
Chunking strategy	Fixed-size (512 tokens, 50 overlap) — fast, predictable, simple	Semantic chunking — coherent boundaries, higher quality, 10–20× slower ingestion	Fixed-size for real-time webhook updates; semantic for scheduled nightly full re-index
Vector database	Managed cloud (Pinecone, Weaviate Cloud) — zero ops, higher cost at scale	Self-hosted (Qdrant, pgvector) — lower cost, requires ops expertise	pgvector to start; Qdrant or Weaviate when knowledge base exceeds 1M chunks or latency becomes a constraint
Hybrid search	Weaviate native hybrid (single system, alpha parameter) — simpler ops	Elasticsearch (BM25) + separate vector DB — more control, higher ops burden	Weaviate native hybrid for most teams; Elasticsearch + Qdrant if you need richer keyword query features
Reranking	Cross-encoder (ms-marco-MiniLM-L-6-v2) — 91% accuracy, ~150ms overhead	No reranking — 62–75% accuracy, no added latency	Always rerank in production; the accuracy gain outweighs the latency cost in all customer support scenarios
Conversation memory	Sliding window (last N messages, drop the rest) — simple, loses early context	Summarization buffer — preserves key facts, adds one LLM call per N turns	Summarization buffer for support conversations; sliding window only for single-session FAQ lookup tools
Hallucination prevention	System prompt only — easy to implement, easy to bypass	Multi-layer: similarity threshold + system prompt + output NLI check	Multi-layer in production; system prompt only acceptable for internal demos

Common Failure Modes in Production RAG#

Failure Mode	Symptom	Root Cause	Fix
Retrieval miss	Answer is 'I don't know' for a question that IS in the knowledge base	Query phrasing doesn't match chunk content; wrong chunk boundaries split the relevant sentence	Improve chunking; add query rewriting; lower the abstention threshold; check if the document was actually ingested
Context overflow	Answer degrades or ignores retrieved chunks in long conversations	Total tokens (history + chunks + prompt) exceeds model's effective attention span	Implement conversation summarization; reduce retrieved chunk count from 5 to 3; enforce token budget
Hallucination	Answer contains specific details (prices, dates, policies) not in the retrieved chunks	LLM falls back to parametric memory when retrieved context is ambiguous or missing	Tighten system prompt; add output NLI validation; raise the retrieval confidence threshold
Stale knowledge	Answer cites old policy or discontinued product feature	Webhook failed silently; nightly consistency job didn't catch the drift	Add webhook retry logic with exponential backoff; verify checksums in nightly job; alert on drift
Retrieval pollution	Chunks from unrelated products or internal-only documents appear in answers	No metadata filtering on retrieval; all documents indexed in a single namespace	Add product/category metadata to each chunk; filter vector search by metadata before similarity ranking
Prompt injection	Bot goes off-topic, reveals system prompt, or claims false capabilities	User embeds instructions in their message ('Ignore previous instructions and...')	Add input guardrail classifier for instruction-like patterns; never include secrets in the system prompt
Citation hallucination	Answer cites [Source: Returns Policy] but the claim doesn't appear in that document	LLM attributes claims to real documents it can see but uses parametric knowledge for content	Run NLI check between each claim and its cited chunk; flag and block answers with unverified citations

Summary#

Concept	What It Solves	Key Insight
Ingestion pipeline	Raw documents must be transformed into searchable vector chunks before retrieval can work	Chunk at 512 tokens with 50-token overlap; attach metadata to every chunk; use checksums to skip unchanged documents on re-index
Webhook-driven updates	Knowledge base must reflect CMS changes within minutes, not days	Delete-before-insert: always remove old chunks before inserting new ones; pair real-time webhooks with a nightly consistency job that catches silent failures
Hybrid search (BM25 + vector)	Pure semantic search misses exact terms; pure keyword search misses paraphrases — you need both	Run BM25 and vector search in parallel; merge with Reciprocal Rank Fusion; production systems achieve 91% accuracy with full cascade vs. 62% with dense-only
Cross-encoder reranking	Bi-encoder similarity scores are imprecise; reranking selects the truly best chunks from the fused candidate set	Run the cross-encoder on only the top-20 to 30 fused candidates — this delivers cross-encoder accuracy at 150ms of latency overhead, not seconds
Multi-turn context management	Conversation history grows without bound; the LLM context window does not	Summarize older turns to preserve key facts while reducing token cost; always rewrite the current query to a standalone retrieval query incorporating context
Multi-layer guardrails	LLMs hallucinate; prompt injection is real; confidently wrong answers damage trust and legal standing	Layer defenses: input classifier → retrieval confidence threshold → generation constraints → output NLI check; abstain gracefully and escalate to humans
Vector database selection	Every vector database makes different trade-offs across cost, ops complexity, hybrid search support, and scale ceiling	Start with pgvector; migrate to Qdrant or Weaviate when knowledge base exceeds 1M chunks; choose Weaviate if native hybrid search simplicity is the priority

The AI customer support RAG system is the canonical introduction to production LLM application design. Every pattern you learn here — chunked ingestion, hybrid retrieval, reranking, context budgeting, guardrails, and event-driven knowledge freshness — reappears in any system that needs to ground an LLM in real, specific, and updatable knowledge: internal knowledge bases, legal document Q&A, medical information systems, and enterprise search. The retrieval stack is reusable across all of these; only the documents change.

Sources:

PreviousSocial Media Feed

NextDistributed File Storage

AI-Powered Customer Support with RAG

The Ingestion Pipeline: From Raw Document to Searchable Chunk

Hybrid Search: BM25 + Vector + Reranking

Context Window Budget: Fitting Conversation + Retrieved Chunks

Guardrail Layers: Input, Retrieval, Generation, Output

Vector Database Options: Managed Cloud vs. Open Source vs. PostgreSQL Extension

Arch Advisor