AI-Powered Customer Support with RAG

A customer support chatbot sounds straightforward — a user asks a question, the bot answers. But when that chatbot must accurately answer hundreds of questions per second about a product catalog that changes daily, using documentation spread across dozens of sources, while remaining factually grounded and never fabricating answers, the engineering challenge becomes one of the most instructive problems in applied AI systems.

This case study introduces Retrieval-Augmented Generation (RAG) — the dominant architecture for building LLM-powered systems that need to answer questions grounded in real, up-to-date company knowledge. The core insight is that a large language model alone is not enough: it has a knowledge cutoff, it can hallucinate, and it cannot know your specific product. RAG fixes all three problems by teaching the LLM to look things up before it answers.

The core question this case study answers: how do you build a system where an LLM gives answers that are accurate, grounded in your data, and never fabricated?

This case study in this section follows the same framework:

  • Clarify constraints (Steps 1–2) — What does the system do, and how much traffic must it handle?
  • High-level design (Step 3) — What are the major components and how do they connect?
  • Deep dives (Steps 4–7) — How do the trickiest parts actually work?
  • Trade-offs (Step 8) — What did we give up, and when would we choose differently?

Step 1: Clarify Requirements#

Functional Requirements#

These describe what the system does.

FeatureDescriptionPriority
Question answeringAnswer natural-language customer questions using the company's documentation, FAQs, and support articlesCore
Source groundingEvery answer must cite the specific documents it was drawn from; no answer should be fabricatedCore
Multi-turn conversationSupport back-and-forth dialogue — a customer can ask a follow-up like 'what about returns?' and the system understands what 'that' refers toCore
Human escalationDetect when the system cannot answer confidently and route to a human agentCore
Knowledge base ingestionIngest new documents (PDF, Markdown, HTML) and make them searchable within minutes of uploadCore
Real-time updatesWhen an article is edited in the content management system (CMS), the knowledge base reflects the change without a manual re-indexCore
Language supportAnswer questions in the customer's language; the knowledge base may be in English onlyOptional
Feedback collectionLet customers rate responses as helpful or not; use this signal to improve retrievalOptional

Non-Functional Requirements#

These describe how well the system works.

PropertyRequirementWhy It Matters
LatencyFirst token streamed to the user within 1 second; complete response within 5 secondsCustomer support is synchronous — users expect near-instant responses. Visible delay triggers abandonment and escalation to human agents, defeating the cost-saving purpose of the system
AccuracyAnswers must be factually grounded in retrieved documents; hallucination rate below 2% in production evaluationA customer who receives a wrong refund policy or incorrect product specification will either escalate or be misinformed — both outcomes are costly and damage trust
Availability99.9% uptime on the query path — under 9 hours of downtime per yearCustomer support operates 24/7; the chat widget is the primary deflection layer before human agents
Knowledge freshnessNew or updated articles visible to the retrieval system within 5 minutes of publicationProduct pricing, return policies, and feature availability change frequently — stale answers are as harmful as hallucinations
ScalabilityHandle 10,000 concurrent conversations; scale to 100,000 during product launches or outagesSupport volume is bursty — a site outage or viral product launch can create a 10× spike in support traffic within minutes
SafetyBlock harmful requests, off-topic conversations, and prompt injection attacksWithout guardrails, users can jailbreak the chatbot into off-brand responses, reveal internal documents, or generate harmful content

The core tension: retrieval quality and response latency pull against each other. More retrieval steps — more documents fetched, re-ranked, and filtered — produces more accurate answers but at the cost of higher latency. The architecture resolves this through a carefully budgeted pipeline: hybrid search retrieves a broad set of candidates efficiently, and a single cross-encoder reranking pass narrows that set to the best three chunks — delivering the accuracy of exhaustive search at the latency of a fast lookup.

Step 2: Back-of-the-Envelope Estimation#

MetricCalculationResult
Daily conversationsAssumed mid-scale SaaS support product500,000 conversations/day
Messages per conversation~4 customer messages per conversation on average (initial question, a clarifying follow-up, a confirmation, and a closing message)~2M customer messages/day
Average throughput2M ÷ 86,400 seconds/day~23 queries/second
Peak throughput (5× average)23 × 5 — product launch or outage spikes~115 queries/second
Knowledge base size10,000 support articles × ~800 tokens average~8M tokens of raw content
Chunks at 512 tokens with 50-token overlap8M tokens ÷ ~460 unique tokens per chunk (512 − 50 overlap = ~462 non-overlapping tokens per chunk)~17,000 chunks
Embedding size per chunk1,536 dimensions × 4 bytes (float32)~6 KB per chunk
Total vector index size17,000 chunks × 6 KB~100 MB — fits in memory on any modern server
Embedding API cost (OpenAI text-embedding-3-small)$0.02 per 1M tokens; 8M tokens at ingestion~$0.16 for full re-index; negligible
LLM cost per query (GPT-4o)~2,000 tokens input (3 retrieved chunks ~1,500 tokens + system prompt ~800 tokens + minimal early-turn history) × ~500 tokens output; $2.50/1M input + $10/1M output (verify current pricing at OpenAI, Claude)~$0.010 per query; $10,000/day at 1M queries — LLM inference is 95%+ of total system cost; all other components are rounding errors by comparison

The number that shapes the cost model: LLM inference dominates. The vector database, embedding API, and retrieval infrastructure together cost less than 5% of what the LLM generation step costs. Cost optimization must therefore focus on the number of tokens sent to the LLM — shorter retrieved contexts, conversation history compression, and caching repeated queries are the primary levers that materially reduce cost at scale.

Step 3: High-Level Design#

The foundational principle: separate knowledge from reasoning. The LLM handles language understanding and generation — it does not need to memorize your product documentation. The retrieval layer handles lookup — it finds the relevant passages and hands them to the LLM as context. This separation makes the system updatable (change the knowledge base, not the model), auditable (every answer traces back to a source document), and accurate (the LLM is constrained to reason over retrieved facts, not invent them).

Rendering diagram...

What each component does:

  • API Gateway + Rate Limiter — Terminates TLS, authenticates requests (session cookie or JWT), and enforces rate limits (e.g., 60 queries per user per minute via Upstash Redis sliding window). Rejects prompt injection patterns before they reach the LLM.
  • Query Service — The orchestrator. It embeds the query, runs hybrid search in parallel, fuses and reranks results, assembles the final prompt, streams the LLM response, and persists the conversation turn to PostgreSQL.
  • Embedding Model — Converts the customer's question text into a dense vector. The same model that was used to embed the knowledge base chunks must be used here — they must live in the same vector space.
  • Vector DB — Stores pre-computed chunk embeddings. Handles approximate nearest neighbor (ANN) search: given the query vector, return the k most semantically similar chunks in milliseconds.
  • Search Index (Elasticsearch / OpenSearch) — Stores the raw text of each chunk and indexes it for BM25 (Best Match 25) keyword matching — a standard ranking algorithm that scores documents by how well their words match the query, with a bonus for rare, distinctive terms. Runs in parallel with vector search to catch exact term matches that semantic search misses.
  • Reciprocal Rank Fusion — A fusion algorithm that merges the keyword and vector result lists into a single ranked list by rewarding documents that rank highly in both. It requires no machine learning or training data — a simple formula with a standard k=60 default handles the merging automatically and works well across virtually all domains.
  • Cross-Encoder Reranker — A small transformer model (e.g., ms-marco-MiniLM-L-6-v2) that scores each candidate chunk against the full query together, rather than comparing pre-computed embeddings. Much more accurate than bi-encoder similarity but too slow to run on the full corpus — run it only on the top 20–30 fused candidates.
  • LLM — Generates the final answer given the system prompt, conversation history, and retrieved chunks. Streamed via Server-Sent Events for fast time-to-first-token.
  • Ingestion Service — Triggered by webhooks from the content management system (CMS). Fetches the updated document, splits it into chunks, embeds each chunk, upserts into the vector DB, and updates the keyword index. The full pipeline for a single article runs in 3–8 seconds.
  • PostgreSQL — Stores conversation history per session (for multi-turn context), user feedback (thumbs up/down), and metadata for audit trails.

API Design#

EndpointMethodRequest / BodyResponse
POST /api/v1/chatPOST{ session_id, message: string }200 OK — SSE stream of tokens; final event includes sources[]
GET /api/v1/chat/{session_id}/historyGET200 OK{ messages[], sources[] }
POST /api/v1/feedbackPOST{ message_id, rating: 'helpful' | 'not_helpful', comment? }200 OK
POST /api/v1/ingestPOST (internal){ document_id, url, content_type }202 Accepted — async; webhook from CMS
DELETE /api/v1/documents/{id}DELETE (internal)200 OK — removes all chunks for this document from both indexes
  • POST /api/v1/chat — The core query endpoint. Embeds the user message, runs hybrid retrieval, assembles the prompt, and streams the LLM response back via Server-Sent Events. The final SSE event includes sources[] so the client can render citations alongside the answer.
  • GET /api/v1/chat/{session_id}/history — Returns the full message history for a session. Used to restore conversation state when a user reopens a chat widget or navigates back to a support thread.
  • POST /api/v1/feedback — Records a thumbs-up / thumbs-down rating on a specific assistant message. Feedback is stored in PostgreSQL and used offline to identify retrieval gaps — a cluster of "not helpful" ratings on the same topic signals a missing or poorly chunked article.
  • POST /api/v1/ingest — Internal endpoint triggered by CMS webhooks on document publish or update. Accepts a document_id and content reference, then kicks off the async ingestion pipeline (fetch → chunk → embed → upsert). Returns 202 Accepted immediately; the actual indexing happens in the background.
  • DELETE /api/v1/documents/{id} — Removes all chunks for a given document from both the vector DB and the keyword index. Called when an article is unpublished or deleted from the CMS — ensures stale content is never returned in retrieval results.

Database Schema#

-- Conversation sessions — one per browser tab or mobile session
CREATE TABLE sessions (
  id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  user_id     BIGINT REFERENCES users(id),
  created_at  TIMESTAMPTZ DEFAULT now(),
  metadata    JSONB  -- locale, product context, etc.
);

-- Individual conversation turns within a session
CREATE TABLE messages (
  id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  session_id  UUID REFERENCES sessions(id) ON DELETE CASCADE,
  role        VARCHAR(16) NOT NULL,  -- 'user' | 'assistant'
  content     TEXT NOT NULL,
  sources     JSONB,   -- [{doc_id, title, url, chunk_text}] for assistant turns
  tokens_used INT,     -- for cost tracking
  created_at  TIMESTAMPTZ DEFAULT now()
);
CREATE INDEX idx_messages_session ON messages(session_id, created_at DESC);

-- Knowledge base document registry — the source of truth for what is indexed
CREATE TABLE documents (
  id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  external_id VARCHAR(255) UNIQUE,  -- CMS content ID
  title       TEXT,
  url         TEXT,
  checksum    VARCHAR(64),  -- SHA-256 of content; skip re-index if unchanged
  indexed_at  TIMESTAMPTZ,
  chunk_count INT
);

The vector database stores each chunk as a separate record with its embedding vector and metadata:

Vector DB record (Weaviate / Qdrant):
  id              UUID          -- chunk-level unique ID
  document_id     UUID          -- links back to documents table
  content         TEXT          -- the raw chunk text (also sent to LLM as context)
  embedding       float32[1536] -- from text-embedding-3-small
  embedding_model VARCHAR(64)   -- e.g. "text-embedding-3-small" — MUST match the model used at query time;
                                --   if you ever switch models, re-embed the full corpus before querying
  title           TEXT          -- document title (for source citation)
  url             TEXT          -- document URL (for source citation)
  created_at      TIMESTAMP
  chunk_index     INT           -- position of this chunk within the document (for ordering)

Step 4: Deep Dive — Ingestion Pipeline and Chunking#

The ingestion pipeline determines the ceiling on retrieval quality. No amount of clever searching or reranking can recover information that was poorly chunked at ingestion time.

The Ingestion Pipeline: From Raw Document to Searchable Chunk

The ingestion pipeline transforms raw documents into searchable vector chunks in five stages. The chunking stage is the most consequential — chunk boundaries determine whether a retrieved passage has enough context to answer a question, and whether related ideas are kept together or split across chunks.

Rendering diagram...

Real-Time Knowledge Base Updates#

The freshness requirement — new articles visible within 5 minutes of publication — demands an event-driven architecture rather than scheduled polling.

Webhook-Driven Update Flow:
1. Content editor saves article in CMS
2. CMS fires POST /api/v1/ingest?doc_id=<id>&event=published
   within < 1 second of save
3. Ingestion Service fetches fresh content via CMS JSON API
4. Deletes all existing chunks for doc_id from both indexes
   (prevents stale duplicates)
5. Re-chunks, re-embeds, and re-upserts the updated content
6. Total re-index time for one article: 3–8 seconds

Nightly Consistency Job (2 AM):
- Fetch all doc IDs and checksums from PostgreSQL documents table
- Fetch all doc IDs currently indexed in vector DB
- For any document in DB but not in vector index: re-ingest
- For any document with mismatched checksum: re-ingest
- For any document in vector index but not in DB: delete from index
- Goal: catch webhooks that failed silently (network errors, CMS bugs)
- Production result: ~2–5 missed updates per night caught and recovered

Use an async queue between the webhook receiver and the ingestion worker. During bulk documentation updates, a CMS can fire dozens of webhooks simultaneously. Without a queue, the ingestion service faces concurrent embed-and-upsert operations that can overwhelm it — dropped webhooks result in stale index entries that the nightly job must recover. A simple Redis queue or a managed queue (AWS SQS, Google Pub/Sub) between the webhook endpoint and the ingestion worker lets you accept all webhook events immediately (returning 202 Accepted) and process them with controlled concurrency. The 5-minute freshness requirement is easily satisfied; queue depth becomes your monitoring signal for processing lag.

Delete-before-insert is mandatory. A document update that only inserts new chunks without first deleting the old ones results in both versions coexisting in the index. Retrieval will then return chunks from the old version — with outdated pricing, superseded policies, or removed feature descriptions — mixed with chunks from the new version, producing contradictory answers. Always delete all existing chunks for a document before inserting the updated ones.

Step 5: Deep Dive — Hybrid Search and Reranking#

Retrieval quality is the single largest determinant of answer quality. An LLM cannot generate a correct answer from wrong or incomplete retrieved context — retrieval failure causes generation failure.

Hybrid Search: BM25 + Vector + Reranking

Pure semantic (vector) search fails on exact terms — product model numbers, error codes, abbreviations, and proper nouns. Pure keyword (BM25) search fails on paraphrases and conceptual questions. Hybrid search runs both in parallel and fuses the results. A final cross-encoder reranking pass selects the top-3 chunks with high precision. Production benchmarks show full cascading retrieval (BM25 + dense vector search + cross-encoder reranking) achieves 91% accuracy vs. 62% for dense-only and 58% for sparse-only.

Rendering diagram...

Why reranking is necessary: Both BM25 and vector search score query and chunk independently — BM25 computes term overlap, vector search compares pre-computed embeddings. Neither method sees the query and chunk together, so their scores are imprecise proxies for relevance. A cross-encoder reranker processes each (query, chunk) pair jointly through a transformer, giving it full attention over both texts at once. This makes it far more accurate at judging true relevance, but it runs a full forward pass per candidate, so it can only be applied to a small set (20–30 chunks) after the fast bi-encoder retrieval has already narrowed the field. The cascade — broad hybrid retrieval followed by narrow cross-encoder reranking — is how you get cross-encoder precision at production latency.

Step 6: Deep Dive — Multi-Turn Conversation and Context Management#

A single-turn question-answering system is significantly simpler than a conversational one. Multi-turn conversations introduce a hard constraint: the LLM has a finite context window, and each new message — along with its retrieved chunks and the full conversation history — consumes tokens from that fixed budget.

Context Window Budget: Fitting Conversation + Retrieved Chunks

A GPT-5.2 context window is 400,000 tokens. That sounds enormous, but the goal is not to fill it — every token you send to the LLM costs money. Token budgeting — allocating fixed slices of the window to each component — serves two purposes: it keeps per-query cost predictable at scale, and it prevents any single component from crowding out the others as conversations grow longer. A short, well-budgeted prompt (system prompt + recent history + 3 chunks + query) typically uses 2,000–4,000 tokens, which is what the cost model in Step 2 assumes. When conversation history grows beyond its budget, summarize rather than truncate.

Rendering diagram...

Step 7: Deep Dive — Guardrails and Hallucination Prevention#

A production customer-support chatbot must do more than answer correctly — it must refuse gracefully, fail safely, and never fabricate. Hallucinations in customer support are not just an accuracy problem; they are a legal and trust problem. A chatbot that confidently states the wrong refund period or a nonexistent product feature can create binding customer expectations and expose the company to liability.

Guardrail Layers: Input, Retrieval, Generation, Output

Guardrails operate at four independent layers. A defect at any layer propagates to the response, so defense must be deep. Input guardrails prevent malicious or off-topic queries from consuming LLM budget. Retrieval guardrails prevent low-confidence or irrelevant chunks from anchoring the LLM to wrong context. Generation guardrails constrain the LLM's output. Output guardrails catch and block harmful content before streaming to the user.

Rendering diagram...

Step 8: Trade-offs#

Vector Database Options: Managed Cloud vs. Open Source vs. PostgreSQL Extension

The choice of vector database is one of the first production decisions in a RAG system. Four options dominate: Pinecone (managed, simple, expensive at scale), Weaviate (managed or self-hosted, native hybrid search), Qdrant (self-hosted, high performance, open-source), and pgvector (adds vectors to existing PostgreSQL — lowest operational overhead for teams already running Postgres).

Rendering diagram...

Key Architectural Decisions Compared#

DecisionOption AOption BRecommendation
Chunking strategyFixed-size (512 tokens, 50 overlap) — fast, predictable, simpleSemantic chunking — coherent boundaries, higher quality, 10–20× slower ingestionFixed-size for real-time webhook updates; semantic for scheduled nightly full re-index
Vector databaseManaged cloud (Pinecone, Weaviate Cloud) — zero ops, higher cost at scaleSelf-hosted (Qdrant, pgvector) — lower cost, requires ops expertisepgvector to start; Qdrant or Weaviate when knowledge base exceeds 1M chunks or latency becomes a constraint
Hybrid searchWeaviate native hybrid (single system, alpha parameter) — simpler opsElasticsearch (BM25) + separate vector DB — more control, higher ops burdenWeaviate native hybrid for most teams; Elasticsearch + Qdrant if you need richer keyword query features
RerankingCross-encoder (ms-marco-MiniLM-L-6-v2) — 91% accuracy, ~150ms overheadNo reranking — 62–75% accuracy, no added latencyAlways rerank in production; the accuracy gain outweighs the latency cost in all customer support scenarios
Conversation memorySliding window (last N messages, drop the rest) — simple, loses early contextSummarization buffer — preserves key facts, adds one LLM call per N turnsSummarization buffer for support conversations; sliding window only for single-session FAQ lookup tools
Hallucination preventionSystem prompt only — easy to implement, easy to bypassMulti-layer: similarity threshold + system prompt + output NLI checkMulti-layer in production; system prompt only acceptable for internal demos

Common Failure Modes in Production RAG#

Failure ModeSymptomRoot CauseFix
Retrieval missAnswer is 'I don't know' for a question that IS in the knowledge baseQuery phrasing doesn't match chunk content; wrong chunk boundaries split the relevant sentenceImprove chunking; add query rewriting; lower the abstention threshold; check if the document was actually ingested
Context overflowAnswer degrades or ignores retrieved chunks in long conversationsTotal tokens (history + chunks + prompt) exceeds model's effective attention spanImplement conversation summarization; reduce retrieved chunk count from 5 to 3; enforce token budget
HallucinationAnswer contains specific details (prices, dates, policies) not in the retrieved chunksLLM falls back to parametric memory when retrieved context is ambiguous or missingTighten system prompt; add output NLI validation; raise the retrieval confidence threshold
Stale knowledgeAnswer cites old policy or discontinued product featureWebhook failed silently; nightly consistency job didn't catch the driftAdd webhook retry logic with exponential backoff; verify checksums in nightly job; alert on drift
Retrieval pollutionChunks from unrelated products or internal-only documents appear in answersNo metadata filtering on retrieval; all documents indexed in a single namespaceAdd product/category metadata to each chunk; filter vector search by metadata before similarity ranking
Prompt injectionBot goes off-topic, reveals system prompt, or claims false capabilitiesUser embeds instructions in their message ('Ignore previous instructions and...')Add input guardrail classifier for instruction-like patterns; never include secrets in the system prompt
Citation hallucinationAnswer cites [Source: Returns Policy] but the claim doesn't appear in that documentLLM attributes claims to real documents it can see but uses parametric knowledge for contentRun NLI check between each claim and its cited chunk; flag and block answers with unverified citations

Summary#

ConceptWhat It SolvesKey Insight
Ingestion pipelineRaw documents must be transformed into searchable vector chunks before retrieval can workChunk at 512 tokens with 50-token overlap; attach metadata to every chunk; use checksums to skip unchanged documents on re-index
Webhook-driven updatesKnowledge base must reflect CMS changes within minutes, not daysDelete-before-insert: always remove old chunks before inserting new ones; pair real-time webhooks with a nightly consistency job that catches silent failures
Hybrid search (BM25 + vector)Pure semantic search misses exact terms; pure keyword search misses paraphrases — you need bothRun BM25 and vector search in parallel; merge with Reciprocal Rank Fusion; production systems achieve 91% accuracy with full cascade vs. 62% with dense-only
Cross-encoder rerankingBi-encoder similarity scores are imprecise; reranking selects the truly best chunks from the fused candidate setRun the cross-encoder on only the top-20 to 30 fused candidates — this delivers cross-encoder accuracy at 150ms of latency overhead, not seconds
Multi-turn context managementConversation history grows without bound; the LLM context window does notSummarize older turns to preserve key facts while reducing token cost; always rewrite the current query to a standalone retrieval query incorporating context
Multi-layer guardrailsLLMs hallucinate; prompt injection is real; confidently wrong answers damage trust and legal standingLayer defenses: input classifier → retrieval confidence threshold → generation constraints → output NLI check; abstain gracefully and escalate to humans
Vector database selectionEvery vector database makes different trade-offs across cost, ops complexity, hybrid search support, and scale ceilingStart with pgvector; migrate to Qdrant or Weaviate when knowledge base exceeds 1M chunks; choose Weaviate if native hybrid search simplicity is the priority

The AI customer support RAG system is the canonical introduction to production LLM application design. Every pattern you learn here — chunked ingestion, hybrid retrieval, reranking, context budgeting, guardrails, and event-driven knowledge freshness — reappears in any system that needs to ground an LLM in real, specific, and updatable knowledge: internal knowledge bases, legal document Q&A, medical information systems, and enterprise search. The retrieval stack is reusable across all of these; only the documents change.

Sources: