
Retrieval-Augmented Generation (RAG) is an AI architecture that combines information retrieval with large language models (LLMs). Instead of relying solely on the LLM's training data for answers, RAG retrieves relevant information from a knowledge base and provides it as context to the LLM.
Retrieval-Augmented Generation (RAG) is an AI architecture that combines information retrieval with large language models (LLMs). Instead of relying solely on the LLM's training data for answers, RAG retrieves relevant information from a knowledge base and provides it as context to the LLM.
This approach solves three critical LLM limitations: outdated knowledge (training data cutoff), hallucination (making up facts), and lack of private data access (not knowing your internal documents).
| Problem | LLM Without RAG | LLM With RAG |
|---|---|---|
| Knowledge cutoff | Only knows data up to training date | Retrieves fresh information from knowledge base |
| Hallucination | May invent facts when uncertain | Grounds answers in retrieved context |
| Private data | Has not seen your internal documents | Searches enterprise knowledge base |
| Source attribution | Cannot cite sources | Returns exact chunks with citations |
| Cost | Fine-tuning is expensive | No training needed — just index documents |
| Aspect | RAG | Fine-Tuning |
|---|---|---|
| Knowledge updates | Instant (update vector DB) | Weeks (retrain model) |
| Training cost | None | $100-$10,000+ |
| New concepts | Yes — add to knowledge base | No — requires retraining |
| Hallucination prevention | Good (context grounding) | Limited (memorization) |
| Source attribution | Yes | No (black box) |
| Latency | Higher (retrieval + generation) | Lower (just generation) |
| When to use | Need up-to-date info, citations, private data | Need to change model behavior, tone, output format |
┌───────────────────────┐
│ Document Processing │ (Ingestion Pipeline)
│ (Chunking → Embed) │
└──────────┬────────────┘
│
▼
┌───────────────────────┐
│ Vector Database │
│ (Knowledge Base) │
└──────────┬────────────┘
│
┌──────────▼────────────┐
User Query ──────────►│ Retrieval Pipeline │
│ (embed → search → │
│ fetch chunks) │
└──────────┬────────────┘
│ (relevant chunks)
▼
┌───────────────────────┐
│ Generation │
│ (LLM + Context) │
└──────────┬────────────┘
│ (answer with citations)
▼
Final Answer
from langchain_community.document_loaders import (
PDFLoader, TextLoader, CSVLoader,
UnstructuredHTMLLoader, ConfluenceLoader
)
# Load various document types
loaders = {
'.pdf': PDFLoader("policy.pdf"),
'.txt': TextLoader("readme.txt"),
'.html': UnstructuredHTMLLoader("page.html"),
'.csv': CSVLoader("data.csv"),
}
documents = []
for ext, loader in loaders.items():
documents.extend(loader.load())
Chunking is critical — too small loses context, too large includes noise:
from langchain.text_splitter import (
RecursiveCharacterTextSplitter,
TokenTextSplitter,
SemanticChunker
)
# Recursive character splitting (most common)
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", " ", ""],
length_function=len,
)
# Semantic chunking (AI-aware — splits at topic boundaries)
semantic_splitter = SemanticChunker(
embeddings=openai_embeddings,
breakpoint_threshold_type="percentile",
)
chunks = text_splitter.split_documents(documents)
| Strategy | Chunk Size | Recall | Precision | Best For |
|---|---|---|---|---|
| Small chunks | 200-500 tokens | High | Low (fragmented) | FAQ, definitions |
| Medium chunks | 500-1500 tokens | Medium | Medium | General RAG |
| Large chunks | 1500-3000 tokens | Low | High | Summarization |
| Semantic chunks | Variable | High | High | Complex documents |
from langchain_openai import OpenAIEmbeddings
from langchain_qdrant import QdrantVectorStore
embeddings = OpenAIEmbeddings(
model="text-embedding-3-small",
dimensions=1536,
)
vector_store = QdrantVectorStore.from_documents(
documents=chunks,
embedding=embeddings,
url="http://localhost:6333",
collection_name="knowledge_base",
)
from langchain.llms import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
def rewrite_query(original_query: str) -> str:
"""Rewrite user query for better retrieval"""
prompt = ChatPromptTemplate.from_messages([
("system", "You are a query rewriting assistant. "
"Rewrite the user's question to be more specific "
"and searchable in a knowledge base."),
("user", original_query),
])
llm = ChatOpenAI(model="gpt-4o-mini")
return llm.invoke(prompt).content
# Example:
# Original: "What about vacation?"
# Rewritten: "What is the company policy on paid vacation days,
# including accrual rate and carry-over limits?"
Simple similarity search:
results = vector_store.similarity_search_with_score(
query=rewritten_query,
k=5,
score_threshold=0.75,
)
Hybrid search (vector + keyword):
# BM25 keyword scores combined with vector similarity
from langchain.retrievers import BM25Retriever, EnsembleRetriever
bm25_retriever = BM25Retriever.from_documents(chunks, k=5)
vector_retriever = vector_store.as_retriever(search_kwargs={"k": 5})
# Weighted ensemble
ensemble_retriever = EnsembleRetriever(
retrievers=[vector_retriever, bm25_retriever],
weights=[0.7, 0.3],
)
Multi-query retrieval:
def generate_sub_queries(query: str, n: int = 3) -> list[str]:
"""Generate multiple perspectives on the same query"""
prompt = f"Generate {n} different versions of this query for search: {query}"
response = llm.invoke(prompt)
return response.content.split('\n')
sub_queries = generate_sub_queries("remote work policy")
# Returns:
# - "How many days per week can employees work from home?"
# - "Remote work eligibility and approval process"
# - "Company work-from-home policy and guidelines"
all_results = []
for q in sub_queries:
all_results.extend(vector_store.similarity_search(q, k=3))
Improve retrieval quality by re-ranking results with a cross-encoder:
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
pairs = [(query, result.page_content) for result in initial_results]
scores = reranker.predict(pairs)
reranked_results = [
result for _, result in
sorted(zip(scores, initial_results), key=lambda x: x[0], reverse=True)
]
# Take top 3 after reranking
final_context = reranked_results[:3]
rag_prompt = """You are a helpful assistant that answers questions based
on the provided context. Follow these rules:
1. Answer ONLY using information from the provided context.
2. If the context does not contain the answer, say
"I don't have enough information to answer this question."
3. Do NOT make up or infer information not in the context.
4. Include citations in [Source: filename.pdf] format.
5. Be concise but thorough.
6. Format lists and tables using markdown when appropriate.
Context:
{context}
Question: {question}
Answer (with citations):"""
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
qa_chain = RetrievalQA.from_chain_type(
llm=ChatOpenAI(model="gpt-4o", temperature=0),
chain_type="stuff", # "stuff" = put all context in one prompt
retriever=vector_store.as_retriever(search_kwargs={"k": 5}),
chain_type_kwargs={
"prompt": PromptTemplate.from_template(rag_prompt),
"verbose": True,
},
return_source_documents=True,
)
result = qa_chain.invoke({"query": "What is the vacation policy?"})
print(result['result'])
# "Employees accrue 15 days of paid vacation per year (accrued monthly
# at 1.25 days/month). Unused days may carry over up to 5 days to the
# next calendar year. [Source: HR_Handbook_2026.pdf]"
print(result['source_documents'])
# [Document(page_content="...", metadata={"source": "HR_Handbook_2026.pdf", ...})]
For questions requiring information from multiple documents:
Q: "Which products are affected by the new regulation and who needs training?"
Step 1: Retrieve regulation document → identifies affected product categories
Step 2: Use product list to query training documents
Step 3: Compose answer from both sources
Use an LLM agent to plan retrieval, decide which tools to use, and iterate:
from langchain.agents import AgentExecutor, create_openai_functions_agent
from langchain.tools import Tool
retrieval_tool = Tool(
name="search_knowledge_base",
func=lambda q: vector_store.similarity_search(q, k=3),
description="Search company knowledge base for policies and procedures"
)
calculator_tool = Tool(
name="calculator",
func=lambda expr: eval(expr),
description="Perform mathematical calculations"
)
agent = create_openai_functions_agent(
llm=ChatOpenAI(model="gpt-4o", temperature=0),
tools=[retrieval_tool, calculator_tool],
prompt=agent_prompt,
)
agent_executor = AgentExecutor(agent=agent, tools=[retrieval_tool, calculator_tool])
The model checks its own retrieval and generation quality:
def self_rag(query: str) -> str:
# 1. Retrieve
docs = retrieve(query, k=5)
# 2. Check: is retrieved info relevant?
relevance_score = check_relevance(query, docs)
if relevance_score < 0.7:
return "I cannot find relevant information to answer this question."
# 3. Generate answer
answer = generate(query, docs)
# 4. Check: does answer match retrieved info?
faithfulness_score = check_faithfulness(answer, docs)
if faithfulness_score < 0.8:
return generate_with_constraints(query, docs) # Regenerate
# 5. Check: is answer useful?
usefulness_score = check_usefulness(query, answer)
return answer
Retrieves, evaluates, and corrects before generating:
def corrective_rag(query):
# Initial retrieval
docs = retrieve(query)
# Evaluate retrieval quality
quality = evaluate_retrieval(query, docs)
if quality == "excellent":
# Direct generation
return generate(query, docs)
elif quality == "partial":
# Re-rank and try again
docs = rerank_and_retry(query, docs)
return generate(query, docs)
else: # poor
# Try web search or generate without context
if web_search_available:
docs = web_search(query)
return generate(query, docs)
else:
return generate_without_context(query)
| Metric | What It Measures | Target |
|---|---|---|
| Retrieval recall | Did we find all relevant chunks? | > 0.90 |
| Context precision | Are retrieved chunks actually relevant? | > 0.85 |
| Faithfulness | Does the answer stay true to context? | > 0.95 |
| Answer relevance | Does the answer address the query? | > 0.90 |
| Hallucination rate | Percentage of fabricated facts | < 1% |
class RAGMonitor:
def __init__(self):
self.metrics = {
'retrieval_latency': [],
'generation_latency': [],
'context_token_count': [],
'completion_token_count': [],
}
def log_query(self, query: str, retrieval_time: float,
gen_time: float, context: str, response: str):
self.metrics['retrieval_latency'].append(retrieval_time)
self.metrics['generation_latency'].append(gen_time)
# Track context size — too large = expensive, too small = poor quality
self.metrics['context_token_count'].append(count_tokens(context))
# Log for later analysis
log_to_db(query, response, context)
def alert_if_degraded(self):
avg_latency = mean(self.metrics['generation_latency'][-100:])
if avg_latency > 5.0: # seconds
alert_pagerduty("RAG generation latency degraded")
Total RAG latency: 1.5-4 seconds
├── Query embedding: 100-300ms
├── Vector search: 50-200ms
├── Reranking: 100-300ms
├── Context formatting: 10-50ms
└── LLM generation: 1000-3000ms (depends on output length)
| Practice | Why | Implementation |
|---|---|---|
| Metadata storage | Filtering, provenance, citations | Store filename, page number, section title, date |
| Parent-child chunking | Retrieve small chunks, return big context | Store small chunks for search, link to parent for generation |
| Document summary | Global context for broad questions | Store one summary per document, use for initial filtering |
| Content deduplication | Avoid redundant context | Hash chunks, remove near-duplicates |
| Versioning | Track document updates | Add version field to metadata, support rollback |
RAG is the most practical way to deploy LLMs with accurate, up-to-date, and attributable knowledge. Key takeaways:
RAG is evolving rapidly — with agentic, self-reflective, and corrective variants pushing the boundaries of what retrieval-augmented systems can achieve.
No approved comments are visible yet. New community replies may wait for moderation.