Article

Retrieval-Augmented Generation (RAG): Architecture, Patterns, and Production

Retrieval-Augmented Generation (RAG) is an AI architecture that combines information retrieval with large language models (LLMs). Instead of relying solely on the LLM's training data for answers, RAG retrieves relevant information from a knowledge base and provides it as context to the LLM.

3 min readLanguage: EN EnglishFree0 claps0 comments

TechnologyAI GuidesAITechnologyAi GuidesRetrievalAugmentedGeneration

Reading options

Introduction

This approach solves three critical LLM limitations: outdated knowledge (training data cutoff), hallucination (making up facts), and lack of private data access (not knowing your internal documents).

Why RAG?

LLM Limitations and RAG Solutions

Problem	LLM Without RAG	LLM With RAG
Knowledge cutoff	Only knows data up to training date	Retrieves fresh information from knowledge base
Hallucination	May invent facts when uncertain	Grounds answers in retrieved context
Private data	Has not seen your internal documents	Searches enterprise knowledge base
Source attribution	Cannot cite sources	Returns exact chunks with citations
Cost	Fine-tuning is expensive	No training needed — just index documents

RAG vs. Fine-Tuning

Aspect	RAG	Fine-Tuning
Knowledge updates	Instant (update vector DB)	Weeks (retrain model)
Training cost	None	$100-$10,000+
New concepts	Yes — add to knowledge base	No — requires retraining
Hallucination prevention	Good (context grounding)	Limited (memorization)
Source attribution	Yes	No (black box)
Latency	Higher (retrieval + generation)	Lower (just generation)
When to use	Need up-to-date info, citations, private data	Need to change model behavior, tone, output format

RAG Architecture

High-Level Flow

                      ┌───────────────────────┐
                      │   Document Processing  │ (Ingestion Pipeline)
                      │   (Chunking → Embed)   │
                      └──────────┬────────────┘
                                 │
                                 ▼
                      ┌───────────────────────┐
                      │   Vector Database     │
                      │   (Knowledge Base)    │
                      └──────────┬────────────┘
                                 │
                      ┌──────────▼────────────┐
User Query ──────────►│   Retrieval Pipeline   │
                      │  (embed → search →     │
                      │   fetch chunks)        │
                      └──────────┬────────────┘
                                 │ (relevant chunks)
                                 ▼
                      ┌───────────────────────┐
                      │   Generation          │
                      │  (LLM + Context)      │
                      └──────────┬────────────┘
                                 │ (answer with citations)
                                 ▼
                           Final Answer

The Ingestion Pipeline

1. Document Loading

from langchain_community.document_loaders import (
    PDFLoader, TextLoader, CSVLoader,
    UnstructuredHTMLLoader, ConfluenceLoader
)

# Load various document types
loaders = {
    '.pdf': PDFLoader("policy.pdf"),
    '.txt': TextLoader("readme.txt"),
    '.html': UnstructuredHTMLLoader("page.html"),
    '.csv': CSVLoader("data.csv"),
}

documents = []
for ext, loader in loaders.items():
    documents.extend(loader.load())

2. Chunking Strategy

Chunking is critical — too small loses context, too large includes noise:

from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,
    TokenTextSplitter,
    SemanticChunker
)

# Recursive character splitting (most common)
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", " ", ""],
    length_function=len,
)

# Semantic chunking (AI-aware — splits at topic boundaries)
semantic_splitter = SemanticChunker(
    embeddings=openai_embeddings,
    breakpoint_threshold_type="percentile",
)

chunks = text_splitter.split_documents(documents)

Chunking Trade-offs

Strategy	Chunk Size	Recall	Precision	Best For
Small chunks	200-500 tokens	High	Low (fragmented)	FAQ, definitions
Medium chunks	500-1500 tokens	Medium	Medium	General RAG
Large chunks	1500-3000 tokens	Low	High	Summarization
Semantic chunks	Variable	High	High	Complex documents

3. Embedding

from langchain_openai import OpenAIEmbeddings
from langchain_qdrant import QdrantVectorStore

embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",
    dimensions=1536,
)

vector_store = QdrantVectorStore.from_documents(
    documents=chunks,
    embedding=embeddings,
    url="http://localhost:6333",
    collection_name="knowledge_base",
)

The Retrieval Pipeline

1. Query Understanding

from langchain.llms import ChatOpenAI
from langchain.prompts import ChatPromptTemplate

def rewrite_query(original_query: str) -> str:
    """Rewrite user query for better retrieval"""
    prompt = ChatPromptTemplate.from_messages([
        ("system", "You are a query rewriting assistant. "
         "Rewrite the user's question to be more specific "
         "and searchable in a knowledge base."),
        ("user", original_query),
    ])
    llm = ChatOpenAI(model="gpt-4o-mini")
    return llm.invoke(prompt).content

# Example:
# Original: "What about vacation?"
# Rewritten: "What is the company policy on paid vacation days, 
#             including accrual rate and carry-over limits?"

2. Retrieval Strategies

Simple similarity search:

results = vector_store.similarity_search_with_score(
    query=rewritten_query,
    k=5,
    score_threshold=0.75,
)

Hybrid search (vector + keyword):

# BM25 keyword scores combined with vector similarity
from langchain.retrievers import BM25Retriever, EnsembleRetriever

bm25_retriever = BM25Retriever.from_documents(chunks, k=5)
vector_retriever = vector_store.as_retriever(search_kwargs={"k": 5})

# Weighted ensemble
ensemble_retriever = EnsembleRetriever(
    retrievers=[vector_retriever, bm25_retriever],
    weights=[0.7, 0.3],
)

Multi-query retrieval:

def generate_sub_queries(query: str, n: int = 3) -> list[str]:
    """Generate multiple perspectives on the same query"""
    prompt = f"Generate {n} different versions of this query for search: {query}"
    response = llm.invoke(prompt)
    return response.content.split('\n')

sub_queries = generate_sub_queries("remote work policy")
# Returns:
# - "How many days per week can employees work from home?"
# - "Remote work eligibility and approval process"
# - "Company work-from-home policy and guidelines"

all_results = []
for q in sub_queries:
    all_results.extend(vector_store.similarity_search(q, k=3))

3. Reranking

Improve retrieval quality by re-ranking results with a cross-encoder:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

pairs = [(query, result.page_content) for result in initial_results]
scores = reranker.predict(pairs)

reranked_results = [
    result for _, result in
    sorted(zip(scores, initial_results), key=lambda x: x[0], reverse=True)
]

# Take top 3 after reranking
final_context = reranked_results[:3]

The Generation Pipeline

System Prompt Engineering

rag_prompt = """You are a helpful assistant that answers questions based 
on the provided context. Follow these rules:

1. Answer ONLY using information from the provided context.
2. If the context does not contain the answer, say 
   "I don't have enough information to answer this question."
3. Do NOT make up or infer information not in the context.
4. Include citations in [Source: filename.pdf] format.
5. Be concise but thorough.
6. Format lists and tables using markdown when appropriate.

Context:
{context}

Question: {question}
Answer (with citations):"""

Generation with Citations

from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

qa_chain = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-4o", temperature=0),
    chain_type="stuff",  # "stuff" = put all context in one prompt
    retriever=vector_store.as_retriever(search_kwargs={"k": 5}),
    chain_type_kwargs={
        "prompt": PromptTemplate.from_template(rag_prompt),
        "verbose": True,
    },
    return_source_documents=True,
)

result = qa_chain.invoke({"query": "What is the vacation policy?"})

print(result['result'])
# "Employees accrue 15 days of paid vacation per year (accrued monthly
# at 1.25 days/month). Unused days may carry over up to 5 days to the
# next calendar year. [Source: HR_Handbook_2026.pdf]"

print(result['source_documents'])
# [Document(page_content="...", metadata={"source": "HR_Handbook_2026.pdf", ...})]

Advanced RAG Patterns

1. Multi-Hop RAG

For questions requiring information from multiple documents:

Q: "Which products are affected by the new regulation and who needs training?"

Step 1: Retrieve regulation document → identifies affected product categories
Step 2: Use product list to query training documents
Step 3: Compose answer from both sources

2. Agentic RAG

Use an LLM agent to plan retrieval, decide which tools to use, and iterate:

from langchain.agents import AgentExecutor, create_openai_functions_agent
from langchain.tools import Tool

retrieval_tool = Tool(
    name="search_knowledge_base",
    func=lambda q: vector_store.similarity_search(q, k=3),
    description="Search company knowledge base for policies and procedures"
)

calculator_tool = Tool(
    name="calculator",
    func=lambda expr: eval(expr),
    description="Perform mathematical calculations"
)

agent = create_openai_functions_agent(
    llm=ChatOpenAI(model="gpt-4o", temperature=0),
    tools=[retrieval_tool, calculator_tool],
    prompt=agent_prompt,
)
agent_executor = AgentExecutor(agent=agent, tools=[retrieval_tool, calculator_tool])

3. Self-Reflective RAG (Self-RAG)

The model checks its own retrieval and generation quality:

def self_rag(query: str) -> str:
    # 1. Retrieve
    docs = retrieve(query, k=5)

    # 2. Check: is retrieved info relevant?
    relevance_score = check_relevance(query, docs)

    if relevance_score < 0.7:
        return "I cannot find relevant information to answer this question."

    # 3. Generate answer
    answer = generate(query, docs)

    # 4. Check: does answer match retrieved info?
    faithfulness_score = check_faithfulness(answer, docs)

    if faithfulness_score < 0.8:
        return generate_with_constraints(query, docs)  # Regenerate

    # 5. Check: is answer useful?
    usefulness_score = check_usefulness(query, answer)

    return answer

4. Corrective RAG (CRAG)

Retrieves, evaluates, and corrects before generating:

def corrective_rag(query):
    # Initial retrieval
    docs = retrieve(query)

    # Evaluate retrieval quality
    quality = evaluate_retrieval(query, docs)

    if quality == "excellent":
        # Direct generation
        return generate(query, docs)

    elif quality == "partial":
        # Re-rank and try again
        docs = rerank_and_retry(query, docs)
        return generate(query, docs)

    else:  # poor
        # Try web search or generate without context
        if web_search_available:
            docs = web_search(query)
            return generate(query, docs)
        else:
            return generate_without_context(query)

Production RAG Considerations

Evaluation Metrics

Metric	What It Measures	Target
Retrieval recall	Did we find all relevant chunks?	> 0.90
Context precision	Are retrieved chunks actually relevant?	> 0.85
Faithfulness	Does the answer stay true to context?	> 0.95
Answer relevance	Does the answer address the query?	> 0.90
Hallucination rate	Percentage of fabricated facts	< 1%

Monitoring

class RAGMonitor:
    def __init__(self):
        self.metrics = {
            'retrieval_latency': [],
            'generation_latency': [],
            'context_token_count': [],
            'completion_token_count': [],
        }

    def log_query(self, query: str, retrieval_time: float,
                  gen_time: float, context: str, response: str):
        self.metrics['retrieval_latency'].append(retrieval_time)
        self.metrics['generation_latency'].append(gen_time)
        # Track context size — too large = expensive, too small = poor quality
        self.metrics['context_token_count'].append(count_tokens(context))
        # Log for later analysis
        log_to_db(query, response, context)

    def alert_if_degraded(self):
        avg_latency = mean(self.metrics['generation_latency'][-100:])
        if avg_latency > 5.0:  # seconds
            alert_pagerduty("RAG generation latency degraded")

Latency Budget

Total RAG latency: 1.5-4 seconds
├── Query embedding:      100-300ms
├── Vector search:         50-200ms
├── Reranking:            100-300ms
├── Context formatting:    10-50ms
└── LLM generation:      1000-3000ms (depends on output length)

Chunking and Indexing Best Practices

Practice	Why	Implementation
Metadata storage	Filtering, provenance, citations	Store filename, page number, section title, date
Parent-child chunking	Retrieve small chunks, return big context	Store small chunks for search, link to parent for generation
Document summary	Global context for broad questions	Store one summary per document, use for initial filtering
Content deduplication	Avoid redundant context	Hash chunks, remove near-duplicates
Versioning	Track document updates	Add version field to metadata, support rollback

Conclusion

RAG is the most practical way to deploy LLMs with accurate, up-to-date, and attributable knowledge. Key takeaways:

RAG > fine-tuning for most use cases — cheaper, more flexible, updateable.
Chunking quality determines RAG quality — invest time in finding the right strategy.
Hybrid search beats pure vector — combine with keyword for better recall.
Reranking is worth the extra step — significantly improves result quality.
Monitor for degradation — retrieval quality drifts as documents are added.
Start simple, add complexity as needed — basic RAG works well; advanced patterns solve specific problems.

RAG is evolving rapidly — with agentic, self-reflective, and corrective variants pushing the boundaries of what retrieval-augmented systems can achieve.

Comments

0 comments

No approved comments are visible yet. New community replies may wait for moderation.