RAG (Retrieval-Augmented Generation)
Combining retrieval from knowledge bases with language model generation for accurate, context-aware responses.
What is RAG?
Retrieval-Augmented Generation (RAG) is a technique that enhances language model outputs by retrieving relevant information from a knowledge base before generation. It combines the power of semantic search with generative AI to produce more accurate and contextual responses.
Why RAG?
Problems RAG Solves
- Hallucination: LLMs can generate false information
- Outdated Knowledge: Training data has a cutoff date
- Domain-Specific Knowledge: General models lack specialized information
- Verifiability: Hard to trace where information comes from
Benefits
- Grounded in actual data
- Up-to-date information
- Source attribution
- Domain expertise without fine-tuning
- Cost-effective compared to training
How RAG Works
1. Indexing Phase (Offline)
Documents → Chunks → Embeddings → Vector Database
2. Query Phase (Runtime)
User Query → Embedding → Vector Search → Relevant Chunks → LLM → Response
Architecture
class RAGSystem:
def __init__(self, embedding_model, vector_db, llm):
self.embedding_model = embedding_model
self.vector_db = vector_db
self.llm = llm
def index_documents(self, documents):
"""Index documents into vector database"""
for doc in documents:
chunks = self.chunk_document(doc)
for chunk in chunks:
embedding = self.embedding_model.encode(chunk.text)
self.vector_db.store(chunk.id, embedding, chunk.metadata)
def retrieve(self, query, top_k=5):
"""Retrieve relevant chunks for query"""
query_embedding = self.embedding_model.encode(query)
results = self.vector_db.search(query_embedding, k=top_k)
return results
def generate(self, query, context):
"""Generate response using retrieved context"""
prompt = f"""Answer the question based on the context below.
Context:
{context}
Question: {query}
Answer:"""
return self.llm.generate(prompt)
def query(self, user_query):
"""Full RAG pipeline"""
# Retrieve relevant chunks
chunks = self.retrieve(user_query, top_k=5)
# Format context
context = "\n\n".join([c.text for c in chunks])
# Generate answer
answer = self.generate(user_query, context)
return {
"answer": answer,
"sources": chunks
}
Key Components
1. Document Processing
- Chunking: Split documents into manageable pieces
- Overlap: Maintain context between chunks
- Metadata: Store source, date, author, etc.
def chunk_document(text, chunk_size=500, overlap=50):
"""Split text into overlapping chunks"""
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunk = text[start:end]
chunks.append(chunk)
start = end - overlap
return chunks
2. Embedding (See: embedding)
Convert text to vectors for semantic search
3. Vector Database
- Options: Pinecone, Weaviate, Qdrant, ChromaDB
- Operations: Store, search, update, delete
- Indexing: HNSW, IVF for fast similarity search
4. Retrieval
- Semantic Search: Find similar content by meaning
- Hybrid Search: Combine keyword + semantic
- Re-ranking: Improve relevance of results
5. Generation (See: generate-function)
Use LLM to synthesize answer from retrieved context
Implementation Example
Complete RAG Pipeline
from sentence_transformers import SentenceTransformer
import chromadb
from openai import OpenAI
class SimpleRAG:
def __init__(self):
self.embedder = SentenceTransformer('all-MiniLM-L6-v2')
self.db = chromadb.Client()
self.collection = self.db.create_collection("docs")
self.llm = OpenAI()
def add_documents(self, documents):
"""Add documents to knowledge base"""
for i, doc in enumerate(documents):
embedding = self.embedder.encode(doc).tolist()
self.collection.add(
ids=[f"doc_{i}"],
embeddings=[embedding],
documents=[doc]
)
def query(self, question, top_k=3):
"""Answer question using RAG"""
# 1. Embed query
query_embedding = self.embedder.encode(question).tolist()
# 2. Retrieve relevant docs
results = self.collection.query(
query_embeddings=[query_embedding],
n_results=top_k
)
# 3. Build context
context = "\n\n".join(results['documents'][0])
# 4. Generate answer
response = self.llm.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "Answer based on the provided context."},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
]
)
return response.choices[0].message.content
# Usage
rag = SimpleRAG()
rag.add_documents([
"Python is a high-level programming language.",
"Machine learning is a subset of AI.",
"RAG combines retrieval with generation."
])
answer = rag.query("What is RAG?")
print(answer)
Advanced Techniques
Hypothetical Document Embeddings (HyDE)
Generate hypothetical answer, embed it, use for retrieval:
def hyde_retrieval(query):
# Generate hypothetical answer
hypothetical = llm.generate(f"Answer this question: {query}")
# Embed and search with hypothetical answer
embedding = embed(hypothetical)
results = vector_db.search(embedding)
return results
Query Expansion
Expand query with synonyms or related terms:
def expand_query(query):
expansion = llm.generate(f"Generate 3 similar questions to: {query}")
all_queries = [query] + parse_questions(expansion)
all_results = []
for q in all_queries:
all_results.extend(retrieve(q, top_k=2))
return deduplicate(all_results)
Re-ranking
Use cross-encoder for better relevance:
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
def rerank_results(query, results, top_k=5):
pairs = [[query, r.text] for r in results]
scores = reranker.predict(pairs)
ranked = sorted(zip(results, scores), key=lambda x: x[1], reverse=True)
return [r for r, s in ranked[:top_k]]
Evaluation Metrics
Retrieval Quality
- Recall@K: Percentage of relevant docs in top K
- MRR (Mean Reciprocal Rank): Position of first relevant doc
- NDCG: Ranking quality metric
Generation Quality
- Faithfulness: Answer grounded in context
- Relevance: Answer addresses question
- RAGAS: RAG Assessment metrics
Common Patterns
Multi-hop RAG
Answer questions requiring multiple retrieval steps
Conversational RAG
Maintain conversation history for context
Agentic RAG
LLM decides when and what to retrieve
Challenges
- Chunking Strategy: Optimal size and overlap
- Embedding Quality: Better embeddings = better retrieval
- Context Window: Limited by LLM max tokens
- Latency: Retrieval + generation time
- Cost: Vector DB + LLM API calls
Key Takeaways
- RAG grounds LLM outputs in real data
- Combines embedding-based retrieval with generation
- More cost-effective than fine-tuning
- Requires both components working well
- Essential pattern for production AI applications
- Enables up-to-date, domain-specific AI systems
Dependencies
This concept builds on:
- Embedding: For semantic search
- Generate Function: For answer synthesis