RAG (Retrieval-Augmented Generation)

Combining retrieval from knowledge bases with language model generation for accurate, context-aware responses.

architecture intermediate Jan 28, 2025

What is RAG?

Retrieval-Augmented Generation (RAG) is a technique that enhances language model outputs by retrieving relevant information from a knowledge base before generation. It combines the power of semantic search with generative AI to produce more accurate and contextual responses.

Why RAG?

Problems RAG Solves

  • Hallucination: LLMs can generate false information
  • Outdated Knowledge: Training data has a cutoff date
  • Domain-Specific Knowledge: General models lack specialized information
  • Verifiability: Hard to trace where information comes from

Benefits

  • Grounded in actual data
  • Up-to-date information
  • Source attribution
  • Domain expertise without fine-tuning
  • Cost-effective compared to training

How RAG Works

1. Indexing Phase (Offline)

Documents → Chunks → Embeddings → Vector Database

2. Query Phase (Runtime)

User Query → Embedding → Vector Search → Relevant Chunks → LLM → Response

Architecture

class RAGSystem:
    def __init__(self, embedding_model, vector_db, llm):
        self.embedding_model = embedding_model
        self.vector_db = vector_db
        self.llm = llm

    def index_documents(self, documents):
        """Index documents into vector database"""
        for doc in documents:
            chunks = self.chunk_document(doc)
            for chunk in chunks:
                embedding = self.embedding_model.encode(chunk.text)
                self.vector_db.store(chunk.id, embedding, chunk.metadata)

    def retrieve(self, query, top_k=5):
        """Retrieve relevant chunks for query"""
        query_embedding = self.embedding_model.encode(query)
        results = self.vector_db.search(query_embedding, k=top_k)
        return results

    def generate(self, query, context):
        """Generate response using retrieved context"""
        prompt = f"""Answer the question based on the context below.

Context:
{context}

Question: {query}

Answer:"""
        return self.llm.generate(prompt)

    def query(self, user_query):
        """Full RAG pipeline"""
        # Retrieve relevant chunks
        chunks = self.retrieve(user_query, top_k=5)

        # Format context
        context = "\n\n".join([c.text for c in chunks])

        # Generate answer
        answer = self.generate(user_query, context)

        return {
            "answer": answer,
            "sources": chunks
        }

Key Components

1. Document Processing

  • Chunking: Split documents into manageable pieces
  • Overlap: Maintain context between chunks
  • Metadata: Store source, date, author, etc.
def chunk_document(text, chunk_size=500, overlap=50):
    """Split text into overlapping chunks"""
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        start = end - overlap
    return chunks

2. Embedding (See: embedding)

Convert text to vectors for semantic search

3. Vector Database

  • Options: Pinecone, Weaviate, Qdrant, ChromaDB
  • Operations: Store, search, update, delete
  • Indexing: HNSW, IVF for fast similarity search

4. Retrieval

  • Semantic Search: Find similar content by meaning
  • Hybrid Search: Combine keyword + semantic
  • Re-ranking: Improve relevance of results

5. Generation (See: generate-function)

Use LLM to synthesize answer from retrieved context

Implementation Example

Complete RAG Pipeline

from sentence_transformers import SentenceTransformer
import chromadb
from openai import OpenAI

class SimpleRAG:
    def __init__(self):
        self.embedder = SentenceTransformer('all-MiniLM-L6-v2')
        self.db = chromadb.Client()
        self.collection = self.db.create_collection("docs")
        self.llm = OpenAI()

    def add_documents(self, documents):
        """Add documents to knowledge base"""
        for i, doc in enumerate(documents):
            embedding = self.embedder.encode(doc).tolist()
            self.collection.add(
                ids=[f"doc_{i}"],
                embeddings=[embedding],
                documents=[doc]
            )

    def query(self, question, top_k=3):
        """Answer question using RAG"""
        # 1. Embed query
        query_embedding = self.embedder.encode(question).tolist()

        # 2. Retrieve relevant docs
        results = self.collection.query(
            query_embeddings=[query_embedding],
            n_results=top_k
        )

        # 3. Build context
        context = "\n\n".join(results['documents'][0])

        # 4. Generate answer
        response = self.llm.chat.completions.create(
            model="gpt-4",
            messages=[
                {"role": "system", "content": "Answer based on the provided context."},
                {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
            ]
        )

        return response.choices[0].message.content

# Usage
rag = SimpleRAG()
rag.add_documents([
    "Python is a high-level programming language.",
    "Machine learning is a subset of AI.",
    "RAG combines retrieval with generation."
])

answer = rag.query("What is RAG?")
print(answer)

Advanced Techniques

Hypothetical Document Embeddings (HyDE)

Generate hypothetical answer, embed it, use for retrieval:

def hyde_retrieval(query):
    # Generate hypothetical answer
    hypothetical = llm.generate(f"Answer this question: {query}")

    # Embed and search with hypothetical answer
    embedding = embed(hypothetical)
    results = vector_db.search(embedding)
    return results

Query Expansion

Expand query with synonyms or related terms:

def expand_query(query):
    expansion = llm.generate(f"Generate 3 similar questions to: {query}")
    all_queries = [query] + parse_questions(expansion)

    all_results = []
    for q in all_queries:
        all_results.extend(retrieve(q, top_k=2))

    return deduplicate(all_results)

Re-ranking

Use cross-encoder for better relevance:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank_results(query, results, top_k=5):
    pairs = [[query, r.text] for r in results]
    scores = reranker.predict(pairs)

    ranked = sorted(zip(results, scores), key=lambda x: x[1], reverse=True)
    return [r for r, s in ranked[:top_k]]

Evaluation Metrics

Retrieval Quality

  • Recall@K: Percentage of relevant docs in top K
  • MRR (Mean Reciprocal Rank): Position of first relevant doc
  • NDCG: Ranking quality metric

Generation Quality

  • Faithfulness: Answer grounded in context
  • Relevance: Answer addresses question
  • RAGAS: RAG Assessment metrics

Common Patterns

Multi-hop RAG

Answer questions requiring multiple retrieval steps

Conversational RAG

Maintain conversation history for context

Agentic RAG

LLM decides when and what to retrieve

Challenges

  • Chunking Strategy: Optimal size and overlap
  • Embedding Quality: Better embeddings = better retrieval
  • Context Window: Limited by LLM max tokens
  • Latency: Retrieval + generation time
  • Cost: Vector DB + LLM API calls

Key Takeaways

  • RAG grounds LLM outputs in real data
  • Combines embedding-based retrieval with generation
  • More cost-effective than fine-tuning
  • Requires both components working well
  • Essential pattern for production AI applications
  • Enables up-to-date, domain-specific AI systems

Dependencies

This concept builds on: