RAG (Retrieval-Augmented Generation)

Retrieval-Augmented Generation techniques for enhancing LLM responses with external knowledge.

Core Idea

RAG combines retrieval from external knowledge bases with LLM generation to produce accurate, up-to-date responses without retraining the model.

Mathematical Foundation

The core retrieval mechanism uses cosine similarity in embedding space:

$$\text{similarity}(q, d) = \frac{\mathbf{q} \cdot \mathbf{d}}{||\mathbf{q}|| \cdot ||\mathbf{d}||} = \cos(\theta)$$

where:

  • $\mathbf{q}$ is the query embedding vector
  • $\mathbf{d}$ is the document chunk embedding vector
  • $\theta$ is the angle between vectors

Key Process:

  1. Embedding: Convert query and documents to dense vectors using an embedding model: $\mathbf{q} = E(q)$, $\mathbf{d}_i = E(d_i)$
  2. Retrieval: Find top-$k$ most similar chunks: $\text{TopK} = \arg\max_{i \in [1,N]} \text{similarity}(\mathbf{q}, \mathbf{d}_i)$
  3. Augmentation: Inject retrieved context into the prompt: $\text{prompt} = f(q, \text{TopK})$
  4. Generation: LLM generates response conditioned on augmented prompt: $r = \text{LLM}(\text{prompt})$

This approach enables:

  • Knowledge injection without model fine-tuning
  • Reduced hallucination by grounding in retrieved facts
  • Dynamic updates by refreshing the document store
  • Source attribution by referencing retrieved chunks

RAG Architecture


Basic RAG Pipeline

 1from langchain.embeddings import OpenAIEmbeddings
 2from langchain.vectorstores import Chroma
 3from langchain.text_splitter import RecursiveCharacterTextSplitter
 4from langchain.llms import OpenAI
 5from langchain.chains import RetrievalQA
 6
 7# 1. Load and split documents
 8documents = load_documents()
 9text_splitter = RecursiveCharacterTextSplitter(
10    chunk_size=1000,
11    chunk_overlap=200
12)
13texts = text_splitter.split_documents(documents)
14
15# 2. Create embeddings and vector store
16embeddings = OpenAIEmbeddings()
17vectorstore = Chroma.from_documents(texts, embeddings)
18
19# 3. Create retrieval chain
20qa_chain = RetrievalQA.from_chain_type(
21    llm=OpenAI(),
22    chain_type="stuff",
23    retriever=vectorstore.as_retriever(search_kwargs={"k": 4})
24)
25
26# 4. Query
27response = qa_chain.run("What is the main topic?")

Chunking Strategies

Fixed-Size Chunking

1from langchain.text_splitter import CharacterTextSplitter
2
3splitter = CharacterTextSplitter(
4    chunk_size=1000,
5    chunk_overlap=200,
6    separator="\n"
7)
8chunks = splitter.split_text(text)

Recursive Chunking

1from langchain.text_splitter import RecursiveCharacterTextSplitter
2
3splitter = RecursiveCharacterTextSplitter(
4    chunk_size=1000,
5    chunk_overlap=200,
6    separators=["\n\n", "\n", " ", ""]
7)
8chunks = splitter.split_text(text)

Semantic Chunking

1from langchain.text_splitter import SemanticChunker
2from langchain.embeddings import OpenAIEmbeddings
3
4splitter = SemanticChunker(
5    embeddings=OpenAIEmbeddings(),
6    breakpoint_threshold_type="percentile"
7)
8chunks = splitter.split_text(text)

Embedding Models

OpenAI Embeddings

1from langchain.embeddings import OpenAIEmbeddings
2
3embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
4vector = embeddings.embed_query("Hello world")

Sentence Transformers

1from langchain.embeddings import HuggingFaceEmbeddings
2
3embeddings = HuggingFaceEmbeddings(
4    model_name="sentence-transformers/all-mpnet-base-v2"
5)
6vector = embeddings.embed_query("Hello world")

Vector Stores

Chroma

 1from langchain.vectorstores import Chroma
 2
 3vectorstore = Chroma.from_documents(
 4    documents=texts,
 5    embedding=embeddings,
 6    persist_directory="./chroma_db"
 7)
 8
 9# Search
10results = vectorstore.similarity_search("query", k=4)

FAISS

1from langchain.vectorstores import FAISS
2
3vectorstore = FAISS.from_documents(texts, embeddings)
4vectorstore.save_local("faiss_index")
5
6# Load
7vectorstore = FAISS.load_local("faiss_index", embeddings)

Pinecone

1import pinecone
2from langchain.vectorstores import Pinecone
3
4pinecone.init(api_key="YOUR_API_KEY", environment="us-west1-gcp")
5vectorstore = Pinecone.from_documents(texts, embeddings, index_name="my-index")

Retrieval Strategies

1# Basic similarity
2results = vectorstore.similarity_search("query", k=4)
3
4# With scores
5results = vectorstore.similarity_search_with_score("query", k=4)

MMR (Maximal Marginal Relevance)

Core Idea: Selects documents that are both relevant to the query and diverse from already-selected documents, preventing redundant information.

Mathematical Formulation: $$\text{MMR} = \arg\max_{d_i \in R \setminus S} \left[ \lambda \cdot \text{sim}(q, d_i) - (1-\lambda) \cdot \max_{d_j \in S} \text{sim}(d_i, d_j) \right]$$

where:

  • $R$ is the candidate set of retrieved documents
  • $S$ is the set of already-selected documents
  • $\lambda \in [0,1]$ controls the trade-off (0 = diversity, 1 = relevance)
  • $\text{sim}(q, d_i)$ is query-document similarity
  • $\text{sim}(d_i, d_j)$ is inter-document similarity

Key Insight: The second term penalizes documents similar to already-selected ones, ensuring coverage of different aspects.

Balances relevance and diversity:

1results = vectorstore.max_marginal_relevance_search(
2    "query",
3    k=4,
4    fetch_k=20,
5    lambda_mult=0.5  # 0=diversity, 1=relevance
6)

Core Idea: Combines dense (semantic) and sparse (keyword-based) retrieval to leverage both semantic understanding and exact term matching.

Mathematical Formulation: $$\text{score}(q, d) = \alpha \cdot \text{sim}_{\text{dense}}(\mathbf{q}, \mathbf{d}) + (1-\alpha) \cdot \text{BM25}(q, d)$$

where:

  • $\text{sim}_{\text{dense}}$ is cosine similarity in embedding space
  • $\text{BM25}(q, d)$ is the BM25 ranking function: $\sum_{t \in q} \text{IDF}(t) \cdot \frac{f(t,d) \cdot (k_1 + 1)}{f(t,d) + k_1 \cdot (1 - b + b \cdot \frac{|d|}{\text{avgdl}})}$
  • $\alpha \in [0,1]$ controls the weighting

Key Insight: Dense vectors capture semantic meaning while sparse retrieval handles exact matches and rare terms effectively.

Combine dense and sparse retrieval:

 1from langchain.retrievers import EnsembleRetriever
 2from langchain.retrievers import BM25Retriever
 3
 4# Dense retriever
 5dense_retriever = vectorstore.as_retriever()
 6
 7# Sparse retriever
 8bm25_retriever = BM25Retriever.from_documents(texts)
 9
10# Ensemble
11ensemble_retriever = EnsembleRetriever(
12    retrievers=[dense_retriever, bm25_retriever],
13    weights=[0.5, 0.5]
14)

Reranking

Core Idea: Uses a more powerful (but slower) model to re-rank initial retrieval results, improving precision by considering query-document interactions more deeply.

Mathematical Formulation: $$\text{rerank}(q, D_k) = \text{argsort}{d \in D_k} \left[ f{\text{reranker}}(q, d) \right]$$

where:

  • $D_k$ are the top-$k$ documents from initial retrieval
  • $f_{\text{reranker}}$ is a cross-encoder that jointly encodes query and document
  • Cross-encoders compute attention between query and document tokens: $\text{score} = \text{softmax}(\mathbf{Q}\mathbf{K}^T / \sqrt{d_k})\mathbf{V}$

Key Insight: Cross-encoders see query-document pairs together, enabling fine-grained relevance scoring, but are too slow for initial retrieval over large corpora.

 1from langchain.retrievers import ContextualCompressionRetriever
 2from langchain.retrievers.document_compressors import CohereRerank
 3
 4# Base retriever
 5base_retriever = vectorstore.as_retriever(search_kwargs={"k": 20})
 6
 7# Reranker
 8compressor = CohereRerank(model="rerank-english-v2.0", top_n=4)
 9
10# Compression retriever
11compression_retriever = ContextualCompressionRetriever(
12    base_compressor=compressor,
13    base_retriever=base_retriever
14)
15
16results = compression_retriever.get_relevant_documents("query")

Query Transformation

Multi-Query

Core Idea: Generates multiple query variations from the original query, then retrieves documents for each variation and merges results, improving recall.

Mathematical Formulation: $$\text{retrieve}(q) = \bigcup_{i=1}^{n} \text{TopK}(E(q_i), D)$$

where:

  • $q_i = \text{LLM}(q, \text{"Generate alternative query"})$ for $i \in [1, n]$
  • Each $q_i$ retrieves top-$k$ documents
  • Results are deduplicated and merged

Key Insight: Different phrasings of the same intent may match different document formulations, expanding coverage.

1from langchain.retrievers.multi_query import MultiQueryRetriever
2
3retriever = MultiQueryRetriever.from_llm(
4    retriever=vectorstore.as_retriever(),
5    llm=OpenAI()
6)

HyDE (Hypothetical Document Embeddings)

Core Idea: Generates a hypothetical answer document using the LLM, then uses its embedding for retrieval instead of the query embedding, bridging the vocabulary gap.

Mathematical Formulation:

  1. Generate hypothetical document: $d_h = \text{LLM}(q, \text{"Generate answer document"})$
  2. Embed hypothetical document: $\mathbf{d}_h = E(d_h)$
  3. Retrieve using hypothetical embedding: $\text{TopK} = \arg\max_{d \in D} \text{sim}(\mathbf{d}_h, \mathbf{d})$

Key Insight: The hypothetical document uses domain-specific vocabulary that better matches the corpus, improving retrieval quality for technical queries.

1from langchain.chains import HypotheticalDocumentEmbedder
2
3hyde_embeddings = HypotheticalDocumentEmbedder.from_llm(
4    llm=OpenAI(),
5    base_embeddings=embeddings,
6    prompt_key="web_search"
7)
8
9vectorstore = FAISS.from_documents(texts, hyde_embeddings)

Advanced RAG Patterns

Self-Query

Core Idea: Uses LLM to parse natural language queries into structured queries with metadata filters, enabling semantic search combined with structured filtering.

Mathematical Formulation: $$\text{parse}(q) \rightarrow (q_{\text{semantic}}, \mathcal{F})$$

where:

  • $q_{\text{semantic}}$ is the semantic query component
  • $\mathcal{F} = {f_1, f_2, \ldots}$ are metadata filters (e.g., date, source, type)
  • Retrieval: $\text{retrieve}(q) = \text{TopK}(\text{sim}(q_{\text{semantic}}, D_{\mathcal{F}}))$ where $D_{\mathcal{F}}$ are documents matching filters

Key Insight: Enables queries like "papers about RAG from 2024" by combining semantic search with structured metadata constraints.

 1from langchain.retrievers.self_query.base import SelfQueryRetriever
 2from langchain.chains.query_constructor.base import AttributeInfo
 3
 4metadata_field_info = [
 5    AttributeInfo(
 6        name="source",
 7        description="The source of the document",
 8        type="string"
 9    ),
10    AttributeInfo(
11        name="date",
12        description="The date the document was created",
13        type="string"
14    )
15]
16
17retriever = SelfQueryRetriever.from_llm(
18    llm=OpenAI(),
19    vectorstore=vectorstore,
20    document_contents="Research papers",
21    metadata_field_info=metadata_field_info
22)

Parent Document Retriever

Core Idea: Uses small chunks for precise retrieval, but returns larger parent documents for context, balancing retrieval precision with generation context.

Mathematical Formulation:

  1. Split into small child chunks: $C = {c_1, c_2, \ldots}$ where $|c_i| < \text{chunk_size}$
  2. Store parent documents: $P = {p_1, p_2, \ldots}$ where $c_i \subseteq p_j$
  3. Retrieve child chunks: $C_{\text{retrieved}} = \text{TopK}(\text{sim}(q, C))$
  4. Return parent documents: $P_{\text{returned}} = {p_j : c_i \in C_{\text{retrieved}} \land c_i \subseteq p_j}$

Key Insight: Small chunks improve retrieval precision (better semantic matching), while parent documents provide full context for generation (avoiding truncation).

 1from langchain.retrievers import ParentDocumentRetriever
 2from langchain.storage import InMemoryStore
 3
 4# Small chunks for retrieval
 5child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
 6
 7# Store for parent documents
 8store = InMemoryStore()
 9
10retriever = ParentDocumentRetriever(
11    vectorstore=vectorstore,
12    docstore=store,
13    child_splitter=child_splitter
14)

Evaluation Metrics

Retrieval Metrics

 1from ragas.metrics import (
 2    context_precision,
 3    context_recall,
 4    faithfulness,
 5    answer_relevancy
 6)
 7
 8# Evaluate
 9results = evaluate(
10    dataset,
11    metrics=[
12        context_precision,
13        context_recall,
14        faithfulness,
15        answer_relevancy
16    ]
17)

Related Snippets