RAG (Retrieval-Augmented Generation)
Retrieval-Augmented Generation techniques for enhancing LLM responses with external knowledge.
Core Idea
RAG combines retrieval from external knowledge bases with LLM generation to produce accurate, up-to-date responses without retraining the model.
Mathematical Foundation
The core retrieval mechanism uses cosine similarity in embedding space:
$$\text{similarity}(q, d) = \frac{\mathbf{q} \cdot \mathbf{d}}{||\mathbf{q}|| \cdot ||\mathbf{d}||} = \cos(\theta)$$
where:
- $\mathbf{q}$ is the query embedding vector
- $\mathbf{d}$ is the document chunk embedding vector
- $\theta$ is the angle between vectors
Key Process:
- Embedding: Convert query and documents to dense vectors using an embedding model: $\mathbf{q} = E(q)$, $\mathbf{d}_i = E(d_i)$
- Retrieval: Find top-$k$ most similar chunks: $\text{TopK} = \arg\max_{i \in [1,N]} \text{similarity}(\mathbf{q}, \mathbf{d}_i)$
- Augmentation: Inject retrieved context into the prompt: $\text{prompt} = f(q, \text{TopK})$
- Generation: LLM generates response conditioned on augmented prompt: $r = \text{LLM}(\text{prompt})$
This approach enables:
- Knowledge injection without model fine-tuning
- Reduced hallucination by grounding in retrieved facts
- Dynamic updates by refreshing the document store
- Source attribution by referencing retrieved chunks
RAG Architecture
Basic RAG Pipeline
1from langchain.embeddings import OpenAIEmbeddings
2from langchain.vectorstores import Chroma
3from langchain.text_splitter import RecursiveCharacterTextSplitter
4from langchain.llms import OpenAI
5from langchain.chains import RetrievalQA
6
7# 1. Load and split documents
8documents = load_documents()
9text_splitter = RecursiveCharacterTextSplitter(
10 chunk_size=1000,
11 chunk_overlap=200
12)
13texts = text_splitter.split_documents(documents)
14
15# 2. Create embeddings and vector store
16embeddings = OpenAIEmbeddings()
17vectorstore = Chroma.from_documents(texts, embeddings)
18
19# 3. Create retrieval chain
20qa_chain = RetrievalQA.from_chain_type(
21 llm=OpenAI(),
22 chain_type="stuff",
23 retriever=vectorstore.as_retriever(search_kwargs={"k": 4})
24)
25
26# 4. Query
27response = qa_chain.run("What is the main topic?")
Chunking Strategies
Fixed-Size Chunking
1from langchain.text_splitter import CharacterTextSplitter
2
3splitter = CharacterTextSplitter(
4 chunk_size=1000,
5 chunk_overlap=200,
6 separator="\n"
7)
8chunks = splitter.split_text(text)
Recursive Chunking
1from langchain.text_splitter import RecursiveCharacterTextSplitter
2
3splitter = RecursiveCharacterTextSplitter(
4 chunk_size=1000,
5 chunk_overlap=200,
6 separators=["\n\n", "\n", " ", ""]
7)
8chunks = splitter.split_text(text)
Semantic Chunking
1from langchain.text_splitter import SemanticChunker
2from langchain.embeddings import OpenAIEmbeddings
3
4splitter = SemanticChunker(
5 embeddings=OpenAIEmbeddings(),
6 breakpoint_threshold_type="percentile"
7)
8chunks = splitter.split_text(text)
Embedding Models
OpenAI Embeddings
1from langchain.embeddings import OpenAIEmbeddings
2
3embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
4vector = embeddings.embed_query("Hello world")
Sentence Transformers
1from langchain.embeddings import HuggingFaceEmbeddings
2
3embeddings = HuggingFaceEmbeddings(
4 model_name="sentence-transformers/all-mpnet-base-v2"
5)
6vector = embeddings.embed_query("Hello world")
Vector Stores
Chroma
1from langchain.vectorstores import Chroma
2
3vectorstore = Chroma.from_documents(
4 documents=texts,
5 embedding=embeddings,
6 persist_directory="./chroma_db"
7)
8
9# Search
10results = vectorstore.similarity_search("query", k=4)
FAISS
1from langchain.vectorstores import FAISS
2
3vectorstore = FAISS.from_documents(texts, embeddings)
4vectorstore.save_local("faiss_index")
5
6# Load
7vectorstore = FAISS.load_local("faiss_index", embeddings)
Pinecone
1import pinecone
2from langchain.vectorstores import Pinecone
3
4pinecone.init(api_key="YOUR_API_KEY", environment="us-west1-gcp")
5vectorstore = Pinecone.from_documents(texts, embeddings, index_name="my-index")
Retrieval Strategies
Similarity Search
1# Basic similarity
2results = vectorstore.similarity_search("query", k=4)
3
4# With scores
5results = vectorstore.similarity_search_with_score("query", k=4)
MMR (Maximal Marginal Relevance)
Core Idea: Selects documents that are both relevant to the query and diverse from already-selected documents, preventing redundant information.
Mathematical Formulation: $$\text{MMR} = \arg\max_{d_i \in R \setminus S} \left[ \lambda \cdot \text{sim}(q, d_i) - (1-\lambda) \cdot \max_{d_j \in S} \text{sim}(d_i, d_j) \right]$$
where:
- $R$ is the candidate set of retrieved documents
- $S$ is the set of already-selected documents
- $\lambda \in [0,1]$ controls the trade-off (0 = diversity, 1 = relevance)
- $\text{sim}(q, d_i)$ is query-document similarity
- $\text{sim}(d_i, d_j)$ is inter-document similarity
Key Insight: The second term penalizes documents similar to already-selected ones, ensuring coverage of different aspects.
Balances relevance and diversity:
1results = vectorstore.max_marginal_relevance_search(
2 "query",
3 k=4,
4 fetch_k=20,
5 lambda_mult=0.5 # 0=diversity, 1=relevance
6)
Hybrid Search
Core Idea: Combines dense (semantic) and sparse (keyword-based) retrieval to leverage both semantic understanding and exact term matching.
Mathematical Formulation: $$\text{score}(q, d) = \alpha \cdot \text{sim}_{\text{dense}}(\mathbf{q}, \mathbf{d}) + (1-\alpha) \cdot \text{BM25}(q, d)$$
where:
- $\text{sim}_{\text{dense}}$ is cosine similarity in embedding space
- $\text{BM25}(q, d)$ is the BM25 ranking function: $\sum_{t \in q} \text{IDF}(t) \cdot \frac{f(t,d) \cdot (k_1 + 1)}{f(t,d) + k_1 \cdot (1 - b + b \cdot \frac{|d|}{\text{avgdl}})}$
- $\alpha \in [0,1]$ controls the weighting
Key Insight: Dense vectors capture semantic meaning while sparse retrieval handles exact matches and rare terms effectively.
Combine dense and sparse retrieval:
1from langchain.retrievers import EnsembleRetriever
2from langchain.retrievers import BM25Retriever
3
4# Dense retriever
5dense_retriever = vectorstore.as_retriever()
6
7# Sparse retriever
8bm25_retriever = BM25Retriever.from_documents(texts)
9
10# Ensemble
11ensemble_retriever = EnsembleRetriever(
12 retrievers=[dense_retriever, bm25_retriever],
13 weights=[0.5, 0.5]
14)
Reranking
Core Idea: Uses a more powerful (but slower) model to re-rank initial retrieval results, improving precision by considering query-document interactions more deeply.
Mathematical Formulation: $$\text{rerank}(q, D_k) = \text{argsort}{d \in D_k} \left[ f{\text{reranker}}(q, d) \right]$$
where:
- $D_k$ are the top-$k$ documents from initial retrieval
- $f_{\text{reranker}}$ is a cross-encoder that jointly encodes query and document
- Cross-encoders compute attention between query and document tokens: $\text{score} = \text{softmax}(\mathbf{Q}\mathbf{K}^T / \sqrt{d_k})\mathbf{V}$
Key Insight: Cross-encoders see query-document pairs together, enabling fine-grained relevance scoring, but are too slow for initial retrieval over large corpora.
1from langchain.retrievers import ContextualCompressionRetriever
2from langchain.retrievers.document_compressors import CohereRerank
3
4# Base retriever
5base_retriever = vectorstore.as_retriever(search_kwargs={"k": 20})
6
7# Reranker
8compressor = CohereRerank(model="rerank-english-v2.0", top_n=4)
9
10# Compression retriever
11compression_retriever = ContextualCompressionRetriever(
12 base_compressor=compressor,
13 base_retriever=base_retriever
14)
15
16results = compression_retriever.get_relevant_documents("query")
Query Transformation
Multi-Query
Core Idea: Generates multiple query variations from the original query, then retrieves documents for each variation and merges results, improving recall.
Mathematical Formulation: $$\text{retrieve}(q) = \bigcup_{i=1}^{n} \text{TopK}(E(q_i), D)$$
where:
- $q_i = \text{LLM}(q, \text{"Generate alternative query"})$ for $i \in [1, n]$
- Each $q_i$ retrieves top-$k$ documents
- Results are deduplicated and merged
Key Insight: Different phrasings of the same intent may match different document formulations, expanding coverage.
1from langchain.retrievers.multi_query import MultiQueryRetriever
2
3retriever = MultiQueryRetriever.from_llm(
4 retriever=vectorstore.as_retriever(),
5 llm=OpenAI()
6)
HyDE (Hypothetical Document Embeddings)
Core Idea: Generates a hypothetical answer document using the LLM, then uses its embedding for retrieval instead of the query embedding, bridging the vocabulary gap.
Mathematical Formulation:
- Generate hypothetical document: $d_h = \text{LLM}(q, \text{"Generate answer document"})$
- Embed hypothetical document: $\mathbf{d}_h = E(d_h)$
- Retrieve using hypothetical embedding: $\text{TopK} = \arg\max_{d \in D} \text{sim}(\mathbf{d}_h, \mathbf{d})$
Key Insight: The hypothetical document uses domain-specific vocabulary that better matches the corpus, improving retrieval quality for technical queries.
1from langchain.chains import HypotheticalDocumentEmbedder
2
3hyde_embeddings = HypotheticalDocumentEmbedder.from_llm(
4 llm=OpenAI(),
5 base_embeddings=embeddings,
6 prompt_key="web_search"
7)
8
9vectorstore = FAISS.from_documents(texts, hyde_embeddings)
Advanced RAG Patterns
Self-Query
Core Idea: Uses LLM to parse natural language queries into structured queries with metadata filters, enabling semantic search combined with structured filtering.
Mathematical Formulation: $$\text{parse}(q) \rightarrow (q_{\text{semantic}}, \mathcal{F})$$
where:
- $q_{\text{semantic}}$ is the semantic query component
- $\mathcal{F} = {f_1, f_2, \ldots}$ are metadata filters (e.g., date, source, type)
- Retrieval: $\text{retrieve}(q) = \text{TopK}(\text{sim}(q_{\text{semantic}}, D_{\mathcal{F}}))$ where $D_{\mathcal{F}}$ are documents matching filters
Key Insight: Enables queries like "papers about RAG from 2024" by combining semantic search with structured metadata constraints.
1from langchain.retrievers.self_query.base import SelfQueryRetriever
2from langchain.chains.query_constructor.base import AttributeInfo
3
4metadata_field_info = [
5 AttributeInfo(
6 name="source",
7 description="The source of the document",
8 type="string"
9 ),
10 AttributeInfo(
11 name="date",
12 description="The date the document was created",
13 type="string"
14 )
15]
16
17retriever = SelfQueryRetriever.from_llm(
18 llm=OpenAI(),
19 vectorstore=vectorstore,
20 document_contents="Research papers",
21 metadata_field_info=metadata_field_info
22)
Parent Document Retriever
Core Idea: Uses small chunks for precise retrieval, but returns larger parent documents for context, balancing retrieval precision with generation context.
Mathematical Formulation:
- Split into small child chunks: $C = {c_1, c_2, \ldots}$ where $|c_i| < \text{chunk_size}$
- Store parent documents: $P = {p_1, p_2, \ldots}$ where $c_i \subseteq p_j$
- Retrieve child chunks: $C_{\text{retrieved}} = \text{TopK}(\text{sim}(q, C))$
- Return parent documents: $P_{\text{returned}} = {p_j : c_i \in C_{\text{retrieved}} \land c_i \subseteq p_j}$
Key Insight: Small chunks improve retrieval precision (better semantic matching), while parent documents provide full context for generation (avoiding truncation).
1from langchain.retrievers import ParentDocumentRetriever
2from langchain.storage import InMemoryStore
3
4# Small chunks for retrieval
5child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
6
7# Store for parent documents
8store = InMemoryStore()
9
10retriever = ParentDocumentRetriever(
11 vectorstore=vectorstore,
12 docstore=store,
13 child_splitter=child_splitter
14)
Evaluation Metrics
Retrieval Metrics
1from ragas.metrics import (
2 context_precision,
3 context_recall,
4 faithfulness,
5 answer_relevancy
6)
7
8# Evaluate
9results = evaluate(
10 dataset,
11 metrics=[
12 context_precision,
13 context_recall,
14 faithfulness,
15 answer_relevancy
16 ]
17)
Related Snippets
- Data Augmentation
Data augmentation techniques for Keras and PyTorch - DNN Policy Learning Theory
Deep Neural Network policy learning with mathematical foundations. Policy … - Graph RAG Techniques
Graph-based Retrieval-Augmented Generation for enhanced context and relationship … - Image to Vector Embeddings
Image embeddings convert visual content into dense vector representations that … - Keras Essentials
High-level Keras API for building neural networks quickly. Installation 1# Keras … - LangChain Recipes
Practical recipes for building LLM applications with LangChain: prompts, chains, … - ONNX Model Conversion
ONNX (Open Neural Network Exchange) for converting models between frameworks. … - PyTorch Essentials
Essential PyTorch operations and patterns for deep learning. Installation 1# CPU … - Q-Learning Theory
Q-Learning algorithm theory with mathematical foundations. Markov Decision … - Sound to Vector Embeddings
Audio embeddings convert sound signals (speech, music, environmental sounds) … - Tensor Mathematics & Backpropagation
Tensor mathematics fundamentals and backpropagation theory with detailed … - TensorFlow Essentials
Essential TensorFlow operations and patterns for deep learning. Installation 1# … - TensorFlow Lite
TensorFlow Lite for deploying models on mobile and embedded devices. Convert … - Text to Vector Embeddings
Text embeddings convert textual content into dense vector representations that …