RAG (Retrieval-Augmented Generation)

Dec 12, 2024 · 7 min read · ai-knowhow rag llm retrieval embeddings ·

Share on:

Retrieval-Augmented Generation techniques for enhancing LLM responses with external knowledge.

Core Idea

RAG combines retrieval from external knowledge bases with LLM generation to produce accurate, up-to-date responses without retraining the model.

Mathematical Foundation

The core retrieval mechanism uses cosine similarity in embedding space:

$$\text{similarity}(q, d) = \frac{\mathbf{q} \cdot \mathbf{d}}{||\mathbf{q}|| \cdot ||\mathbf{d}||} = \cos(\theta)$$

where:

$\mathbf{q}$ is the query embedding vector
$\mathbf{d}$ is the document chunk embedding vector
$\theta$ is the angle between vectors

Key Process:

Embedding: Convert query and documents to dense vectors using an embedding model: $\mathbf{q} = E(q)$, $\mathbf{d}_i = E(d_i)$
Retrieval: Find top-$k$ most similar chunks: $\text{TopK} = \arg\max_{i \in [1,N]} \text{similarity}(\mathbf{q}, \mathbf{d}_i)$
Augmentation: Inject retrieved context into the prompt: $\text{prompt} = f(q, \text{TopK})$
Generation: LLM generates response conditioned on augmented prompt: $r = \text{LLM}(\text{prompt})$

This approach enables:

Knowledge injection without model fine-tuning
Reduced hallucination by grounding in retrieved facts
Dynamic updates by refreshing the document store
Source attribution by referencing retrieved chunks

RAG Architecture

Basic RAG Pipeline

 1from langchain.embeddings import OpenAIEmbeddings
 2from langchain.vectorstores import Chroma
 3from langchain.text_splitter import RecursiveCharacterTextSplitter
 4from langchain.llms import OpenAI
 5from langchain.chains import RetrievalQA
 6
 7# 1. Load and split documents
 8documents = load_documents()
 9text_splitter = RecursiveCharacterTextSplitter(
10    chunk_size=1000,
11    chunk_overlap=200
12)
13texts = text_splitter.split_documents(documents)
14
15# 2. Create embeddings and vector store
16embeddings = OpenAIEmbeddings()
17vectorstore = Chroma.from_documents(texts, embeddings)
18
19# 3. Create retrieval chain
20qa_chain = RetrievalQA.from_chain_type(
21    llm=OpenAI(),
22    chain_type="stuff",
23    retriever=vectorstore.as_retriever(search_kwargs={"k": 4})
24)
25
26# 4. Query
27response = qa_chain.run("What is the main topic?")

Chunking Strategies

Fixed-Size Chunking

1from langchain.text_splitter import CharacterTextSplitter
2
3splitter = CharacterTextSplitter(
4    chunk_size=1000,
5    chunk_overlap=200,
6    separator="\n"
7)
8chunks = splitter.split_text(text)

Recursive Chunking

1from langchain.text_splitter import RecursiveCharacterTextSplitter
2
3splitter = RecursiveCharacterTextSplitter(
4    chunk_size=1000,
5    chunk_overlap=200,
6    separators=["\n\n", "\n", " ", ""]
7)
8chunks = splitter.split_text(text)

Semantic Chunking

1from langchain.text_splitter import SemanticChunker
2from langchain.embeddings import OpenAIEmbeddings
3
4splitter = SemanticChunker(
5    embeddings=OpenAIEmbeddings(),
6    breakpoint_threshold_type="percentile"
7)
8chunks = splitter.split_text(text)

Embedding Models

OpenAI Embeddings

1from langchain.embeddings import OpenAIEmbeddings
2
3embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
4vector = embeddings.embed_query("Hello world")

Sentence Transformers

1from langchain.embeddings import HuggingFaceEmbeddings
2
3embeddings = HuggingFaceEmbeddings(
4    model_name="sentence-transformers/all-mpnet-base-v2"
5)
6vector = embeddings.embed_query("Hello world")

Vector Stores

Chroma

 1from langchain.vectorstores import Chroma
 2
 3vectorstore = Chroma.from_documents(
 4    documents=texts,
 5    embedding=embeddings,
 6    persist_directory="./chroma_db"
 7)
 8
 9# Search
10results = vectorstore.similarity_search("query", k=4)

FAISS

1from langchain.vectorstores import FAISS
2
3vectorstore = FAISS.from_documents(texts, embeddings)
4vectorstore.save_local("faiss_index")
5
6# Load
7vectorstore = FAISS.load_local("faiss_index", embeddings)

Pinecone

1import pinecone
2from langchain.vectorstores import Pinecone
3
4pinecone.init(api_key="YOUR_API_KEY", environment="us-west1-gcp")
5vectorstore = Pinecone.from_documents(texts, embeddings, index_name="my-index")

Retrieval Strategies

Similarity Search

1# Basic similarity
2results = vectorstore.similarity_search("query", k=4)
3
4# With scores
5results = vectorstore.similarity_search_with_score("query", k=4)

MMR (Maximal Marginal Relevance)

Core Idea: Selects documents that are both relevant to the query and diverse from already-selected documents, preventing redundant information.

Mathematical Formulation: $$\text{MMR} = \arg\max_{d_i \in R \setminus S} \left[ \lambda \cdot \text{sim}(q, d_i) - (1-\lambda) \cdot \max_{d_j \in S} \text{sim}(d_i, d_j) \right]$$

where:

$R$ is the candidate set of retrieved documents
$S$ is the set of already-selected documents
$\lambda \in [0,1]$ controls the trade-off (0 = diversity, 1 = relevance)
$\text{sim}(q, d_i)$ is query-document similarity
$\text{sim}(d_i, d_j)$ is inter-document similarity

Key Insight: The second term penalizes documents similar to already-selected ones, ensuring coverage of different aspects.

Balances relevance and diversity:

1results = vectorstore.max_marginal_relevance_search(
2    "query",
3    k=4,
4    fetch_k=20,
5    lambda_mult=0.5  # 0=diversity, 1=relevance
6)

Hybrid Search

Core Idea: Combines dense (semantic) and sparse (keyword-based) retrieval to leverage both semantic understanding and exact term matching.

Mathematical Formulation: $$\text{score}(q, d) = \alpha \cdot \text{sim}_{\text{dense}}(\mathbf{q}, \mathbf{d}) + (1-\alpha) \cdot \text{BM25}(q, d)$$

where:

$\text{sim}_{\text{dense}}$ is cosine similarity in embedding space
$\text{BM25}(q, d)$ is the BM25 ranking function: $\sum_{t \in q} \text{IDF}(t) \cdot \frac{f(t,d) \cdot (k_1 + 1)}{f(t,d) + k_1 \cdot (1 - b + b \cdot \frac{|d|}{\text{avgdl}})}$
$\alpha \in [0,1]$ controls the weighting

Key Insight: Dense vectors capture semantic meaning while sparse retrieval handles exact matches and rare terms effectively.

Combine dense and sparse retrieval:

 1from langchain.retrievers import EnsembleRetriever
 2from langchain.retrievers import BM25Retriever
 3
 4# Dense retriever
 5dense_retriever = vectorstore.as_retriever()
 6
 7# Sparse retriever
 8bm25_retriever = BM25Retriever.from_documents(texts)
 9
10# Ensemble
11ensemble_retriever = EnsembleRetriever(
12    retrievers=[dense_retriever, bm25_retriever],
13    weights=[0.5, 0.5]
14)

Reranking

Core Idea: Uses a more powerful (but slower) model to re-rank initial retrieval results, improving precision by considering query-document interactions more deeply.

Mathematical Formulation: $$\text{rerank}(q, D_k) = \text{argsort}{d \in D_k} \left[ f{\text{reranker}}(q, d) \right]$$

where:

$D_k$ are the top-$k$ documents from initial retrieval
$f_{\text{reranker}}$ is a cross-encoder that jointly encodes query and document
Cross-encoders compute attention between query and document tokens: $\text{score} = \text{softmax}(\mathbf{Q}\mathbf{K}^T / \sqrt{d_k})\mathbf{V}$

Key Insight: Cross-encoders see query-document pairs together, enabling fine-grained relevance scoring, but are too slow for initial retrieval over large corpora.

 1from langchain.retrievers import ContextualCompressionRetriever
 2from langchain.retrievers.document_compressors import CohereRerank
 3
 4# Base retriever
 5base_retriever = vectorstore.as_retriever(search_kwargs={"k": 20})
 6
 7# Reranker
 8compressor = CohereRerank(model="rerank-english-v2.0", top_n=4)
 9
10# Compression retriever
11compression_retriever = ContextualCompressionRetriever(
12    base_compressor=compressor,
13    base_retriever=base_retriever
14)
15
16results = compression_retriever.get_relevant_documents("query")

Query Transformation

Multi-Query

Core Idea: Generates multiple query variations from the original query, then retrieves documents for each variation and merges results, improving recall.

Mathematical Formulation: $$\text{retrieve}(q) = \bigcup_{i=1}^{n} \text{TopK}(E(q_i), D)$$

where:

$q_i = \text{LLM}(q, \text{"Generate alternative query"})$ for $i \in [1, n]$
Each $q_i$ retrieves top-$k$ documents
Results are deduplicated and merged

Key Insight: Different phrasings of the same intent may match different document formulations, expanding coverage.

1from langchain.retrievers.multi_query import MultiQueryRetriever
2
3retriever = MultiQueryRetriever.from_llm(
4    retriever=vectorstore.as_retriever(),
5    llm=OpenAI()
6)

HyDE (Hypothetical Document Embeddings)

Core Idea: Generates a hypothetical answer document using the LLM, then uses its embedding for retrieval instead of the query embedding, bridging the vocabulary gap.

Mathematical Formulation:

Generate hypothetical document: $d_h = \text{LLM}(q, \text{"Generate answer document"})$
Embed hypothetical document: $\mathbf{d}_h = E(d_h)$
Retrieve using hypothetical embedding: $\text{TopK} = \arg\max_{d \in D} \text{sim}(\mathbf{d}_h, \mathbf{d})$

Key Insight: The hypothetical document uses domain-specific vocabulary that better matches the corpus, improving retrieval quality for technical queries.

1from langchain.chains import HypotheticalDocumentEmbedder
2
3hyde_embeddings = HypotheticalDocumentEmbedder.from_llm(
4    llm=OpenAI(),
5    base_embeddings=embeddings,
6    prompt_key="web_search"
7)
8
9vectorstore = FAISS.from_documents(texts, hyde_embeddings)

Advanced RAG Patterns

Self-Query

Core Idea: Uses LLM to parse natural language queries into structured queries with metadata filters, enabling semantic search combined with structured filtering.

Mathematical Formulation: $$\text{parse}(q) \rightarrow (q_{\text{semantic}}, \mathcal{F})$$

where:

$q_{\text{semantic}}$ is the semantic query component
$\mathcal{F} = {f_1, f_2, \ldots}$ are metadata filters (e.g., date, source, type)
Retrieval: $\text{retrieve}(q) = \text{TopK}(\text{sim}(q_{\text{semantic}}, D_{\mathcal{F}}))$ where $D_{\mathcal{F}}$ are documents matching filters

Key Insight: Enables queries like "papers about RAG from 2024" by combining semantic search with structured metadata constraints.

 1from langchain.retrievers.self_query.base import SelfQueryRetriever
 2from langchain.chains.query_constructor.base import AttributeInfo
 3
 4metadata_field_info = [
 5    AttributeInfo(
 6        name="source",
 7        description="The source of the document",
 8        type="string"
 9    ),
10    AttributeInfo(
11        name="date",
12        description="The date the document was created",
13        type="string"
14    )
15]
16
17retriever = SelfQueryRetriever.from_llm(
18    llm=OpenAI(),
19    vectorstore=vectorstore,
20    document_contents="Research papers",
21    metadata_field_info=metadata_field_info
22)

Parent Document Retriever

Core Idea: Uses small chunks for precise retrieval, but returns larger parent documents for context, balancing retrieval precision with generation context.

Mathematical Formulation:

Split into small child chunks: $C = {c_1, c_2, \ldots}$ where $|c_i| < \text{chunk_size}$
Store parent documents: $P = {p_1, p_2, \ldots}$ where $c_i \subseteq p_j$
Retrieve child chunks: $C_{\text{retrieved}} = \text{TopK}(\text{sim}(q, C))$
Return parent documents: $P_{\text{returned}} = {p_j : c_i \in C_{\text{retrieved}} \land c_i \subseteq p_j}$

Key Insight: Small chunks improve retrieval precision (better semantic matching), while parent documents provide full context for generation (avoiding truncation).

 1from langchain.retrievers import ParentDocumentRetriever
 2from langchain.storage import InMemoryStore
 3
 4# Small chunks for retrieval
 5child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
 6
 7# Store for parent documents
 8store = InMemoryStore()
 9
10retriever = ParentDocumentRetriever(
11    vectorstore=vectorstore,
12    docstore=store,
13    child_splitter=child_splitter
14)

Evaluation Metrics

Retrieval Metrics

 1from ragas.metrics import (
 2    context_precision,
 3    context_recall,
 4    faithfulness,
 5    answer_relevancy
 6)
 7
 8# Evaluate
 9results = evaluate(
10    dataset,
11    metrics=[
12        context_precision,
13        context_recall,
14        faithfulness,
15        answer_relevancy
16    ]
17)

Related Snippets

Data Augmentation
Data augmentation techniques for Keras and PyTorch
DNN Policy Learning Theory
Deep Neural Network policy learning with mathematical foundations. Policy …
Graph RAG Techniques
Graph-based Retrieval-Augmented Generation for enhanced context and relationship …
Image to Vector Embeddings
Image embeddings convert visual content into dense vector representations that …
Keras Essentials
High-level Keras API for building neural networks quickly. Installation 1# Keras …
LangChain Recipes
Practical recipes for building LLM applications with LangChain: prompts, chains, …
ONNX Model Conversion
ONNX (Open Neural Network Exchange) for converting models between frameworks. …
PyTorch Essentials
Essential PyTorch operations and patterns for deep learning. Installation 1# CPU …
Q-Learning Theory
Q-Learning algorithm theory with mathematical foundations. Markov Decision …
Sound to Vector Embeddings
Audio embeddings convert sound signals (speech, music, environmental sounds) …
Tensor Mathematics & Backpropagation
Tensor mathematics fundamentals and backpropagation theory with detailed …
TensorFlow Essentials
Essential TensorFlow operations and patterns for deep learning. Installation 1# …
TensorFlow Lite
TensorFlow Lite for deploying models on mobile and embedded devices. Convert …
Text to Vector Embeddings
Text embeddings convert textual content into dense vector representations that …