Text to Vector Embeddings
Text embeddings convert textual content into dense vector representations that capture semantic meaning, enabling similarity search, classification, and retrieval in natural language processing.
Core Idea
Text embeddings map text sequences (words, sentences, documents) to fixed-size vectors in a high-dimensional space where semantically similar texts are close together. The embedding function $E: \mathcal{T} \rightarrow \mathbb{R}^d$ transforms text $T \in \mathcal{T}$ into a vector $\mathbf{v} \in \mathbb{R}^d$.
Mathematical Foundation
Tokenization and Encoding:
Text is first tokenized into subword units: $$T \rightarrow [t_1, t_2, \ldots, t_n]$$
Each token is mapped to an embedding: $$\mathbf{e}_i = \text{Embedding}(t_i) \in \mathbb{R}^d$$
Inference (Forward Pass):
For Transformer-based encoders (BERT, Sentence-BERT): $$\mathbf{v} = E(T) = \text{Pool}(\text{Transformer}([\mathbf{e}_1, \mathbf{e}_2, \ldots, \mathbf{e}_n]))$$
For mean pooling: $$\mathbf{v} = \frac{1}{n}\sum_{i=1}^{n} \mathbf{h}_i$$
where $\mathbf{h}_i$ are hidden states from the transformer encoder.
For CLS token pooling: $$\mathbf{v} = \mathbf{h}_{\text{CLS}}$$
Training Objective:
Contrastive learning with InfoNCE loss: $$\mathcal{L} = -\log \frac{\exp(\text{sim}(\mathbf{v}_i, \mathbf{v}i^+) / \tau)}{\sum{j=1}^{N} \exp(\text{sim}(\mathbf{v}_i, \mathbf{v}_j) / \tau)}$$
where:
- $\mathbf{v}_i$ is the anchor embedding
- $\mathbf{v}_i^+$ is the positive (similar) embedding
- $\mathbf{v}_j$ are all embeddings in the batch (including negatives)
- $\text{sim}(\cdot, \cdot)$ is cosine similarity
- $\tau$ is the temperature parameter
Supervised Fine-tuning (Sentence-BERT):
For sentence pairs $(s_i, s_j)$ with label $y_{ij}$: $$\mathcal{L} = -\sum_{(i,j)} y_{ij} \log(\sigma(\text{sim}(\mathbf{v}_i, \mathbf{v}j))) + (1-y{ij}) \log(1-\sigma(\text{sim}(\mathbf{v}_i, \mathbf{v}_j)))$$
where $\sigma$ is the sigmoid function.
Architecture Overview
PyTorch Implementation
BERT-based Text Encoder
1import torch
2import torch.nn as nn
3from transformers import BertModel, BertTokenizer
4
5class BERTTextEncoder(nn.Module):
6 def __init__(self, model_name="bert-base-uncased", pooling="mean"):
7 super().__init__()
8 self.bert = BertModel.from_pretrained(model_name)
9 self.tokenizer = BertTokenizer.from_pretrained(model_name)
10 self.embedding_dim = self.bert.config.hidden_size
11 self.pooling = pooling
12
13 def forward(self, input_ids, attention_mask=None):
14 outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
15
16 if self.pooling == "cls":
17 # Use CLS token
18 embedding = outputs.last_hidden_state[:, 0]
19 elif self.pooling == "mean":
20 # Mean pooling with attention mask
21 hidden_states = outputs.last_hidden_state
22 if attention_mask is not None:
23 mask_expanded = attention_mask.unsqueeze(-1).expand(hidden_states.size())
24 sum_embeddings = torch.sum(hidden_states * mask_expanded, dim=1)
25 sum_mask = torch.clamp(mask_expanded.sum(dim=1), min=1e-9)
26 embedding = sum_embeddings / sum_mask
27 else:
28 embedding = hidden_states.mean(dim=1)
29 else:
30 raise ValueError(f"Unknown pooling: {self.pooling}")
31
32 return nn.functional.normalize(embedding, p=2, dim=1)
33
34# Usage
35encoder = BERTTextEncoder(pooling="mean")
36encoder.eval()
37
38text = "This is a sample sentence."
39inputs = encoder.tokenizer(
40 text,
41 return_tensors="pt",
42 padding=True,
43 truncation=True,
44 max_length=512
45)
46
47with torch.no_grad():
48 embedding = encoder(**inputs)
49 # embedding shape: [1, 768]
Sentence-BERT Style Encoder
1import torch
2import torch.nn as nn
3from transformers import AutoModel, AutoTokenizer
4
5class SentenceBERTEncoder(nn.Module):
6 def __init__(self, model_name="sentence-transformers/all-MiniLM-L6-v2"):
7 super().__init__()
8 self.model = AutoModel.from_pretrained(model_name)
9 self.tokenizer = AutoTokenizer.from_pretrained(model_name)
10 self.embedding_dim = self.model.config.hidden_size
11
12 def forward(self, input_ids, attention_mask=None):
13 outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
14
15 # Mean pooling
16 hidden_states = outputs.last_hidden_state
17 if attention_mask is not None:
18 mask_expanded = attention_mask.unsqueeze(-1).expand(hidden_states.size())
19 sum_embeddings = torch.sum(hidden_states * mask_expanded, dim=1)
20 sum_mask = torch.clamp(mask_expanded.sum(dim=1), min=1e-9)
21 embedding = sum_embeddings / sum_mask
22 else:
23 embedding = hidden_states.mean(dim=1)
24
25 return nn.functional.normalize(embedding, p=2, dim=1)
26
27# Usage
28encoder = SentenceBERTEncoder()
29encoder.eval()
30
31texts = ["First sentence.", "Second sentence."]
32inputs = encoder.tokenizer(
33 texts,
34 return_tensors="pt",
35 padding=True,
36 truncation=True,
37 max_length=512
38)
39
40with torch.no_grad():
41 embeddings = encoder(**inputs)
42 # embeddings shape: [2, 384]
Training with Contrastive Loss
1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5class InfoNCELoss(nn.Module):
6 def __init__(self, temperature=0.05):
7 super().__init__()
8 self.temperature = temperature
9
10 def forward(self, anchor, positive):
11 # anchor: [B, d], positive: [B, d]
12 batch_size = anchor.size(0)
13
14 # Normalize
15 anchor = F.normalize(anchor, p=2, dim=1)
16 positive = F.normalize(positive, p=2, dim=1)
17
18 # Compute similarity matrix: [B, B]
19 similarity_matrix = torch.matmul(anchor, positive.t()) / self.temperature
20
21 # Positive pairs are on the diagonal
22 labels = torch.arange(batch_size, device=anchor.device)
23
24 # Cross-entropy loss
25 loss = F.cross_entropy(similarity_matrix, labels)
26 return loss
27
28# Training loop
29encoder = SentenceBERTEncoder()
30criterion = InfoNCELoss(temperature=0.05)
31optimizer = torch.optim.Adam(encoder.parameters(), lr=2e-5)
32
33for anchor_texts, positive_texts in dataloader:
34 optimizer.zero_grad()
35
36 # Encode anchor texts
37 anchor_inputs = encoder.tokenizer(
38 anchor_texts,
39 return_tensors="pt",
40 padding=True,
41 truncation=True
42 )
43 anchor_emb = encoder(**anchor_inputs)
44
45 # Encode positive texts
46 positive_inputs = encoder.tokenizer(
47 positive_texts,
48 return_tensors="pt",
49 padding=True,
50 truncation=True
51 )
52 positive_emb = encoder(**positive_inputs)
53
54 loss = criterion(anchor_emb, positive_emb)
55 loss.backward()
56 optimizer.step()
LangChain Integration
Basic Text Embeddings
1from langchain.embeddings import OpenAIEmbeddings, HuggingFaceEmbeddings
2from langchain.vectorstores import FAISS, Chroma
3from langchain.schema import Document
4from langchain.text_splitter import RecursiveCharacterTextSplitter
5
6# Using OpenAI embeddings
7embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
8
9# Using HuggingFace embeddings
10embeddings = HuggingFaceEmbeddings(
11 model_name="sentence-transformers/all-MiniLM-L6-v2"
12)
13
14# Create embeddings for texts
15texts = ["First document.", "Second document.", "Third document."]
16text_embeddings = embeddings.embed_documents(texts)
17
18# Single query embedding
19query_embedding = embeddings.embed_query("What is the first document?")
Vector Store with Text Embeddings
1from langchain.vectorstores import FAISS
2from langchain.schema import Document
3
4# Create documents
5documents = [
6 Document(page_content="Machine learning is a subset of AI."),
7 Document(page_content="Deep learning uses neural networks."),
8 Document(page_content="Natural language processing enables text understanding.")
9]
10
11# Create vector store
12vectorstore = FAISS.from_documents(documents, embeddings)
13
14# Similarity search
15results = vectorstore.similarity_search("What is AI?", k=2)
16
17# Similarity search with scores
18results_with_scores = vectorstore.similarity_search_with_score(
19 "What is AI?",
20 k=2
21)
Advanced: Custom Embedding Function
1from langchain.embeddings.base import Embeddings
2from transformers import AutoModel, AutoTokenizer
3import torch
4
5class CustomHuggingFaceEmbeddings(Embeddings):
6 def __init__(self, model_name="sentence-transformers/all-MiniLM-L6-v2"):
7 self.model = AutoModel.from_pretrained(model_name)
8 self.tokenizer = AutoTokenizer.from_pretrained(model_name)
9 self.model.eval()
10 self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
11 self.model.to(self.device)
12
13 def embed_documents(self, texts):
14 inputs = self.tokenizer(
15 texts,
16 return_tensors="pt",
17 padding=True,
18 truncation=True,
19 max_length=512
20 ).to(self.device)
21
22 with torch.no_grad():
23 outputs = self.model(**inputs)
24 # Mean pooling
25 embeddings = outputs.last_hidden_state.mean(dim=1)
26 embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
27
28 return embeddings.cpu().numpy().tolist()
29
30 def embed_query(self, text):
31 return self.embed_documents([text])[0]
32
33# Usage
34custom_embeddings = CustomHuggingFaceEmbeddings()
35vectorstore = FAISS.from_documents(documents, custom_embeddings)
RAG Pipeline with Text Embeddings
1from langchain.embeddings import OpenAIEmbeddings
2from langchain.vectorstores import Chroma
3from langchain.text_splitter import RecursiveCharacterTextSplitter
4from langchain.chains import RetrievalQA
5from langchain.llms import OpenAI
6
7# Load and split documents
8documents = load_documents()
9text_splitter = RecursiveCharacterTextSplitter(
10 chunk_size=1000,
11 chunk_overlap=200
12)
13texts = text_splitter.split_documents(documents)
14
15# Create embeddings and vector store
16embeddings = OpenAIEmbeddings()
17vectorstore = Chroma.from_documents(texts, embeddings)
18
19# Create retrieval chain
20qa_chain = RetrievalQA.from_chain_type(
21 llm=OpenAI(),
22 chain_type="stuff",
23 retriever=vectorstore.as_retriever(search_kwargs={"k": 4})
24)
25
26# Query
27response = qa_chain.run("What is the main topic?")
Key Concepts
Tokenization: Converts text to subword units:
- Word-level: "machine learning" → ["machine", "learning"]
- Subword-level (BPE): "machine" → ["machine", "##ing"] (handles OOV)
- SentencePiece: Handles multilingual text and special characters
Position Embeddings: Inject positional information: $$\mathbf{e}_i = \mathbf{e}_i^{\text{token}} + \mathbf{e}_i^{\text{position}}$$
Pooling Strategies:
- CLS Token: Use special classification token embedding
- Mean Pooling: Average all token embeddings: $\mathbf{v} = \frac{1}{n}\sum_{i=1}^{n} \mathbf{h}_i$
- Max Pooling: Element-wise maximum: $\mathbf{v} = \max_i \mathbf{h}_i$
- Attention Pooling: Weighted sum: $\mathbf{v} = \sum_{i} \alpha_i \mathbf{h}_i$
Normalization: L2 normalization ensures embeddings on unit hypersphere: $$\mathbf{v}_{\text{norm}} = \frac{\mathbf{v}}{||\mathbf{v}||_2}$$
Similarity Metrics:
- Cosine Similarity: $\text{sim}(\mathbf{v}_1, \mathbf{v}_2) = \frac{\mathbf{v}_1 \cdot \mathbf{v}_2}{||\mathbf{v}_1|| \cdot ||\mathbf{v}_2||}$
- Dot Product: $\text{sim}(\mathbf{v}_1, \mathbf{v}_2) = \mathbf{v}_1 \cdot \mathbf{v}_2$ (after normalization, equals cosine)
- Euclidean Distance: $d(\mathbf{v}_1, \mathbf{v}_2) = ||\mathbf{v}_1 - \mathbf{v}_2||_2$
Chunking for Long Documents:
- Fixed-size chunks with overlap to preserve context
- Semantic chunking based on embedding similarity
- Hierarchical chunking (parent-child relationships)
Related Snippets
- Data Augmentation
Data augmentation techniques for Keras and PyTorch - DNN Policy Learning Theory
Deep Neural Network policy learning with mathematical foundations. Policy … - Graph RAG Techniques
Graph-based Retrieval-Augmented Generation for enhanced context and relationship … - Image to Vector Embeddings
Image embeddings convert visual content into dense vector representations that … - Keras Essentials
High-level Keras API for building neural networks quickly. Installation 1# Keras … - LangChain Recipes
Practical recipes for building LLM applications with LangChain: prompts, chains, … - ONNX Model Conversion
ONNX (Open Neural Network Exchange) for converting models between frameworks. … - PyTorch Essentials
Essential PyTorch operations and patterns for deep learning. Installation 1# CPU … - Q-Learning Theory
Q-Learning algorithm theory with mathematical foundations. Markov Decision … - RAG (Retrieval-Augmented Generation)
Retrieval-Augmented Generation techniques for enhancing LLM responses with … - Sound to Vector Embeddings
Audio embeddings convert sound signals (speech, music, environmental sounds) … - Tensor Mathematics & Backpropagation
Tensor mathematics fundamentals and backpropagation theory with detailed … - TensorFlow Essentials
Essential TensorFlow operations and patterns for deep learning. Installation 1# … - TensorFlow Lite
TensorFlow Lite for deploying models on mobile and embedded devices. Convert …