Text to Vector Embeddings

Text embeddings convert textual content into dense vector representations that capture semantic meaning, enabling similarity search, classification, and retrieval in natural language processing.

Core Idea

Text embeddings map text sequences (words, sentences, documents) to fixed-size vectors in a high-dimensional space where semantically similar texts are close together. The embedding function $E: \mathcal{T} \rightarrow \mathbb{R}^d$ transforms text $T \in \mathcal{T}$ into a vector $\mathbf{v} \in \mathbb{R}^d$.

Mathematical Foundation

Tokenization and Encoding:

Text is first tokenized into subword units: $$T \rightarrow [t_1, t_2, \ldots, t_n]$$

Each token is mapped to an embedding: $$\mathbf{e}_i = \text{Embedding}(t_i) \in \mathbb{R}^d$$

Inference (Forward Pass):

For Transformer-based encoders (BERT, Sentence-BERT): $$\mathbf{v} = E(T) = \text{Pool}(\text{Transformer}([\mathbf{e}_1, \mathbf{e}_2, \ldots, \mathbf{e}_n]))$$

For mean pooling: $$\mathbf{v} = \frac{1}{n}\sum_{i=1}^{n} \mathbf{h}_i$$

where $\mathbf{h}_i$ are hidden states from the transformer encoder.

For CLS token pooling: $$\mathbf{v} = \mathbf{h}_{\text{CLS}}$$

Training Objective:

Contrastive learning with InfoNCE loss: $$\mathcal{L} = -\log \frac{\exp(\text{sim}(\mathbf{v}_i, \mathbf{v}i^+) / \tau)}{\sum{j=1}^{N} \exp(\text{sim}(\mathbf{v}_i, \mathbf{v}_j) / \tau)}$$

where:

  • $\mathbf{v}_i$ is the anchor embedding
  • $\mathbf{v}_i^+$ is the positive (similar) embedding
  • $\mathbf{v}_j$ are all embeddings in the batch (including negatives)
  • $\text{sim}(\cdot, \cdot)$ is cosine similarity
  • $\tau$ is the temperature parameter

Supervised Fine-tuning (Sentence-BERT):

For sentence pairs $(s_i, s_j)$ with label $y_{ij}$: $$\mathcal{L} = -\sum_{(i,j)} y_{ij} \log(\sigma(\text{sim}(\mathbf{v}_i, \mathbf{v}j))) + (1-y{ij}) \log(1-\sigma(\text{sim}(\mathbf{v}_i, \mathbf{v}_j)))$$

where $\sigma$ is the sigmoid function.


Architecture Overview


PyTorch Implementation

BERT-based Text Encoder

 1import torch
 2import torch.nn as nn
 3from transformers import BertModel, BertTokenizer
 4
 5class BERTTextEncoder(nn.Module):
 6    def __init__(self, model_name="bert-base-uncased", pooling="mean"):
 7        super().__init__()
 8        self.bert = BertModel.from_pretrained(model_name)
 9        self.tokenizer = BertTokenizer.from_pretrained(model_name)
10        self.embedding_dim = self.bert.config.hidden_size
11        self.pooling = pooling
12    
13    def forward(self, input_ids, attention_mask=None):
14        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
15        
16        if self.pooling == "cls":
17            # Use CLS token
18            embedding = outputs.last_hidden_state[:, 0]
19        elif self.pooling == "mean":
20            # Mean pooling with attention mask
21            hidden_states = outputs.last_hidden_state
22            if attention_mask is not None:
23                mask_expanded = attention_mask.unsqueeze(-1).expand(hidden_states.size())
24                sum_embeddings = torch.sum(hidden_states * mask_expanded, dim=1)
25                sum_mask = torch.clamp(mask_expanded.sum(dim=1), min=1e-9)
26                embedding = sum_embeddings / sum_mask
27            else:
28                embedding = hidden_states.mean(dim=1)
29        else:
30            raise ValueError(f"Unknown pooling: {self.pooling}")
31        
32        return nn.functional.normalize(embedding, p=2, dim=1)
33
34# Usage
35encoder = BERTTextEncoder(pooling="mean")
36encoder.eval()
37
38text = "This is a sample sentence."
39inputs = encoder.tokenizer(
40    text,
41    return_tensors="pt",
42    padding=True,
43    truncation=True,
44    max_length=512
45)
46
47with torch.no_grad():
48    embedding = encoder(**inputs)
49    # embedding shape: [1, 768]

Sentence-BERT Style Encoder

 1import torch
 2import torch.nn as nn
 3from transformers import AutoModel, AutoTokenizer
 4
 5class SentenceBERTEncoder(nn.Module):
 6    def __init__(self, model_name="sentence-transformers/all-MiniLM-L6-v2"):
 7        super().__init__()
 8        self.model = AutoModel.from_pretrained(model_name)
 9        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
10        self.embedding_dim = self.model.config.hidden_size
11    
12    def forward(self, input_ids, attention_mask=None):
13        outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
14        
15        # Mean pooling
16        hidden_states = outputs.last_hidden_state
17        if attention_mask is not None:
18            mask_expanded = attention_mask.unsqueeze(-1).expand(hidden_states.size())
19            sum_embeddings = torch.sum(hidden_states * mask_expanded, dim=1)
20            sum_mask = torch.clamp(mask_expanded.sum(dim=1), min=1e-9)
21            embedding = sum_embeddings / sum_mask
22        else:
23            embedding = hidden_states.mean(dim=1)
24        
25        return nn.functional.normalize(embedding, p=2, dim=1)
26
27# Usage
28encoder = SentenceBERTEncoder()
29encoder.eval()
30
31texts = ["First sentence.", "Second sentence."]
32inputs = encoder.tokenizer(
33    texts,
34    return_tensors="pt",
35    padding=True,
36    truncation=True,
37    max_length=512
38)
39
40with torch.no_grad():
41    embeddings = encoder(**inputs)
42    # embeddings shape: [2, 384]

Training with Contrastive Loss

 1import torch
 2import torch.nn as nn
 3import torch.nn.functional as F
 4
 5class InfoNCELoss(nn.Module):
 6    def __init__(self, temperature=0.05):
 7        super().__init__()
 8        self.temperature = temperature
 9    
10    def forward(self, anchor, positive):
11        # anchor: [B, d], positive: [B, d]
12        batch_size = anchor.size(0)
13        
14        # Normalize
15        anchor = F.normalize(anchor, p=2, dim=1)
16        positive = F.normalize(positive, p=2, dim=1)
17        
18        # Compute similarity matrix: [B, B]
19        similarity_matrix = torch.matmul(anchor, positive.t()) / self.temperature
20        
21        # Positive pairs are on the diagonal
22        labels = torch.arange(batch_size, device=anchor.device)
23        
24        # Cross-entropy loss
25        loss = F.cross_entropy(similarity_matrix, labels)
26        return loss
27
28# Training loop
29encoder = SentenceBERTEncoder()
30criterion = InfoNCELoss(temperature=0.05)
31optimizer = torch.optim.Adam(encoder.parameters(), lr=2e-5)
32
33for anchor_texts, positive_texts in dataloader:
34    optimizer.zero_grad()
35    
36    # Encode anchor texts
37    anchor_inputs = encoder.tokenizer(
38        anchor_texts,
39        return_tensors="pt",
40        padding=True,
41        truncation=True
42    )
43    anchor_emb = encoder(**anchor_inputs)
44    
45    # Encode positive texts
46    positive_inputs = encoder.tokenizer(
47        positive_texts,
48        return_tensors="pt",
49        padding=True,
50        truncation=True
51    )
52    positive_emb = encoder(**positive_inputs)
53    
54    loss = criterion(anchor_emb, positive_emb)
55    loss.backward()
56    optimizer.step()

LangChain Integration

Basic Text Embeddings

 1from langchain.embeddings import OpenAIEmbeddings, HuggingFaceEmbeddings
 2from langchain.vectorstores import FAISS, Chroma
 3from langchain.schema import Document
 4from langchain.text_splitter import RecursiveCharacterTextSplitter
 5
 6# Using OpenAI embeddings
 7embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
 8
 9# Using HuggingFace embeddings
10embeddings = HuggingFaceEmbeddings(
11    model_name="sentence-transformers/all-MiniLM-L6-v2"
12)
13
14# Create embeddings for texts
15texts = ["First document.", "Second document.", "Third document."]
16text_embeddings = embeddings.embed_documents(texts)
17
18# Single query embedding
19query_embedding = embeddings.embed_query("What is the first document?")

Vector Store with Text Embeddings

 1from langchain.vectorstores import FAISS
 2from langchain.schema import Document
 3
 4# Create documents
 5documents = [
 6    Document(page_content="Machine learning is a subset of AI."),
 7    Document(page_content="Deep learning uses neural networks."),
 8    Document(page_content="Natural language processing enables text understanding.")
 9]
10
11# Create vector store
12vectorstore = FAISS.from_documents(documents, embeddings)
13
14# Similarity search
15results = vectorstore.similarity_search("What is AI?", k=2)
16
17# Similarity search with scores
18results_with_scores = vectorstore.similarity_search_with_score(
19    "What is AI?",
20    k=2
21)

Advanced: Custom Embedding Function

 1from langchain.embeddings.base import Embeddings
 2from transformers import AutoModel, AutoTokenizer
 3import torch
 4
 5class CustomHuggingFaceEmbeddings(Embeddings):
 6    def __init__(self, model_name="sentence-transformers/all-MiniLM-L6-v2"):
 7        self.model = AutoModel.from_pretrained(model_name)
 8        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
 9        self.model.eval()
10        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
11        self.model.to(self.device)
12    
13    def embed_documents(self, texts):
14        inputs = self.tokenizer(
15            texts,
16            return_tensors="pt",
17            padding=True,
18            truncation=True,
19            max_length=512
20        ).to(self.device)
21        
22        with torch.no_grad():
23            outputs = self.model(**inputs)
24            # Mean pooling
25            embeddings = outputs.last_hidden_state.mean(dim=1)
26            embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
27        
28        return embeddings.cpu().numpy().tolist()
29    
30    def embed_query(self, text):
31        return self.embed_documents([text])[0]
32
33# Usage
34custom_embeddings = CustomHuggingFaceEmbeddings()
35vectorstore = FAISS.from_documents(documents, custom_embeddings)

RAG Pipeline with Text Embeddings

 1from langchain.embeddings import OpenAIEmbeddings
 2from langchain.vectorstores import Chroma
 3from langchain.text_splitter import RecursiveCharacterTextSplitter
 4from langchain.chains import RetrievalQA
 5from langchain.llms import OpenAI
 6
 7# Load and split documents
 8documents = load_documents()
 9text_splitter = RecursiveCharacterTextSplitter(
10    chunk_size=1000,
11    chunk_overlap=200
12)
13texts = text_splitter.split_documents(documents)
14
15# Create embeddings and vector store
16embeddings = OpenAIEmbeddings()
17vectorstore = Chroma.from_documents(texts, embeddings)
18
19# Create retrieval chain
20qa_chain = RetrievalQA.from_chain_type(
21    llm=OpenAI(),
22    chain_type="stuff",
23    retriever=vectorstore.as_retriever(search_kwargs={"k": 4})
24)
25
26# Query
27response = qa_chain.run("What is the main topic?")

Key Concepts

Tokenization: Converts text to subword units:

  • Word-level: "machine learning" → ["machine", "learning"]
  • Subword-level (BPE): "machine" → ["machine", "##ing"] (handles OOV)
  • SentencePiece: Handles multilingual text and special characters

Position Embeddings: Inject positional information: $$\mathbf{e}_i = \mathbf{e}_i^{\text{token}} + \mathbf{e}_i^{\text{position}}$$

Pooling Strategies:

  • CLS Token: Use special classification token embedding
  • Mean Pooling: Average all token embeddings: $\mathbf{v} = \frac{1}{n}\sum_{i=1}^{n} \mathbf{h}_i$
  • Max Pooling: Element-wise maximum: $\mathbf{v} = \max_i \mathbf{h}_i$
  • Attention Pooling: Weighted sum: $\mathbf{v} = \sum_{i} \alpha_i \mathbf{h}_i$

Normalization: L2 normalization ensures embeddings on unit hypersphere: $$\mathbf{v}_{\text{norm}} = \frac{\mathbf{v}}{||\mathbf{v}||_2}$$

Similarity Metrics:

  • Cosine Similarity: $\text{sim}(\mathbf{v}_1, \mathbf{v}_2) = \frac{\mathbf{v}_1 \cdot \mathbf{v}_2}{||\mathbf{v}_1|| \cdot ||\mathbf{v}_2||}$
  • Dot Product: $\text{sim}(\mathbf{v}_1, \mathbf{v}_2) = \mathbf{v}_1 \cdot \mathbf{v}_2$ (after normalization, equals cosine)
  • Euclidean Distance: $d(\mathbf{v}_1, \mathbf{v}_2) = ||\mathbf{v}_1 - \mathbf{v}_2||_2$

Chunking for Long Documents:

  • Fixed-size chunks with overlap to preserve context
  • Semantic chunking based on embedding similarity
  • Hierarchical chunking (parent-child relationships)

Related Snippets