Image to Vector Embeddings
Image embeddings convert visual content into dense vector representations that capture semantic and visual features, enabling similarity search, classification, and retrieval.
Core Idea
Image embeddings map images to fixed-size vectors in a high-dimensional space where semantically similar images are close together. The embedding function $E: \mathcal{I} \rightarrow \mathbb{R}^d$ transforms an image $I \in \mathcal{I}$ into a vector $\mathbf{v} \in \mathbb{R}^d$.
Mathematical Foundation
Inference (Forward Pass):
For a CNN-based encoder: $$\mathbf{v} = E(I) = \text{GlobalPool}(\text{CNN}(I))$$
For a Vision Transformer (ViT): $$\mathbf{v} = E(I) = \text{CLS}_{\text{pool}}(\text{ViT}(\text{PatchEmbed}(I)))$$
where:
- $I \in \mathbb{R}^{H \times W \times C}$ is the input image
- $\mathbf{v} \in \mathbb{R}^d$ is the output embedding vector
- $d$ is the embedding dimension (typically 512, 768, or 1024)
Training Objective:
Contrastive learning with triplet loss: $$\mathcal{L}_{\text{triplet}} = \max(0, d(\mathbf{v}_a, \mathbf{v}_p) - d(\mathbf{v}_a, \mathbf{v}_n) + \alpha)$$
where:
- $\mathbf{v}_a$ is the anchor embedding
- $\mathbf{v}_p$ is the positive (similar) embedding
- $\mathbf{v}_n$ is the negative (dissimilar) embedding
- $d(\cdot, \cdot)$ is the distance metric (e.g., Euclidean or cosine)
- $\alpha$ is the margin hyperparameter
Alternative: InfoNCE Loss (used in CLIP): $$\mathcal{L}_{\text{InfoNCE}} = -\log \frac{\exp(\text{sim}(\mathbf{v}_i, \mathbf{t}i) / \tau)}{\sum{j=1}^{N} \exp(\text{sim}(\mathbf{v}_i, \mathbf{t}_j) / \tau)}$$
where:
- $\mathbf{v}_i$ is the image embedding
- $\mathbf{t}_i$ is the paired text embedding
- $\text{sim}(\cdot, \cdot)$ is cosine similarity
- $\tau$ is the temperature parameter
- $N$ is the batch size
Architecture Overview
PyTorch Implementation
ResNet-based Image Encoder
1import torch
2import torch.nn as nn
3from torchvision import models, transforms
4from PIL import Image
5
6class ImageEncoder(nn.Module):
7 def __init__(self, embedding_dim=512):
8 super().__init__()
9 # Use pretrained ResNet as backbone
10 resnet = models.resnet50(pretrained=True)
11 # Remove final classification layer
12 self.backbone = nn.Sequential(*list(resnet.children())[:-1])
13 # Add projection head
14 self.projection = nn.Sequential(
15 nn.Linear(2048, embedding_dim),
16 nn.ReLU(),
17 nn.Linear(embedding_dim, embedding_dim)
18 )
19
20 def forward(self, x):
21 # Extract features: [B, 2048, 1, 1]
22 features = self.backbone(x)
23 # Flatten: [B, 2048]
24 features = features.view(features.size(0), -1)
25 # Project to embedding space: [B, embedding_dim]
26 embedding = self.projection(features)
27 # L2 normalize
28 return nn.functional.normalize(embedding, p=2, dim=1)
29
30# Usage
31encoder = ImageEncoder(embedding_dim=512)
32encoder.eval()
33
34transform = transforms.Compose([
35 transforms.Resize((224, 224)),
36 transforms.ToTensor(),
37 transforms.Normalize(mean=[0.485, 0.456, 0.406],
38 std=[0.229, 0.224, 0.225])
39])
40
41image = Image.open("image.jpg")
42image_tensor = transform(image).unsqueeze(0)
43
44with torch.no_grad():
45 embedding = encoder(image_tensor)
46 # embedding shape: [1, 512]
Vision Transformer (ViT) Encoder
1import torch
2import torch.nn as nn
3from transformers import ViTModel, ViTImageProcessor
4
5class ViTImageEncoder(nn.Module):
6 def __init__(self, model_name="google/vit-base-patch16-224"):
7 super().__init__()
8 self.vit = ViTModel.from_pretrained(model_name)
9 self.embedding_dim = self.vit.config.hidden_size
10
11 def forward(self, pixel_values):
12 outputs = self.vit(pixel_values=pixel_values)
13 # Use CLS token embedding
14 embedding = outputs.last_hidden_state[:, 0]
15 return nn.functional.normalize(embedding, p=2, dim=1)
16
17# Usage
18encoder = ViTImageEncoder()
19processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224")
20
21image = Image.open("image.jpg")
22inputs = processor(images=image, return_tensors="pt")
23
24with torch.no_grad():
25 embedding = encoder(**inputs)
26 # embedding shape: [1, 768]
Training with Contrastive Loss
1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5class TripletLoss(nn.Module):
6 def __init__(self, margin=0.5):
7 super().__init__()
8 self.margin = margin
9
10 def forward(self, anchor, positive, negative):
11 # Compute distances
12 pos_dist = F.pairwise_distance(anchor, positive)
13 neg_dist = F.pairwise_distance(anchor, negative)
14 # Triplet loss
15 loss = F.relu(pos_dist - neg_dist + self.margin)
16 return loss.mean()
17
18# Training loop
19encoder = ImageEncoder()
20criterion = TripletLoss(margin=0.5)
21optimizer = torch.optim.Adam(encoder.parameters(), lr=1e-4)
22
23for anchor_img, positive_img, negative_img in dataloader:
24 optimizer.zero_grad()
25
26 anchor_emb = encoder(anchor_img)
27 positive_emb = encoder(positive_img)
28 negative_emb = encoder(negative_img)
29
30 loss = criterion(anchor_emb, positive_emb, negative_emb)
31 loss.backward()
32 optimizer.step()
LangChain Integration
1from langchain_community.embeddings import HuggingFaceEmbeddings
2from langchain_community.vectorstores import FAISS
3from langchain.schema import Document
4import base64
5from io import BytesIO
6
7# For multimodal embeddings (image + text)
8from langchain_community.embeddings import HuggingFaceInferenceAPIEmbeddings
9
10# Using CLIP for image embeddings
11embeddings = HuggingFaceInferenceAPIEmbeddings(
12 model_name="sentence-transformers/clip-ViT-B-32"
13)
14
15# Encode image
16def encode_image(image_path):
17 with open(image_path, "rb") as f:
18 image_bytes = base64.b64encode(f.read()).decode()
19 return embeddings.embed_image(image_bytes)
20
21# Create vector store with image embeddings
22image_embeddings = [encode_image(f"image_{i}.jpg") for i in range(100)]
23documents = [Document(page_content=f"Image {i}") for i in range(100)]
24
25vectorstore = FAISS.from_embeddings(
26 texts=[doc.page_content for doc in documents],
27 embeddings=image_embeddings,
28 embedding=embeddings
29)
30
31# Similarity search
32query_image_emb = encode_image("query.jpg")
33results = vectorstore.similarity_search_by_vector(query_image_emb, k=5)
Key Concepts
Global Pooling: Aggregates spatial features into a fixed-size vector:
- Average Pooling: $\mathbf{v} = \frac{1}{HW}\sum_{i,j} \mathbf{F}_{i,j}$
- Max Pooling: $\mathbf{v} = \max_{i,j} \mathbf{F}_{i,j}$
- Attention Pooling: $\mathbf{v} = \sum_{i,j} \alpha_{i,j} \mathbf{F}_{i,j}$ where $\alpha = \text{softmax}(\mathbf{W}\mathbf{F})$
Normalization: L2 normalization ensures embeddings lie on the unit hypersphere, making cosine similarity equivalent to dot product: $$\mathbf{v}_{\text{norm}} = \frac{\mathbf{v}}{||\mathbf{v}||_2}$$
Similarity Metrics:
- Cosine Similarity: $\text{sim}(\mathbf{v}_1, \mathbf{v}_2) = \frac{\mathbf{v}_1 \cdot \mathbf{v}_2}{||\mathbf{v}_1|| \cdot ||\mathbf{v}_2||}$
- Euclidean Distance: $d(\mathbf{v}_1, \mathbf{v}_2) = ||\mathbf{v}_1 - \mathbf{v}_2||_2$
Related Snippets
- Data Augmentation
Data augmentation techniques for Keras and PyTorch - DNN Policy Learning Theory
Deep Neural Network policy learning with mathematical foundations. Policy … - Graph RAG Techniques
Graph-based Retrieval-Augmented Generation for enhanced context and relationship … - Keras Essentials
High-level Keras API for building neural networks quickly. Installation 1# Keras … - LangChain Recipes
Practical recipes for building LLM applications with LangChain: prompts, chains, … - ONNX Model Conversion
ONNX (Open Neural Network Exchange) for converting models between frameworks. … - PyTorch Essentials
Essential PyTorch operations and patterns for deep learning. Installation 1# CPU … - Q-Learning Theory
Q-Learning algorithm theory with mathematical foundations. Markov Decision … - RAG (Retrieval-Augmented Generation)
Retrieval-Augmented Generation techniques for enhancing LLM responses with … - Sound to Vector Embeddings
Audio embeddings convert sound signals (speech, music, environmental sounds) … - Tensor Mathematics & Backpropagation
Tensor mathematics fundamentals and backpropagation theory with detailed … - TensorFlow Essentials
Essential TensorFlow operations and patterns for deep learning. Installation 1# … - TensorFlow Lite
TensorFlow Lite for deploying models on mobile and embedded devices. Convert … - Text to Vector Embeddings
Text embeddings convert textual content into dense vector representations that …