Image to Vector Embeddings

Dec 12, 2024 · 4 min read · ai-knowhow embeddings computer-vision cnn vision-transformers ·

Share on:

Image embeddings convert visual content into dense vector representations that capture semantic and visual features, enabling similarity search, classification, and retrieval.

Core Idea

Image embeddings map images to fixed-size vectors in a high-dimensional space where semantically similar images are close together. The embedding function $E: \mathcal{I} \rightarrow \mathbb{R}^d$ transforms an image $I \in \mathcal{I}$ into a vector $\mathbf{v} \in \mathbb{R}^d$.

Mathematical Foundation

Inference (Forward Pass):

For a CNN-based encoder: $$\mathbf{v} = E(I) = \text{GlobalPool}(\text{CNN}(I))$$

For a Vision Transformer (ViT): $$\mathbf{v} = E(I) = \text{CLS}_{\text{pool}}(\text{ViT}(\text{PatchEmbed}(I)))$$

where:

$I \in \mathbb{R}^{H \times W \times C}$ is the input image
$\mathbf{v} \in \mathbb{R}^d$ is the output embedding vector
$d$ is the embedding dimension (typically 512, 768, or 1024)

Training Objective:

Contrastive learning with triplet loss: $$\mathcal{L}_{\text{triplet}} = \max(0, d(\mathbf{v}_a, \mathbf{v}_p) - d(\mathbf{v}_a, \mathbf{v}_n) + \alpha)$$

where:

$\mathbf{v}_a$ is the anchor embedding
$\mathbf{v}_p$ is the positive (similar) embedding
$\mathbf{v}_n$ is the negative (dissimilar) embedding
$d(\cdot, \cdot)$ is the distance metric (e.g., Euclidean or cosine)
$\alpha$ is the margin hyperparameter

Alternative: InfoNCE Loss (used in CLIP): $$\mathcal{L}_{\text{InfoNCE}} = -\log \frac{\exp(\text{sim}(\mathbf{v}_i, \mathbf{t}i) / \tau)}{\sum{j=1}^{N} \exp(\text{sim}(\mathbf{v}_i, \mathbf{t}_j) / \tau)}$$

where:

$\mathbf{v}_i$ is the image embedding
$\mathbf{t}_i$ is the paired text embedding
$\text{sim}(\cdot, \cdot)$ is cosine similarity
$\tau$ is the temperature parameter
$N$ is the batch size

Architecture Overview

PyTorch Implementation

ResNet-based Image Encoder

 1import torch
 2import torch.nn as nn
 3from torchvision import models, transforms
 4from PIL import Image
 5
 6class ImageEncoder(nn.Module):
 7    def __init__(self, embedding_dim=512):
 8        super().__init__()
 9        # Use pretrained ResNet as backbone
10        resnet = models.resnet50(pretrained=True)
11        # Remove final classification layer
12        self.backbone = nn.Sequential(*list(resnet.children())[:-1])
13        # Add projection head
14        self.projection = nn.Sequential(
15            nn.Linear(2048, embedding_dim),
16            nn.ReLU(),
17            nn.Linear(embedding_dim, embedding_dim)
18        )
19    
20    def forward(self, x):
21        # Extract features: [B, 2048, 1, 1]
22        features = self.backbone(x)
23        # Flatten: [B, 2048]
24        features = features.view(features.size(0), -1)
25        # Project to embedding space: [B, embedding_dim]
26        embedding = self.projection(features)
27        # L2 normalize
28        return nn.functional.normalize(embedding, p=2, dim=1)
29
30# Usage
31encoder = ImageEncoder(embedding_dim=512)
32encoder.eval()
33
34transform = transforms.Compose([
35    transforms.Resize((224, 224)),
36    transforms.ToTensor(),
37    transforms.Normalize(mean=[0.485, 0.456, 0.406],
38                        std=[0.229, 0.224, 0.225])
39])
40
41image = Image.open("image.jpg")
42image_tensor = transform(image).unsqueeze(0)
43
44with torch.no_grad():
45    embedding = encoder(image_tensor)
46    # embedding shape: [1, 512]

Vision Transformer (ViT) Encoder

 1import torch
 2import torch.nn as nn
 3from transformers import ViTModel, ViTImageProcessor
 4
 5class ViTImageEncoder(nn.Module):
 6    def __init__(self, model_name="google/vit-base-patch16-224"):
 7        super().__init__()
 8        self.vit = ViTModel.from_pretrained(model_name)
 9        self.embedding_dim = self.vit.config.hidden_size
10    
11    def forward(self, pixel_values):
12        outputs = self.vit(pixel_values=pixel_values)
13        # Use CLS token embedding
14        embedding = outputs.last_hidden_state[:, 0]
15        return nn.functional.normalize(embedding, p=2, dim=1)
16
17# Usage
18encoder = ViTImageEncoder()
19processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224")
20
21image = Image.open("image.jpg")
22inputs = processor(images=image, return_tensors="pt")
23
24with torch.no_grad():
25    embedding = encoder(**inputs)
26    # embedding shape: [1, 768]

Training with Contrastive Loss

 1import torch
 2import torch.nn as nn
 3import torch.nn.functional as F
 4
 5class TripletLoss(nn.Module):
 6    def __init__(self, margin=0.5):
 7        super().__init__()
 8        self.margin = margin
 9    
10    def forward(self, anchor, positive, negative):
11        # Compute distances
12        pos_dist = F.pairwise_distance(anchor, positive)
13        neg_dist = F.pairwise_distance(anchor, negative)
14        # Triplet loss
15        loss = F.relu(pos_dist - neg_dist + self.margin)
16        return loss.mean()
17
18# Training loop
19encoder = ImageEncoder()
20criterion = TripletLoss(margin=0.5)
21optimizer = torch.optim.Adam(encoder.parameters(), lr=1e-4)
22
23for anchor_img, positive_img, negative_img in dataloader:
24    optimizer.zero_grad()
25    
26    anchor_emb = encoder(anchor_img)
27    positive_emb = encoder(positive_img)
28    negative_emb = encoder(negative_img)
29    
30    loss = criterion(anchor_emb, positive_emb, negative_emb)
31    loss.backward()
32    optimizer.step()

LangChain Integration

 1from langchain_community.embeddings import HuggingFaceEmbeddings
 2from langchain_community.vectorstores import FAISS
 3from langchain.schema import Document
 4import base64
 5from io import BytesIO
 6
 7# For multimodal embeddings (image + text)
 8from langchain_community.embeddings import HuggingFaceInferenceAPIEmbeddings
 9
10# Using CLIP for image embeddings
11embeddings = HuggingFaceInferenceAPIEmbeddings(
12    model_name="sentence-transformers/clip-ViT-B-32"
13)
14
15# Encode image
16def encode_image(image_path):
17    with open(image_path, "rb") as f:
18        image_bytes = base64.b64encode(f.read()).decode()
19    return embeddings.embed_image(image_bytes)
20
21# Create vector store with image embeddings
22image_embeddings = [encode_image(f"image_{i}.jpg") for i in range(100)]
23documents = [Document(page_content=f"Image {i}") for i in range(100)]
24
25vectorstore = FAISS.from_embeddings(
26    texts=[doc.page_content for doc in documents],
27    embeddings=image_embeddings,
28    embedding=embeddings
29)
30
31# Similarity search
32query_image_emb = encode_image("query.jpg")
33results = vectorstore.similarity_search_by_vector(query_image_emb, k=5)

Key Concepts

Global Pooling: Aggregates spatial features into a fixed-size vector:

Average Pooling: $\mathbf{v} = \frac{1}{HW}\sum_{i,j} \mathbf{F}_{i,j}$
Max Pooling: $\mathbf{v} = \max_{i,j} \mathbf{F}_{i,j}$
Attention Pooling: $\mathbf{v} = \sum_{i,j} \alpha_{i,j} \mathbf{F}_{i,j}$ where $\alpha = \text{softmax}(\mathbf{W}\mathbf{F})$

Normalization: L2 normalization ensures embeddings lie on the unit hypersphere, making cosine similarity equivalent to dot product: $$\mathbf{v}_{\text{norm}} = \frac{\mathbf{v}}{||\mathbf{v}||_2}$$

Similarity Metrics:

Cosine Similarity: $\text{sim}(\mathbf{v}_1, \mathbf{v}_2) = \frac{\mathbf{v}_1 \cdot \mathbf{v}_2}{||\mathbf{v}_1|| \cdot ||\mathbf{v}_2||}$
Euclidean Distance: $d(\mathbf{v}_1, \mathbf{v}_2) = ||\mathbf{v}_1 - \mathbf{v}_2||_2$

Related Snippets

Data Augmentation
Data augmentation techniques for Keras and PyTorch
DNN Policy Learning Theory
Deep Neural Network policy learning with mathematical foundations. Policy …
Graph RAG Techniques
Graph-based Retrieval-Augmented Generation for enhanced context and relationship …
Keras Essentials
High-level Keras API for building neural networks quickly. Installation 1# Keras …
LangChain Recipes
Practical recipes for building LLM applications with LangChain: prompts, chains, …
ONNX Model Conversion
ONNX (Open Neural Network Exchange) for converting models between frameworks. …
PyTorch Essentials
Essential PyTorch operations and patterns for deep learning. Installation 1# CPU …
Q-Learning Theory
Q-Learning algorithm theory with mathematical foundations. Markov Decision …
RAG (Retrieval-Augmented Generation)
Retrieval-Augmented Generation techniques for enhancing LLM responses with …
Sound to Vector Embeddings
Audio embeddings convert sound signals (speech, music, environmental sounds) …
Tensor Mathematics & Backpropagation
Tensor mathematics fundamentals and backpropagation theory with detailed …
TensorFlow Essentials
Essential TensorFlow operations and patterns for deep learning. Installation 1# …
TensorFlow Lite
TensorFlow Lite for deploying models on mobile and embedded devices. Convert …
Text to Vector Embeddings
Text embeddings convert textual content into dense vector representations that …