Image to Vector Embeddings

Image embeddings convert visual content into dense vector representations that capture semantic and visual features, enabling similarity search, classification, and retrieval.

Core Idea

Image embeddings map images to fixed-size vectors in a high-dimensional space where semantically similar images are close together. The embedding function $E: \mathcal{I} \rightarrow \mathbb{R}^d$ transforms an image $I \in \mathcal{I}$ into a vector $\mathbf{v} \in \mathbb{R}^d$.

Mathematical Foundation

Inference (Forward Pass):

For a CNN-based encoder: $$\mathbf{v} = E(I) = \text{GlobalPool}(\text{CNN}(I))$$

For a Vision Transformer (ViT): $$\mathbf{v} = E(I) = \text{CLS}_{\text{pool}}(\text{ViT}(\text{PatchEmbed}(I)))$$

where:

  • $I \in \mathbb{R}^{H \times W \times C}$ is the input image
  • $\mathbf{v} \in \mathbb{R}^d$ is the output embedding vector
  • $d$ is the embedding dimension (typically 512, 768, or 1024)

Training Objective:

Contrastive learning with triplet loss: $$\mathcal{L}_{\text{triplet}} = \max(0, d(\mathbf{v}_a, \mathbf{v}_p) - d(\mathbf{v}_a, \mathbf{v}_n) + \alpha)$$

where:

  • $\mathbf{v}_a$ is the anchor embedding
  • $\mathbf{v}_p$ is the positive (similar) embedding
  • $\mathbf{v}_n$ is the negative (dissimilar) embedding
  • $d(\cdot, \cdot)$ is the distance metric (e.g., Euclidean or cosine)
  • $\alpha$ is the margin hyperparameter

Alternative: InfoNCE Loss (used in CLIP): $$\mathcal{L}_{\text{InfoNCE}} = -\log \frac{\exp(\text{sim}(\mathbf{v}_i, \mathbf{t}i) / \tau)}{\sum{j=1}^{N} \exp(\text{sim}(\mathbf{v}_i, \mathbf{t}_j) / \tau)}$$

where:

  • $\mathbf{v}_i$ is the image embedding
  • $\mathbf{t}_i$ is the paired text embedding
  • $\text{sim}(\cdot, \cdot)$ is cosine similarity
  • $\tau$ is the temperature parameter
  • $N$ is the batch size

Architecture Overview


PyTorch Implementation

ResNet-based Image Encoder

 1import torch
 2import torch.nn as nn
 3from torchvision import models, transforms
 4from PIL import Image
 5
 6class ImageEncoder(nn.Module):
 7    def __init__(self, embedding_dim=512):
 8        super().__init__()
 9        # Use pretrained ResNet as backbone
10        resnet = models.resnet50(pretrained=True)
11        # Remove final classification layer
12        self.backbone = nn.Sequential(*list(resnet.children())[:-1])
13        # Add projection head
14        self.projection = nn.Sequential(
15            nn.Linear(2048, embedding_dim),
16            nn.ReLU(),
17            nn.Linear(embedding_dim, embedding_dim)
18        )
19    
20    def forward(self, x):
21        # Extract features: [B, 2048, 1, 1]
22        features = self.backbone(x)
23        # Flatten: [B, 2048]
24        features = features.view(features.size(0), -1)
25        # Project to embedding space: [B, embedding_dim]
26        embedding = self.projection(features)
27        # L2 normalize
28        return nn.functional.normalize(embedding, p=2, dim=1)
29
30# Usage
31encoder = ImageEncoder(embedding_dim=512)
32encoder.eval()
33
34transform = transforms.Compose([
35    transforms.Resize((224, 224)),
36    transforms.ToTensor(),
37    transforms.Normalize(mean=[0.485, 0.456, 0.406],
38                        std=[0.229, 0.224, 0.225])
39])
40
41image = Image.open("image.jpg")
42image_tensor = transform(image).unsqueeze(0)
43
44with torch.no_grad():
45    embedding = encoder(image_tensor)
46    # embedding shape: [1, 512]

Vision Transformer (ViT) Encoder

 1import torch
 2import torch.nn as nn
 3from transformers import ViTModel, ViTImageProcessor
 4
 5class ViTImageEncoder(nn.Module):
 6    def __init__(self, model_name="google/vit-base-patch16-224"):
 7        super().__init__()
 8        self.vit = ViTModel.from_pretrained(model_name)
 9        self.embedding_dim = self.vit.config.hidden_size
10    
11    def forward(self, pixel_values):
12        outputs = self.vit(pixel_values=pixel_values)
13        # Use CLS token embedding
14        embedding = outputs.last_hidden_state[:, 0]
15        return nn.functional.normalize(embedding, p=2, dim=1)
16
17# Usage
18encoder = ViTImageEncoder()
19processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224")
20
21image = Image.open("image.jpg")
22inputs = processor(images=image, return_tensors="pt")
23
24with torch.no_grad():
25    embedding = encoder(**inputs)
26    # embedding shape: [1, 768]

Training with Contrastive Loss

 1import torch
 2import torch.nn as nn
 3import torch.nn.functional as F
 4
 5class TripletLoss(nn.Module):
 6    def __init__(self, margin=0.5):
 7        super().__init__()
 8        self.margin = margin
 9    
10    def forward(self, anchor, positive, negative):
11        # Compute distances
12        pos_dist = F.pairwise_distance(anchor, positive)
13        neg_dist = F.pairwise_distance(anchor, negative)
14        # Triplet loss
15        loss = F.relu(pos_dist - neg_dist + self.margin)
16        return loss.mean()
17
18# Training loop
19encoder = ImageEncoder()
20criterion = TripletLoss(margin=0.5)
21optimizer = torch.optim.Adam(encoder.parameters(), lr=1e-4)
22
23for anchor_img, positive_img, negative_img in dataloader:
24    optimizer.zero_grad()
25    
26    anchor_emb = encoder(anchor_img)
27    positive_emb = encoder(positive_img)
28    negative_emb = encoder(negative_img)
29    
30    loss = criterion(anchor_emb, positive_emb, negative_emb)
31    loss.backward()
32    optimizer.step()

LangChain Integration

 1from langchain_community.embeddings import HuggingFaceEmbeddings
 2from langchain_community.vectorstores import FAISS
 3from langchain.schema import Document
 4import base64
 5from io import BytesIO
 6
 7# For multimodal embeddings (image + text)
 8from langchain_community.embeddings import HuggingFaceInferenceAPIEmbeddings
 9
10# Using CLIP for image embeddings
11embeddings = HuggingFaceInferenceAPIEmbeddings(
12    model_name="sentence-transformers/clip-ViT-B-32"
13)
14
15# Encode image
16def encode_image(image_path):
17    with open(image_path, "rb") as f:
18        image_bytes = base64.b64encode(f.read()).decode()
19    return embeddings.embed_image(image_bytes)
20
21# Create vector store with image embeddings
22image_embeddings = [encode_image(f"image_{i}.jpg") for i in range(100)]
23documents = [Document(page_content=f"Image {i}") for i in range(100)]
24
25vectorstore = FAISS.from_embeddings(
26    texts=[doc.page_content for doc in documents],
27    embeddings=image_embeddings,
28    embedding=embeddings
29)
30
31# Similarity search
32query_image_emb = encode_image("query.jpg")
33results = vectorstore.similarity_search_by_vector(query_image_emb, k=5)

Key Concepts

Global Pooling: Aggregates spatial features into a fixed-size vector:

  • Average Pooling: $\mathbf{v} = \frac{1}{HW}\sum_{i,j} \mathbf{F}_{i,j}$
  • Max Pooling: $\mathbf{v} = \max_{i,j} \mathbf{F}_{i,j}$
  • Attention Pooling: $\mathbf{v} = \sum_{i,j} \alpha_{i,j} \mathbf{F}_{i,j}$ where $\alpha = \text{softmax}(\mathbf{W}\mathbf{F})$

Normalization: L2 normalization ensures embeddings lie on the unit hypersphere, making cosine similarity equivalent to dot product: $$\mathbf{v}_{\text{norm}} = \frac{\mathbf{v}}{||\mathbf{v}||_2}$$

Similarity Metrics:

  • Cosine Similarity: $\text{sim}(\mathbf{v}_1, \mathbf{v}_2) = \frac{\mathbf{v}_1 \cdot \mathbf{v}_2}{||\mathbf{v}_1|| \cdot ||\mathbf{v}_2||}$
  • Euclidean Distance: $d(\mathbf{v}_1, \mathbf{v}_2) = ||\mathbf{v}_1 - \mathbf{v}_2||_2$

Related Snippets