AI/ML Interview Questions - Hard

Hard-level AI/ML interview questions covering advanced architectures, optimization, and theoretical concepts.

Q1: Implement attention mechanism from scratch.

Answer:

How It Works:

Attention allows model to focus on relevant parts of input when producing output.

Core Idea: Compute weighted sum of values, where weights are determined by query-key similarity.

Formula: $$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$

where:

  • $Q$ = Queries (what we're looking for)
  • $K$ = Keys (what's available)
  • $V$ = Values (actual content)
  • $d_k$ = dimension of keys (for scaling)

PyTorch Implementation:

 1import torch
 2import torch.nn as nn
 3import torch.nn.functional as F
 4import math
 5
 6def scaled_dot_product_attention(Q, K, V, mask=None):
 7    """
 8    Args:
 9        Q: Queries (batch_size, num_heads, seq_len, d_k)
10        K: Keys (batch_size, num_heads, seq_len, d_k)
11        V: Values (batch_size, num_heads, seq_len, d_v)
12        mask: Optional mask (batch_size, 1, 1, seq_len)
13    
14    Returns:
15        output: (batch_size, num_heads, seq_len, d_v)
16        attention_weights: (batch_size, num_heads, seq_len, seq_len)
17    """
18    d_k = Q.size(-1)
19    
20    # Calculate attention scores: Q @ K^T / sqrt(d_k)
21    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
22    
23    # Apply mask if provided (for padding or future tokens)
24    if mask is not None:
25        scores = scores.masked_fill(mask == 0, -1e9)
26    
27    # Softmax to get attention weights
28    attention_weights = F.softmax(scores, dim=-1)
29    
30    # Apply attention to values
31    output = torch.matmul(attention_weights, V)
32    
33    return output, attention_weights
34
35class MultiHeadAttention(nn.Module):
36    def __init__(self, d_model, num_heads):
37        super(MultiHeadAttention, self).__init__()
38        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
39        
40        self.d_model = d_model
41        self.num_heads = num_heads
42        self.d_k = d_model // num_heads
43        
44        # Linear projections
45        self.W_q = nn.Linear(d_model, d_model)
46        self.W_k = nn.Linear(d_model, d_model)
47        self.W_v = nn.Linear(d_model, d_model)
48        self.W_o = nn.Linear(d_model, d_model)
49    
50    def split_heads(self, x):
51        """Split last dimension into (num_heads, d_k)"""
52        batch_size, seq_len, d_model = x.size()
53        return x.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
54    
55    def forward(self, Q, K, V, mask=None):
56        batch_size = Q.size(0)
57        
58        # Linear projections
59        Q = self.W_q(Q)
60        K = self.W_k(K)
61        V = self.W_v(V)
62        
63        # Split into multiple heads
64        Q = self.split_heads(Q)  # (batch, heads, seq_len, d_k)
65        K = self.split_heads(K)
66        V = self.split_heads(V)
67        
68        # Apply attention for each head
69        output, attention_weights = scaled_dot_product_attention(Q, K, V, mask)
70        
71        # Concatenate heads
72        output = output.transpose(1, 2).contiguous()
73        output = output.view(batch_size, -1, self.d_model)
74        
75        # Final linear projection
76        output = self.W_o(output)
77        
78        return output, attention_weights
79
80# Usage example
81d_model = 512
82num_heads = 8
83seq_len = 10
84batch_size = 2
85
86mha = MultiHeadAttention(d_model, num_heads)
87
88# Create sample input
89x = torch.randn(batch_size, seq_len, d_model)
90
91# Self-attention: Q, K, V are all the same
92output, attn_weights = mha(x, x, x)
93
94print(f"Output shape: {output.shape}")  # (2, 10, 512)
95print(f"Attention weights shape: {attn_weights.shape}")  # (2, 8, 10, 10)

Why It Works: PyTorch's autograd handles backpropagation through attention, and the model learns optimal attention patterns during training.


Q2: Explain and implement Transformer architecture.

Answer:

How It Works:

Transformer uses self-attention to process sequences in parallel (unlike RNNs).

Key Components:

  1. Multi-Head Attention: Multiple attention mechanisms in parallel
  2. Position Encoding: Add positional information (no recurrence)
  3. Feed-Forward Networks: Process each position independently
  4. Layer Normalization: Stabilize training
  5. Residual Connections: Help gradient flow

Architecture:

 1Input → Embedding + Positional Encoding
 2 3[Encoder Block] × N:
 4  - Multi-Head Self-Attention
 5  - Add & Norm
 6  - Feed-Forward Network
 7  - Add & Norm
 8 9[Decoder Block] × N:
10  - Masked Multi-Head Self-Attention
11  - Add & Norm
12  - Multi-Head Cross-Attention (with encoder output)
13  - Add & Norm
14  - Feed-Forward Network
15  - Add & Norm
1617Output Linear + Softmax

Positional Encoding: $$ PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right) $$ $$ PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right) $$

PyTorch Implementation:

  1import torch
  2import torch.nn as nn
  3import math
  4
  5class TransformerBlock(nn.Module):
  6    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
  7        super(TransformerBlock, self).__init__()
  8        
  9        # Multi-head attention
 10        self.attention = nn.MultiheadAttention(d_model, num_heads, dropout=dropout, batch_first=True)
 11        
 12        # Feed-forward network
 13        self.ffn = nn.Sequential(
 14            nn.Linear(d_model, d_ff),
 15            nn.ReLU(),
 16            nn.Dropout(dropout),
 17            nn.Linear(d_ff, d_model)
 18        )
 19        
 20        # Layer normalization
 21        self.norm1 = nn.LayerNorm(d_model)
 22        self.norm2 = nn.LayerNorm(d_model)
 23        
 24        self.dropout = nn.Dropout(dropout)
 25    
 26    def forward(self, x, mask=None):
 27        # Self-attention with residual connection
 28        attn_output, _ = self.attention(x, x, x, attn_mask=mask)
 29        x = self.norm1(x + self.dropout(attn_output))
 30        
 31        # Feed-forward with residual connection
 32        ffn_output = self.ffn(x)
 33        x = self.norm2(x + self.dropout(ffn_output))
 34        
 35        return x
 36
 37class PositionalEncoding(nn.Module):
 38    def __init__(self, d_model, max_len=5000):
 39        super(PositionalEncoding, self).__init__()
 40        
 41        # Create positional encoding matrix
 42        pe = torch.zeros(max_len, d_model)
 43        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
 44        div_term = torch.exp(torch.arange(0, d_model, 2).float() * 
 45                            (-math.log(10000.0) / d_model))
 46        
 47        pe[:, 0::2] = torch.sin(position * div_term)
 48        pe[:, 1::2] = torch.cos(position * div_term)
 49        
 50        pe = pe.unsqueeze(0)  # Add batch dimension
 51        self.register_buffer('pe', pe)
 52    
 53    def forward(self, x):
 54        # x shape: (batch_size, seq_len, d_model)
 55        return x + self.pe[:, :x.size(1), :]
 56
 57class Transformer(nn.Module):
 58    def __init__(self, vocab_size, d_model=512, num_heads=8, num_layers=6, d_ff=2048, dropout=0.1):
 59        super(Transformer, self).__init__()
 60        
 61        self.embedding = nn.Embedding(vocab_size, d_model)
 62        self.pos_encoding = PositionalEncoding(d_model)
 63        
 64        self.encoder_layers = nn.ModuleList([
 65            TransformerBlock(d_model, num_heads, d_ff, dropout)
 66            for _ in range(num_layers)
 67        ])
 68        
 69        self.fc_out = nn.Linear(d_model, vocab_size)
 70        self.dropout = nn.Dropout(dropout)
 71        self.d_model = d_model
 72    
 73    def forward(self, x, mask=None):
 74        # Embedding + positional encoding
 75        x = self.embedding(x) * math.sqrt(self.d_model)
 76        x = self.pos_encoding(x)
 77        x = self.dropout(x)
 78        
 79        # Pass through encoder layers
 80        for layer in self.encoder_layers:
 81            x = layer(x, mask)
 82        
 83        # Output projection
 84        output = self.fc_out(x)
 85        return output
 86
 87# Usage example
 88vocab_size = 10000
 89model = Transformer(vocab_size, d_model=512, num_heads=8, num_layers=6)
 90
 91# Sample input (batch_size=2, seq_len=10)
 92input_ids = torch.randint(0, vocab_size, (2, 10))
 93output = model(input_ids)
 94
 95print(f"Input shape: {input_ids.shape}")   # (2, 10)
 96print(f"Output shape: {output.shape}")     # (2, 10, 10000)
 97
 98# Training example
 99criterion = nn.CrossEntropyLoss()
100optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)
101
102# Training loop
103model.train()
104for epoch in range(10):
105    optimizer.zero_grad()
106    
107    # Forward pass
108    predictions = model(input_ids)
109    
110    # Reshape for loss calculation
111    predictions = predictions.view(-1, vocab_size)
112    targets = torch.randint(0, vocab_size, (2, 10)).view(-1)
113    
114    loss = criterion(predictions, targets)
115    
116    # Backward pass
117    loss.backward()
118    optimizer.step()
119    
120    print(f"Epoch {epoch+1}, Loss: {loss.item():.4f}")

Why Transformers Work:

  • Parallel processing: Unlike RNNs, all positions processed simultaneously
  • Long-range dependencies: Attention mechanism sees all positions at once
  • Scalable: Can scale to billions of parameters efficiently
  • PyTorch autograd: Handles complex backpropagation automatically

Q3: Explain variational autoencoders (VAE) and implement one.

Answer:

How It Works:

VAE learns a probabilistic latent representation by encoding to distribution parameters, then sampling.

Key Idea:

  • Encoder outputs $\mu$ and $\sigma$ (mean and std of latent distribution)
  • Sample from $\mathcal{N}(\mu, \sigma^2)$ using reparameterization trick
  • Decoder reconstructs from sample

Loss Function: $$ \mathcal{L} = \mathcal{L}{\text{reconstruction}} + \beta \cdot \mathcal{L}{\text{KL}} $$

where:

  • Reconstruction: How well we reconstruct input
  • KL divergence: How close latent distribution is to $\mathcal{N}(0, 1)$

Reparameterization Trick: Instead of sampling $z \sim \mathcal{N}(\mu, \sigma^2)$ (not differentiable), do: $z = \mu + \sigma \odot \epsilon$ where $\epsilon \sim \mathcal{N}(0, 1)$

Implementation:

 1import torch
 2import torch.nn as nn
 3import torch.nn.functional as F
 4
 5class VAE(nn.Module):
 6    def __init__(self, input_dim, hidden_dim, latent_dim):
 7        super().__init__()
 8        
 9        # Encoder
10        self.encoder = nn.Sequential(
11            nn.Linear(input_dim, hidden_dim),
12            nn.ReLU(),
13            nn.Linear(hidden_dim, hidden_dim),
14            nn.ReLU()
15        )
16        
17        # Latent space
18        self.fc_mu = nn.Linear(hidden_dim, latent_dim)
19        self.fc_logvar = nn.Linear(hidden_dim, latent_dim)
20        
21        # Decoder
22        self.decoder = nn.Sequential(
23            nn.Linear(latent_dim, hidden_dim),
24            nn.ReLU(),
25            nn.Linear(hidden_dim, hidden_dim),
26            nn.ReLU(),
27            nn.Linear(hidden_dim, input_dim),
28            nn.Sigmoid()
29        )
30    
31    def encode(self, x):
32        h = self.encoder(x)
33        mu = self.fc_mu(h)
34        logvar = self.fc_logvar(h)
35        return mu, logvar
36    
37    def reparameterize(self, mu, logvar):
38        std = torch.exp(0.5 * logvar)
39        eps = torch.randn_like(std)
40        return mu + eps * std
41    
42    def decode(self, z):
43        return self.decoder(z)
44    
45    def forward(self, x):
46        mu, logvar = self.encode(x)
47        z = self.reparameterize(mu, logvar)
48        recon_x = self.decode(z)
49        return recon_x, mu, logvar
50
51def vae_loss(recon_x, x, mu, logvar, beta=1.0):
52    # Reconstruction loss (binary cross-entropy)
53    recon_loss = F.binary_cross_entropy(recon_x, x, reduction='sum')
54    
55    # KL divergence
56    # KL(N(mu, sigma^2) || N(0, 1)) = -0.5 * sum(1 + log(sigma^2) - mu^2 - sigma^2)
57    kl_loss = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
58    
59    return recon_loss + beta * kl_loss
60
61# Training
62model = VAE(input_dim=784, hidden_dim=400, latent_dim=20)
63optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
64
65for epoch in range(num_epochs):
66    for batch in dataloader:
67        x = batch.view(-1, 784)
68        
69        recon_x, mu, logvar = model(x)
70        loss = vae_loss(recon_x, x, mu, logvar)
71        
72        optimizer.zero_grad()
73        loss.backward()
74        optimizer.step()

Why VAE vs. Regular Autoencoder:

  • VAE: Smooth latent space, can generate new samples
  • AE: Just compression, latent space may have gaps

Q4: Explain Generative Adversarial Networks (GANs).

Answer:

How It Works:

Two networks compete:

  • Generator: Creates fake samples
  • Discriminator: Distinguishes real from fake

Training Process:

  1. Generator creates fake samples
  2. Discriminator tries to classify real vs. fake
  3. Generator tries to fool discriminator
  4. Both improve through adversarial training

Loss Functions:

Discriminator: $$ \max_D \mathbb{E}{x \sim p{data}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))] $$

Generator: $$ \min_G \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))] $$

Or equivalently (non-saturating loss): $$ \max_G \mathbb{E}_{z \sim p_z}[\log D(G(z))] $$

Implementation:

 1import torch
 2import torch.nn as nn
 3
 4class Generator(nn.Module):
 5    def __init__(self, latent_dim, img_shape):
 6        super().__init__()
 7        self.img_shape = img_shape
 8        
 9        self.model = nn.Sequential(
10            nn.Linear(latent_dim, 128),
11            nn.LeakyReLU(0.2),
12            nn.Linear(128, 256),
13            nn.BatchNorm1d(256),
14            nn.LeakyReLU(0.2),
15            nn.Linear(256, 512),
16            nn.BatchNorm1d(512),
17            nn.LeakyReLU(0.2),
18            nn.Linear(512, int(np.prod(img_shape))),
19            nn.Tanh()
20        )
21    
22    def forward(self, z):
23        img = self.model(z)
24        return img.view(img.size(0), *self.img_shape)
25
26class Discriminator(nn.Module):
27    def __init__(self, img_shape):
28        super().__init__()
29        
30        self.model = nn.Sequential(
31            nn.Linear(int(np.prod(img_shape)), 512),
32            nn.LeakyReLU(0.2),
33            nn.Linear(512, 256),
34            nn.LeakyReLU(0.2),
35            nn.Linear(256, 1),
36            nn.Sigmoid()
37        )
38    
39    def forward(self, img):
40        img_flat = img.view(img.size(0), -1)
41        validity = self.model(img_flat)
42        return validity
43
44# Training
45generator = Generator(latent_dim=100, img_shape=(1, 28, 28))
46discriminator = Discriminator(img_shape=(1, 28, 28))
47
48optimizer_G = torch.optim.Adam(generator.parameters(), lr=0.0002, betas=(0.5, 0.999))
49optimizer_D = torch.optim.Adam(discriminator.parameters(), lr=0.0002, betas=(0.5, 0.999))
50
51adversarial_loss = nn.BCELoss()
52
53for epoch in range(num_epochs):
54    for real_imgs in dataloader:
55        batch_size = real_imgs.size(0)
56        
57        # Labels
58        real_labels = torch.ones(batch_size, 1)
59        fake_labels = torch.zeros(batch_size, 1)
60        
61        # Train Discriminator
62        optimizer_D.zero_grad()
63        
64        real_loss = adversarial_loss(discriminator(real_imgs), real_labels)
65        
66        z = torch.randn(batch_size, 100)
67        fake_imgs = generator(z)
68        fake_loss = adversarial_loss(discriminator(fake_imgs.detach()), fake_labels)
69        
70        d_loss = (real_loss + fake_loss) / 2
71        d_loss.backward()
72        optimizer_D.step()
73        
74        # Train Generator
75        optimizer_G.zero_grad()
76        
77        z = torch.randn(batch_size, 100)
78        gen_imgs = generator(z)
79        g_loss = adversarial_loss(discriminator(gen_imgs), real_labels)
80        
81        g_loss.backward()
82        optimizer_G.step()

Common Problems:

  • Mode collapse: Generator produces limited variety
  • Training instability: Hard to balance G and D
  • Vanishing gradients: When D is too good

Solutions:

  • Wasserstein GAN (WGAN)
  • Spectral normalization
  • Progressive growing
  • StyleGAN architectures

Q5: Implement beam search for sequence generation.

Answer:

How It Works:

Beam search keeps top-k most likely sequences at each step, exploring multiple paths.

Algorithm:

  1. Start with k beams (initially just start token)
  2. For each beam, generate all possible next tokens
  3. Score each candidate (beam score + token log probability)
  4. Keep top k candidates as new beams
  5. Repeat until all beams end or max length reached

Implementation:

  1import torch
  2import torch.nn.functional as F
  3from queue import PriorityQueue
  4
  5def beam_search(model, start_token, end_token, max_length, beam_width=5, device='cpu'):
  6    """
  7    Args:
  8        model: Sequence generation model
  9        start_token: Starting token ID
 10        end_token: End token ID
 11        max_length: Maximum sequence length
 12        beam_width: Number of beams to keep
 13    
 14    Returns:
 15        best_sequence: Most likely sequence
 16        best_score: Log probability of best sequence
 17    """
 18    # Initialize beams: (score, sequence, is_complete)
 19    beams = [(0.0, [start_token], False)]
 20    completed_beams = []
 21    
 22    for step in range(max_length):
 23        candidates = []
 24        
 25        for score, sequence, is_complete in beams:
 26            if is_complete:
 27                completed_beams.append((score, sequence))
 28                continue
 29            
 30            # Get model predictions for current sequence
 31            input_ids = torch.tensor([sequence]).to(device)
 32            with torch.no_grad():
 33                logits = model(input_ids)
 34                log_probs = F.log_softmax(logits[:, -1, :], dim=-1)
 35            
 36            # Get top k tokens
 37            top_log_probs, top_indices = torch.topk(log_probs, beam_width)
 38            
 39            # Create candidates
 40            for log_prob, token_id in zip(top_log_probs[0], top_indices[0]):
 41                new_score = score + log_prob.item()
 42                new_sequence = sequence + [token_id.item()]
 43                is_end = (token_id.item() == end_token)
 44                
 45                candidates.append((new_score, new_sequence, is_end))
 46        
 47        # Keep top beam_width candidates
 48        candidates.sort(key=lambda x: x[0] / len(x[1]), reverse=True)  # Normalize by length
 49        beams = candidates[:beam_width]
 50        
 51        # Check if all beams are complete
 52        if all(is_complete for _, _, is_complete in beams):
 53            completed_beams.extend(beams)
 54            break
 55    
 56    # Add remaining beams
 57    completed_beams.extend(beams)
 58    
 59    # Return best sequence
 60    if completed_beams:
 61        best_score, best_sequence, _ = max(completed_beams, key=lambda x: x[0] / len(x[1]))
 62        return best_sequence, best_score
 63    
 64    return beams[0][1], beams[0][0]
 65
 66# Alternative: Batch beam search (more efficient)
 67def batch_beam_search(model, start_tokens, end_token, max_length, beam_width=5):
 68    batch_size = start_tokens.size(0)
 69    
 70    # Initialize: (batch_size * beam_width, seq_len)
 71    sequences = start_tokens.unsqueeze(1).repeat(1, beam_width, 1)
 72    sequences = sequences.view(batch_size * beam_width, -1)
 73    
 74    scores = torch.zeros(batch_size, beam_width)
 75    scores[:, 1:] = float('-inf')  # Only first beam is active initially
 76    
 77    for step in range(max_length):
 78        # Get predictions
 79        logits = model(sequences)
 80        log_probs = F.log_softmax(logits[:, -1, :], dim=-1)
 81        
 82        # Add to beam scores
 83        log_probs = log_probs.view(batch_size, beam_width, -1)
 84        scores = scores.unsqueeze(-1) + log_probs
 85        
 86        # Flatten and get top k
 87        scores_flat = scores.view(batch_size, -1)
 88        top_scores, top_indices = torch.topk(scores_flat, beam_width, dim=-1)
 89        
 90        # Convert flat indices to (beam_idx, token_idx)
 91        beam_indices = top_indices // log_probs.size(-1)
 92        token_indices = top_indices % log_probs.size(-1)
 93        
 94        # Update sequences
 95        sequences = sequences.view(batch_size, beam_width, -1)
 96        sequences = torch.gather(
 97            sequences, 
 98            1, 
 99            beam_indices.unsqueeze(-1).expand(-1, -1, sequences.size(-1))
100        )
101        sequences = torch.cat([sequences, token_indices.unsqueeze(-1)], dim=-1)
102        sequences = sequences.view(batch_size * beam_width, -1)
103        
104        scores = top_scores
105    
106    # Return best sequences
107    best_scores, best_indices = scores.max(dim=-1)
108    best_sequences = sequences.view(batch_size, beam_width, -1)
109    best_sequences = torch.gather(
110        best_sequences,
111        1,
112        best_indices.unsqueeze(-1).unsqueeze(-1).expand(-1, -1, best_sequences.size(-1))
113    ).squeeze(1)
114    
115    return best_sequences, best_scores

Beam Search vs. Greedy:

  • Greedy: Always pick most likely token (fast, but suboptimal)
  • Beam: Explore multiple paths (better quality, slower)

Typical beam widths: 5-10 for translation, 1-3 for chatbots


Summary

Hard AI/ML topics require deep understanding of:

  • Attention mechanisms: Core of modern NLP
  • Transformers: Architecture powering GPT, BERT
  • Generative models: VAE, GAN for creating new data
  • Search algorithms: Beam search for sequence generation
  • Optimization: Advanced training techniques

Key Skills:

  • Implement from scratch (not just use libraries)
  • Understand mathematical foundations
  • Debug training issues
  • Optimize for production

Related Snippets