AI/ML Interview Questions - Hard
Hard-level AI/ML interview questions covering advanced architectures, optimization, and theoretical concepts.
Q1: Implement attention mechanism from scratch.
Answer:
How It Works:
Attention allows model to focus on relevant parts of input when producing output.
Core Idea: Compute weighted sum of values, where weights are determined by query-key similarity.
Formula: $$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$
where:
- $Q$ = Queries (what we're looking for)
- $K$ = Keys (what's available)
- $V$ = Values (actual content)
- $d_k$ = dimension of keys (for scaling)
PyTorch Implementation:
1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4import math
5
6def scaled_dot_product_attention(Q, K, V, mask=None):
7 """
8 Args:
9 Q: Queries (batch_size, num_heads, seq_len, d_k)
10 K: Keys (batch_size, num_heads, seq_len, d_k)
11 V: Values (batch_size, num_heads, seq_len, d_v)
12 mask: Optional mask (batch_size, 1, 1, seq_len)
13
14 Returns:
15 output: (batch_size, num_heads, seq_len, d_v)
16 attention_weights: (batch_size, num_heads, seq_len, seq_len)
17 """
18 d_k = Q.size(-1)
19
20 # Calculate attention scores: Q @ K^T / sqrt(d_k)
21 scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
22
23 # Apply mask if provided (for padding or future tokens)
24 if mask is not None:
25 scores = scores.masked_fill(mask == 0, -1e9)
26
27 # Softmax to get attention weights
28 attention_weights = F.softmax(scores, dim=-1)
29
30 # Apply attention to values
31 output = torch.matmul(attention_weights, V)
32
33 return output, attention_weights
34
35class MultiHeadAttention(nn.Module):
36 def __init__(self, d_model, num_heads):
37 super(MultiHeadAttention, self).__init__()
38 assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
39
40 self.d_model = d_model
41 self.num_heads = num_heads
42 self.d_k = d_model // num_heads
43
44 # Linear projections
45 self.W_q = nn.Linear(d_model, d_model)
46 self.W_k = nn.Linear(d_model, d_model)
47 self.W_v = nn.Linear(d_model, d_model)
48 self.W_o = nn.Linear(d_model, d_model)
49
50 def split_heads(self, x):
51 """Split last dimension into (num_heads, d_k)"""
52 batch_size, seq_len, d_model = x.size()
53 return x.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
54
55 def forward(self, Q, K, V, mask=None):
56 batch_size = Q.size(0)
57
58 # Linear projections
59 Q = self.W_q(Q)
60 K = self.W_k(K)
61 V = self.W_v(V)
62
63 # Split into multiple heads
64 Q = self.split_heads(Q) # (batch, heads, seq_len, d_k)
65 K = self.split_heads(K)
66 V = self.split_heads(V)
67
68 # Apply attention for each head
69 output, attention_weights = scaled_dot_product_attention(Q, K, V, mask)
70
71 # Concatenate heads
72 output = output.transpose(1, 2).contiguous()
73 output = output.view(batch_size, -1, self.d_model)
74
75 # Final linear projection
76 output = self.W_o(output)
77
78 return output, attention_weights
79
80# Usage example
81d_model = 512
82num_heads = 8
83seq_len = 10
84batch_size = 2
85
86mha = MultiHeadAttention(d_model, num_heads)
87
88# Create sample input
89x = torch.randn(batch_size, seq_len, d_model)
90
91# Self-attention: Q, K, V are all the same
92output, attn_weights = mha(x, x, x)
93
94print(f"Output shape: {output.shape}") # (2, 10, 512)
95print(f"Attention weights shape: {attn_weights.shape}") # (2, 8, 10, 10)
Why It Works: PyTorch's autograd handles backpropagation through attention, and the model learns optimal attention patterns during training.
Q2: Explain and implement Transformer architecture.
Answer:
How It Works:
Transformer uses self-attention to process sequences in parallel (unlike RNNs).
Key Components:
- Multi-Head Attention: Multiple attention mechanisms in parallel
- Position Encoding: Add positional information (no recurrence)
- Feed-Forward Networks: Process each position independently
- Layer Normalization: Stabilize training
- Residual Connections: Help gradient flow
Architecture:
1Input → Embedding + Positional Encoding
2 ↓
3[Encoder Block] × N:
4 - Multi-Head Self-Attention
5 - Add & Norm
6 - Feed-Forward Network
7 - Add & Norm
8 ↓
9[Decoder Block] × N:
10 - Masked Multi-Head Self-Attention
11 - Add & Norm
12 - Multi-Head Cross-Attention (with encoder output)
13 - Add & Norm
14 - Feed-Forward Network
15 - Add & Norm
16 ↓
17Output Linear + Softmax
Positional Encoding: $$ PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right) $$ $$ PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right) $$
PyTorch Implementation:
1import torch
2import torch.nn as nn
3import math
4
5class TransformerBlock(nn.Module):
6 def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
7 super(TransformerBlock, self).__init__()
8
9 # Multi-head attention
10 self.attention = nn.MultiheadAttention(d_model, num_heads, dropout=dropout, batch_first=True)
11
12 # Feed-forward network
13 self.ffn = nn.Sequential(
14 nn.Linear(d_model, d_ff),
15 nn.ReLU(),
16 nn.Dropout(dropout),
17 nn.Linear(d_ff, d_model)
18 )
19
20 # Layer normalization
21 self.norm1 = nn.LayerNorm(d_model)
22 self.norm2 = nn.LayerNorm(d_model)
23
24 self.dropout = nn.Dropout(dropout)
25
26 def forward(self, x, mask=None):
27 # Self-attention with residual connection
28 attn_output, _ = self.attention(x, x, x, attn_mask=mask)
29 x = self.norm1(x + self.dropout(attn_output))
30
31 # Feed-forward with residual connection
32 ffn_output = self.ffn(x)
33 x = self.norm2(x + self.dropout(ffn_output))
34
35 return x
36
37class PositionalEncoding(nn.Module):
38 def __init__(self, d_model, max_len=5000):
39 super(PositionalEncoding, self).__init__()
40
41 # Create positional encoding matrix
42 pe = torch.zeros(max_len, d_model)
43 position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
44 div_term = torch.exp(torch.arange(0, d_model, 2).float() *
45 (-math.log(10000.0) / d_model))
46
47 pe[:, 0::2] = torch.sin(position * div_term)
48 pe[:, 1::2] = torch.cos(position * div_term)
49
50 pe = pe.unsqueeze(0) # Add batch dimension
51 self.register_buffer('pe', pe)
52
53 def forward(self, x):
54 # x shape: (batch_size, seq_len, d_model)
55 return x + self.pe[:, :x.size(1), :]
56
57class Transformer(nn.Module):
58 def __init__(self, vocab_size, d_model=512, num_heads=8, num_layers=6, d_ff=2048, dropout=0.1):
59 super(Transformer, self).__init__()
60
61 self.embedding = nn.Embedding(vocab_size, d_model)
62 self.pos_encoding = PositionalEncoding(d_model)
63
64 self.encoder_layers = nn.ModuleList([
65 TransformerBlock(d_model, num_heads, d_ff, dropout)
66 for _ in range(num_layers)
67 ])
68
69 self.fc_out = nn.Linear(d_model, vocab_size)
70 self.dropout = nn.Dropout(dropout)
71 self.d_model = d_model
72
73 def forward(self, x, mask=None):
74 # Embedding + positional encoding
75 x = self.embedding(x) * math.sqrt(self.d_model)
76 x = self.pos_encoding(x)
77 x = self.dropout(x)
78
79 # Pass through encoder layers
80 for layer in self.encoder_layers:
81 x = layer(x, mask)
82
83 # Output projection
84 output = self.fc_out(x)
85 return output
86
87# Usage example
88vocab_size = 10000
89model = Transformer(vocab_size, d_model=512, num_heads=8, num_layers=6)
90
91# Sample input (batch_size=2, seq_len=10)
92input_ids = torch.randint(0, vocab_size, (2, 10))
93output = model(input_ids)
94
95print(f"Input shape: {input_ids.shape}") # (2, 10)
96print(f"Output shape: {output.shape}") # (2, 10, 10000)
97
98# Training example
99criterion = nn.CrossEntropyLoss()
100optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)
101
102# Training loop
103model.train()
104for epoch in range(10):
105 optimizer.zero_grad()
106
107 # Forward pass
108 predictions = model(input_ids)
109
110 # Reshape for loss calculation
111 predictions = predictions.view(-1, vocab_size)
112 targets = torch.randint(0, vocab_size, (2, 10)).view(-1)
113
114 loss = criterion(predictions, targets)
115
116 # Backward pass
117 loss.backward()
118 optimizer.step()
119
120 print(f"Epoch {epoch+1}, Loss: {loss.item():.4f}")
Why Transformers Work:
- Parallel processing: Unlike RNNs, all positions processed simultaneously
- Long-range dependencies: Attention mechanism sees all positions at once
- Scalable: Can scale to billions of parameters efficiently
- PyTorch autograd: Handles complex backpropagation automatically
Q3: Explain variational autoencoders (VAE) and implement one.
Answer:
How It Works:
VAE learns a probabilistic latent representation by encoding to distribution parameters, then sampling.
Key Idea:
- Encoder outputs $\mu$ and $\sigma$ (mean and std of latent distribution)
- Sample from $\mathcal{N}(\mu, \sigma^2)$ using reparameterization trick
- Decoder reconstructs from sample
Loss Function: $$ \mathcal{L} = \mathcal{L}{\text{reconstruction}} + \beta \cdot \mathcal{L}{\text{KL}} $$
where:
- Reconstruction: How well we reconstruct input
- KL divergence: How close latent distribution is to $\mathcal{N}(0, 1)$
Reparameterization Trick: Instead of sampling $z \sim \mathcal{N}(\mu, \sigma^2)$ (not differentiable), do: $z = \mu + \sigma \odot \epsilon$ where $\epsilon \sim \mathcal{N}(0, 1)$
Implementation:
1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5class VAE(nn.Module):
6 def __init__(self, input_dim, hidden_dim, latent_dim):
7 super().__init__()
8
9 # Encoder
10 self.encoder = nn.Sequential(
11 nn.Linear(input_dim, hidden_dim),
12 nn.ReLU(),
13 nn.Linear(hidden_dim, hidden_dim),
14 nn.ReLU()
15 )
16
17 # Latent space
18 self.fc_mu = nn.Linear(hidden_dim, latent_dim)
19 self.fc_logvar = nn.Linear(hidden_dim, latent_dim)
20
21 # Decoder
22 self.decoder = nn.Sequential(
23 nn.Linear(latent_dim, hidden_dim),
24 nn.ReLU(),
25 nn.Linear(hidden_dim, hidden_dim),
26 nn.ReLU(),
27 nn.Linear(hidden_dim, input_dim),
28 nn.Sigmoid()
29 )
30
31 def encode(self, x):
32 h = self.encoder(x)
33 mu = self.fc_mu(h)
34 logvar = self.fc_logvar(h)
35 return mu, logvar
36
37 def reparameterize(self, mu, logvar):
38 std = torch.exp(0.5 * logvar)
39 eps = torch.randn_like(std)
40 return mu + eps * std
41
42 def decode(self, z):
43 return self.decoder(z)
44
45 def forward(self, x):
46 mu, logvar = self.encode(x)
47 z = self.reparameterize(mu, logvar)
48 recon_x = self.decode(z)
49 return recon_x, mu, logvar
50
51def vae_loss(recon_x, x, mu, logvar, beta=1.0):
52 # Reconstruction loss (binary cross-entropy)
53 recon_loss = F.binary_cross_entropy(recon_x, x, reduction='sum')
54
55 # KL divergence
56 # KL(N(mu, sigma^2) || N(0, 1)) = -0.5 * sum(1 + log(sigma^2) - mu^2 - sigma^2)
57 kl_loss = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
58
59 return recon_loss + beta * kl_loss
60
61# Training
62model = VAE(input_dim=784, hidden_dim=400, latent_dim=20)
63optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
64
65for epoch in range(num_epochs):
66 for batch in dataloader:
67 x = batch.view(-1, 784)
68
69 recon_x, mu, logvar = model(x)
70 loss = vae_loss(recon_x, x, mu, logvar)
71
72 optimizer.zero_grad()
73 loss.backward()
74 optimizer.step()
Why VAE vs. Regular Autoencoder:
- VAE: Smooth latent space, can generate new samples
- AE: Just compression, latent space may have gaps
Q4: Explain Generative Adversarial Networks (GANs).
Answer:
How It Works:
Two networks compete:
- Generator: Creates fake samples
- Discriminator: Distinguishes real from fake
Training Process:
- Generator creates fake samples
- Discriminator tries to classify real vs. fake
- Generator tries to fool discriminator
- Both improve through adversarial training
Loss Functions:
Discriminator: $$ \max_D \mathbb{E}{x \sim p{data}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))] $$
Generator: $$ \min_G \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))] $$
Or equivalently (non-saturating loss): $$ \max_G \mathbb{E}_{z \sim p_z}[\log D(G(z))] $$
Implementation:
1import torch
2import torch.nn as nn
3
4class Generator(nn.Module):
5 def __init__(self, latent_dim, img_shape):
6 super().__init__()
7 self.img_shape = img_shape
8
9 self.model = nn.Sequential(
10 nn.Linear(latent_dim, 128),
11 nn.LeakyReLU(0.2),
12 nn.Linear(128, 256),
13 nn.BatchNorm1d(256),
14 nn.LeakyReLU(0.2),
15 nn.Linear(256, 512),
16 nn.BatchNorm1d(512),
17 nn.LeakyReLU(0.2),
18 nn.Linear(512, int(np.prod(img_shape))),
19 nn.Tanh()
20 )
21
22 def forward(self, z):
23 img = self.model(z)
24 return img.view(img.size(0), *self.img_shape)
25
26class Discriminator(nn.Module):
27 def __init__(self, img_shape):
28 super().__init__()
29
30 self.model = nn.Sequential(
31 nn.Linear(int(np.prod(img_shape)), 512),
32 nn.LeakyReLU(0.2),
33 nn.Linear(512, 256),
34 nn.LeakyReLU(0.2),
35 nn.Linear(256, 1),
36 nn.Sigmoid()
37 )
38
39 def forward(self, img):
40 img_flat = img.view(img.size(0), -1)
41 validity = self.model(img_flat)
42 return validity
43
44# Training
45generator = Generator(latent_dim=100, img_shape=(1, 28, 28))
46discriminator = Discriminator(img_shape=(1, 28, 28))
47
48optimizer_G = torch.optim.Adam(generator.parameters(), lr=0.0002, betas=(0.5, 0.999))
49optimizer_D = torch.optim.Adam(discriminator.parameters(), lr=0.0002, betas=(0.5, 0.999))
50
51adversarial_loss = nn.BCELoss()
52
53for epoch in range(num_epochs):
54 for real_imgs in dataloader:
55 batch_size = real_imgs.size(0)
56
57 # Labels
58 real_labels = torch.ones(batch_size, 1)
59 fake_labels = torch.zeros(batch_size, 1)
60
61 # Train Discriminator
62 optimizer_D.zero_grad()
63
64 real_loss = adversarial_loss(discriminator(real_imgs), real_labels)
65
66 z = torch.randn(batch_size, 100)
67 fake_imgs = generator(z)
68 fake_loss = adversarial_loss(discriminator(fake_imgs.detach()), fake_labels)
69
70 d_loss = (real_loss + fake_loss) / 2
71 d_loss.backward()
72 optimizer_D.step()
73
74 # Train Generator
75 optimizer_G.zero_grad()
76
77 z = torch.randn(batch_size, 100)
78 gen_imgs = generator(z)
79 g_loss = adversarial_loss(discriminator(gen_imgs), real_labels)
80
81 g_loss.backward()
82 optimizer_G.step()
Common Problems:
- Mode collapse: Generator produces limited variety
- Training instability: Hard to balance G and D
- Vanishing gradients: When D is too good
Solutions:
- Wasserstein GAN (WGAN)
- Spectral normalization
- Progressive growing
- StyleGAN architectures
Q5: Implement beam search for sequence generation.
Answer:
How It Works:
Beam search keeps top-k most likely sequences at each step, exploring multiple paths.
Algorithm:
- Start with k beams (initially just start token)
- For each beam, generate all possible next tokens
- Score each candidate (beam score + token log probability)
- Keep top k candidates as new beams
- Repeat until all beams end or max length reached
Implementation:
1import torch
2import torch.nn.functional as F
3from queue import PriorityQueue
4
5def beam_search(model, start_token, end_token, max_length, beam_width=5, device='cpu'):
6 """
7 Args:
8 model: Sequence generation model
9 start_token: Starting token ID
10 end_token: End token ID
11 max_length: Maximum sequence length
12 beam_width: Number of beams to keep
13
14 Returns:
15 best_sequence: Most likely sequence
16 best_score: Log probability of best sequence
17 """
18 # Initialize beams: (score, sequence, is_complete)
19 beams = [(0.0, [start_token], False)]
20 completed_beams = []
21
22 for step in range(max_length):
23 candidates = []
24
25 for score, sequence, is_complete in beams:
26 if is_complete:
27 completed_beams.append((score, sequence))
28 continue
29
30 # Get model predictions for current sequence
31 input_ids = torch.tensor([sequence]).to(device)
32 with torch.no_grad():
33 logits = model(input_ids)
34 log_probs = F.log_softmax(logits[:, -1, :], dim=-1)
35
36 # Get top k tokens
37 top_log_probs, top_indices = torch.topk(log_probs, beam_width)
38
39 # Create candidates
40 for log_prob, token_id in zip(top_log_probs[0], top_indices[0]):
41 new_score = score + log_prob.item()
42 new_sequence = sequence + [token_id.item()]
43 is_end = (token_id.item() == end_token)
44
45 candidates.append((new_score, new_sequence, is_end))
46
47 # Keep top beam_width candidates
48 candidates.sort(key=lambda x: x[0] / len(x[1]), reverse=True) # Normalize by length
49 beams = candidates[:beam_width]
50
51 # Check if all beams are complete
52 if all(is_complete for _, _, is_complete in beams):
53 completed_beams.extend(beams)
54 break
55
56 # Add remaining beams
57 completed_beams.extend(beams)
58
59 # Return best sequence
60 if completed_beams:
61 best_score, best_sequence, _ = max(completed_beams, key=lambda x: x[0] / len(x[1]))
62 return best_sequence, best_score
63
64 return beams[0][1], beams[0][0]
65
66# Alternative: Batch beam search (more efficient)
67def batch_beam_search(model, start_tokens, end_token, max_length, beam_width=5):
68 batch_size = start_tokens.size(0)
69
70 # Initialize: (batch_size * beam_width, seq_len)
71 sequences = start_tokens.unsqueeze(1).repeat(1, beam_width, 1)
72 sequences = sequences.view(batch_size * beam_width, -1)
73
74 scores = torch.zeros(batch_size, beam_width)
75 scores[:, 1:] = float('-inf') # Only first beam is active initially
76
77 for step in range(max_length):
78 # Get predictions
79 logits = model(sequences)
80 log_probs = F.log_softmax(logits[:, -1, :], dim=-1)
81
82 # Add to beam scores
83 log_probs = log_probs.view(batch_size, beam_width, -1)
84 scores = scores.unsqueeze(-1) + log_probs
85
86 # Flatten and get top k
87 scores_flat = scores.view(batch_size, -1)
88 top_scores, top_indices = torch.topk(scores_flat, beam_width, dim=-1)
89
90 # Convert flat indices to (beam_idx, token_idx)
91 beam_indices = top_indices // log_probs.size(-1)
92 token_indices = top_indices % log_probs.size(-1)
93
94 # Update sequences
95 sequences = sequences.view(batch_size, beam_width, -1)
96 sequences = torch.gather(
97 sequences,
98 1,
99 beam_indices.unsqueeze(-1).expand(-1, -1, sequences.size(-1))
100 )
101 sequences = torch.cat([sequences, token_indices.unsqueeze(-1)], dim=-1)
102 sequences = sequences.view(batch_size * beam_width, -1)
103
104 scores = top_scores
105
106 # Return best sequences
107 best_scores, best_indices = scores.max(dim=-1)
108 best_sequences = sequences.view(batch_size, beam_width, -1)
109 best_sequences = torch.gather(
110 best_sequences,
111 1,
112 best_indices.unsqueeze(-1).unsqueeze(-1).expand(-1, -1, best_sequences.size(-1))
113 ).squeeze(1)
114
115 return best_sequences, best_scores
Beam Search vs. Greedy:
- Greedy: Always pick most likely token (fast, but suboptimal)
- Beam: Explore multiple paths (better quality, slower)
Typical beam widths: 5-10 for translation, 1-3 for chatbots
Summary
Hard AI/ML topics require deep understanding of:
- Attention mechanisms: Core of modern NLP
- Transformers: Architecture powering GPT, BERT
- Generative models: VAE, GAN for creating new data
- Search algorithms: Beam search for sequence generation
- Optimization: Advanced training techniques
Key Skills:
- Implement from scratch (not just use libraries)
- Understand mathematical foundations
- Debug training issues
- Optimize for production
Related Snippets
- AI/ML Interview Questions - Easy
Easy-level AI/ML interview questions with LangChain examples and Mermaid … - AI/ML Interview Questions - Medium
Medium-level AI/ML interview questions covering neural networks, ensemble … - LLM/Agentic AI Interview Questions - Easy
Easy-level LLM and Agentic AI interview questions covering fundamentals, … - LLM/Agentic AI Interview Questions - Hard
Hard-level LLM and Agentic AI interview questions covering multi-agent systems, … - LLM/Agentic AI Interview Questions - Medium
Medium-level LLM and Agentic AI interview questions covering agent …