AI/ML Interview Questions - Medium
Medium-level AI/ML interview questions covering neural networks, ensemble methods, and advanced concepts.
Q1: Explain backpropagation in neural networks.
Answer:
How It Works:
Backpropagation is the algorithm for training neural networks by computing gradients of the loss with respect to weights.
Forward Pass:
- Input flows through network
- Each layer applies: $z = Wx + b$, then activation $a = \sigma(z)$
- Final layer produces prediction
- Calculate loss: $L = \text{loss}(y_{\text{pred}}, y_{\text{true}})$
Backward Pass (Chain Rule):
- Start from output layer
- Calculate gradient of loss w.r.t. output: $\frac{\partial L}{\partial a^{(L)}}$
- Propagate backwards using chain rule: $$ \frac{\partial L}{\partial W^{(l)}} = \frac{\partial L}{\partial a^{(l)}} \cdot \frac{\partial a^{(l)}}{\partial z^{(l)}} \cdot \frac{\partial z^{(l)}}{\partial W^{(l)}} $$
Update Weights: $$ W^{(l)} = W^{(l)} - \alpha \frac{\partial L}{\partial W^{(l)}} $$
PyTorch Implementation:
1import torch
2import torch.nn as nn
3import torch.optim as optim
4
5class NeuralNetwork(nn.Module):
6 def __init__(self, layers):
7 super(NeuralNetwork, self).__init__()
8
9 # Create layers dynamically
10 self.layers = nn.ModuleList()
11 for i in range(len(layers) - 1):
12 self.layers.append(nn.Linear(layers[i], layers[i+1]))
13
14 def forward(self, x):
15 # Forward pass through all layers
16 for i, layer in enumerate(self.layers):
17 x = layer(x)
18 # Apply sigmoid activation (except for last layer if doing classification)
19 if i < len(self.layers) - 1:
20 x = torch.sigmoid(x)
21 return x
22
23# Training example
24def train_network(model, X_train, y_train, epochs=100, lr=0.01):
25 # Define loss and optimizer
26 criterion = nn.MSELoss()
27 optimizer = optim.SGD(model.parameters(), lr=lr)
28
29 # Convert to tensors
30 X = torch.FloatTensor(X_train)
31 y = torch.FloatTensor(y_train)
32
33 for epoch in range(epochs):
34 # Forward pass
35 outputs = model(X)
36 loss = criterion(outputs, y)
37
38 # Backward pass (automatic differentiation)
39 optimizer.zero_grad() # Clear gradients
40 loss.backward() # Compute gradients
41 optimizer.step() # Update weights
42
43 if (epoch + 1) % 10 == 0:
44 print(f'Epoch [{epoch+1}/{epochs}], Loss: {loss.item():.4f}')
45
46# Usage
47model = NeuralNetwork([784, 256, 128, 10]) # Input, hidden, output layers
48X_train = torch.randn(100, 784) # 100 samples, 784 features
49y_train = torch.randn(100, 10) # 100 samples, 10 classes
50
51train_network(model, X_train, y_train)
52
53# PyTorch automatically handles backpropagation!
54# You can also inspect gradients:
55for name, param in model.named_parameters():
56 if param.grad is not None:
57 print(f'{name} gradient: {param.grad.norm().item()}')
Why It Works: PyTorch's autograd automatically computes gradients using chain rule, making backpropagation seamless.
Q2: What are ensemble methods? Explain bagging vs. boosting.
Answer:
Ensemble Methods: Combine multiple models to improve performance.
Bagging (Bootstrap Aggregating)
How It Works:
- Create multiple bootstrap samples (random sampling with replacement)
- Train independent model on each sample
- Aggregate predictions (voting for classification, averaging for regression)
Example: Random Forest
- Each tree trained on different bootstrap sample
- Each split considers random subset of features
- Final prediction: majority vote
Benefits:
- Reduces variance (overfitting)
- Parallel training
- Works well with high-variance models (deep trees)
Implementation:
1from sklearn.ensemble import BaggingClassifier
2from sklearn.tree import DecisionTreeClassifier
3
4# Bagging with decision trees
5bagging = BaggingClassifier(
6 base_estimator=DecisionTreeClassifier(),
7 n_estimators=100,
8 max_samples=0.8, # Use 80% of data for each sample
9 bootstrap=True
10)
11bagging.fit(X_train, y_train)
Boosting
How It Works:
- Train models sequentially
- Each model focuses on mistakes of previous models
- Weight samples based on difficulty
- Combine with weighted voting
Example: AdaBoost
- Start with equal weights for all samples
- Train weak learner
- Increase weights for misclassified samples
- Repeat, giving more weight to harder examples
Benefits:
- Reduces bias (underfitting)
- Often better accuracy than bagging
- Works well with weak learners
Implementation:
1from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
2
3# AdaBoost
4adaboost = AdaBoostClassifier(n_estimators=100, learning_rate=1.0)
5adaboost.fit(X_train, y_train)
6
7# Gradient Boosting
8gb = GradientBoostingClassifier(
9 n_estimators=100,
10 learning_rate=0.1,
11 max_depth=3
12)
13gb.fit(X_train, y_train)
Comparison
| Aspect | Bagging | Boosting |
|---|---|---|
| Training | Parallel | Sequential |
| Focus | Reduce variance | Reduce bias |
| Weighting | Equal | Adaptive |
| Example | Random Forest | AdaBoost, XGBoost |
| Overfitting | Less prone | More prone |
Q3: Explain gradient descent variants (SGD, Mini-batch, Adam).
Answer:
Batch Gradient Descent
How It Works: Use entire dataset to compute gradient.
1for epoch in range(n_epochs):
2 gradient = compute_gradient(X_train, y_train, weights)
3 weights -= learning_rate * gradient
Pros: Stable convergence, accurate gradient
Cons: Slow for large datasets, memory intensive
Stochastic Gradient Descent (SGD)
How It Works: Use one sample at a time.
1for epoch in range(n_epochs):
2 for i in range(n_samples):
3 gradient = compute_gradient(X_train[i], y_train[i], weights)
4 weights -= learning_rate * gradient
Pros: Fast, can escape local minima
Cons: Noisy updates, unstable convergence
Mini-batch Gradient Descent
How It Works: Use small batches (e.g., 32, 64, 128 samples).
1for epoch in range(n_epochs):
2 for batch in get_batches(X_train, y_train, batch_size=32):
3 X_batch, y_batch = batch
4 gradient = compute_gradient(X_batch, y_batch, weights)
5 weights -= learning_rate * gradient
Pros: Balance between speed and stability
Cons: Requires tuning batch size
Adam (Adaptive Moment Estimation)
How It Works: Combines momentum and adaptive learning rates.
Algorithm:
1# Initialize
2m = 0 # First moment (momentum)
3v = 0 # Second moment (RMSprop)
4t = 0 # Time step
5
6for epoch in range(n_epochs):
7 for batch in get_batches(X_train, y_train, batch_size):
8 t += 1
9 gradient = compute_gradient(batch, weights)
10
11 # Update biased first moment
12 m = beta1 * m + (1 - beta1) * gradient
13
14 # Update biased second moment
15 v = beta2 * v + (1 - beta2) * (gradient ** 2)
16
17 # Bias correction
18 m_hat = m / (1 - beta1 ** t)
19 v_hat = v / (1 - beta2 ** t)
20
21 # Update weights
22 weights -= learning_rate * m_hat / (np.sqrt(v_hat) + epsilon)
Typical values: $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$
Why Adam is popular: Adaptive learning rates per parameter, works well with default hyperparameters.
Q4: What is batch normalization and why does it help?
Answer:
How It Works:
Normalize activations within each mini-batch to have mean 0 and variance 1.
Algorithm:
- For each mini-batch: $$ \mu_B = \frac{1}{m}\sum_{i=1}^m x_i $$ $$ \sigma_B^2 = \frac{1}{m}\sum_{i=1}^m (x_i - \mu_B)^2 $$
- Normalize: $$ \hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} $$
- Scale and shift (learnable parameters): $$ y_i = \gamma \hat{x}_i + \beta $$
Why It Helps:
- Reduces internal covariate shift: Stabilizes distribution of activations
- Allows higher learning rates: Less sensitive to initialization
- Regularization effect: Adds noise through batch statistics
- Faster convergence: Smoother optimization landscape
Implementation:
1import torch.nn as nn
2
3model = nn.Sequential(
4 nn.Linear(784, 256),
5 nn.BatchNorm1d(256), # Batch norm after linear layer
6 nn.ReLU(),
7 nn.Linear(256, 128),
8 nn.BatchNorm1d(128),
9 nn.ReLU(),
10 nn.Linear(128, 10)
11)
During Inference: Use running statistics (exponential moving average) instead of batch statistics.
Q5: Explain the vanishing/exploding gradient problem.
Answer:
Vanishing Gradients
Problem: Gradients become extremely small in early layers, preventing learning.
Why It Happens:
- Deep networks multiply many gradients via chain rule
- If gradients < 1, repeated multiplication → very small values
- Common with sigmoid/tanh activations (gradients max at 0.25)
Example: $$ \frac{\partial L}{\partial W^{(1)}} = \frac{\partial L}{\partial a^{(L)}} \cdot \frac{\partial a^{(L)}}{\partial a^{(L-1)}} \cdot ... \cdot \frac{\partial a^{(2)}}{\partial W^{(1)}} $$
If each $\frac{\partial a^{(l)}}{\partial a^{(l-1)}} < 1$, product vanishes.
Solutions:
- ReLU activation: Gradient is 1 for positive inputs
- Batch normalization: Stabilizes gradients
- Residual connections (ResNet): Skip connections allow gradient flow
- Better initialization: Xavier/He initialization
- LSTM/GRU: For RNNs, use gating mechanisms
Exploding Gradients
Problem: Gradients become extremely large, causing unstable training.
Why It Happens: Repeated multiplication of gradients > 1
Solutions:
- Gradient clipping: Cap gradient magnitude
1torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
- Lower learning rate
- Batch normalization
Q6: Implement a confusion matrix and calculate metrics.
Answer:
How It Works: 2x2 matrix for binary classification showing true/false positives/negatives.
Implementation:
1import numpy as np
2
3def confusion_matrix(y_true, y_pred):
4 """Calculate confusion matrix"""
5 tp = np.sum((y_true == 1) & (y_pred == 1))
6 tn = np.sum((y_true == 0) & (y_pred == 0))
7 fp = np.sum((y_true == 0) & (y_pred == 1))
8 fn = np.sum((y_true == 1) & (y_pred == 0))
9
10 return np.array([[tn, fp], [fn, tp]])
11
12def calculate_metrics(cm):
13 """Calculate all metrics from confusion matrix"""
14 tn, fp, fn, tp = cm.ravel()
15
16 accuracy = (tp + tn) / (tp + tn + fp + fn)
17 precision = tp / (tp + fp) if (tp + fp) > 0 else 0
18 recall = tp / (tp + fn) if (tp + fn) > 0 else 0
19 f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
20 specificity = tn / (tn + fp) if (tn + fp) > 0 else 0
21
22 return {
23 'accuracy': accuracy,
24 'precision': precision,
25 'recall': recall,
26 'f1_score': f1,
27 'specificity': specificity
28 }
29
30# Example
31y_true = np.array([1, 0, 1, 1, 0, 1, 0, 0, 1, 0])
32y_pred = np.array([1, 0, 1, 0, 0, 1, 0, 1, 1, 0])
33
34cm = confusion_matrix(y_true, y_pred)
35print("Confusion Matrix:")
36print(cm)
37print("\nMetrics:")
38print(calculate_metrics(cm))
Output:
1Confusion Matrix:
2[[4 1]
3 [1 4]]
4
5Metrics:
6{'accuracy': 0.8, 'precision': 0.8, 'recall': 0.8, 'f1_score': 0.8, 'specificity': 0.8}
Q7: What is transfer learning and when to use it?
Answer:
How It Works: Use pre-trained model as starting point, fine-tune for your task.
Typical Approach:
- Take model trained on large dataset (e.g., ImageNet)
- Remove final layer(s)
- Add new layers for your task
- Fine-tune:
- Option A: Freeze early layers, train only new layers
- Option B: Train all layers with small learning rate
Implementation:
1import torch
2import torchvision.models as models
3
4# Load pre-trained ResNet
5model = models.resnet50(pretrained=True)
6
7# Freeze all layers
8for param in model.parameters():
9 param.requires_grad = False
10
11# Replace final layer
12num_features = model.fc.in_features
13model.fc = torch.nn.Linear(num_features, num_classes)
14
15# Only final layer will be trained
16optimizer = torch.optim.Adam(model.fc.parameters(), lr=0.001)
When to Use:
✅ Small dataset (< 10k samples)
✅ Similar domain (e.g., both are images)
✅ Limited compute resources
✅ Want faster convergence
When NOT to Use:
❌ Very different domains (text → images)
❌ Huge dataset available
❌ Very specific task with no similar pre-trained models
Q8: Explain dropout and how it prevents overfitting.
Answer:
How It Works: Randomly "drop" (set to 0) neurons during training with probability $p$.
Algorithm:
1def dropout(x, p=0.5, training=True):
2 if not training:
3 return x
4
5 # Create mask: 1 with probability (1-p), 0 with probability p
6 mask = (np.random.rand(*x.shape) > p).astype(float)
7
8 # Scale to maintain expected value
9 return x * mask / (1 - p)
Why It Works:
- Prevents co-adaptation: Neurons can't rely on specific other neurons
- Ensemble effect: Training many "thinned" networks, averaging at test time
- Regularization: Adds noise, prevents overfitting
During Training: Randomly drop neurons
During Inference: Use all neurons (no dropout)
Implementation:
1import torch.nn as nn
2
3model = nn.Sequential(
4 nn.Linear(784, 512),
5 nn.ReLU(),
6 nn.Dropout(0.5), # Drop 50% of neurons
7 nn.Linear(512, 256),
8 nn.ReLU(),
9 nn.Dropout(0.3), # Drop 30% of neurons
10 nn.Linear(256, 10)
11)
Typical values: 0.2-0.5 for hidden layers, 0.5 for input layer
Q9: What is the difference between L1 and L2 regularization?
Answer:
L2 Regularization (Ridge)
Formula: Add squared magnitude of weights to loss $$ L_{\text{total}} = L_{\text{data}} + \lambda \sum_{i} w_i^2 $$
Gradient: $\frac{\partial}{\partial w} = \frac{\partial L}{\partial w} + 2\lambda w$
Effect: Weights decay towards zero but rarely become exactly zero
Implementation:
1from sklearn.linear_model import Ridge
2
3model = Ridge(alpha=1.0) # alpha is λ
4model.fit(X_train, y_train)
L1 Regularization (Lasso)
Formula: Add absolute magnitude of weights $$ L_{\text{total}} = L_{\text{data}} + \lambda \sum_{i} |w_i| $$
Gradient: $\frac{\partial}{\partial w} = \frac{\partial L}{\partial w} + \lambda \cdot \text{sign}(w)$
Effect: Drives some weights to exactly zero (feature selection)
Implementation:
1from sklearn.linear_model import Lasso
2
3model = Lasso(alpha=1.0)
4model.fit(X_train, y_train)
Comparison
| Aspect | L1 (Lasso) | L2 (Ridge) |
|---|---|---|
| Penalty | $\sum |w_i|$ | $\sum w_i^2$ |
| Feature Selection | Yes (sparse) | No |
| Solution | Non-differentiable at 0 | Differentiable everywhere |
| Use When | Many irrelevant features | All features contribute |
Elastic Net
Combines both: $$ L = L_{\text{data}} + \lambda_1 \sum |w_i| + \lambda_2 \sum w_i^2 $$
1from sklearn.linear_model import ElasticNet
2
3model = ElasticNet(alpha=1.0, l1_ratio=0.5) # 50% L1, 50% L2
4model.fit(X_train, y_train)
Q10: How do you handle missing data in ML?
Answer:
1. Deletion
Listwise deletion: Remove entire row if any value missing
1df_clean = df.dropna()
✅ Simple
❌ Loses data, biased if not MCAR (Missing Completely At Random)
Pairwise deletion: Use available data for each calculation
✅ Retains more data
❌ Different sample sizes for different calculations
2. Imputation
Mean/Median/Mode:
1from sklearn.impute import SimpleImputer
2
3imputer = SimpleImputer(strategy='mean') # or 'median', 'most_frequent'
4X_imputed = imputer.fit_transform(X)
✅ Simple, fast
❌ Reduces variance, ignores relationships
K-NN Imputation:
1from sklearn.impute import KNNImputer
2
3imputer = KNNImputer(n_neighbors=5)
4X_imputed = imputer.fit_transform(X)
✅ Uses feature relationships
❌ Computationally expensive
Iterative Imputation (MICE):
1from sklearn.experimental import enable_iterative_imputer
2from sklearn.impute import IterativeImputer
3
4imputer = IterativeImputer(max_iter=10, random_state=0)
5X_imputed = imputer.fit_transform(X)
✅ Models relationships between features
❌ Slow, can be unstable
3. Model-Based
Use models that handle missing values:
- XGBoost, LightGBM (built-in handling)
- Decision trees (can split on "missing" as category)
4. Create Indicator
Add binary feature indicating missingness:
1X['feature_missing'] = X['feature'].isnull().astype(int)
2X['feature'].fillna(X['feature'].mean(), inplace=True)
✅ Preserves information about missingness
❌ Increases dimensionality
Choose based on:
- Amount of missing data
- Missing mechanism (MCAR, MAR, MNAR)
- Computational resources
- Domain knowledge
Related Snippets
- AI/ML Interview Questions - Easy
Easy-level AI/ML interview questions with LangChain examples and Mermaid … - AI/ML Interview Questions - Hard
Hard-level AI/ML interview questions covering advanced architectures, … - LLM/Agentic AI Interview Questions - Easy
Easy-level LLM and Agentic AI interview questions covering fundamentals, … - LLM/Agentic AI Interview Questions - Hard
Hard-level LLM and Agentic AI interview questions covering multi-agent systems, … - LLM/Agentic AI Interview Questions - Medium
Medium-level LLM and Agentic AI interview questions covering agent …