AI/ML Interview Questions - Medium

Medium-level AI/ML interview questions covering neural networks, ensemble methods, and advanced concepts.

Q1: Explain backpropagation in neural networks.

Answer:

How It Works:

Backpropagation is the algorithm for training neural networks by computing gradients of the loss with respect to weights.

Forward Pass:

  1. Input flows through network
  2. Each layer applies: $z = Wx + b$, then activation $a = \sigma(z)$
  3. Final layer produces prediction
  4. Calculate loss: $L = \text{loss}(y_{\text{pred}}, y_{\text{true}})$

Backward Pass (Chain Rule):

  1. Start from output layer
  2. Calculate gradient of loss w.r.t. output: $\frac{\partial L}{\partial a^{(L)}}$
  3. Propagate backwards using chain rule: $$ \frac{\partial L}{\partial W^{(l)}} = \frac{\partial L}{\partial a^{(l)}} \cdot \frac{\partial a^{(l)}}{\partial z^{(l)}} \cdot \frac{\partial z^{(l)}}{\partial W^{(l)}} $$

Update Weights: $$ W^{(l)} = W^{(l)} - \alpha \frac{\partial L}{\partial W^{(l)}} $$

PyTorch Implementation:

 1import torch
 2import torch.nn as nn
 3import torch.optim as optim
 4
 5class NeuralNetwork(nn.Module):
 6    def __init__(self, layers):
 7        super(NeuralNetwork, self).__init__()
 8        
 9        # Create layers dynamically
10        self.layers = nn.ModuleList()
11        for i in range(len(layers) - 1):
12            self.layers.append(nn.Linear(layers[i], layers[i+1]))
13    
14    def forward(self, x):
15        # Forward pass through all layers
16        for i, layer in enumerate(self.layers):
17            x = layer(x)
18            # Apply sigmoid activation (except for last layer if doing classification)
19            if i < len(self.layers) - 1:
20                x = torch.sigmoid(x)
21        return x
22
23# Training example
24def train_network(model, X_train, y_train, epochs=100, lr=0.01):
25    # Define loss and optimizer
26    criterion = nn.MSELoss()
27    optimizer = optim.SGD(model.parameters(), lr=lr)
28    
29    # Convert to tensors
30    X = torch.FloatTensor(X_train)
31    y = torch.FloatTensor(y_train)
32    
33    for epoch in range(epochs):
34        # Forward pass
35        outputs = model(X)
36        loss = criterion(outputs, y)
37        
38        # Backward pass (automatic differentiation)
39        optimizer.zero_grad()  # Clear gradients
40        loss.backward()        # Compute gradients
41        optimizer.step()       # Update weights
42        
43        if (epoch + 1) % 10 == 0:
44            print(f'Epoch [{epoch+1}/{epochs}], Loss: {loss.item():.4f}')
45
46# Usage
47model = NeuralNetwork([784, 256, 128, 10])  # Input, hidden, output layers
48X_train = torch.randn(100, 784)  # 100 samples, 784 features
49y_train = torch.randn(100, 10)   # 100 samples, 10 classes
50
51train_network(model, X_train, y_train)
52
53# PyTorch automatically handles backpropagation!
54# You can also inspect gradients:
55for name, param in model.named_parameters():
56    if param.grad is not None:
57        print(f'{name} gradient: {param.grad.norm().item()}')

Why It Works: PyTorch's autograd automatically computes gradients using chain rule, making backpropagation seamless.


Q2: What are ensemble methods? Explain bagging vs. boosting.

Answer:

Ensemble Methods: Combine multiple models to improve performance.

Bagging (Bootstrap Aggregating)

How It Works:

  1. Create multiple bootstrap samples (random sampling with replacement)
  2. Train independent model on each sample
  3. Aggregate predictions (voting for classification, averaging for regression)

Example: Random Forest

  • Each tree trained on different bootstrap sample
  • Each split considers random subset of features
  • Final prediction: majority vote

Benefits:

  • Reduces variance (overfitting)
  • Parallel training
  • Works well with high-variance models (deep trees)

Implementation:

 1from sklearn.ensemble import BaggingClassifier
 2from sklearn.tree import DecisionTreeClassifier
 3
 4# Bagging with decision trees
 5bagging = BaggingClassifier(
 6    base_estimator=DecisionTreeClassifier(),
 7    n_estimators=100,
 8    max_samples=0.8,  # Use 80% of data for each sample
 9    bootstrap=True
10)
11bagging.fit(X_train, y_train)

Boosting

How It Works:

  1. Train models sequentially
  2. Each model focuses on mistakes of previous models
  3. Weight samples based on difficulty
  4. Combine with weighted voting

Example: AdaBoost

  1. Start with equal weights for all samples
  2. Train weak learner
  3. Increase weights for misclassified samples
  4. Repeat, giving more weight to harder examples

Benefits:

  • Reduces bias (underfitting)
  • Often better accuracy than bagging
  • Works well with weak learners

Implementation:

 1from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
 2
 3# AdaBoost
 4adaboost = AdaBoostClassifier(n_estimators=100, learning_rate=1.0)
 5adaboost.fit(X_train, y_train)
 6
 7# Gradient Boosting
 8gb = GradientBoostingClassifier(
 9    n_estimators=100,
10    learning_rate=0.1,
11    max_depth=3
12)
13gb.fit(X_train, y_train)

Comparison

AspectBaggingBoosting
TrainingParallelSequential
FocusReduce varianceReduce bias
WeightingEqualAdaptive
ExampleRandom ForestAdaBoost, XGBoost
OverfittingLess proneMore prone

Q3: Explain gradient descent variants (SGD, Mini-batch, Adam).

Answer:

Batch Gradient Descent

How It Works: Use entire dataset to compute gradient.

1for epoch in range(n_epochs):
2    gradient = compute_gradient(X_train, y_train, weights)
3    weights -= learning_rate * gradient

Pros: Stable convergence, accurate gradient
Cons: Slow for large datasets, memory intensive

Stochastic Gradient Descent (SGD)

How It Works: Use one sample at a time.

1for epoch in range(n_epochs):
2    for i in range(n_samples):
3        gradient = compute_gradient(X_train[i], y_train[i], weights)
4        weights -= learning_rate * gradient

Pros: Fast, can escape local minima
Cons: Noisy updates, unstable convergence

Mini-batch Gradient Descent

How It Works: Use small batches (e.g., 32, 64, 128 samples).

1for epoch in range(n_epochs):
2    for batch in get_batches(X_train, y_train, batch_size=32):
3        X_batch, y_batch = batch
4        gradient = compute_gradient(X_batch, y_batch, weights)
5        weights -= learning_rate * gradient

Pros: Balance between speed and stability
Cons: Requires tuning batch size

Adam (Adaptive Moment Estimation)

How It Works: Combines momentum and adaptive learning rates.

Algorithm:

 1# Initialize
 2m = 0  # First moment (momentum)
 3v = 0  # Second moment (RMSprop)
 4t = 0  # Time step
 5
 6for epoch in range(n_epochs):
 7    for batch in get_batches(X_train, y_train, batch_size):
 8        t += 1
 9        gradient = compute_gradient(batch, weights)
10        
11        # Update biased first moment
12        m = beta1 * m + (1 - beta1) * gradient
13        
14        # Update biased second moment
15        v = beta2 * v + (1 - beta2) * (gradient ** 2)
16        
17        # Bias correction
18        m_hat = m / (1 - beta1 ** t)
19        v_hat = v / (1 - beta2 ** t)
20        
21        # Update weights
22        weights -= learning_rate * m_hat / (np.sqrt(v_hat) + epsilon)

Typical values: $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$

Why Adam is popular: Adaptive learning rates per parameter, works well with default hyperparameters.


Q4: What is batch normalization and why does it help?

Answer:

How It Works:

Normalize activations within each mini-batch to have mean 0 and variance 1.

Algorithm:

  1. For each mini-batch: $$ \mu_B = \frac{1}{m}\sum_{i=1}^m x_i $$ $$ \sigma_B^2 = \frac{1}{m}\sum_{i=1}^m (x_i - \mu_B)^2 $$
  2. Normalize: $$ \hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} $$
  3. Scale and shift (learnable parameters): $$ y_i = \gamma \hat{x}_i + \beta $$

Why It Helps:

  1. Reduces internal covariate shift: Stabilizes distribution of activations
  2. Allows higher learning rates: Less sensitive to initialization
  3. Regularization effect: Adds noise through batch statistics
  4. Faster convergence: Smoother optimization landscape

Implementation:

 1import torch.nn as nn
 2
 3model = nn.Sequential(
 4    nn.Linear(784, 256),
 5    nn.BatchNorm1d(256),  # Batch norm after linear layer
 6    nn.ReLU(),
 7    nn.Linear(256, 128),
 8    nn.BatchNorm1d(128),
 9    nn.ReLU(),
10    nn.Linear(128, 10)
11)

During Inference: Use running statistics (exponential moving average) instead of batch statistics.


Q5: Explain the vanishing/exploding gradient problem.

Answer:

Vanishing Gradients

Problem: Gradients become extremely small in early layers, preventing learning.

Why It Happens:

  • Deep networks multiply many gradients via chain rule
  • If gradients < 1, repeated multiplication → very small values
  • Common with sigmoid/tanh activations (gradients max at 0.25)

Example: $$ \frac{\partial L}{\partial W^{(1)}} = \frac{\partial L}{\partial a^{(L)}} \cdot \frac{\partial a^{(L)}}{\partial a^{(L-1)}} \cdot ... \cdot \frac{\partial a^{(2)}}{\partial W^{(1)}} $$

If each $\frac{\partial a^{(l)}}{\partial a^{(l-1)}} < 1$, product vanishes.

Solutions:

  1. ReLU activation: Gradient is 1 for positive inputs
  2. Batch normalization: Stabilizes gradients
  3. Residual connections (ResNet): Skip connections allow gradient flow
  4. Better initialization: Xavier/He initialization
  5. LSTM/GRU: For RNNs, use gating mechanisms

Exploding Gradients

Problem: Gradients become extremely large, causing unstable training.

Why It Happens: Repeated multiplication of gradients > 1

Solutions:

  1. Gradient clipping: Cap gradient magnitude
1torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
  1. Lower learning rate
  2. Batch normalization

Q6: Implement a confusion matrix and calculate metrics.

Answer:

How It Works: 2x2 matrix for binary classification showing true/false positives/negatives.

Implementation:

 1import numpy as np
 2
 3def confusion_matrix(y_true, y_pred):
 4    """Calculate confusion matrix"""
 5    tp = np.sum((y_true == 1) & (y_pred == 1))
 6    tn = np.sum((y_true == 0) & (y_pred == 0))
 7    fp = np.sum((y_true == 0) & (y_pred == 1))
 8    fn = np.sum((y_true == 1) & (y_pred == 0))
 9    
10    return np.array([[tn, fp], [fn, tp]])
11
12def calculate_metrics(cm):
13    """Calculate all metrics from confusion matrix"""
14    tn, fp, fn, tp = cm.ravel()
15    
16    accuracy = (tp + tn) / (tp + tn + fp + fn)
17    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
18    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
19    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
20    specificity = tn / (tn + fp) if (tn + fp) > 0 else 0
21    
22    return {
23        'accuracy': accuracy,
24        'precision': precision,
25        'recall': recall,
26        'f1_score': f1,
27        'specificity': specificity
28    }
29
30# Example
31y_true = np.array([1, 0, 1, 1, 0, 1, 0, 0, 1, 0])
32y_pred = np.array([1, 0, 1, 0, 0, 1, 0, 1, 1, 0])
33
34cm = confusion_matrix(y_true, y_pred)
35print("Confusion Matrix:")
36print(cm)
37print("\nMetrics:")
38print(calculate_metrics(cm))

Output:

1Confusion Matrix:
2[[4 1]
3 [1 4]]
4
5Metrics:
6{'accuracy': 0.8, 'precision': 0.8, 'recall': 0.8, 'f1_score': 0.8, 'specificity': 0.8}

Q7: What is transfer learning and when to use it?

Answer:

How It Works: Use pre-trained model as starting point, fine-tune for your task.

Typical Approach:

  1. Take model trained on large dataset (e.g., ImageNet)
  2. Remove final layer(s)
  3. Add new layers for your task
  4. Fine-tune:
    • Option A: Freeze early layers, train only new layers
    • Option B: Train all layers with small learning rate

Implementation:

 1import torch
 2import torchvision.models as models
 3
 4# Load pre-trained ResNet
 5model = models.resnet50(pretrained=True)
 6
 7# Freeze all layers
 8for param in model.parameters():
 9    param.requires_grad = False
10
11# Replace final layer
12num_features = model.fc.in_features
13model.fc = torch.nn.Linear(num_features, num_classes)
14
15# Only final layer will be trained
16optimizer = torch.optim.Adam(model.fc.parameters(), lr=0.001)

When to Use: ✅ Small dataset (< 10k samples)
✅ Similar domain (e.g., both are images)
✅ Limited compute resources
✅ Want faster convergence

When NOT to Use: ❌ Very different domains (text → images)
❌ Huge dataset available
❌ Very specific task with no similar pre-trained models


Q8: Explain dropout and how it prevents overfitting.

Answer:

How It Works: Randomly "drop" (set to 0) neurons during training with probability $p$.

Algorithm:

1def dropout(x, p=0.5, training=True):
2    if not training:
3        return x
4    
5    # Create mask: 1 with probability (1-p), 0 with probability p
6    mask = (np.random.rand(*x.shape) > p).astype(float)
7    
8    # Scale to maintain expected value
9    return x * mask / (1 - p)

Why It Works:

  1. Prevents co-adaptation: Neurons can't rely on specific other neurons
  2. Ensemble effect: Training many "thinned" networks, averaging at test time
  3. Regularization: Adds noise, prevents overfitting

During Training: Randomly drop neurons
During Inference: Use all neurons (no dropout)

Implementation:

 1import torch.nn as nn
 2
 3model = nn.Sequential(
 4    nn.Linear(784, 512),
 5    nn.ReLU(),
 6    nn.Dropout(0.5),  # Drop 50% of neurons
 7    nn.Linear(512, 256),
 8    nn.ReLU(),
 9    nn.Dropout(0.3),  # Drop 30% of neurons
10    nn.Linear(256, 10)
11)

Typical values: 0.2-0.5 for hidden layers, 0.5 for input layer


Q9: What is the difference between L1 and L2 regularization?

Answer:

L2 Regularization (Ridge)

Formula: Add squared magnitude of weights to loss $$ L_{\text{total}} = L_{\text{data}} + \lambda \sum_{i} w_i^2 $$

Gradient: $\frac{\partial}{\partial w} = \frac{\partial L}{\partial w} + 2\lambda w$

Effect: Weights decay towards zero but rarely become exactly zero

Implementation:

1from sklearn.linear_model import Ridge
2
3model = Ridge(alpha=1.0)  # alpha is λ
4model.fit(X_train, y_train)

L1 Regularization (Lasso)

Formula: Add absolute magnitude of weights $$ L_{\text{total}} = L_{\text{data}} + \lambda \sum_{i} |w_i| $$

Gradient: $\frac{\partial}{\partial w} = \frac{\partial L}{\partial w} + \lambda \cdot \text{sign}(w)$

Effect: Drives some weights to exactly zero (feature selection)

Implementation:

1from sklearn.linear_model import Lasso
2
3model = Lasso(alpha=1.0)
4model.fit(X_train, y_train)

Comparison

AspectL1 (Lasso)L2 (Ridge)
Penalty$\sum |w_i|$$\sum w_i^2$
Feature SelectionYes (sparse)No
SolutionNon-differentiable at 0Differentiable everywhere
Use WhenMany irrelevant featuresAll features contribute

Elastic Net

Combines both: $$ L = L_{\text{data}} + \lambda_1 \sum |w_i| + \lambda_2 \sum w_i^2 $$

1from sklearn.linear_model import ElasticNet
2
3model = ElasticNet(alpha=1.0, l1_ratio=0.5)  # 50% L1, 50% L2
4model.fit(X_train, y_train)

Q10: How do you handle missing data in ML?

Answer:

1. Deletion

Listwise deletion: Remove entire row if any value missing

1df_clean = df.dropna()

✅ Simple
❌ Loses data, biased if not MCAR (Missing Completely At Random)

Pairwise deletion: Use available data for each calculation ✅ Retains more data
❌ Different sample sizes for different calculations

2. Imputation

Mean/Median/Mode:

1from sklearn.impute import SimpleImputer
2
3imputer = SimpleImputer(strategy='mean')  # or 'median', 'most_frequent'
4X_imputed = imputer.fit_transform(X)

✅ Simple, fast
❌ Reduces variance, ignores relationships

K-NN Imputation:

1from sklearn.impute import KNNImputer
2
3imputer = KNNImputer(n_neighbors=5)
4X_imputed = imputer.fit_transform(X)

✅ Uses feature relationships
❌ Computationally expensive

Iterative Imputation (MICE):

1from sklearn.experimental import enable_iterative_imputer
2from sklearn.impute import IterativeImputer
3
4imputer = IterativeImputer(max_iter=10, random_state=0)
5X_imputed = imputer.fit_transform(X)

✅ Models relationships between features
❌ Slow, can be unstable

3. Model-Based

Use models that handle missing values:

  • XGBoost, LightGBM (built-in handling)
  • Decision trees (can split on "missing" as category)

4. Create Indicator

Add binary feature indicating missingness:

1X['feature_missing'] = X['feature'].isnull().astype(int)
2X['feature'].fillna(X['feature'].mean(), inplace=True)

✅ Preserves information about missingness
❌ Increases dimensionality

Choose based on:

  • Amount of missing data
  • Missing mechanism (MCAR, MAR, MNAR)
  • Computational resources
  • Domain knowledge

Related Snippets