Statistics Interview Questions - Easy
Easy-level statistics interview questions covering fundamental concepts, descriptive statistics, and basic probability.
Q1: Explain the difference between mean, median, and mode.
Answer:
Definitions
- Mean ($\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i$): Arithmetic average of all values
- Median: Middle value when data is sorted (50th percentile)
- Mode: Most frequently occurring value(s)
When to Use Each
- Mean: When data is normally distributed, no extreme outliers
- Median: When data has outliers, skewed distributions
- Mode: Categorical data, finding most common category
Python Example
1import numpy as np
2from scipy import stats
3
4# Sample data
5data = [10, 20, 20, 30, 40, 50, 60, 70, 1000] # 1000 is an outlier
6
7# Mean (sensitive to outliers)
8mean = np.mean(data)
9print(f"Mean: {mean:.2f}") # 144.44 - heavily influenced by 1000
10
11# Median (robust to outliers)
12median = np.median(data)
13print(f"Median: {median:.2f}") # 40.0 - not affected by outlier
14
15# Mode
16mode = stats.mode(data, keepdims=True)
17print(f"Mode: {mode.mode[0]}") # 20 - most frequent value
18
19# For skewed data, median is preferred
20skewed_data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 100]
21print(f"Skewed data - Mean: {np.mean(skewed_data):.2f}, Median: {np.median(skewed_data):.2f}")
Thinking Process: Mean is most intuitive but sensitive to outliers. Median is better for skewed data. Mode is useful for categorical or discrete data.
Q2: What is variance and standard deviation? How do you calculate them?
Answer:
Definitions
Variance ($\sigma^2$): Average squared deviation from the mean
- Measures spread/dispersion of data
- Units are squared (hard to interpret)
Standard Deviation ($\sigma$): Square root of variance
- Same units as original data
- More interpretable measure of spread
Formulas
- Population Variance: $\sigma^2 = \frac{1}{N}\sum_{i=1}^{N}(x_i - \mu)^2$
- Sample Variance: $s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2$ (Bessel's correction)
- Standard Deviation: $\sigma = \sqrt{\sigma^2}$ or $s = \sqrt{s^2}$
Python Example
1import numpy as np
2
3data = [10, 20, 30, 40, 50]
4
5# Calculate mean
6mean = np.mean(data)
7print(f"Mean: {mean}")
8
9# Variance (using n-1 for sample, ddof=1)
10variance = np.var(data, ddof=1) # Sample variance
11variance_pop = np.var(data, ddof=0) # Population variance
12print(f"Sample variance: {variance:.2f}")
13print(f"Population variance: {variance_pop:.2f}")
14
15# Standard deviation
16std_dev = np.std(data, ddof=1) # Sample std dev
17print(f"Standard deviation: {std_dev:.2f}")
18
19# Manual calculation
20deviations = [(x - mean) ** 2 for x in data]
21manual_variance = sum(deviations) / (len(data) - 1)
22manual_std = np.sqrt(manual_variance)
23print(f"Manual calculation - Variance: {manual_variance:.2f}, Std Dev: {manual_std:.2f}")
24
25# Interpretation
26print(f"\nData points within 1 std dev: {mean - std_dev:.2f} to {mean + std_dev:.2f}")
Thinking Process: Variance squares deviations to avoid cancellation, but units are squared. Standard deviation fixes units and is more interpretable. Use $n-1$ for samples (Bessel's correction) to get unbiased estimate.
Q3: What is the difference between correlation and causation?
Answer:
Key Differences
Correlation:
- Measures association between two variables
- Range: -1 to +1
- Does NOT imply one causes the other
- Can be spurious (third variable)
Causation:
- One variable directly causes change in another
- Requires controlled experiments
- Correlation is necessary but not sufficient
Common Fallacies
- Post hoc ergo propter hoc: Assuming correlation implies causation
- Spurious correlation: Third variable explains both
- Reverse causation: Y causes X, not X causes Y
Python Example
1import numpy as np
2import matplotlib.pyplot as plt
3
4# Spurious correlation example
5np.random.seed(42)
6days = np.arange(1, 101)
7temperature = 20 + 10 * np.sin(days * 2 * np.pi / 365) + np.random.normal(0, 2, 100)
8ice_cream_sales = 100 + 50 * (temperature - 20) / 10 + np.random.normal(0, 10, 100)
9drowning_deaths = 5 + 0.3 * (temperature - 20) + np.random.normal(0, 1, 100)
10
11# Calculate correlation
12correlation = np.corrcoef(ice_cream_sales, drowning_deaths)[0, 1]
13print(f"Correlation between ice cream sales and drowning: {correlation:.3f}")
14print("This is spurious - both are caused by temperature (season)")
15
16# True correlation with temperature
17corr_temp_ice = np.corrcoef(temperature, ice_cream_sales)[0, 1]
18corr_temp_drown = np.corrcoef(temperature, drowning_deaths)[0, 1]
19print(f"Temperature vs Ice cream: {corr_temp_ice:.3f}")
20print(f"Temperature vs Drownings: {corr_temp_drown:.3f}")
21
22# To establish causation, need controlled experiment
23print("\nTo prove causation:")
24print("1. Controlled experiment with randomization")
25print("2. Control for confounding variables")
26print("3. Temporal precedence (cause before effect)")
27print("4. Dose-response relationship")
Thinking Process: Correlation is easy to measure but causation requires careful experimental design. Always consider confounding variables and alternative explanations.
Q4: Explain the normal distribution and its properties.
Answer:
Properties
- Symmetry: Mean = Median = Mode
- 68-95-99.7 Rule: Empirical rule for standard deviations
- Infinite support: Extends from -∞ to +∞
- Defined by two parameters: Mean (μ) and standard deviation (σ)
Probability Density Function
$$f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2}$$
Python Example
1import numpy as np
2import matplotlib.pyplot as plt
3from scipy import stats
4
5# Generate normal distribution
6mu, sigma = 100, 15 # Mean and standard deviation
7data = np.random.normal(mu, sigma, 10000)
8
9# Calculate properties
10mean = np.mean(data)
11median = np.median(data)
12std_dev = np.std(data)
13
14print(f"Mean: {mean:.2f}")
15print(f"Median: {median:.2f}")
16print(f"Standard Deviation: {std_dev:.2f}")
17print(f"Mean ≈ Median (symmetric): {abs(mean - median) < 0.1}")
18
19# 68-95-99.7 rule verification
20within_1sigma = np.sum(np.abs(data - mean) < std_dev) / len(data)
21within_2sigma = np.sum(np.abs(data - mean) < 2*std_dev) / len(data)
22within_3sigma = np.sum(np.abs(data - mean) < 3*std_dev) / len(data)
23
24print(f"\n68-95-99.7 Rule:")
25print(f"Within 1σ: {within_1sigma*100:.1f}% (expected 68%)")
26print(f"Within 2σ: {within_2sigma*100:.1f}% (expected 95%)")
27print(f"Within 3σ: {within_3sigma*100:.1f}% (expected 99.7%)")
28
29# Calculate probabilities using scipy
30prob_less_than_85 = stats.norm.cdf(85, loc=mu, scale=sigma)
31prob_between_85_115 = stats.norm.cdf(115, loc=mu, scale=sigma) - stats.norm.cdf(85, loc=mu, scale=sigma)
32print(f"\nP(X < 85) = {prob_less_than_85:.3f}")
33print(f"P(85 < X < 115) = {prob_between_85_115:.3f}")
34
35# Z-score calculation
36z_score = (85 - mu) / sigma
37print(f"Z-score for 85: {z_score:.2f}")
Thinking Process: Normal distribution is fundamental in statistics due to Central Limit Theorem. Many statistical tests assume normality. Always verify assumptions before applying tests.
Q5: What is a confidence interval? How do you interpret it?
Answer:
Definition
A confidence interval is a range of values that likely contains the true population parameter, based on sample data.
Common Misinterpretation
❌ Wrong: "There's a 95% probability the true mean is in this interval" ✅ Correct: "95% of such intervals (if we repeated the study) would contain the true mean"
Calculation
For a mean with known population standard deviation: $$\bar{x} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}}$$
For a mean with unknown population standard deviation (t-distribution): $$\bar{x} \pm t_{\alpha/2, n-1} \frac{s}{\sqrt{n}}$$
Python Example
1import numpy as np
2from scipy import stats
3
4# Sample data
5np.random.seed(42)
6sample = np.random.normal(100, 15, 30) # Sample size 30
7sample_mean = np.mean(sample)
8sample_std = np.std(sample, ddof=1)
9n = len(sample)
10
11# 95% Confidence interval using t-distribution (unknown population std)
12confidence_level = 0.95
13alpha = 1 - confidence_level
14t_critical = stats.t.ppf(1 - alpha/2, df=n-1)
15margin_error = t_critical * (sample_std / np.sqrt(n))
16
17ci_lower = sample_mean - margin_error
18ci_upper = sample_mean + margin_error
19
20print(f"Sample mean: {sample_mean:.2f}")
21print(f"Sample std: {sample_std:.2f}")
22print(f"95% Confidence Interval: [{ci_lower:.2f}, {ci_upper:.2f}]")
23print(f"Margin of error: ±{margin_error:.2f}")
24
25# Interpretation
26print(f"\nInterpretation:")
27print(f"We are 95% confident that the true population mean")
28print(f"lies between {ci_lower:.2f} and {ci_upper:.2f}")
29
30# Using scipy directly
31ci = stats.t.interval(confidence_level, df=n-1, loc=sample_mean, scale=stats.sem(sample))
32print(f"\nUsing scipy: [{ci[0]:.2f}, {ci[1]:.2f}]")
33
34# Effect of sample size on CI width
35for n_size in [10, 30, 100]:
36 t_crit = stats.t.ppf(0.975, df=n_size-1)
37 margin = t_crit * 15 / np.sqrt(n_size)
38 print(f"n={n_size}: CI width = {2*margin:.2f}")
Thinking Process: Confidence intervals provide uncertainty estimates. Wider intervals indicate more uncertainty. Sample size, variability, and confidence level all affect interval width. Remember it's about the method, not a single interval.
Q6: Explain probability vs conditional probability.
Answer:
Definitions
- Probability $P(A)$: Chance of event A occurring (unconditional)
- Conditional Probability $P(A|B)$: Probability of A given that B has occurred
Formula
$$P(A|B) = \frac{P(A \cap B)}{P(B)}$$
Python Example
1import numpy as np
2
3# Example: Disease testing
4# Prevalence: 1% of population has disease
5# Test sensitivity: 95% (P(positive|disease))
6# Test specificity: 90% (P(negative|no disease))
7
8P_disease = 0.01
9P_no_disease = 1 - P_disease
10P_positive_given_disease = 0.95
11P_negative_given_no_disease = 0.90
12P_positive_given_no_disease = 1 - P_negative_given_no_disease
13
14# Unconditional probability of positive test
15P_positive = (P_positive_given_disease * P_disease +
16 P_positive_given_no_disease * P_no_disease)
17
18print(f"Unconditional P(positive test): {P_positive:.3f}")
19
20# Conditional probability: P(disease|positive test) using Bayes
21P_disease_given_positive = (P_positive_given_disease * P_disease) / P_positive
22
23print(f"\nConditional P(disease|positive test): {P_disease_given_positive:.3f}")
24print(f"This is surprisingly low despite 95% sensitivity!")
25
26# Bayes' Theorem manually
27print(f"\nBayes' Theorem breakdown:")
28print(f"P(disease|positive) = P(positive|disease) × P(disease) / P(positive)")
29print(f" = {P_positive_given_disease} × {P_disease} / {P_positive:.3f}")
30print(f" = {P_disease_given_positive:.3f}")
31
32# Why conditional probability matters
33print(f"\nKey insight: Low prevalence makes false positives more common")
34print(f"than true positives, even with good test sensitivity!")
Thinking Process: Conditional probability updates beliefs based on new information. Bayes' theorem shows how prior knowledge combines with evidence. Always consider base rates (prevalence) when interpreting test results.
Q7: What is sampling bias and how can you avoid it?
Answer:
Types of Sampling Bias
- Selection Bias: Non-random selection from population
- Voluntary Response Bias: Self-selected participants
- Survivorship Bias: Only considering successful/visible cases
- Convenience Sampling: Easy-to-reach participants
How to Avoid
- Random Sampling: Each member has equal chance
- Stratified Sampling: Ensure subgroups are represented
- Proper Sampling Frame: Complete list of population
- Adequate Sample Size: Reduce random error
Python Example
1import numpy as np
2import pandas as pd
3
4# Example: Survey bias
5np.random.seed(42)
6
7# True population parameters
8population_size = 10000
9true_support = 0.45 # 45% support policy
10
11# Generate population
12population = np.random.choice([0, 1], size=population_size, p=[1-true_support, true_support])
13
14# 1. Random sample (unbiased)
15random_sample = np.random.choice(population, size=100, replace=False)
16random_support = np.mean(random_sample)
17print(f"True support: {true_support:.2%}")
18print(f"Random sample estimate: {random_support:.2%}")
19
20# 2. Convenience sample (biased - only older people respond)
21# Assume older people more likely to respond and have higher support
22age_factor = np.random.beta(2, 5, population_size) # Skewed toward older
23response_probability = age_factor * 0.3 + 0.1
24responded = np.random.binomial(1, response_probability, population_size).astype(bool)
25convenience_sample = population[responded][:100] if np.sum(responded) >= 100 else population[responded]
26convenience_support = np.mean(convenience_sample)
27print(f"Convenience sample estimate: {convenience_support:.2%} (BIASED)")
28
29# 3. Stratified sampling (ensures representation)
30# Stratify by age groups
31age_groups = pd.cut(range(population_size), bins=3, labels=['Young', 'Middle', 'Old'])
32stratum_size = 33
33stratified_sample = []
34for group in ['Young', 'Middle', 'Old']:
35 group_indices = np.where(age_groups == group)[0]
36 stratum = np.random.choice(population[group_indices], size=stratum_size, replace=False)
37 stratified_sample.extend(stratum)
38stratified_support = np.mean(stratified_sample)
39print(f"Stratified sample estimate: {stratified_support:.2%}")
40
41print(f"\nBias in convenience sampling: {abs(convenience_support - true_support):.2%}")
42print(f"Bias in stratified sampling: {abs(stratified_support - true_support):.2%}")
Thinking Process: Bias creates systematic error that doesn't decrease with sample size. Random sampling is key, but may need stratification if subgroups differ. Always consider who is included/excluded from sample.
These fundamental concepts form the basis for more advanced statistical analysis and inference.
Related Snippets
- Statistics Interview Questions - Hard
Hard-level statistics interview questions covering advanced inference, multiple … - Statistics Interview Questions - Medium
Medium-level statistics interview questions covering hypothesis testing, …