Statistics Interview Questions - Easy

Dec 15, 2025 · 8 min read · statistics interview easy probability descriptive-statistics ·

Share on:

Easy-level statistics interview questions covering fundamental concepts, descriptive statistics, and basic probability.

Q1: Explain the difference between mean, median, and mode.

Answer:

Definitions

Mean ($\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i$): Arithmetic average of all values
Median: Middle value when data is sorted (50th percentile)
Mode: Most frequently occurring value(s)

When to Use Each

Mean: When data is normally distributed, no extreme outliers
Median: When data has outliers, skewed distributions
Mode: Categorical data, finding most common category

Python Example

 1import numpy as np
 2from scipy import stats
 3
 4# Sample data
 5data = [10, 20, 20, 30, 40, 50, 60, 70, 1000]  # 1000 is an outlier
 6
 7# Mean (sensitive to outliers)
 8mean = np.mean(data)
 9print(f"Mean: {mean:.2f}")  # 144.44 - heavily influenced by 1000
10
11# Median (robust to outliers)
12median = np.median(data)
13print(f"Median: {median:.2f}")  # 40.0 - not affected by outlier
14
15# Mode
16mode = stats.mode(data, keepdims=True)
17print(f"Mode: {mode.mode[0]}")  # 20 - most frequent value
18
19# For skewed data, median is preferred
20skewed_data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 100]
21print(f"Skewed data - Mean: {np.mean(skewed_data):.2f}, Median: {np.median(skewed_data):.2f}")

Thinking Process: Mean is most intuitive but sensitive to outliers. Median is better for skewed data. Mode is useful for categorical or discrete data.

Q2: What is variance and standard deviation? How do you calculate them?

Answer:

Definitions

Variance ($\sigma^2$): Average squared deviation from the mean

Measures spread/dispersion of data
Units are squared (hard to interpret)

Standard Deviation ($\sigma$): Square root of variance

Same units as original data
More interpretable measure of spread

Formulas

Population Variance: $\sigma^2 = \frac{1}{N}\sum_{i=1}^{N}(x_i - \mu)^2$
Sample Variance: $s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2$ (Bessel's correction)
Standard Deviation: $\sigma = \sqrt{\sigma^2}$ or $s = \sqrt{s^2}$

Python Example

 1import numpy as np
 2
 3data = [10, 20, 30, 40, 50]
 4
 5# Calculate mean
 6mean = np.mean(data)
 7print(f"Mean: {mean}")
 8
 9# Variance (using n-1 for sample, ddof=1)
10variance = np.var(data, ddof=1)  # Sample variance
11variance_pop = np.var(data, ddof=0)  # Population variance
12print(f"Sample variance: {variance:.2f}")
13print(f"Population variance: {variance_pop:.2f}")
14
15# Standard deviation
16std_dev = np.std(data, ddof=1)  # Sample std dev
17print(f"Standard deviation: {std_dev:.2f}")
18
19# Manual calculation
20deviations = [(x - mean) ** 2 for x in data]
21manual_variance = sum(deviations) / (len(data) - 1)
22manual_std = np.sqrt(manual_variance)
23print(f"Manual calculation - Variance: {manual_variance:.2f}, Std Dev: {manual_std:.2f}")
24
25# Interpretation
26print(f"\nData points within 1 std dev: {mean - std_dev:.2f} to {mean + std_dev:.2f}")

Thinking Process: Variance squares deviations to avoid cancellation, but units are squared. Standard deviation fixes units and is more interpretable. Use $n-1$ for samples (Bessel's correction) to get unbiased estimate.

Q3: What is the difference between correlation and causation?

Answer:

Key Differences

Correlation:

Measures association between two variables
Range: -1 to +1
Does NOT imply one causes the other
Can be spurious (third variable)

Causation:

One variable directly causes change in another
Requires controlled experiments
Correlation is necessary but not sufficient

Common Fallacies

Post hoc ergo propter hoc: Assuming correlation implies causation
Spurious correlation: Third variable explains both
Reverse causation: Y causes X, not X causes Y

Python Example

 1import numpy as np
 2import matplotlib.pyplot as plt
 3
 4# Spurious correlation example
 5np.random.seed(42)
 6days = np.arange(1, 101)
 7temperature = 20 + 10 * np.sin(days * 2 * np.pi / 365) + np.random.normal(0, 2, 100)
 8ice_cream_sales = 100 + 50 * (temperature - 20) / 10 + np.random.normal(0, 10, 100)
 9drowning_deaths = 5 + 0.3 * (temperature - 20) + np.random.normal(0, 1, 100)
10
11# Calculate correlation
12correlation = np.corrcoef(ice_cream_sales, drowning_deaths)[0, 1]
13print(f"Correlation between ice cream sales and drowning: {correlation:.3f}")
14print("This is spurious - both are caused by temperature (season)")
15
16# True correlation with temperature
17corr_temp_ice = np.corrcoef(temperature, ice_cream_sales)[0, 1]
18corr_temp_drown = np.corrcoef(temperature, drowning_deaths)[0, 1]
19print(f"Temperature vs Ice cream: {corr_temp_ice:.3f}")
20print(f"Temperature vs Drownings: {corr_temp_drown:.3f}")
21
22# To establish causation, need controlled experiment
23print("\nTo prove causation:")
24print("1. Controlled experiment with randomization")
25print("2. Control for confounding variables")
26print("3. Temporal precedence (cause before effect)")
27print("4. Dose-response relationship")

Thinking Process: Correlation is easy to measure but causation requires careful experimental design. Always consider confounding variables and alternative explanations.

Q4: Explain the normal distribution and its properties.

Answer:

Properties

Symmetry: Mean = Median = Mode
68-95-99.7 Rule: Empirical rule for standard deviations
Infinite support: Extends from -∞ to +∞
Defined by two parameters: Mean (μ) and standard deviation (σ)

Probability Density Function

$$f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2}$$

Python Example

 1import numpy as np
 2import matplotlib.pyplot as plt
 3from scipy import stats
 4
 5# Generate normal distribution
 6mu, sigma = 100, 15  # Mean and standard deviation
 7data = np.random.normal(mu, sigma, 10000)
 8
 9# Calculate properties
10mean = np.mean(data)
11median = np.median(data)
12std_dev = np.std(data)
13
14print(f"Mean: {mean:.2f}")
15print(f"Median: {median:.2f}")
16print(f"Standard Deviation: {std_dev:.2f}")
17print(f"Mean ≈ Median (symmetric): {abs(mean - median) < 0.1}")
18
19# 68-95-99.7 rule verification
20within_1sigma = np.sum(np.abs(data - mean) < std_dev) / len(data)
21within_2sigma = np.sum(np.abs(data - mean) < 2*std_dev) / len(data)
22within_3sigma = np.sum(np.abs(data - mean) < 3*std_dev) / len(data)
23
24print(f"\n68-95-99.7 Rule:")
25print(f"Within 1σ: {within_1sigma*100:.1f}% (expected 68%)")
26print(f"Within 2σ: {within_2sigma*100:.1f}% (expected 95%)")
27print(f"Within 3σ: {within_3sigma*100:.1f}% (expected 99.7%)")
28
29# Calculate probabilities using scipy
30prob_less_than_85 = stats.norm.cdf(85, loc=mu, scale=sigma)
31prob_between_85_115 = stats.norm.cdf(115, loc=mu, scale=sigma) - stats.norm.cdf(85, loc=mu, scale=sigma)
32print(f"\nP(X < 85) = {prob_less_than_85:.3f}")
33print(f"P(85 < X < 115) = {prob_between_85_115:.3f}")
34
35# Z-score calculation
36z_score = (85 - mu) / sigma
37print(f"Z-score for 85: {z_score:.2f}")

Thinking Process: Normal distribution is fundamental in statistics due to Central Limit Theorem. Many statistical tests assume normality. Always verify assumptions before applying tests.

Q5: What is a confidence interval? How do you interpret it?

Answer:

Definition

A confidence interval is a range of values that likely contains the true population parameter, based on sample data.

Common Misinterpretation

❌ Wrong: "There's a 95% probability the true mean is in this interval" ✅ Correct: "95% of such intervals (if we repeated the study) would contain the true mean"

Calculation

For a mean with known population standard deviation: $$\bar{x} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}}$$

For a mean with unknown population standard deviation (t-distribution): $$\bar{x} \pm t_{\alpha/2, n-1} \frac{s}{\sqrt{n}}$$

Python Example

 1import numpy as np
 2from scipy import stats
 3
 4# Sample data
 5np.random.seed(42)
 6sample = np.random.normal(100, 15, 30)  # Sample size 30
 7sample_mean = np.mean(sample)
 8sample_std = np.std(sample, ddof=1)
 9n = len(sample)
10
11# 95% Confidence interval using t-distribution (unknown population std)
12confidence_level = 0.95
13alpha = 1 - confidence_level
14t_critical = stats.t.ppf(1 - alpha/2, df=n-1)
15margin_error = t_critical * (sample_std / np.sqrt(n))
16
17ci_lower = sample_mean - margin_error
18ci_upper = sample_mean + margin_error
19
20print(f"Sample mean: {sample_mean:.2f}")
21print(f"Sample std: {sample_std:.2f}")
22print(f"95% Confidence Interval: [{ci_lower:.2f}, {ci_upper:.2f}]")
23print(f"Margin of error: ±{margin_error:.2f}")
24
25# Interpretation
26print(f"\nInterpretation:")
27print(f"We are 95% confident that the true population mean")
28print(f"lies between {ci_lower:.2f} and {ci_upper:.2f}")
29
30# Using scipy directly
31ci = stats.t.interval(confidence_level, df=n-1, loc=sample_mean, scale=stats.sem(sample))
32print(f"\nUsing scipy: [{ci[0]:.2f}, {ci[1]:.2f}]")
33
34# Effect of sample size on CI width
35for n_size in [10, 30, 100]:
36    t_crit = stats.t.ppf(0.975, df=n_size-1)
37    margin = t_crit * 15 / np.sqrt(n_size)
38    print(f"n={n_size}: CI width = {2*margin:.2f}")

Thinking Process: Confidence intervals provide uncertainty estimates. Wider intervals indicate more uncertainty. Sample size, variability, and confidence level all affect interval width. Remember it's about the method, not a single interval.

Q6: Explain probability vs conditional probability.

Answer:

Definitions

Probability $P(A)$: Chance of event A occurring (unconditional)
Conditional Probability $P(A|B)$: Probability of A given that B has occurred

Formula

$$P(A|B) = \frac{P(A \cap B)}{P(B)}$$

Python Example

 1import numpy as np
 2
 3# Example: Disease testing
 4# Prevalence: 1% of population has disease
 5# Test sensitivity: 95% (P(positive|disease))
 6# Test specificity: 90% (P(negative|no disease))
 7
 8P_disease = 0.01
 9P_no_disease = 1 - P_disease
10P_positive_given_disease = 0.95
11P_negative_given_no_disease = 0.90
12P_positive_given_no_disease = 1 - P_negative_given_no_disease
13
14# Unconditional probability of positive test
15P_positive = (P_positive_given_disease * P_disease + 
16              P_positive_given_no_disease * P_no_disease)
17
18print(f"Unconditional P(positive test): {P_positive:.3f}")
19
20# Conditional probability: P(disease|positive test) using Bayes
21P_disease_given_positive = (P_positive_given_disease * P_disease) / P_positive
22
23print(f"\nConditional P(disease|positive test): {P_disease_given_positive:.3f}")
24print(f"This is surprisingly low despite 95% sensitivity!")
25
26# Bayes' Theorem manually
27print(f"\nBayes' Theorem breakdown:")
28print(f"P(disease|positive) = P(positive|disease) × P(disease) / P(positive)")
29print(f"                    = {P_positive_given_disease} × {P_disease} / {P_positive:.3f}")
30print(f"                    = {P_disease_given_positive:.3f}")
31
32# Why conditional probability matters
33print(f"\nKey insight: Low prevalence makes false positives more common")
34print(f"than true positives, even with good test sensitivity!")

Thinking Process: Conditional probability updates beliefs based on new information. Bayes' theorem shows how prior knowledge combines with evidence. Always consider base rates (prevalence) when interpreting test results.

Q7: What is sampling bias and how can you avoid it?

Answer:

Types of Sampling Bias

Selection Bias: Non-random selection from population
Voluntary Response Bias: Self-selected participants
Survivorship Bias: Only considering successful/visible cases
Convenience Sampling: Easy-to-reach participants

How to Avoid

Random Sampling: Each member has equal chance
Stratified Sampling: Ensure subgroups are represented
Proper Sampling Frame: Complete list of population
Adequate Sample Size: Reduce random error

Python Example

 1import numpy as np
 2import pandas as pd
 3
 4# Example: Survey bias
 5np.random.seed(42)
 6
 7# True population parameters
 8population_size = 10000
 9true_support = 0.45  # 45% support policy
10
11# Generate population
12population = np.random.choice([0, 1], size=population_size, p=[1-true_support, true_support])
13
14# 1. Random sample (unbiased)
15random_sample = np.random.choice(population, size=100, replace=False)
16random_support = np.mean(random_sample)
17print(f"True support: {true_support:.2%}")
18print(f"Random sample estimate: {random_support:.2%}")
19
20# 2. Convenience sample (biased - only older people respond)
21# Assume older people more likely to respond and have higher support
22age_factor = np.random.beta(2, 5, population_size)  # Skewed toward older
23response_probability = age_factor * 0.3 + 0.1
24responded = np.random.binomial(1, response_probability, population_size).astype(bool)
25convenience_sample = population[responded][:100] if np.sum(responded) >= 100 else population[responded]
26convenience_support = np.mean(convenience_sample)
27print(f"Convenience sample estimate: {convenience_support:.2%} (BIASED)")
28
29# 3. Stratified sampling (ensures representation)
30# Stratify by age groups
31age_groups = pd.cut(range(population_size), bins=3, labels=['Young', 'Middle', 'Old'])
32stratum_size = 33
33stratified_sample = []
34for group in ['Young', 'Middle', 'Old']:
35    group_indices = np.where(age_groups == group)[0]
36    stratum = np.random.choice(population[group_indices], size=stratum_size, replace=False)
37    stratified_sample.extend(stratum)
38stratified_support = np.mean(stratified_sample)
39print(f"Stratified sample estimate: {stratified_support:.2%}")
40
41print(f"\nBias in convenience sampling: {abs(convenience_support - true_support):.2%}")
42print(f"Bias in stratified sampling: {abs(stratified_support - true_support):.2%}")

Thinking Process: Bias creates systematic error that doesn't decrease with sample size. Random sampling is key, but may need stratification if subgroups differ. Always consider who is included/excluded from sample.

These fundamental concepts form the basis for more advanced statistical analysis and inference.

Related Snippets

Statistics Interview Questions - Hard
Hard-level statistics interview questions covering advanced inference, multiple …
Statistics Interview Questions - Medium
Medium-level statistics interview questions covering hypothesis testing, …