DS Statistics for Data Science
Statistics is the language of data. Every data science task — from summarising a dataset to evaluating a model — relies on statistical concepts. This topic covers descriptive statistics, probability, distributions, hypothesis testing, and correlation in the context of data science using Python.
Why Statistics Matters in Data Science
Statistics provides the mathematical tools to make sense of data. Without statistics, a data scientist cannot determine whether a pattern in data is real or just random noise. Statistics also helps make predictions with measurable confidence levels — not just guesses.
Descriptive Statistics
Descriptive statistics summarise a dataset by describing its centre, spread, and shape. They answer: "What does the data look like?"
Measures of Central Tendency
| Measure | Definition | Best Used When |
|---|---|---|
| Mean | Sum of all values divided by count | Data has no extreme outliers |
| Median | Middle value when data is sorted | Data has outliers or is skewed |
| Mode | Most frequently occurring value | Categorical or discrete data |
import numpy as np
import pandas as pd
from scipy import stats
# Monthly salaries of 10 employees (in thousands)
salaries = [32, 35, 38, 40, 42, 45, 47, 50, 55, 200]
mean = np.mean(salaries)
median = np.median(salaries)
mode = stats.mode(salaries, keepdims=True).mode[0]
print(f"Mean: ₹{mean:.1f}K") # Pulled up by 200K outlier
print(f"Median: ₹{median:.1f}K") # Not affected by outlier
print(f"Mode: ₹{mode}K")
Output:
Mean: ₹58.4K ← Distorted by the ₹200K outlier Median: ₹43.5K ← More representative of the typical salary Mode: ₹32K
Diagram – Mean vs Median Under Outlier Influence
Salary Data:
[32, 35, 38, 40, 42, 45, 47, 50, 55, 200]
Number line:
32 40 44 50 200
├─────┼──────┼────┼─────────┤
↑ ↑
Median Mean
(43.5) (58.4)
The ₹200K outlier pulls the mean far right.
The median stays in the centre of the actual cluster.
→ Use median when data has extreme values.
Measures of Spread (Variability)
Spread measures how widely values are scattered around the centre. Two datasets can have the same mean but very different spreads.
# Two exam result groups with same mean but different spreads
class_A = np.array([60, 62, 65, 70, 68, 66, 69]) # Consistent
class_B = np.array([40, 50, 90, 95, 30, 85, 65]) # Unpredictable
print("Class A – Mean:", np.mean(class_A).round(1), "| Std:", np.std(class_A).round(1))
print("Class B – Mean:", np.mean(class_B).round(1), "| Std:", np.std(class_B).round(1))
Output:
Class A – Mean: 65.7 | Std: 3.3 ← Tight, predictable scores Class B – Mean: 65.0 | Std: 23.5 ← Wildly spread, unpredictable
Key Spread Measures
data = np.array([10, 12, 15, 18, 20, 22, 25, 30])
print("Range :", data.max() - data.min()) # Max - Min
print("Variance :", np.var(data).round(2)) # Avg squared deviation
print("Std Dev :", np.std(data).round(2)) # Square root of variance
print("IQR :", np.percentile(data, 75) -
np.percentile(data, 25)) # Q3 - Q1
print("CV (%) :", (np.std(data)/np.mean(data)*100).round(1)) # Relative spread
Percentiles and the Five-Number Summary
scores = np.array([45, 52, 58, 61, 65, 68, 72, 75, 80, 85, 90, 95])
P25 = np.percentile(scores, 25) # Q1
P50 = np.percentile(scores, 50) # Median (Q2)
P75 = np.percentile(scores, 75) # Q3
print("Min :", scores.min())
print("Q1 :", P25)
print("Median :", P50)
print("Q3 :", P75)
print("Max :", scores.max())
print("IQR :", P75 - P25)
Diagram – Five-Number Summary (Box Plot View)
|────────[━━━━━|━━━━━━━]────────| ↑ ↑ ↑ ↑ ↑ Min Q1 Median Q3 Max Between Q1 and Q3 = 50% of all data (IQR) Lines outside box = rest of the data (not outliers)
Probability Basics
Probability measures how likely an event is to occur. It ranges from 0 (impossible) to 1 (certain).
# Probability of rolling a 6 on a fair die
P_six = 1 / 6
print(f"P(rolling 6) = {P_six:.4f}") # 0.1667
# Probability of drawing a heart from a deck of cards
P_heart = 13 / 52
print(f"P(heart) = {P_heart:.4f}") # 0.25
# Probability of two independent events both happening
P_A = 0.6 # Rain today
P_B = 0.4 # Train delay
P_AB = P_A * P_B
print(f"P(rain AND delay) = {P_AB:.2f}") # 0.24
Normal Distribution
The normal distribution (bell curve) is the most important distribution in statistics. Many real-world measurements — heights, exam scores, measurement errors — follow this pattern. The area under the curve represents probability.
import matplotlib.pyplot as plt
from scipy.stats import norm
# Plot a normal distribution
x = np.linspace(-4, 4, 200)
pdf = norm.pdf(x, loc=0, scale=1) # Mean=0, Std=1
plt.figure(figsize=(8, 4))
plt.plot(x, pdf, color="steelblue", linewidth=2)
plt.fill_between(x, pdf, where=(x >= -1) & (x <= 1), alpha=0.3, color="steelblue", label="68%")
plt.fill_between(x, pdf, where=(x >= -2) & (x <= 2), alpha=0.2, color="orange", label="95%")
plt.fill_between(x, pdf, where=(x >= -3) & (x <= 3), alpha=0.1, color="red", label="99.7%")
plt.title("Standard Normal Distribution – Empirical Rule")
plt.legend()
plt.tight_layout()
plt.savefig("normal_dist.png")
plt.show()
The Empirical Rule (68-95-99.7)
68% of data
┌─────────────┐
│ │
95% of data │
┌────┴─────────────┴────┐
│ │
99.7% of data │
┌──┴───────────────────┴──┐
│ │
──┼────┼────┼────┼────┼────┼──
-3σ -2σ -1σ μ +1σ +2σ +3σ
μ = mean, σ = standard deviation
• 68% of values fall within 1σ of the mean
• 95% of values fall within 2σ of the mean
• 99.7% of values fall within 3σ of the mean
Skewness and Kurtosis
import pandas as pd
data = pd.Series([15, 18, 20, 22, 25, 28, 30, 35, 50, 80, 150])
print("Mean :", data.mean().round(2))
print("Median :", data.median())
print("Skewness:", data.skew().round(3))
# Positive skew → tail on the right (income, house prices)
# Negative skew → tail on the left (exam scores near 100)
Hypothesis Testing
Hypothesis testing determines whether an observation in data is statistically significant or just due to random chance. It answers: "Is the difference real, or did it happen by luck?"
The Hypothesis Testing Framework
Step 1: State the Null Hypothesis (H₀)
"There is NO significant difference or effect."
Step 2: State the Alternative Hypothesis (H₁)
"There IS a significant difference or effect."
Step 3: Choose a significance level (α)
α = 0.05 (5% chance of being wrong – standard choice)
Step 4: Calculate the test statistic and p-value
Step 5: Make a decision
p < α → Reject H₀ → Difference is significant
p ≥ α → Fail to reject H₀ → No significant difference
T-Test – Comparing Two Groups
from scipy import stats
# Did a new training programme improve scores?
before_training = [72, 68, 74, 70, 65, 71, 69, 73]
after_training = [80, 78, 82, 79, 75, 83, 77, 81]
t_stat, p_value = stats.ttest_rel(before_training, after_training)
print(f"T-statistic: {t_stat:.3f}")
print(f"P-value: {p_value:.5f}")
alpha = 0.05
if p_value < alpha:
print("Result: Reject H₀ → Training significantly improved scores")
else:
print("Result: Fail to reject H₀ → No significant improvement")
Output:
T-statistic: -11.489 P-value: 0.00001 Result: Reject H₀ → Training significantly improved scores
Chi-Square Test – Categorical Variables
# Does customer preference depend on age group?
# Observed counts
observed = np.array([
[30, 20, 10], # Young customers: Product A, B, C
[15, 25, 20], # Middle-aged customers
[10, 15, 30] # Older customers
])
chi2, p, dof, expected = stats.chi2_contingency(observed)
print(f"Chi-Square: {chi2:.3f}")
print(f"P-value: {p:.4f}")
print(f"Degrees of Freedom: {dof}")
if p < 0.05:
print("Result: Product preference DOES depend on age group")
else:
print("Result: No significant relationship between age and preference")
Correlation vs Causation
Correlation measures how strongly two variables move together. Causation means one variable directly causes the other. Correlation does not imply causation — two variables can move together for completely unrelated reasons.
# Calculate Pearson correlation coefficient
from scipy.stats import pearsonr
study_hours = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
exam_scores = np.array([55, 58, 65, 68, 72, 75, 80, 83, 87, 90])
r, p_value = pearsonr(study_hours, exam_scores)
print(f"Pearson r : {r:.3f}")
print(f"P-value : {p_value:.5f}")
print(f"r² (R²) : {r**2:.3f}") # % of variance explained
Output:
Pearson r : 0.997 P-value : 0.00000 r² (R²) : 0.994 → Very strong positive correlation → Study hours explain 99.4% of score variation
Classic Correlation vs Causation Example
CORRELATION: Ice cream sales and drowning rates are both high in summer. ↓ Does ice cream cause drowning? NO. Both are caused by a THIRD variable: hot weather (confounding variable). CAUSATION: More study hours → Higher exam scores (Experimentally confirmed, with controls)
Confidence Intervals
A confidence interval gives a range of values that likely contains the true population value. A 95% confidence interval means: if this study ran 100 times, the true value would fall inside the interval in 95 of those runs.
# 95% Confidence Interval for average delivery time
delivery_times = np.array([23, 25, 28, 22, 27, 24, 26, 29, 21, 25])
mean = np.mean(delivery_times)
se = stats.sem(delivery_times) # Standard error of mean
ci = stats.t.interval(0.95, df=len(delivery_times)-1, loc=mean, scale=se)
print(f"Sample Mean : {mean:.2f} hours")
print(f"95% Confidence Interval: {ci[0]:.2f} – {ci[1]:.2f} hours")
Output:
Sample Mean : 25.00 hours 95% Confidence Interval: 23.27 – 26.73 hours → The true average delivery time is between 23.3 and 26.7 hours with 95% confidence.
Key Statistics Summary Table
| Concept | Python Function | What It Tells |
|---|---|---|
| Mean | np.mean() | Average value |
| Median | np.median() | Middle value (robust to outliers) |
| Mode | stats.mode() | Most frequent value |
| Standard Deviation | np.std() | Average distance from mean |
| Variance | np.var() | Squared average distance from mean |
| Percentile | np.percentile() | Value below which N% of data falls |
| Skewness | data.skew() | Left or right tail of distribution |
| Pearson r | pearsonr() | Linear correlation strength |
| T-test | stats.ttest_rel() | Significant difference between two groups |
| Chi-square | chi2_contingency() | Association between categorical variables |
| Confidence Interval | stats.t.interval() | Plausible range for true population value |
Summary
- Mean, median, and mode describe the centre of a dataset — choose based on skewness and outliers
- Standard deviation and IQR measure how spread out values are from the centre
- The normal distribution describes many real-world datasets; 68-95-99.7% rule defines its spread
- Hypothesis testing uses p-values to determine whether observed differences are statistically significant
- A p-value below 0.05 means the result is unlikely to have occurred by chance
- Correlation measures the strength of relationship between two variables
- Correlation does not prove causation — always look for confounding variables
- Confidence intervals give a range for the true population parameter
