DS Statistics for Data Science

Statistics is the language of data. Every data science task — from summarising a dataset to evaluating a model — relies on statistical concepts. This topic covers descriptive statistics, probability, distributions, hypothesis testing, and correlation in the context of data science using Python.

Why Statistics Matters in Data Science

Statistics provides the mathematical tools to make sense of data. Without statistics, a data scientist cannot determine whether a pattern in data is real or just random noise. Statistics also helps make predictions with measurable confidence levels — not just guesses.

Descriptive Statistics

Descriptive statistics summarise a dataset by describing its centre, spread, and shape. They answer: "What does the data look like?"

Measures of Central Tendency

Measure	Definition	Best Used When
Mean	Sum of all values divided by count	Data has no extreme outliers
Median	Middle value when data is sorted	Data has outliers or is skewed
Mode	Most frequently occurring value	Categorical or discrete data

import numpy as np
import pandas as pd
from scipy import stats

# Monthly salaries of 10 employees (in thousands)
salaries = [32, 35, 38, 40, 42, 45, 47, 50, 55, 200]

mean   = np.mean(salaries)
median = np.median(salaries)
mode   = stats.mode(salaries, keepdims=True).mode[0]

print(f"Mean:   ₹{mean:.1f}K")     # Pulled up by 200K outlier
print(f"Median: ₹{median:.1f}K")   # Not affected by outlier
print(f"Mode:   ₹{mode}K")

Output:

Mean:   ₹58.4K   ← Distorted by the ₹200K outlier
Median: ₹43.5K   ← More representative of the typical salary
Mode:   ₹32K

Diagram – Mean vs Median Under Outlier Influence

Salary Data:
[32, 35, 38, 40, 42, 45, 47, 50, 55, 200]

Number line:
  32    40    44   50       200
  ├─────┼──────┼────┼─────────┤
             ↑                  ↑
          Median              Mean
          (43.5)              (58.4)

The ₹200K outlier pulls the mean far right.
The median stays in the centre of the actual cluster.
→ Use median when data has extreme values.

Measures of Spread (Variability)

Spread measures how widely values are scattered around the centre. Two datasets can have the same mean but very different spreads.

# Two exam result groups with same mean but different spreads
class_A = np.array([60, 62, 65, 70, 68, 66, 69])   # Consistent
class_B = np.array([40, 50, 90, 95, 30, 85, 65])   # Unpredictable

print("Class A – Mean:", np.mean(class_A).round(1), "| Std:", np.std(class_A).round(1))
print("Class B – Mean:", np.mean(class_B).round(1), "| Std:", np.std(class_B).round(1))

Output:

Class A – Mean: 65.7 | Std:  3.3   ← Tight, predictable scores
Class B – Mean: 65.0 | Std: 23.5   ← Wildly spread, unpredictable

Key Spread Measures

data = np.array([10, 12, 15, 18, 20, 22, 25, 30])

print("Range     :", data.max() - data.min())          # Max - Min
print("Variance  :", np.var(data).round(2))            # Avg squared deviation
print("Std Dev   :", np.std(data).round(2))            # Square root of variance
print("IQR       :", np.percentile(data, 75) -
                    np.percentile(data, 25))            # Q3 - Q1
print("CV (%)    :", (np.std(data)/np.mean(data)*100).round(1))  # Relative spread

Percentiles and the Five-Number Summary

scores = np.array([45, 52, 58, 61, 65, 68, 72, 75, 80, 85, 90, 95])

P25 = np.percentile(scores, 25)   # Q1
P50 = np.percentile(scores, 50)   # Median (Q2)
P75 = np.percentile(scores, 75)   # Q3

print("Min    :", scores.min())
print("Q1     :", P25)
print("Median :", P50)
print("Q3     :", P75)
print("Max    :", scores.max())
print("IQR    :", P75 - P25)

Diagram – Five-Number Summary (Box Plot View)

|────────[━━━━━|━━━━━━━]────────|
↑        ↑     ↑       ↑        ↑
Min      Q1  Median   Q3       Max

Between Q1 and Q3 = 50% of all data (IQR)
Lines outside box = rest of the data (not outliers)

Probability Basics

Probability measures how likely an event is to occur. It ranges from 0 (impossible) to 1 (certain).

# Probability of rolling a 6 on a fair die
P_six = 1 / 6
print(f"P(rolling 6) = {P_six:.4f}")      # 0.1667

# Probability of drawing a heart from a deck of cards
P_heart = 13 / 52
print(f"P(heart) = {P_heart:.4f}")         # 0.25

# Probability of two independent events both happening
P_A = 0.6   # Rain today
P_B = 0.4   # Train delay
P_AB = P_A * P_B
print(f"P(rain AND delay) = {P_AB:.2f}")   # 0.24

Normal Distribution

The normal distribution (bell curve) is the most important distribution in statistics. Many real-world measurements — heights, exam scores, measurement errors — follow this pattern. The area under the curve represents probability.

import matplotlib.pyplot as plt
from scipy.stats import norm

# Plot a normal distribution
x   = np.linspace(-4, 4, 200)
pdf = norm.pdf(x, loc=0, scale=1)    # Mean=0, Std=1

plt.figure(figsize=(8, 4))
plt.plot(x, pdf, color="steelblue", linewidth=2)
plt.fill_between(x, pdf, where=(x >= -1) & (x <= 1), alpha=0.3, color="steelblue", label="68%")
plt.fill_between(x, pdf, where=(x >= -2) & (x <= 2), alpha=0.2, color="orange",    label="95%")
plt.fill_between(x, pdf, where=(x >= -3) & (x <= 3), alpha=0.1, color="red",       label="99.7%")
plt.title("Standard Normal Distribution – Empirical Rule")
plt.legend()
plt.tight_layout()
plt.savefig("normal_dist.png")
plt.show()

The Empirical Rule (68-95-99.7)

        68% of data
       ┌─────────────┐
       │             │
  95% of data        │
  ┌────┴─────────────┴────┐
  │                       │
  99.7% of data           │
  ┌──┴───────────────────┴──┐
  │                         │
──┼────┼────┼────┼────┼────┼──
-3σ  -2σ  -1σ   μ  +1σ  +2σ  +3σ

μ = mean, σ = standard deviation

• 68% of values fall within 1σ of the mean
• 95% of values fall within 2σ of the mean
• 99.7% of values fall within 3σ of the mean

Skewness and Kurtosis

import pandas as pd

data = pd.Series([15, 18, 20, 22, 25, 28, 30, 35, 50, 80, 150])

print("Mean    :", data.mean().round(2))
print("Median  :", data.median())
print("Skewness:", data.skew().round(3))
# Positive skew → tail on the right (income, house prices)
# Negative skew → tail on the left (exam scores near 100)

Hypothesis Testing

Hypothesis testing determines whether an observation in data is statistically significant or just due to random chance. It answers: "Is the difference real, or did it happen by luck?"

The Hypothesis Testing Framework

Step 1: State the Null Hypothesis (H₀)
        "There is NO significant difference or effect."

Step 2: State the Alternative Hypothesis (H₁)
        "There IS a significant difference or effect."

Step 3: Choose a significance level (α)
        α = 0.05 (5% chance of being wrong – standard choice)

Step 4: Calculate the test statistic and p-value

Step 5: Make a decision
        p < α  → Reject H₀ → Difference is significant
        p ≥ α  → Fail to reject H₀ → No significant difference

T-Test – Comparing Two Groups

from scipy import stats

# Did a new training programme improve scores?
before_training = [72, 68, 74, 70, 65, 71, 69, 73]
after_training  = [80, 78, 82, 79, 75, 83, 77, 81]

t_stat, p_value = stats.ttest_rel(before_training, after_training)

print(f"T-statistic: {t_stat:.3f}")
print(f"P-value:     {p_value:.5f}")

alpha = 0.05
if p_value < alpha:
    print("Result: Reject H₀ → Training significantly improved scores")
else:
    print("Result: Fail to reject H₀ → No significant improvement")

Output:

T-statistic: -11.489
P-value:     0.00001
Result: Reject H₀ → Training significantly improved scores

Chi-Square Test – Categorical Variables

# Does customer preference depend on age group?
# Observed counts
observed = np.array([
    [30, 20, 10],   # Young customers: Product A, B, C
    [15, 25, 20],   # Middle-aged customers
    [10, 15, 30]    # Older customers
])

chi2, p, dof, expected = stats.chi2_contingency(observed)

print(f"Chi-Square: {chi2:.3f}")
print(f"P-value:    {p:.4f}")
print(f"Degrees of Freedom: {dof}")

if p < 0.05:
    print("Result: Product preference DOES depend on age group")
else:
    print("Result: No significant relationship between age and preference")

Correlation vs Causation

Correlation measures how strongly two variables move together. Causation means one variable directly causes the other. Correlation does not imply causation — two variables can move together for completely unrelated reasons.

# Calculate Pearson correlation coefficient
from scipy.stats import pearsonr

study_hours = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
exam_scores = np.array([55, 58, 65, 68, 72, 75, 80, 83, 87, 90])

r, p_value = pearsonr(study_hours, exam_scores)

print(f"Pearson r : {r:.3f}")
print(f"P-value   : {p_value:.5f}")
print(f"r² (R²)   : {r**2:.3f}")  # % of variance explained

Output:

Pearson r : 0.997
P-value   : 0.00000
r² (R²)   : 0.994

→ Very strong positive correlation
→ Study hours explain 99.4% of score variation

Classic Correlation vs Causation Example

CORRELATION:
Ice cream sales and drowning rates are both high in summer.
↓
Does ice cream cause drowning? NO.
Both are caused by a THIRD variable: hot weather (confounding variable).

CAUSATION:
More study hours → Higher exam scores
(Experimentally confirmed, with controls)

Confidence Intervals

A confidence interval gives a range of values that likely contains the true population value. A 95% confidence interval means: if this study ran 100 times, the true value would fall inside the interval in 95 of those runs.

# 95% Confidence Interval for average delivery time
delivery_times = np.array([23, 25, 28, 22, 27, 24, 26, 29, 21, 25])

mean = np.mean(delivery_times)
se   = stats.sem(delivery_times)   # Standard error of mean
ci   = stats.t.interval(0.95, df=len(delivery_times)-1, loc=mean, scale=se)

print(f"Sample Mean          : {mean:.2f} hours")
print(f"95% Confidence Interval: {ci[0]:.2f} – {ci[1]:.2f} hours")

Output:

Sample Mean           : 25.00 hours
95% Confidence Interval: 23.27 – 26.73 hours

→ The true average delivery time is between 23.3 and 26.7 hours
   with 95% confidence.

Key Statistics Summary Table

Concept	Python Function	What It Tells
Mean	np.mean()	Average value
Median	np.median()	Middle value (robust to outliers)
Mode	stats.mode()	Most frequent value
Standard Deviation	np.std()	Average distance from mean
Variance	np.var()	Squared average distance from mean
Percentile	np.percentile()	Value below which N% of data falls
Skewness	data.skew()	Left or right tail of distribution
Pearson r	pearsonr()	Linear correlation strength
T-test	stats.ttest_rel()	Significant difference between two groups
Chi-square	chi2_contingency()	Association between categorical variables
Confidence Interval	stats.t.interval()	Plausible range for true population value

Summary

Mean, median, and mode describe the centre of a dataset — choose based on skewness and outliers
Standard deviation and IQR measure how spread out values are from the centre
The normal distribution describes many real-world datasets; 68-95-99.7% rule defines its spread
Hypothesis testing uses p-values to determine whether observed differences are statistically significant
A p-value below 0.05 means the result is unlikely to have occurred by chance
Correlation measures the strength of relationship between two variables
Correlation does not prove causation — always look for confounding variables
Confidence intervals give a range for the true population parameter

Previous lessons

Back to courses

Next lessons

DS Statistics for Data Science

Why Statistics Matters in Data Science

Descriptive Statistics

Measures of Central Tendency

Diagram – Mean vs Median Under Outlier Influence

Measures of Spread (Variability)

Key Spread Measures

Percentiles and the Five-Number Summary

Diagram – Five-Number Summary (Box Plot View)

Probability Basics

Normal Distribution

The Empirical Rule (68-95-99.7)

Skewness and Kurtosis

Hypothesis Testing

The Hypothesis Testing Framework

T-Test – Comparing Two Groups

Chi-Square Test – Categorical Variables

Correlation vs Causation

Classic Correlation vs Causation Example

Confidence Intervals

Key Statistics Summary Table

Summary

Leave a Comment Cancel reply