DS Statistics for Data Science

Statistics is the language of data. Every data science task — from summarising a dataset to evaluating a model — relies on statistical concepts. This topic covers descriptive statistics, probability, distributions, hypothesis testing, and correlation in the context of data science using Python.

Why Statistics Matters in Data Science

Statistics provides the mathematical tools to make sense of data. Without statistics, a data scientist cannot determine whether a pattern in data is real or just random noise. Statistics also helps make predictions with measurable confidence levels — not just guesses.

Descriptive Statistics

Descriptive statistics summarise a dataset by describing its centre, spread, and shape. They answer: "What does the data look like?"

Measures of Central Tendency

MeasureDefinitionBest Used When
MeanSum of all values divided by countData has no extreme outliers
MedianMiddle value when data is sortedData has outliers or is skewed
ModeMost frequently occurring valueCategorical or discrete data
import numpy as np
import pandas as pd
from scipy import stats

# Monthly salaries of 10 employees (in thousands)
salaries = [32, 35, 38, 40, 42, 45, 47, 50, 55, 200]

mean   = np.mean(salaries)
median = np.median(salaries)
mode   = stats.mode(salaries, keepdims=True).mode[0]

print(f"Mean:   ₹{mean:.1f}K")     # Pulled up by 200K outlier
print(f"Median: ₹{median:.1f}K")   # Not affected by outlier
print(f"Mode:   ₹{mode}K")

Output:

Mean:   ₹58.4K   ← Distorted by the ₹200K outlier
Median: ₹43.5K   ← More representative of the typical salary
Mode:   ₹32K

Diagram – Mean vs Median Under Outlier Influence

Salary Data:
[32, 35, 38, 40, 42, 45, 47, 50, 55, 200]

Number line:
  32    40    44   50       200
  ├─────┼──────┼────┼─────────┤
             ↑                  ↑
          Median              Mean
          (43.5)              (58.4)

The ₹200K outlier pulls the mean far right.
The median stays in the centre of the actual cluster.
→ Use median when data has extreme values.

Measures of Spread (Variability)

Spread measures how widely values are scattered around the centre. Two datasets can have the same mean but very different spreads.

# Two exam result groups with same mean but different spreads
class_A = np.array([60, 62, 65, 70, 68, 66, 69])   # Consistent
class_B = np.array([40, 50, 90, 95, 30, 85, 65])   # Unpredictable

print("Class A – Mean:", np.mean(class_A).round(1), "| Std:", np.std(class_A).round(1))
print("Class B – Mean:", np.mean(class_B).round(1), "| Std:", np.std(class_B).round(1))

Output:

Class A – Mean: 65.7 | Std:  3.3   ← Tight, predictable scores
Class B – Mean: 65.0 | Std: 23.5   ← Wildly spread, unpredictable

Key Spread Measures

data = np.array([10, 12, 15, 18, 20, 22, 25, 30])

print("Range     :", data.max() - data.min())          # Max - Min
print("Variance  :", np.var(data).round(2))            # Avg squared deviation
print("Std Dev   :", np.std(data).round(2))            # Square root of variance
print("IQR       :", np.percentile(data, 75) -
                    np.percentile(data, 25))            # Q3 - Q1
print("CV (%)    :", (np.std(data)/np.mean(data)*100).round(1))  # Relative spread

Percentiles and the Five-Number Summary

scores = np.array([45, 52, 58, 61, 65, 68, 72, 75, 80, 85, 90, 95])

P25 = np.percentile(scores, 25)   # Q1
P50 = np.percentile(scores, 50)   # Median (Q2)
P75 = np.percentile(scores, 75)   # Q3

print("Min    :", scores.min())
print("Q1     :", P25)
print("Median :", P50)
print("Q3     :", P75)
print("Max    :", scores.max())
print("IQR    :", P75 - P25)

Diagram – Five-Number Summary (Box Plot View)

|────────[━━━━━|━━━━━━━]────────|
↑        ↑     ↑       ↑        ↑
Min      Q1  Median   Q3       Max

Between Q1 and Q3 = 50% of all data (IQR)
Lines outside box = rest of the data (not outliers)

Probability Basics

Probability measures how likely an event is to occur. It ranges from 0 (impossible) to 1 (certain).

# Probability of rolling a 6 on a fair die
P_six = 1 / 6
print(f"P(rolling 6) = {P_six:.4f}")      # 0.1667

# Probability of drawing a heart from a deck of cards
P_heart = 13 / 52
print(f"P(heart) = {P_heart:.4f}")         # 0.25

# Probability of two independent events both happening
P_A = 0.6   # Rain today
P_B = 0.4   # Train delay
P_AB = P_A * P_B
print(f"P(rain AND delay) = {P_AB:.2f}")   # 0.24

Normal Distribution

The normal distribution (bell curve) is the most important distribution in statistics. Many real-world measurements — heights, exam scores, measurement errors — follow this pattern. The area under the curve represents probability.

import matplotlib.pyplot as plt
from scipy.stats import norm

# Plot a normal distribution
x   = np.linspace(-4, 4, 200)
pdf = norm.pdf(x, loc=0, scale=1)    # Mean=0, Std=1

plt.figure(figsize=(8, 4))
plt.plot(x, pdf, color="steelblue", linewidth=2)
plt.fill_between(x, pdf, where=(x >= -1) & (x <= 1), alpha=0.3, color="steelblue", label="68%")
plt.fill_between(x, pdf, where=(x >= -2) & (x <= 2), alpha=0.2, color="orange",    label="95%")
plt.fill_between(x, pdf, where=(x >= -3) & (x <= 3), alpha=0.1, color="red",       label="99.7%")
plt.title("Standard Normal Distribution – Empirical Rule")
plt.legend()
plt.tight_layout()
plt.savefig("normal_dist.png")
plt.show()

The Empirical Rule (68-95-99.7)

        68% of data
       ┌─────────────┐
       │             │
  95% of data        │
  ┌────┴─────────────┴────┐
  │                       │
  99.7% of data           │
  ┌──┴───────────────────┴──┐
  │                         │
──┼────┼────┼────┼────┼────┼──
-3σ  -2σ  -1σ   μ  +1σ  +2σ  +3σ

μ = mean, σ = standard deviation

• 68% of values fall within 1σ of the mean
• 95% of values fall within 2σ of the mean
• 99.7% of values fall within 3σ of the mean

Skewness and Kurtosis

import pandas as pd

data = pd.Series([15, 18, 20, 22, 25, 28, 30, 35, 50, 80, 150])

print("Mean    :", data.mean().round(2))
print("Median  :", data.median())
print("Skewness:", data.skew().round(3))
# Positive skew → tail on the right (income, house prices)
# Negative skew → tail on the left (exam scores near 100)

Hypothesis Testing

Hypothesis testing determines whether an observation in data is statistically significant or just due to random chance. It answers: "Is the difference real, or did it happen by luck?"

The Hypothesis Testing Framework

Step 1: State the Null Hypothesis (H₀)
        "There is NO significant difference or effect."

Step 2: State the Alternative Hypothesis (H₁)
        "There IS a significant difference or effect."

Step 3: Choose a significance level (α)
        α = 0.05 (5% chance of being wrong – standard choice)

Step 4: Calculate the test statistic and p-value

Step 5: Make a decision
        p < α  → Reject H₀ → Difference is significant
        p ≥ α  → Fail to reject H₀ → No significant difference

T-Test – Comparing Two Groups

from scipy import stats

# Did a new training programme improve scores?
before_training = [72, 68, 74, 70, 65, 71, 69, 73]
after_training  = [80, 78, 82, 79, 75, 83, 77, 81]

t_stat, p_value = stats.ttest_rel(before_training, after_training)

print(f"T-statistic: {t_stat:.3f}")
print(f"P-value:     {p_value:.5f}")

alpha = 0.05
if p_value < alpha:
    print("Result: Reject H₀ → Training significantly improved scores")
else:
    print("Result: Fail to reject H₀ → No significant improvement")

Output:

T-statistic: -11.489
P-value:     0.00001
Result: Reject H₀ → Training significantly improved scores

Chi-Square Test – Categorical Variables

# Does customer preference depend on age group?
# Observed counts
observed = np.array([
    [30, 20, 10],   # Young customers: Product A, B, C
    [15, 25, 20],   # Middle-aged customers
    [10, 15, 30]    # Older customers
])

chi2, p, dof, expected = stats.chi2_contingency(observed)

print(f"Chi-Square: {chi2:.3f}")
print(f"P-value:    {p:.4f}")
print(f"Degrees of Freedom: {dof}")

if p < 0.05:
    print("Result: Product preference DOES depend on age group")
else:
    print("Result: No significant relationship between age and preference")

Correlation vs Causation

Correlation measures how strongly two variables move together. Causation means one variable directly causes the other. Correlation does not imply causation — two variables can move together for completely unrelated reasons.

# Calculate Pearson correlation coefficient
from scipy.stats import pearsonr

study_hours = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
exam_scores = np.array([55, 58, 65, 68, 72, 75, 80, 83, 87, 90])

r, p_value = pearsonr(study_hours, exam_scores)

print(f"Pearson r : {r:.3f}")
print(f"P-value   : {p_value:.5f}")
print(f"r² (R²)   : {r**2:.3f}")  # % of variance explained

Output:

Pearson r : 0.997
P-value   : 0.00000
r² (R²)   : 0.994

→ Very strong positive correlation
→ Study hours explain 99.4% of score variation

Classic Correlation vs Causation Example

CORRELATION:
Ice cream sales and drowning rates are both high in summer.
↓
Does ice cream cause drowning? NO.
Both are caused by a THIRD variable: hot weather (confounding variable).

CAUSATION:
More study hours → Higher exam scores
(Experimentally confirmed, with controls)

Confidence Intervals

A confidence interval gives a range of values that likely contains the true population value. A 95% confidence interval means: if this study ran 100 times, the true value would fall inside the interval in 95 of those runs.

# 95% Confidence Interval for average delivery time
delivery_times = np.array([23, 25, 28, 22, 27, 24, 26, 29, 21, 25])

mean = np.mean(delivery_times)
se   = stats.sem(delivery_times)   # Standard error of mean
ci   = stats.t.interval(0.95, df=len(delivery_times)-1, loc=mean, scale=se)

print(f"Sample Mean          : {mean:.2f} hours")
print(f"95% Confidence Interval: {ci[0]:.2f} – {ci[1]:.2f} hours")

Output:

Sample Mean           : 25.00 hours
95% Confidence Interval: 23.27 – 26.73 hours

→ The true average delivery time is between 23.3 and 26.7 hours
   with 95% confidence.

Key Statistics Summary Table

ConceptPython FunctionWhat It Tells
Meannp.mean()Average value
Mediannp.median()Middle value (robust to outliers)
Modestats.mode()Most frequent value
Standard Deviationnp.std()Average distance from mean
Variancenp.var()Squared average distance from mean
Percentilenp.percentile()Value below which N% of data falls
Skewnessdata.skew()Left or right tail of distribution
Pearson rpearsonr()Linear correlation strength
T-teststats.ttest_rel()Significant difference between two groups
Chi-squarechi2_contingency()Association between categorical variables
Confidence Intervalstats.t.interval()Plausible range for true population value

Summary

  • Mean, median, and mode describe the centre of a dataset — choose based on skewness and outliers
  • Standard deviation and IQR measure how spread out values are from the centre
  • The normal distribution describes many real-world datasets; 68-95-99.7% rule defines its spread
  • Hypothesis testing uses p-values to determine whether observed differences are statistically significant
  • A p-value below 0.05 means the result is unlikely to have occurred by chance
  • Correlation measures the strength of relationship between two variables
  • Correlation does not prove causation — always look for confounding variables
  • Confidence intervals give a range for the true population parameter

Leave a Comment