ML Model Evaluation Metrics

Building a model is only half the work. Measuring how well that model actually performs is equally important. Different problems need different metrics. Using the wrong metric gives a false picture of model quality and leads to poor decisions in production.

Two Categories of Metrics

Regression Metrics     → When output is a continuous number
Classification Metrics → When output is a category (class label)

Regression Evaluation Metrics

Mean Absolute Error (MAE)

MAE = Average of |Actual - Predicted| for all records

Example (house price prediction):
┌──────────┬───────────┬──────────┬─────────────────┐
│ Record   │ Actual    │ Predicted│ Absolute Error  │
├──────────┼───────────┼──────────┼─────────────────┤
│ House 1  │ ₹2,50,000 │ ₹2,40,000│ ₹10,000         │
│ House 2  │ ₹3,00,000 │ ₹3,10,000│ ₹10,000         │
│ House 3  │ ₹1,80,000 │ ₹1,95,000│ ₹15,000         │
│ House 4  │ ₹4,20,000 │ ₹4,00,000│ ₹20,000         │
└──────────┴───────────┴──────────┴─────────────────┘

MAE = (10,000 + 10,000 + 15,000 + 20,000) / 4 = ₹13,750

Interpretation: On average, predictions are off by ₹13,750.
MAE is easy to understand. It treats all errors equally.

Mean Squared Error (MSE)

MSE = Average of (Actual - Predicted)² for all records

Using same example:
  Errors: 10,000 / 10,000 / 15,000 / 20,000
  Squared: 100M / 100M / 225M / 400M
  MSE = (100M + 100M + 225M + 400M) / 4 = 206,250,000

MSE penalizes large errors much more than small ones.
A single huge error pushes MSE up dramatically.
Best when large errors are especially harmful to avoid.

Root Mean Squared Error (RMSE)

RMSE = √MSE = √206,250,000 ≈ ₹14,361

RMSE is in the same unit as the target (rupees).
Easier to interpret than MSE.
Still penalizes large errors more than MAE does.
Most commonly reported metric for regression problems.

R² Score (Coefficient of Determination)

R² = 1 - (Sum of Squared Residuals / Total Sum of Squares)

R² measures: "How much better is the model than just
              predicting the mean every time?"

R² = 1.0  → Perfect predictions
R² = 0.0  → Model is no better than guessing the mean
R² < 0.0  → Model is worse than guessing the mean

Example:
  R² = 0.87 → Model explains 87% of the variation in house prices.
  The remaining 13% is unexplained (noise, missing features, etc.)

Visual:
  R²=0.95  ●●●●● closely hugging prediction line ← great
  R²=0.50  ●● ● ●● ●  scattered somewhat ← moderate
  R²=0.10  ● ●  ●    ●   ●●  random scatter ← poor

Classification Evaluation Metrics

The Confusion Matrix

Foundation of all classification metrics.
Compares actual vs predicted class labels.

Example: Disease Detection (Positive = Has Disease)

                     Predicted: Positive  Predicted: Negative
Actual: Positive  │  True Positive (TP)  │  False Negative (FN) │
Actual: Negative  │  False Positive (FP) │  True Negative (TN)  │

TP = 90  (has disease, correctly detected)
TN = 850 (no disease, correctly cleared)
FP = 30  (no disease, wrongly flagged — "false alarm")
FN = 30  (has disease, missed — "dangerous miss")

Total records = 90+850+30+30 = 1000

Accuracy

Accuracy = (TP + TN) / Total

= (90 + 850) / 1000 = 940 / 1000 = 94%

Warning: Accuracy is misleading on imbalanced datasets.

Counter-example (Fraud Detection):
  999 genuine transactions, 1 fraud
  Model always predicts "genuine" → 99.9% accuracy
  But it catches ZERO fraud cases — completely useless.

Precision

Precision = TP / (TP + FP)

"Of all records predicted Positive, how many truly are Positive?"

Disease example:
  Precision = 90 / (90 + 30) = 90/120 = 75%

25% of flagged patients do NOT have the disease (false alarms).

High Precision matters when: False alarms are costly.
Example: Spam filter — a legitimate email flagged as spam is bad.

Recall (Sensitivity)

Recall = TP / (TP + FN)

"Of all actual Positive records, how many did the model catch?"

Disease example:
  Recall = 90 / (90 + 30) = 90/120 = 75%

25% of actual disease patients were MISSED by the model.

High Recall matters when: Missing a positive is dangerous.
Example: Cancer screening — missing a cancer patient is far worse
         than a false alarm that extra tests will clear up.

F1 Score

F1 = 2 × (Precision × Recall) / (Precision + Recall)

F1 is the harmonic mean of Precision and Recall.
It balances both when neither alone tells the full story.

Disease example:
  F1 = 2 × (0.75 × 0.75) / (0.75 + 0.75) = 0.75 = 75%

When to use F1:
  ✓ Imbalanced classes
  ✓ Both precision and recall are important
  ✓ Cannot sacrifice one for the other

Precision-Recall Tradeoff

Adjusting the classification threshold changes both metrics:

Low Threshold (e.g., 0.3):
  Model predicts Positive very easily
  → Higher Recall (catches more true positives)
  → Lower Precision (more false alarms)

High Threshold (e.g., 0.8):
  Model predicts Positive only when very confident
  → Higher Precision (fewer false alarms)
  → Lower Recall (misses more true positives)

You cannot maximize both simultaneously.
The right balance depends on the problem context.

┌────────────────────┬──────────────────┬──────────────────┐
│ Use Case           │ Priority         │ Reason           │
├────────────────────┼──────────────────┼──────────────────┤
│ Cancer Screening   │ High Recall      │ Missing cancer   │
│                    │                  │ is catastrophic  │
│ Spam Filter        │ High Precision   │ Losing real emails│
│                    │                  │ is unacceptable  │
│ Fraud Detection    │ Balance (F1)     │ Both FP and FN   │
│                    │                  │ have real costs  │
└────────────────────┴──────────────────┴──────────────────┘

ROC Curve and AUC

ROC Curve: Plots True Positive Rate vs False Positive Rate
           at every possible threshold value.

TPR (True Positive Rate) = Recall = TP / (TP + FN)
FPR (False Positive Rate) = FP / (FP + TN)

TPR
1.0 │            ╭──────────────────
    │         ╭──╯
0.7 │       ╭─╯
    │     ╭─╯
0.5 │   ╭─╯         ← Model's ROC curve
    │  ╭╯
    │ ╭╯
0.0 │─╯─────────────────────────────►
   0.0   0.3    0.7   1.0      FPR
   
   Diagonal line = Random guessing

AUC (Area Under the Curve):
  AUC = 1.0 → Perfect classifier
  AUC = 0.5 → Random guessing (no better than coin flip)
  AUC = 0.9 → Excellent
  AUC = 0.7 → Acceptable
  AUC < 0.5 → Worse than random (something is wrong)

AUC summarizes the entire ROC curve in a single number.
Higher AUC = better classifier across all thresholds.

Multi-Class Metrics

For problems with 3+ classes, Precision/Recall/F1 extend to:

Macro Average:
  Compute metric for each class separately.
  Take the simple average.
  Every class weighted equally.

Weighted Average:
  Compute metric for each class separately.
  Take average weighted by class size.
  Larger classes have more influence.

Example (3-class problem: Cat, Dog, Rabbit):
┌──────────┬───────────┬──────────┬──────────┬────────────┐
│ Class    │ Precision │ Recall   │ F1       │ Support    │
├──────────┼───────────┼──────────┼──────────┼────────────┤
│ Cat      │ 0.90      │ 0.85     │ 0.87     │ 200 records│
│ Dog      │ 0.82      │ 0.88     │ 0.85     │ 150 records│
│ Rabbit   │ 0.75      │ 0.70     │ 0.72     │ 50 records │
├──────────┼───────────┼──────────┼──────────┼────────────┤
│ Macro    │ 0.82      │ 0.81     │ 0.81     │            │
│ Weighted │ 0.85      │ 0.83     │ 0.84     │            │
└──────────┴───────────┴──────────┴──────────┴────────────┘

Metrics Quick Reference

┌──────────────────┬──────────────────────────────────────────────┐
│ Metric           │ Use When                                     │
├──────────────────┼──────────────────────────────────────────────┤
│ MAE              │ Regression, all errors equally important     │
│ RMSE             │ Regression, large errors especially costly   │
│ R²               │ Regression, need to explain variance         │
│ Accuracy         │ Classification, balanced classes only        │
│ Precision        │ Classification, false alarms are costly      │
│ Recall           │ Classification, missing positives is costly  │
│ F1               │ Classification, imbalanced classes           │
│ AUC-ROC          │ Classification, comparing model quality      │
│                  │ across all thresholds                        │
└──────────────────┴──────────────────────────────────────────────┘

Leave a Comment