ML Model Evaluation Metrics
Building a model is only half the work. Measuring how well that model actually performs is equally important. Different problems need different metrics. Using the wrong metric gives a false picture of model quality and leads to poor decisions in production.
Two Categories of Metrics
Regression Metrics → When output is a continuous number Classification Metrics → When output is a category (class label)
Regression Evaluation Metrics
Mean Absolute Error (MAE)
MAE = Average of |Actual - Predicted| for all records Example (house price prediction): ┌──────────┬───────────┬──────────┬─────────────────┐ │ Record │ Actual │ Predicted│ Absolute Error │ ├──────────┼───────────┼──────────┼─────────────────┤ │ House 1 │ ₹2,50,000 │ ₹2,40,000│ ₹10,000 │ │ House 2 │ ₹3,00,000 │ ₹3,10,000│ ₹10,000 │ │ House 3 │ ₹1,80,000 │ ₹1,95,000│ ₹15,000 │ │ House 4 │ ₹4,20,000 │ ₹4,00,000│ ₹20,000 │ └──────────┴───────────┴──────────┴─────────────────┘ MAE = (10,000 + 10,000 + 15,000 + 20,000) / 4 = ₹13,750 Interpretation: On average, predictions are off by ₹13,750. MAE is easy to understand. It treats all errors equally.
Mean Squared Error (MSE)
MSE = Average of (Actual - Predicted)² for all records Using same example: Errors: 10,000 / 10,000 / 15,000 / 20,000 Squared: 100M / 100M / 225M / 400M MSE = (100M + 100M + 225M + 400M) / 4 = 206,250,000 MSE penalizes large errors much more than small ones. A single huge error pushes MSE up dramatically. Best when large errors are especially harmful to avoid.
Root Mean Squared Error (RMSE)
RMSE = √MSE = √206,250,000 ≈ ₹14,361 RMSE is in the same unit as the target (rupees). Easier to interpret than MSE. Still penalizes large errors more than MAE does. Most commonly reported metric for regression problems.
R² Score (Coefficient of Determination)
R² = 1 - (Sum of Squared Residuals / Total Sum of Squares)
R² measures: "How much better is the model than just
predicting the mean every time?"
R² = 1.0 → Perfect predictions
R² = 0.0 → Model is no better than guessing the mean
R² < 0.0 → Model is worse than guessing the mean
Example:
R² = 0.87 → Model explains 87% of the variation in house prices.
The remaining 13% is unexplained (noise, missing features, etc.)
Visual:
R²=0.95 ●●●●● closely hugging prediction line ← great
R²=0.50 ●● ● ●● ● scattered somewhat ← moderate
R²=0.10 ● ● ● ● ●● random scatter ← poor
Classification Evaluation Metrics
The Confusion Matrix
Foundation of all classification metrics.
Compares actual vs predicted class labels.
Example: Disease Detection (Positive = Has Disease)
Predicted: Positive Predicted: Negative
Actual: Positive │ True Positive (TP) │ False Negative (FN) │
Actual: Negative │ False Positive (FP) │ True Negative (TN) │
TP = 90 (has disease, correctly detected)
TN = 850 (no disease, correctly cleared)
FP = 30 (no disease, wrongly flagged — "false alarm")
FN = 30 (has disease, missed — "dangerous miss")
Total records = 90+850+30+30 = 1000
Accuracy
Accuracy = (TP + TN) / Total = (90 + 850) / 1000 = 940 / 1000 = 94% Warning: Accuracy is misleading on imbalanced datasets. Counter-example (Fraud Detection): 999 genuine transactions, 1 fraud Model always predicts "genuine" → 99.9% accuracy But it catches ZERO fraud cases — completely useless.
Precision
Precision = TP / (TP + FP) "Of all records predicted Positive, how many truly are Positive?" Disease example: Precision = 90 / (90 + 30) = 90/120 = 75% 25% of flagged patients do NOT have the disease (false alarms). High Precision matters when: False alarms are costly. Example: Spam filter — a legitimate email flagged as spam is bad.
Recall (Sensitivity)
Recall = TP / (TP + FN)
"Of all actual Positive records, how many did the model catch?"
Disease example:
Recall = 90 / (90 + 30) = 90/120 = 75%
25% of actual disease patients were MISSED by the model.
High Recall matters when: Missing a positive is dangerous.
Example: Cancer screening — missing a cancer patient is far worse
than a false alarm that extra tests will clear up.
F1 Score
F1 = 2 × (Precision × Recall) / (Precision + Recall) F1 is the harmonic mean of Precision and Recall. It balances both when neither alone tells the full story. Disease example: F1 = 2 × (0.75 × 0.75) / (0.75 + 0.75) = 0.75 = 75% When to use F1: ✓ Imbalanced classes ✓ Both precision and recall are important ✓ Cannot sacrifice one for the other
Precision-Recall Tradeoff
Adjusting the classification threshold changes both metrics: Low Threshold (e.g., 0.3): Model predicts Positive very easily → Higher Recall (catches more true positives) → Lower Precision (more false alarms) High Threshold (e.g., 0.8): Model predicts Positive only when very confident → Higher Precision (fewer false alarms) → Lower Recall (misses more true positives) You cannot maximize both simultaneously. The right balance depends on the problem context. ┌────────────────────┬──────────────────┬──────────────────┐ │ Use Case │ Priority │ Reason │ ├────────────────────┼──────────────────┼──────────────────┤ │ Cancer Screening │ High Recall │ Missing cancer │ │ │ │ is catastrophic │ │ Spam Filter │ High Precision │ Losing real emails│ │ │ │ is unacceptable │ │ Fraud Detection │ Balance (F1) │ Both FP and FN │ │ │ │ have real costs │ └────────────────────┴──────────────────┴──────────────────┘
ROC Curve and AUC
ROC Curve: Plots True Positive Rate vs False Positive Rate
at every possible threshold value.
TPR (True Positive Rate) = Recall = TP / (TP + FN)
FPR (False Positive Rate) = FP / (FP + TN)
TPR
1.0 │ ╭──────────────────
│ ╭──╯
0.7 │ ╭─╯
│ ╭─╯
0.5 │ ╭─╯ ← Model's ROC curve
│ ╭╯
│ ╭╯
0.0 │─╯─────────────────────────────►
0.0 0.3 0.7 1.0 FPR
Diagonal line = Random guessing
AUC (Area Under the Curve):
AUC = 1.0 → Perfect classifier
AUC = 0.5 → Random guessing (no better than coin flip)
AUC = 0.9 → Excellent
AUC = 0.7 → Acceptable
AUC < 0.5 → Worse than random (something is wrong)
AUC summarizes the entire ROC curve in a single number.
Higher AUC = better classifier across all thresholds.
Multi-Class Metrics
For problems with 3+ classes, Precision/Recall/F1 extend to: Macro Average: Compute metric for each class separately. Take the simple average. Every class weighted equally. Weighted Average: Compute metric for each class separately. Take average weighted by class size. Larger classes have more influence. Example (3-class problem: Cat, Dog, Rabbit): ┌──────────┬───────────┬──────────┬──────────┬────────────┐ │ Class │ Precision │ Recall │ F1 │ Support │ ├──────────┼───────────┼──────────┼──────────┼────────────┤ │ Cat │ 0.90 │ 0.85 │ 0.87 │ 200 records│ │ Dog │ 0.82 │ 0.88 │ 0.85 │ 150 records│ │ Rabbit │ 0.75 │ 0.70 │ 0.72 │ 50 records │ ├──────────┼───────────┼──────────┼──────────┼────────────┤ │ Macro │ 0.82 │ 0.81 │ 0.81 │ │ │ Weighted │ 0.85 │ 0.83 │ 0.84 │ │ └──────────┴───────────┴──────────┴──────────┴────────────┘
Metrics Quick Reference
┌──────────────────┬──────────────────────────────────────────────┐ │ Metric │ Use When │ ├──────────────────┼──────────────────────────────────────────────┤ │ MAE │ Regression, all errors equally important │ │ RMSE │ Regression, large errors especially costly │ │ R² │ Regression, need to explain variance │ │ Accuracy │ Classification, balanced classes only │ │ Precision │ Classification, false alarms are costly │ │ Recall │ Classification, missing positives is costly │ │ F1 │ Classification, imbalanced classes │ │ AUC-ROC │ Classification, comparing model quality │ │ │ across all thresholds │ └──────────────────┴──────────────────────────────────────────────┘
