Machine Learning Ensemble Learning

Ensemble Learning combines multiple Machine Learning models to produce a prediction that is more accurate and reliable than any single model on its own. The core idea is simple: when independent models each make different types of errors, combining their outputs cancels out individual mistakes and leads to better overall performance.

The Core Intuition

Analogy — Medical Diagnosis:
  One doctor: 78% accurate
  Five independent doctors, each voting: ~92% accurate

Why? Each doctor makes different mistakes.
Their errors do not all happen at the same patient.
The majority vote is almost always right.

Machine Learning works the same way.
A group of diverse, imperfect models beats one strong model.

Three Main Ensemble Strategies

Ensemble Methods
      │
      ├──► Bagging    (parallel independent models → average/vote)
      │
      ├──► Boosting   (sequential models → each fixes previous errors)
      │
      └──► Stacking   (combine different model types → meta-learner)

Bagging (Bootstrap Aggregating)

How Bagging Works:
  Step 1: Create N random bootstrap samples from training data
          (sample with replacement — some records repeat)
  Step 2: Train one independent model on each sample
  Step 3: Combine predictions:
          Classification → Majority vote
          Regression     → Average

Key Properties:
  ✓ Models train in PARALLEL (can run simultaneously)
  ✓ Reduces Variance (overfitting) dramatically
  ✓ Does not reduce Bias much

Diagram:

Original Data (1000 records)
        │
  ┌─────┼─────┐
  │     │     │
Sample1 Sample2 Sample3 ... SampleN
  │     │     │             │
Model1  Model2 Model3 ... ModelN
  │     │     │             │
  └─────┴─────┴─────────────┘
                 │
         Majority Vote / Average
                 │
           Final Prediction

Best Known Bagging Algorithm: Random Forest

Boosting

How Boosting Works:
  Step 1: Train a weak model (slightly better than random) on data
  Step 2: Find records where Model 1 made mistakes
  Step 3: Give those misclassified records MORE weight in next round
  Step 4: Train Model 2 on this weighted data
  Step 5: Model 2 focuses on what Model 1 got wrong
  Step 6: Repeat for N rounds
  Step 7: Combine all models with weighted voting

Key Properties:
  ✓ Models train SEQUENTIALLY (each depends on previous)
  ✓ Reduces both Bias and Variance
  ✓ Produces the most accurate ensemble models
  ✗ Slower (sequential — cannot parallelize)
  ✗ More prone to overfitting on noisy data

Diagram:

                Training Data
                      │
                 Model 1 (weak)
                      │
              Errors get higher weight
                      │
                 Model 2 (weak, focuses on Model 1 errors)
                      │
              Errors get higher weight
                      │
                 Model 3 ...
                      │
                 Model N
                      │
        Weighted combination of all N models
                      │
                Final Prediction (strong)

Best Known Boosting Algorithms: AdaBoost, Gradient Boosting, XGBoost

AdaBoost (Adaptive Boosting)

AdaBoost uses very shallow Decision Trees (depth=1, called "stumps")
as its weak learners.

Each stump only splits on one feature — barely better than random.
But combining 100+ stumps produces a powerful ensemble.

Record Weighting in AdaBoost:

Round 1: All records have equal weight (1/N)
  Model 1 misclassifies records: 3, 7, 12
  → Records 3, 7, 12 get higher weight

Round 2: Model 2 trains on weighted data
  It focuses harder on records 3, 7, 12
  Model 2 classifies those correctly but misclassifies: 5, 9
  → Records 5, 9 get higher weight

...and so on.

Final Prediction:
  Each model's vote is weighted by its accuracy.
  More accurate models have stronger votes.

Stacking (Stacked Generalization)

Stacking uses multiple DIFFERENT algorithms (base models)
and trains a second-level model (meta-learner) to combine their predictions.

Why different algorithms?
  Diverse model types make different types of mistakes.
  A meta-learner learns which base model to trust in which situation.

Example:
  Base Models (Level 0):
    Model A: Logistic Regression → Prediction A
    Model B: Random Forest       → Prediction B
    Model C: KNN                 → Prediction C
    Model D: SVM                 → Prediction D

  Meta-Learner (Level 1):
    Input: [Prediction A, Prediction B, Prediction C, Prediction D]
    Output: Final Prediction

Diagram:

  Training Data
       │
  ┌────┼────┬────┐
  │    │    │    │
 LR   RF   KNN  SVM
  │    │    │    │
 Pred Pred Pred Pred
  └────┴────┴────┘
             │
       Meta-Learner
             │
       Final Prediction

Key Rule: Base models make predictions on VALIDATION data
(data not used in their own training) to prevent data leakage.

Voting Ensemble

The simplest ensemble: combine predictions from multiple models by voting.

Hard Voting (Classification):
  Model A: Cat
  Model B: Dog
  Model C: Cat
  Model D: Cat
  Final: Cat (3 votes vs 1)

Soft Voting (Classification):
  Uses predicted probabilities instead of class labels.
  Model A: Cat=0.80, Dog=0.20
  Model B: Cat=0.40, Dog=0.60
  Model C: Cat=0.75, Dog=0.25
  Model D: Cat=0.70, Dog=0.30
  Average: Cat=0.6625, Dog=0.3375
  Final: Cat ← higher average probability

Soft voting is usually more accurate when models output
well-calibrated probabilities.

Bagging vs Boosting vs Stacking Comparison

┌────────────────────────┬───────────────┬───────────────┬─────────────┐
│ Feature                │ Bagging       │ Boosting      │ Stacking    │
├────────────────────────┼───────────────┼───────────────┼─────────────┤
│ Model training order   │ Parallel      │ Sequential    │ Parallel    │
│                        │               │               │ then 1 more │
│ Error type reduced     │ Variance      │ Bias+Variance │ Both        │
│ Same or diff models    │ Same          │ Same          │ Different   │
│ Overfitting risk       │ Low           │ Moderate      │ Low-Moderate│
│ Computation time       │ Moderate      │ Higher        │ High        │
│ Accuracy level         │ High          │ Highest       │ High        │
│ Popular examples       │ Random Forest │ XGBoost,      │ Custom      │
│                        │               │ AdaBoost      │ combinations│
└────────────────────────┴───────────────┴───────────────┴─────────────┘

When to Use Ensemble Methods

Use Ensembles When:
  ✓ Single models are not accurate enough
  ✓ Competing in Machine Learning competitions (Kaggle)
  ✓ Dataset is noisy with lots of variance
  ✓ Prediction accuracy is more important than speed

Consider Single Models When:
  ✗ Interpretability is required (decision must be explained)
  ✗ Prediction latency must be very low (real-time systems)
  ✗ Computational resources are limited
  ✗ Dataset is small (ensembles need enough data to benefit)

Industry Note:
  Most winning solutions on Kaggle use ensemble methods.
  XGBoost and LightGBM (boosting-based) dominate structured data.
  Stacking is common in the final submission stage of competitions.

Previous lesson

Back to course

Next lesson