ML Gradient Boosting and XGBoost

Gradient Boosting is one of the most powerful and widely used Machine Learning algorithms. It builds an ensemble of Decision Trees sequentially, where each new tree corrects the errors of all previous trees. XGBoost (Extreme Gradient Boosting) is an optimized, faster, and more flexible implementation of this technique that dominates structured data competitions and industry applications.

The Gradient Boosting Idea

AdaBoost reweights misclassified records.
Gradient Boosting takes a different approach:
  Each new tree is trained to predict the ERRORS (residuals)
  of the current ensemble — not the original target.

Analogy:
  Building a sculpture:
  Round 1: Rough shape (large errors in detail)
  Round 2: Chisel the rough parts (fix largest errors)
  Round 3: Sand smooth (fix remaining rough spots)
  Round 4: Polish (fix tiny imperfections)
  ...
  Final: A precise sculpture (accurate model)

Gradient Boosting Step by Step

Goal: Predict house prices.

Step 1: Start with a simple prediction — the mean of all prices.
  Mean price = ₹2,50,000

  Record 1: Actual=₹2,80,000  Predicted=₹2,50,000  Error=+₹30,000
  Record 2: Actual=₹2,20,000  Predicted=₹2,50,000  Error=-₹30,000
  Record 3: Actual=₹3,00,000  Predicted=₹2,50,000  Error=+₹50,000
  Record 4: Actual=₹2,40,000  Predicted=₹2,50,000  Error=-₹10,000

Step 2: Train Tree 1 on the ERRORS (residuals):
  Tree 1 learns to predict: +30,000 / -30,000 / +50,000 / -10,000

  New Predictions = ₹2,50,000 + (learning_rate × Tree1 prediction)
  Learning rate = 0.1:
  Record 1: ₹2,50,000 + 0.1×30,000 = ₹2,53,000
  Record 3: ₹2,50,000 + 0.1×50,000 = ₹2,55,000

Step 3: Calculate new residuals on updated predictions.
  Record 1: Actual=₹2,80,000  New Predicted=₹2,53,000  New Error=+₹27,000

Step 4: Train Tree 2 on new residuals.
  Update predictions again with Tree 2's output.

Step 5: Repeat for N rounds (n_estimators trees).

Final Prediction = Mean + lr×Tree1 + lr×Tree2 + ... + lr×TreeN

Role of the Learning Rate in Gradient Boosting

Learning rate (also called shrinkage) scales each tree's contribution.

High learning rate (e.g., 0.5):
  Each tree makes big corrections → fewer trees needed
  Risk: jumps past the optimal, overfits

Low learning rate (e.g., 0.01):
  Each tree makes tiny corrections → many trees needed
  Benefit: more precise, better generalization

Rule: Lower learning rate + more trees = better result (but slower)

Typical practice:
  Start: learning_rate=0.1, n_estimators=100
  Tune:  learning_rate=0.01–0.05, n_estimators=500–2000

XGBoost: Why It Became the Industry Standard

XGBoost improves Gradient Boosting in several key ways:

┌──────────────────────────────┬────────────────────────────────────┐
│ Improvement                  │ How It Helps                       │
├──────────────────────────────┼────────────────────────────────────┤
│ Regularization (L1 and L2)   │ Prevents overfitting automatically │
│ built in                     │                                    │
│ Tree pruning                 │ Uses max_delta_step and gamma to   │
│                              │ prune leaves that add little gain  │
│ Parallel processing          │ Splits computed in parallel        │
│                              │ despite being sequential in logic  │
│ Handles missing values       │ Learns the best direction for      │
│ natively                     │ missing data automatically         │
│ Cache-aware computation      │ Uses CPU memory more efficiently   │
│ Out-of-core processing       │ Can handle data larger than RAM    │
│ Built-in cross validation    │ Finds optimal n_estimators auto    │
└──────────────────────────────┴────────────────────────────────────┘

Result: XGBoost is 10x faster than standard Gradient Boosting
        and typically more accurate due to regularization.

XGBoost Key Hyperparameters

┌───────────────────────┬───────────────────────────────────────────┐
│ Hyperparameter        │ What It Controls                          │
├───────────────────────┼───────────────────────────────────────────┤
│ n_estimators          │ Number of trees                           │
│                       │ More trees → better, but slower + overfit │
│ learning_rate (eta)   │ Step size per tree (default=0.3)          │
│                       │ Lower = better generalization             │
│ max_depth             │ Max depth per tree (default=6)            │
│                       │ Deeper = more complex = more overfit risk │
│ subsample             │ Fraction of training rows per tree        │
│                       │ e.g., 0.8 = use 80% of rows each round   │
│                       │ Adds randomness, reduces overfitting      │
│ colsample_bytree      │ Fraction of features per tree             │
│                       │ Like Random Forest's feature randomness   │
│ min_child_weight      │ Min sum of sample weights in a leaf       │
│                       │ Higher = more conservative splits         │
│ gamma                 │ Min loss reduction to make a split        │
│                       │ Higher = more conservative tree growth    │
│ lambda (reg_lambda)   │ L2 regularization on weights              │
│ alpha (reg_alpha)     │ L1 regularization on weights              │
└───────────────────────┴───────────────────────────────────────────┘

Practical XGBoost Tuning Guide

Phase 1: Find the right number of trees
  Set learning_rate=0.1, max_depth=6
  Use early stopping (stop when validation score stops improving)
  → This gives optimal n_estimators

Phase 2: Tune tree-specific parameters
  Try max_depth: [3, 4, 5, 6, 7, 8]
  Try min_child_weight: [1, 3, 5, 7]

Phase 3: Add randomness to reduce overfitting
  Try subsample: [0.6, 0.7, 0.8, 0.9, 1.0]
  Try colsample_bytree: [0.6, 0.7, 0.8, 0.9, 1.0]

Phase 4: Tune regularization
  Try lambda (L2): [0, 0.1, 1, 5, 10]
  Try alpha  (L1): [0, 0.1, 0.5, 1]

Phase 5: Lower learning rate, add more trees
  Reduce learning_rate from 0.1 to 0.01–0.05
  Increase n_estimators proportionally

Early Stopping

Training XGBoost for too many rounds causes overfitting.
Early Stopping automatically stops training when the
validation score stops improving.

Example:
  Round  50: Validation Accuracy = 84.2%
  Round 100: Validation Accuracy = 87.1%
  Round 150: Validation Accuracy = 88.3%
  Round 200: Validation Accuracy = 88.4%
  Round 250: Validation Accuracy = 88.3%  ← no improvement
  Round 300: Validation Accuracy = 88.2%  ← getting worse

With early_stopping_rounds=50:
  Training stops at Round 200 (50 rounds with no improvement)
  Best model from Round 200 is saved.

This saves computation time and prevents overfitting automatically.

XGBoost vs Random Forest

┌──────────────────────────┬──────────────────┬─────────────────────┐
│ Feature                  │ Random Forest    │ XGBoost             │
├──────────────────────────┼──────────────────┼─────────────────────┤
│ Training strategy        │ Parallel bagging │ Sequential boosting │
│ Best for                 │ Robust baseline  │ Maximum accuracy    │
│ Overfitting risk         │ Low              │ Moderate (needs     │
│                          │                  │ regularization)     │
│ Hyperparameter tuning    │ Easier           │ More involved       │
│ Handles missing values   │ Needs imputation │ Natively handles    │
│ Training time            │ Faster           │ Slower per tree     │
│ Accuracy (typical)       │ 85–90%           │ 88–95%              │
│ Kaggle competitions      │ Good baseline    │ Often wins          │
└──────────────────────────┴──────────────────┴─────────────────────┘

Other Gradient Boosting Variants

LightGBM (by Microsoft):
  Leaf-wise tree growth (vs level-wise in XGBoost)
  Faster on large datasets (millions of records)
  Lower memory usage
  Better for categorical features

CatBoost (by Yandex):
  Best native handling of categorical features
  No need to encode categories manually
  Reduces overfitting on small datasets
  Slower to train than LightGBM

┌──────────────────┬────────────────────────────────────────────────┐
│ Algorithm        │ Best Use Case                                  │
├──────────────────┼────────────────────────────────────────────────┤
│ XGBoost          │ General purpose, well-understood, widely used  │
│ LightGBM         │ Very large datasets, speed is critical         │
│ CatBoost         │ Many categorical features, minimal preprocessing│
└──────────────────┴────────────────────────────────────────────────┘

Gradient Boosting Full Flow

Training Data
      │
      ▼
Start with mean prediction → Compute residuals
      │
      ▼
Tree 1: Trained to predict residuals
      │
      ▼
Update prediction (add lr × Tree 1 output)
      │
      ▼
Compute new residuals
      │
      ▼
Tree 2: Trained on new residuals
      │
      ▼
Update prediction again
      │
      ▼
...repeat N times (n_estimators)...
      │
      ▼
Final Model = Sum of all tree outputs × learning rate
      │
      ▼
New data → run through all trees → sum outputs → Final Prediction ✓

Previous lesson

Back to course

Next lesson