Machine Learning Random Forest

Random Forest is an ensemble algorithm that builds many Decision Trees and combines their results to make a final prediction. Instead of relying on one tree's opinion, it asks many trees and takes a vote. This approach produces a much more accurate and stable model than any single Decision Tree.

The Problem Random Forest Solves

A single Decision Tree is unstable. Change a few records in the training data and the entire tree structure changes dramatically. This instability leads to high variance — the model fits noise rather than real patterns. Random Forest fixes this by averaging out the noise across hundreds of trees.

Single Decision Tree:
  Training Data Slightly Changes → Tree Completely Changes
  Result: High Variance, Unreliable Predictions

Random Forest (100 trees):
  Each tree trained on slightly different data
  Final answer = majority vote of all 100 trees
  Result: Stable and Accurate Predictions

The Wisdom of the Crowd Analogy

Imagine asking 100 doctors to diagnose a patient independently.
  80 doctors say: "Condition A"
  15 doctors say: "Condition B"
   5 doctors say: "Condition C"

Final diagnosis = "Condition A" (majority vote)

One expert might be wrong. But 100 independent experts
collectively are almost always closer to the truth.
Random Forest applies this idea to Decision Trees.

How Random Forest Works

Random Forest Training Process:

Step 1: Bootstrap Sampling
  Original dataset: 1000 records

  Tree 1 trains on: 1000 records (randomly sampled WITH replacement)
    Some records appear twice. Some records are left out.
  Tree 2 trains on: 1000 records (different random sample)
  Tree 3 trains on: 1000 records (different random sample)
  ...
  Tree N trains on: 1000 records (different random sample)

  This technique = "Bagging" (Bootstrap Aggregating)

Step 2: Random Feature Selection
  Each time a tree considers a split, it only looks at
  a RANDOM SUBSET of features — not all features.

  If dataset has 20 features:
    Each split considers only √20 ≈ 4–5 random features
    Different features get emphasized in different trees

  Why? This ensures trees are diverse from each other.
  If all trees use the same dominant feature, they all
  make the same mistake together.

Step 3: Grow Each Tree Fully
  Each tree grows deep without pruning (individual trees
  may overfit, but the average cancels this out)

Step 4: Prediction by Majority Vote (Classification)
         or Averaging (Regression)

Prediction: Voting Mechanism

Classification Example (Spam or Not Spam):

  Training dataset: 1000 emails
  Random Forest: 10 trees (simplified example)

  New Email arrives for prediction:

  Tree  1: NOT SPAM
  Tree  2: SPAM ← 
  Tree  3: NOT SPAM
  Tree  4: NOT SPAM
  Tree  5: SPAM ←
  Tree  6: NOT SPAM
  Tree  7: NOT SPAM
  Tree  8: SPAM ←
  Tree  9: NOT SPAM
  Tree 10: NOT SPAM

  NOT SPAM votes: 7
  SPAM votes:     3

  Final Prediction: NOT SPAM ✓

Regression Example (House Price):
  Tree  1: ₹2,40,000
  Tree  2: ₹2,55,000
  Tree  3: ₹2,48,000
  Tree  4: ₹2,52,000
  Tree  5: ₹2,45,000

  Final Prediction = Average = ₹2,48,000

Out-of-Bag (OOB) Evaluation

Because each tree trains on a bootstrap sample, approximately 37% of records are not seen by each individual tree. These "left-out" records are the Out-of-Bag samples. They can be used to evaluate that tree's performance without a separate validation set.

Tree 1 trains on: Records 1,1,3,5,6,8,9,10,10,... (with repetition)
OOB records for Tree 1: Records 2, 4, 7  (never seen by Tree 1)

Evaluate Tree 1 on its OOB records → OOB Score for Tree 1

Average OOB Score across all trees = OOB Accuracy
(A free validation score — no separate validation set needed)

Feature Importance in Random Forest

Random Forest measures how much each feature contributed to accurate splits across all trees. This gives a ranked list of the most influential features — a valuable insight for feature selection.

Loan Approval Dataset — Feature Importance:

┌──────────────────────┬───────────────────────────────────────┐
│ Feature              │ Importance Score                      │
├──────────────────────┼───────────────────────────────────────┤
│ Credit Score         │ 0.38 ████████████████████             │
│ Annual Income        │ 0.25 █████████████                    │
│ Loan Amount          │ 0.18 █████████                        │
│ Employment Type      │ 0.12 ██████                           │
│ Age                  │ 0.05 ██                               │
│ Number of Dependents │ 0.02 █                                │
└──────────────────────┴───────────────────────────────────────┘

Total always sums to 1.0
Credit Score is the most important feature for loan approval.
Number of Dependents barely matters.

Key Hyperparameters

┌──────────────────────────┬───────────────────────────────────────┐
│ Hyperparameter           │ Effect                                │
├──────────────────────────┼───────────────────────────────────────┤
│ n_estimators             │ Number of trees in the forest         │
│                          │ More trees → more stable, slower      │
│                          │ Typical range: 100 – 500              │
│ max_depth                │ Maximum depth per tree                │
│                          │ None = fully grown trees              │
│ max_features             │ Features considered per split         │
│                          │ "sqrt" or "log2" → adds randomness    │
│ min_samples_leaf         │ Min records required in each leaf     │
│ bootstrap                │ True = use bootstrap sampling         │
│ oob_score                │ True = compute OOB accuracy           │
└──────────────────────────┴───────────────────────────────────────┘

Random Forest vs Single Decision Tree

┌────────────────────────┬────────────────┬─────────────────────┐
│ Feature                │ Decision Tree  │ Random Forest       │
├────────────────────────┼────────────────┼─────────────────────┤
│ Number of models       │ 1              │ Hundreds            │
│ Prone to overfitting   │ Yes (severe)   │ No (much better)    │
│ Stability              │ Low            │ High                │
│ Interpretability       │ High (visual)  │ Low (black box)     │
│ Training speed         │ Fast           │ Slower              │
│ Prediction accuracy    │ Moderate       │ High                │
│ Feature importance     │ Basic          │ Reliable ranking    │
│ Handles missing values │ Moderately     │ Better              │
└────────────────────────┴────────────────┴─────────────────────┘

When to Use Random Forest

Use Random Forest When:
  ✓ High accuracy is the priority
  ✓ Dataset has a mix of numerical and categorical features
  ✓ Data has some noise or outliers
  ✓ Feature importance insights are needed
  ✓ No time for deep hyperparameter tuning (good default behavior)

Consider Other Options When:
  ✗ Model must be fully interpretable (use single Decision Tree)
  ✗ Dataset is extremely large and speed is critical
  ✗ Very high-dimensional sparse data (text — use other methods)

Random Forest Architecture Diagram

Full Dataset
     │
     ├── Bootstrap Sample 1 → Tree 1 → Prediction 1
     │
     ├── Bootstrap Sample 2 → Tree 2 → Prediction 2
     │
     ├── Bootstrap Sample 3 → Tree 3 → Prediction 3
     │      (each tree uses random feature subsets)
     │
     └── Bootstrap Sample N → Tree N → Prediction N
                                           │
                                           ▼
                                 Majority Vote (Classification)
                                 or Average (Regression)
                                           │
                                           ▼
                                   Final Prediction ✓

Leave a Comment