Machine Learning Decision Trees

A Decision Tree is a flowchart-like model that makes predictions by asking a series of yes/no questions about the input features. Each question splits the data into smaller groups until a final prediction is reached. Decision Trees are intuitive, visual, and easy to explain to non-technical audiences.

The Core Idea

Think of a Decision Tree as a game of "20 Questions." Each question narrows down the possibilities. The model learns which questions to ask and in what order by analyzing the training data.

Example: Should a bank approve a loan?

                    Is Credit Score > 700?
                   /                      \
                 Yes                       No
                  /                          \
    Is Income > ₹5L/year?            Is Debt < ₹1L?
         /         \                  /            \
       Yes          No              Yes             No
        /            \               /                \
     APPROVE        REJECT        APPROVE            REJECT

Key Terminology

┌──────────────────┬───────────────────────────────────────────────┐
│ Term             │ Meaning                                       │
├──────────────────┼───────────────────────────────────────────────┤
│ Root Node        │ The very first question at the top of the tree│
│ Internal Node    │ Any question node in the middle of the tree   │
│ Branch           │ A path from one node to the next             │
│ Leaf Node        │ The final answer at the bottom (no more       │
│                  │ questions — just a prediction)                │
│ Depth            │ Number of levels from root to deepest leaf    │
│ Split            │ The act of dividing data using a question     │
└──────────────────┴───────────────────────────────────────────────┘

How the Tree Decides Which Question to Ask First

The algorithm picks the feature and the split value that best separates the classes. Two common measures determine "best separation":

Gini Impurity

Gini Impurity measures how mixed a group is.
  0.0 = Perfectly pure (all one class) — best
  0.5 = Completely mixed (equal split) — worst

Formula:
  Gini = 1 - (p_class1² + p_class2²)

Example Node: 40 Approve, 10 Reject (total 50)
  p_approve = 40/50 = 0.80
  p_reject  = 10/50 = 0.20
  Gini = 1 - (0.80² + 0.20²)
       = 1 - (0.64 + 0.04)
       = 1 - 0.68
       = 0.32

A pure node (50 Approve, 0 Reject):
  Gini = 1 - (1.0² + 0.0²) = 0.0 ← Perfect

Information Gain (Entropy)

Entropy measures uncertainty or disorder in a group.
  0.0 = No uncertainty (all one class) — best
  1.0 = Maximum uncertainty (perfectly balanced)

Information Gain = Entropy before split - Entropy after split

The algorithm picks the feature that gives the HIGHEST
information gain — meaning the split that reduces
uncertainty the most.

Building a Decision Tree Step by Step

Step 1: Start with all training records at the root node.

Step 2: Try every possible feature and every possible split value.
        Calculate Gini or Information Gain for each.
        Choose the split with the best score.

Step 3: Split the data into two child nodes based on the chosen question.

Step 4: Repeat Steps 2–3 for each child node.

Step 5: Stop when:
   - A leaf node has only one class (pure)
   - Maximum tree depth is reached
   - Minimum number of samples per node is reached
   - Splitting no longer improves the score

Step 6: Assign each leaf the most common class among its records.

Full Example: Predicting Weather Play Decision

Training Data (14 records):
┌──────────┬──────────┬──────────┬────────┬────────────┐
│ Outlook  │ Temp     │ Humidity │ Wind   │ Play?      │
├──────────┼──────────┼──────────┼────────┼────────────┤
│ Sunny    │ Hot      │ High     │ Weak   │ No         │
│ Sunny    │ Hot      │ High     │ Strong │ No         │
│ Overcast │ Hot      │ High     │ Weak   │ Yes        │
│ Rainy    │ Mild     │ High     │ Weak   │ Yes        │
│ Rainy    │ Cool     │ Normal   │ Weak   │ Yes        │
│ Rainy    │ Cool     │ Normal   │ Strong │ No         │
│ Overcast │ Cool     │ Normal   │ Strong │ Yes        │
│ Sunny    │ Mild     │ High     │ Weak   │ No         │
│ Sunny    │ Cool     │ Normal   │ Weak   │ Yes        │
│ Rainy    │ Mild     │ Normal   │ Weak   │ Yes        │
└──────────┴──────────┴──────────┴────────┴────────────┘

Resulting Decision Tree:

              Outlook?
             /    |    \
         Sunny  Over-  Rainy
           |    cast     |
           ▼      |      ▼
      Humidity?  YES   Wind?
      /       \        /    \
   High      Normal Strong  Weak
    NO         YES    NO     YES

Using the Tree for Prediction

New day: Outlook=Sunny, Humidity=Normal, Wind=Weak

Step 1: Root → Outlook = Sunny → go left
Step 2: Humidity = Normal → go right
Step 3: Reach leaf → Prediction: YES (Play)

New day: Outlook=Rainy, Wind=Strong

Step 1: Root → Outlook = Rainy → go right
Step 2: Wind = Strong → go left
Step 3: Reach leaf → Prediction: NO (Don't Play)

Overfitting in Decision Trees

An unconstrained Decision Tree grows until every training record is perfectly classified. This creates an extremely deep tree that memorizes the training data instead of learning general patterns. On new data, it performs poorly.

Overfit Tree (too deep):
  Training Accuracy = 100% ← memorized every detail
  Test Accuracy     =  62% ← fails on new data

Balanced Tree (with depth limit):
  Training Accuracy = 88%
  Test Accuracy     = 85% ← generalizes well

Controlling Tree Complexity: Pruning

Hyperparameters to prevent overfitting:

┌─────────────────────────────┬───────────────────────────────────┐
│ Parameter                   │ Effect                            │
├─────────────────────────────┼───────────────────────────────────┤
│ max_depth                   │ Limits how deep the tree can grow │
│ min_samples_split           │ Minimum records needed to split   │
│ min_samples_leaf            │ Minimum records allowed in a leaf │
│ max_features                │ Max features considered per split │
└─────────────────────────────┴───────────────────────────────────┘

Example:
  max_depth = 4 means no more than 4 levels below root.
  A shallow tree generalizes better but may underfit.
  A deep tree fits training data better but may overfit.

Decision Trees for Regression

Decision Trees work for regression tasks too. Instead of predicting a class label at the leaf node, they predict the average value of all training records that land in that leaf.

Regression Tree Leaf:
  Leaf contains: House prices 2,50,000 / 2,60,000 / 2,70,000
  Prediction for new house = Average = 2,60,000

Advantages and Limitations

Advantages:
  ✓ Easy to visualize and explain
  ✓ No feature scaling needed
  ✓ Handles both numerical and categorical features
  ✓ Captures non-linear relationships
  ✓ Fast prediction (just follow the branches)

Limitations:
  ✗ Easily overfits without proper constraints
  ✗ Sensitive to small changes in data (unstable)
  ✗ Single trees have lower accuracy than ensemble methods
  ✗ Can create biased trees if one class dominates

Decision Trees lay the groundwork for some of the most powerful algorithms in Machine Learning. Random Forest and Gradient Boosting — both built from many trees working together — address the instability and overfitting problems that a single tree faces.

Previous lessons

Back to courses

Next lessons