Machine Learning Decision Trees
A Decision Tree is a flowchart-like model that makes predictions by asking a series of yes/no questions about the input features. Each question splits the data into smaller groups until a final prediction is reached. Decision Trees are intuitive, visual, and easy to explain to non-technical audiences.
The Core Idea
Think of a Decision Tree as a game of "20 Questions." Each question narrows down the possibilities. The model learns which questions to ask and in what order by analyzing the training data.
Example: Should a bank approve a loan?
Is Credit Score > 700?
/ \
Yes No
/ \
Is Income > ₹5L/year? Is Debt < ₹1L?
/ \ / \
Yes No Yes No
/ \ / \
APPROVE REJECT APPROVE REJECT
Key Terminology
┌──────────────────┬───────────────────────────────────────────────┐ │ Term │ Meaning │ ├──────────────────┼───────────────────────────────────────────────┤ │ Root Node │ The very first question at the top of the tree│ │ Internal Node │ Any question node in the middle of the tree │ │ Branch │ A path from one node to the next │ │ Leaf Node │ The final answer at the bottom (no more │ │ │ questions — just a prediction) │ │ Depth │ Number of levels from root to deepest leaf │ │ Split │ The act of dividing data using a question │ └──────────────────┴───────────────────────────────────────────────┘
How the Tree Decides Which Question to Ask First
The algorithm picks the feature and the split value that best separates the classes. Two common measures determine "best separation":
Gini Impurity
Gini Impurity measures how mixed a group is.
0.0 = Perfectly pure (all one class) — best
0.5 = Completely mixed (equal split) — worst
Formula:
Gini = 1 - (p_class1² + p_class2²)
Example Node: 40 Approve, 10 Reject (total 50)
p_approve = 40/50 = 0.80
p_reject = 10/50 = 0.20
Gini = 1 - (0.80² + 0.20²)
= 1 - (0.64 + 0.04)
= 1 - 0.68
= 0.32
A pure node (50 Approve, 0 Reject):
Gini = 1 - (1.0² + 0.0²) = 0.0 ← Perfect
Information Gain (Entropy)
Entropy measures uncertainty or disorder in a group. 0.0 = No uncertainty (all one class) — best 1.0 = Maximum uncertainty (perfectly balanced) Information Gain = Entropy before split - Entropy after split The algorithm picks the feature that gives the HIGHEST information gain — meaning the split that reduces uncertainty the most.
Building a Decision Tree Step by Step
Step 1: Start with all training records at the root node.
Step 2: Try every possible feature and every possible split value.
Calculate Gini or Information Gain for each.
Choose the split with the best score.
Step 3: Split the data into two child nodes based on the chosen question.
Step 4: Repeat Steps 2–3 for each child node.
Step 5: Stop when:
- A leaf node has only one class (pure)
- Maximum tree depth is reached
- Minimum number of samples per node is reached
- Splitting no longer improves the score
Step 6: Assign each leaf the most common class among its records.
Full Example: Predicting Weather Play Decision
Training Data (14 records):
┌──────────┬──────────┬──────────┬────────┬────────────┐
│ Outlook │ Temp │ Humidity │ Wind │ Play? │
├──────────┼──────────┼──────────┼────────┼────────────┤
│ Sunny │ Hot │ High │ Weak │ No │
│ Sunny │ Hot │ High │ Strong │ No │
│ Overcast │ Hot │ High │ Weak │ Yes │
│ Rainy │ Mild │ High │ Weak │ Yes │
│ Rainy │ Cool │ Normal │ Weak │ Yes │
│ Rainy │ Cool │ Normal │ Strong │ No │
│ Overcast │ Cool │ Normal │ Strong │ Yes │
│ Sunny │ Mild │ High │ Weak │ No │
│ Sunny │ Cool │ Normal │ Weak │ Yes │
│ Rainy │ Mild │ Normal │ Weak │ Yes │
└──────────┴──────────┴──────────┴────────┴────────────┘
Resulting Decision Tree:
Outlook?
/ | \
Sunny Over- Rainy
| cast |
▼ | ▼
Humidity? YES Wind?
/ \ / \
High Normal Strong Weak
NO YES NO YES
Using the Tree for Prediction
New day: Outlook=Sunny, Humidity=Normal, Wind=Weak Step 1: Root → Outlook = Sunny → go left Step 2: Humidity = Normal → go right Step 3: Reach leaf → Prediction: YES (Play) New day: Outlook=Rainy, Wind=Strong Step 1: Root → Outlook = Rainy → go right Step 2: Wind = Strong → go left Step 3: Reach leaf → Prediction: NO (Don't Play)
Overfitting in Decision Trees
An unconstrained Decision Tree grows until every training record is perfectly classified. This creates an extremely deep tree that memorizes the training data instead of learning general patterns. On new data, it performs poorly.
Overfit Tree (too deep): Training Accuracy = 100% ← memorized every detail Test Accuracy = 62% ← fails on new data Balanced Tree (with depth limit): Training Accuracy = 88% Test Accuracy = 85% ← generalizes well
Controlling Tree Complexity: Pruning
Hyperparameters to prevent overfitting: ┌─────────────────────────────┬───────────────────────────────────┐ │ Parameter │ Effect │ ├─────────────────────────────┼───────────────────────────────────┤ │ max_depth │ Limits how deep the tree can grow │ │ min_samples_split │ Minimum records needed to split │ │ min_samples_leaf │ Minimum records allowed in a leaf │ │ max_features │ Max features considered per split │ └─────────────────────────────┴───────────────────────────────────┘ Example: max_depth = 4 means no more than 4 levels below root. A shallow tree generalizes better but may underfit. A deep tree fits training data better but may overfit.
Decision Trees for Regression
Decision Trees work for regression tasks too. Instead of predicting a class label at the leaf node, they predict the average value of all training records that land in that leaf.
Regression Tree Leaf: Leaf contains: House prices 2,50,000 / 2,60,000 / 2,70,000 Prediction for new house = Average = 2,60,000
Advantages and Limitations
Advantages: ✓ Easy to visualize and explain ✓ No feature scaling needed ✓ Handles both numerical and categorical features ✓ Captures non-linear relationships ✓ Fast prediction (just follow the branches) Limitations: ✗ Easily overfits without proper constraints ✗ Sensitive to small changes in data (unstable) ✗ Single trees have lower accuracy than ensemble methods ✗ Can create biased trees if one class dominates
Decision Trees lay the groundwork for some of the most powerful algorithms in Machine Learning. Random Forest and Gradient Boosting — both built from many trees working together — address the instability and overfitting problems that a single tree faces.
