DS Supervised Learning with Classification

Classification is a type of supervised learning that predicts which category or class a new input belongs to. Unlike regression which predicts a number, classification predicts a label — such as spam or not spam, disease or healthy, cat or dog. This topic covers Logistic Regression, Decision Trees, Random Forest, Support Vector Machines (SVM), and K-Nearest Neighbours (KNN).

Classification vs Regression

+------------------+---------------------------+---------------------------+
|                  |   Regression              |   Classification          |
+------------------+---------------------------+---------------------------+
| Output type      | Continuous number         | Discrete category/label   |
| Example output   | ₹65.5 lakhs               | "Spam" or "Not Spam"      |
| Evaluation       | MAE, RMSE, R²             | Accuracy, F1, Precision   |
| Common models    | Linear Reg, Ridge, Lasso  | Logistic Reg, Tree, SVM   |
+------------------+---------------------------+---------------------------+

Setting Up the Dataset

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler

# Use Scikit-learn's built-in Breast Cancer dataset
from sklearn.datasets import load_breast_cancer

data   = load_breast_cancer()
X      = pd.DataFrame(data.data, columns=data.feature_names)
y      = pd.Series(data.target)   # 0 = malignant, 1 = benign

print("Dataset shape:", X.shape)
print("Classes:", data.target_names)
print("Class distribution:\n", y.value_counts())

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features for algorithms that are distance-based
scaler    = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s  = scaler.transform(X_test)

Logistic Regression

Despite the name, Logistic Regression is a classification algorithm. It calculates the probability that an input belongs to a class (0 or 1). Internally, it applies the sigmoid function to a linear combination of features to convert a number into a probability between 0 and 1.

Diagram – Sigmoid Function (S-Curve)

Probability
  1.0 ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ╭─────────────
        │                 ╭╯
  0.5 ──┼────────────────╳─────────────
        │           ╭╯
  0.0 ──╰───────────╯
        ← Negative   ↑   Positive →
                  z = 0
                  Threshold

If P(y=1) ≥ 0.5 → Predict class 1 (e.g., benign)
If P(y=1) < 0.5 → Predict class 0 (e.g., malignant)

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(max_iter=1000, random_state=42)
lr.fit(X_train_s, y_train)
y_pred_lr = lr.predict(X_test_s)

print("Logistic Regression")
print("Accuracy:", accuracy_score(y_test, y_pred_lr))
print(classification_report(y_test, y_pred_lr, target_names=data.target_names))

Decision Tree

A Decision Tree makes predictions by asking a series of yes/no questions about the input features. It splits the data at each node based on the feature that best separates the classes. The tree structure makes predictions easy to understand and explain.

Diagram – Decision Tree Structure

                [Root Node]
           Is tumour radius < 15?
               /            \
             Yes              No
             /                 \
    [radius < 12?]         → Malignant (class 0)
       /       \
     Yes         No
      |            |
   Benign      [Check concavity]
  (class 1)      /        \
               < 0.1      ≥ 0.1
                |             |
            Benign        Malignant
           (class 1)      (class 0)

Each leaf node gives a final class prediction.
The path from root to leaf = the decision logic.

from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(max_depth=4, random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)

print("Decision Tree")
print("Accuracy:", accuracy_score(y_test, y_pred_dt))

# Feature importance
importance = pd.Series(dt.feature_importances_, index=X.columns)
print("\nTop 5 Important Features:")
print(importance.nlargest(5))

Random Forest

Random Forest builds many decision trees on random subsets of the data and features, then combines their predictions by majority vote. This process (called bagging) dramatically reduces overfitting compared to a single decision tree.

Diagram – Random Forest Ensemble

Training Data (with random sampling)
      │
   ┌──┴──────────────┐
   │                  │
  Random             Random
 Sample 1           Sample 2        ...  Sample N
   │                  │
 Tree 1             Tree 2              Tree N
   │                  │                   │
 "Benign"         "Malignant"          "Benign"

           Majority Vote
                 │
              "Benign" ← Final Prediction
(2 out of 3 trees voted "Benign")

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

print("Random Forest")
print("Accuracy:", accuracy_score(y_test, y_pred_rf))

# Feature importance from Random Forest
importance_rf = pd.Series(rf.feature_importances_, index=X.columns)
print("\nTop 5 Important Features:")
print(importance_rf.nlargest(5))

Support Vector Machine (SVM)

SVM finds the best boundary (called a hyperplane) that separates two classes with the maximum margin between them. Points closest to the boundary — called support vectors — define the margin. SVM performs well on high-dimensional data and small datasets.

Diagram – SVM Maximum Margin

Class A (●) and Class B (○):

  ○ ○ ○
  ○   ○
  ○ ○   ║ ← Best Hyperplane
        ║
    ● ● ║ ← Margin (as wide as possible)
    ● ●
  ● ●

Support Vectors: points closest to the boundary ─ they define where the line goes.
The SVM tries to push this margin as wide as possible.

from sklearn.svm import SVC

svm = SVC(kernel="rbf", C=1.0, random_state=42)
svm.fit(X_train_s, y_train)
y_pred_svm = svm.predict(X_test_s)

print("SVM (RBF Kernel)")
print("Accuracy:", accuracy_score(y_test, y_pred_svm))

K-Nearest Neighbours (KNN)

KNN classifies a new point by looking at its K nearest neighbours in the training data and taking a majority vote. It requires no training — it simply memorises the data and computes distances at prediction time.

Diagram – KNN with K=3

Training data:
  ● ● ○
  ●   ○ ○
  ●     ○
  ● ●

New point: ★

  ● ● ○
  ●   ○ ○
  ●  ★   ○      ← New point to classify
  ● ●

Find K=3 nearest neighbours:
  Neighbour 1: ○ (distance = 1.2)
  Neighbour 2: ○ (distance = 1.5)
  Neighbour 3: ● (distance = 1.8)

Vote: 2× ○ vs 1× ●
Decision: Classify ★ as ○

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_s, y_train)
y_pred_knn = knn.predict(X_test_s)

print("KNN (K=5)")
print("Accuracy:", accuracy_score(y_test, y_pred_knn))

Confusion Matrix – Detailed Evaluation

Accuracy alone can be misleading — especially when classes are imbalanced. The confusion matrix shows exactly which classes the model confuses with each other.

Diagram – Confusion Matrix Structure

                Predicted
              Neg (0)  Pos (1)
Actual  ┌──────────────────────┐
Neg (0) │  TN (True Neg)  │  FP (False Pos) │
Pos (1) │  FN (False Neg) │  TP (True Pos)  │
        └──────────────────────┘

TN = Correctly predicted Negative (Healthy → predicted Healthy)
TP = Correctly predicted Positive (Sick → predicted Sick)
FP = False Alarm (Healthy → predicted Sick) – Type I Error
FN = Missed Case (Sick → predicted Healthy) – Type II Error

In medical diagnosis, FN is dangerous: predicting healthy when sick.

# Confusion matrix for Random Forest
cm = confusion_matrix(y_test, y_pred_rf)
print("Confusion Matrix:")
print(cm)

# Classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred_rf, target_names=data.target_names))

Classification Metrics Explained

Metric	Formula	Meaning	Use When
Accuracy	(TP + TN) / Total	% of all predictions correct	Classes are balanced
Precision	TP / (TP + FP)	Of all "positive" predictions, how many were correct	False positives are costly (spam filter)
Recall (Sensitivity)	TP / (TP + FN)	Of all actual positives, how many were caught	False negatives are dangerous (disease detection)
F1-Score	2 × (P × R) / (P + R)	Harmonic mean of Precision and Recall	Imbalanced classes, both errors matter

Model Comparison – All Classifiers

from sklearn.metrics import accuracy_score, f1_score

results = []
models = {
    "Logistic Regression": (lr, X_test_s, y_pred_lr),
    "Decision Tree":       (dt, X_test,   y_pred_dt),
    "Random Forest":       (rf, X_test,   y_pred_rf),
    "SVM":                 (svm, X_test_s, y_pred_svm),
    "KNN (K=5)":           (knn, X_test_s, y_pred_knn)
}

print(f"{'Model':<25} {'Accuracy':>10} {'F1-Score':>10}")
print("-" * 47)

for name, (model, Xt, yp) in models.items():
    acc = accuracy_score(y_test, yp)
    f1  = f1_score(y_test, yp, average="weighted")
    print(f"{name:<25} {acc:>10.4f} {f1:>10.4f}")

Choosing the Right Classifier

Algorithm	Strengths	Weaknesses	Best For
Logistic Regression	Fast, interpretable, probabilistic	Linear boundary only	Baseline model, interpretable results
Decision Tree	Easy to explain, no scaling needed	Overfits easily	Rule extraction, explainability
Random Forest	High accuracy, handles outliers	Slow to predict, less interpretable	General purpose, tabular data
SVM	Effective in high dimensions	Slow on large datasets, needs scaling	Text classification, small datasets
KNN	Simple, no training needed	Slow at prediction, affected by scale	Small datasets, prototype testing

Summary

Classification predicts a discrete category, not a continuous number
Logistic Regression calculates class probabilities using the sigmoid function
Decision Trees make predictions through a series of yes/no questions — highly interpretable
Random Forest combines many decision trees to reduce overfitting and improve accuracy
SVM finds the widest possible boundary between classes using support vectors
KNN classifies by majority vote among the K nearest training points
The confusion matrix reveals exactly which classes the model misclassifies
Use Precision when false positives are costly; use Recall when false negatives are dangerous

Previous lesson

Back to course

Next lesson