DS Supervised Learning with Classification

Classification is a type of supervised learning that predicts which category or class a new input belongs to. Unlike regression which predicts a number, classification predicts a label — such as spam or not spam, disease or healthy, cat or dog. This topic covers Logistic Regression, Decision Trees, Random Forest, Support Vector Machines (SVM), and K-Nearest Neighbours (KNN).

Classification vs Regression

+------------------+---------------------------+---------------------------+
|                  |   Regression              |   Classification          |
+------------------+---------------------------+---------------------------+
| Output type      | Continuous number         | Discrete category/label   |
| Example output   | ₹65.5 lakhs               | "Spam" or "Not Spam"      |
| Evaluation       | MAE, RMSE, R²             | Accuracy, F1, Precision   |
| Common models    | Linear Reg, Ridge, Lasso  | Logistic Reg, Tree, SVM   |
+------------------+---------------------------+---------------------------+

Setting Up the Dataset

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler

# Use Scikit-learn's built-in Breast Cancer dataset
from sklearn.datasets import load_breast_cancer

data   = load_breast_cancer()
X      = pd.DataFrame(data.data, columns=data.feature_names)
y      = pd.Series(data.target)   # 0 = malignant, 1 = benign

print("Dataset shape:", X.shape)
print("Classes:", data.target_names)
print("Class distribution:\n", y.value_counts())
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features for algorithms that are distance-based
scaler    = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s  = scaler.transform(X_test)

Logistic Regression

Despite the name, Logistic Regression is a classification algorithm. It calculates the probability that an input belongs to a class (0 or 1). Internally, it applies the sigmoid function to a linear combination of features to convert a number into a probability between 0 and 1.

Diagram – Sigmoid Function (S-Curve)

Probability
  1.0 ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ╭─────────────
        │                 ╭╯
  0.5 ──┼────────────────╳─────────────
        │           ╭╯
  0.0 ──╰───────────╯
        ← Negative   ↑   Positive →
                  z = 0
                  Threshold

If P(y=1) ≥ 0.5 → Predict class 1 (e.g., benign)
If P(y=1) < 0.5 → Predict class 0 (e.g., malignant)
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(max_iter=1000, random_state=42)
lr.fit(X_train_s, y_train)
y_pred_lr = lr.predict(X_test_s)

print("Logistic Regression")
print("Accuracy:", accuracy_score(y_test, y_pred_lr))
print(classification_report(y_test, y_pred_lr, target_names=data.target_names))

Decision Tree

A Decision Tree makes predictions by asking a series of yes/no questions about the input features. It splits the data at each node based on the feature that best separates the classes. The tree structure makes predictions easy to understand and explain.

Diagram – Decision Tree Structure

                [Root Node]
           Is tumour radius < 15?
               /            \
             Yes              No
             /                 \
    [radius < 12?]         → Malignant (class 0)
       /       \
     Yes         No
      |            |
   Benign      [Check concavity]
  (class 1)      /        \
               < 0.1      ≥ 0.1
                |             |
            Benign        Malignant
           (class 1)      (class 0)

Each leaf node gives a final class prediction.
The path from root to leaf = the decision logic.
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(max_depth=4, random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)

print("Decision Tree")
print("Accuracy:", accuracy_score(y_test, y_pred_dt))

# Feature importance
importance = pd.Series(dt.feature_importances_, index=X.columns)
print("\nTop 5 Important Features:")
print(importance.nlargest(5))

Random Forest

Random Forest builds many decision trees on random subsets of the data and features, then combines their predictions by majority vote. This process (called bagging) dramatically reduces overfitting compared to a single decision tree.

Diagram – Random Forest Ensemble

Training Data (with random sampling)
      │
   ┌──┴──────────────┐
   │                  │
  Random             Random
 Sample 1           Sample 2        ...  Sample N
   │                  │
 Tree 1             Tree 2              Tree N
   │                  │                   │
 "Benign"         "Malignant"          "Benign"

           Majority Vote
                 │
              "Benign" ← Final Prediction
(2 out of 3 trees voted "Benign")
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

print("Random Forest")
print("Accuracy:", accuracy_score(y_test, y_pred_rf))

# Feature importance from Random Forest
importance_rf = pd.Series(rf.feature_importances_, index=X.columns)
print("\nTop 5 Important Features:")
print(importance_rf.nlargest(5))

Support Vector Machine (SVM)

SVM finds the best boundary (called a hyperplane) that separates two classes with the maximum margin between them. Points closest to the boundary — called support vectors — define the margin. SVM performs well on high-dimensional data and small datasets.

Diagram – SVM Maximum Margin

Class A (●) and Class B (○):

  ○ ○ ○
  ○   ○
  ○ ○   ║ ← Best Hyperplane
        ║
    ● ● ║ ← Margin (as wide as possible)
    ● ●
  ● ●

Support Vectors: points closest to the boundary ─ they define where the line goes.
The SVM tries to push this margin as wide as possible.
from sklearn.svm import SVC

svm = SVC(kernel="rbf", C=1.0, random_state=42)
svm.fit(X_train_s, y_train)
y_pred_svm = svm.predict(X_test_s)

print("SVM (RBF Kernel)")
print("Accuracy:", accuracy_score(y_test, y_pred_svm))

K-Nearest Neighbours (KNN)

KNN classifies a new point by looking at its K nearest neighbours in the training data and taking a majority vote. It requires no training — it simply memorises the data and computes distances at prediction time.

Diagram – KNN with K=3

Training data:
  ● ● ○
  ●   ○ ○
  ●     ○
  ● ●

New point: ★

  ● ● ○
  ●   ○ ○
  ●  ★   ○      ← New point to classify
  ● ●

Find K=3 nearest neighbours:
  Neighbour 1: ○ (distance = 1.2)
  Neighbour 2: ○ (distance = 1.5)
  Neighbour 3: ● (distance = 1.8)

Vote: 2× ○ vs 1× ●
Decision: Classify ★ as ○
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_s, y_train)
y_pred_knn = knn.predict(X_test_s)

print("KNN (K=5)")
print("Accuracy:", accuracy_score(y_test, y_pred_knn))

Confusion Matrix – Detailed Evaluation

Accuracy alone can be misleading — especially when classes are imbalanced. The confusion matrix shows exactly which classes the model confuses with each other.

Diagram – Confusion Matrix Structure

                Predicted
              Neg (0)  Pos (1)
Actual  ┌──────────────────────┐
Neg (0) │  TN (True Neg)  │  FP (False Pos) │
Pos (1) │  FN (False Neg) │  TP (True Pos)  │
        └──────────────────────┘

TN = Correctly predicted Negative (Healthy → predicted Healthy)
TP = Correctly predicted Positive (Sick → predicted Sick)
FP = False Alarm (Healthy → predicted Sick) – Type I Error
FN = Missed Case (Sick → predicted Healthy) – Type II Error

In medical diagnosis, FN is dangerous: predicting healthy when sick.
# Confusion matrix for Random Forest
cm = confusion_matrix(y_test, y_pred_rf)
print("Confusion Matrix:")
print(cm)

# Classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred_rf, target_names=data.target_names))

Classification Metrics Explained

MetricFormulaMeaningUse When
Accuracy(TP + TN) / Total% of all predictions correctClasses are balanced
PrecisionTP / (TP + FP)Of all "positive" predictions, how many were correctFalse positives are costly (spam filter)
Recall (Sensitivity)TP / (TP + FN)Of all actual positives, how many were caughtFalse negatives are dangerous (disease detection)
F1-Score2 × (P × R) / (P + R)Harmonic mean of Precision and RecallImbalanced classes, both errors matter

Model Comparison – All Classifiers

from sklearn.metrics import accuracy_score, f1_score

results = []
models = {
    "Logistic Regression": (lr, X_test_s, y_pred_lr),
    "Decision Tree":       (dt, X_test,   y_pred_dt),
    "Random Forest":       (rf, X_test,   y_pred_rf),
    "SVM":                 (svm, X_test_s, y_pred_svm),
    "KNN (K=5)":           (knn, X_test_s, y_pred_knn)
}

print(f"{'Model':<25} {'Accuracy':>10} {'F1-Score':>10}")
print("-" * 47)

for name, (model, Xt, yp) in models.items():
    acc = accuracy_score(y_test, yp)
    f1  = f1_score(y_test, yp, average="weighted")
    print(f"{name:<25} {acc:>10.4f} {f1:>10.4f}")

Choosing the Right Classifier

AlgorithmStrengthsWeaknessesBest For
Logistic RegressionFast, interpretable, probabilisticLinear boundary onlyBaseline model, interpretable results
Decision TreeEasy to explain, no scaling neededOverfits easilyRule extraction, explainability
Random ForestHigh accuracy, handles outliersSlow to predict, less interpretableGeneral purpose, tabular data
SVMEffective in high dimensionsSlow on large datasets, needs scalingText classification, small datasets
KNNSimple, no training neededSlow at prediction, affected by scaleSmall datasets, prototype testing

Summary

  • Classification predicts a discrete category, not a continuous number
  • Logistic Regression calculates class probabilities using the sigmoid function
  • Decision Trees make predictions through a series of yes/no questions — highly interpretable
  • Random Forest combines many decision trees to reduce overfitting and improve accuracy
  • SVM finds the widest possible boundary between classes using support vectors
  • KNN classifies by majority vote among the K nearest training points
  • The confusion matrix reveals exactly which classes the model misclassifies
  • Use Precision when false positives are costly; use Recall when false negatives are dangerous

Leave a Comment