DS Supervised Learning with Classification
Classification is a type of supervised learning that predicts which category or class a new input belongs to. Unlike regression which predicts a number, classification predicts a label — such as spam or not spam, disease or healthy, cat or dog. This topic covers Logistic Regression, Decision Trees, Random Forest, Support Vector Machines (SVM), and K-Nearest Neighbours (KNN).
Classification vs Regression
+------------------+---------------------------+---------------------------+ | | Regression | Classification | +------------------+---------------------------+---------------------------+ | Output type | Continuous number | Discrete category/label | | Example output | ₹65.5 lakhs | "Spam" or "Not Spam" | | Evaluation | MAE, RMSE, R² | Accuracy, F1, Precision | | Common models | Linear Reg, Ridge, Lasso | Logistic Reg, Tree, SVM | +------------------+---------------------------+---------------------------+
Setting Up the Dataset
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler
# Use Scikit-learn's built-in Breast Cancer dataset
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target) # 0 = malignant, 1 = benign
print("Dataset shape:", X.shape)
print("Classes:", data.target_names)
print("Class distribution:\n", y.value_counts())
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Scale features for algorithms that are distance-based scaler = StandardScaler() X_train_s = scaler.fit_transform(X_train) X_test_s = scaler.transform(X_test)
Logistic Regression
Despite the name, Logistic Regression is a classification algorithm. It calculates the probability that an input belongs to a class (0 or 1). Internally, it applies the sigmoid function to a linear combination of features to convert a number into a probability between 0 and 1.
Diagram – Sigmoid Function (S-Curve)
Probability
1.0 ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ╭─────────────
│ ╭╯
0.5 ──┼────────────────╳─────────────
│ ╭╯
0.0 ──╰───────────╯
← Negative ↑ Positive →
z = 0
Threshold
If P(y=1) ≥ 0.5 → Predict class 1 (e.g., benign)
If P(y=1) < 0.5 → Predict class 0 (e.g., malignant)
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(max_iter=1000, random_state=42)
lr.fit(X_train_s, y_train)
y_pred_lr = lr.predict(X_test_s)
print("Logistic Regression")
print("Accuracy:", accuracy_score(y_test, y_pred_lr))
print(classification_report(y_test, y_pred_lr, target_names=data.target_names))
Decision Tree
A Decision Tree makes predictions by asking a series of yes/no questions about the input features. It splits the data at each node based on the feature that best separates the classes. The tree structure makes predictions easy to understand and explain.
Diagram – Decision Tree Structure
[Root Node]
Is tumour radius < 15?
/ \
Yes No
/ \
[radius < 12?] → Malignant (class 0)
/ \
Yes No
| |
Benign [Check concavity]
(class 1) / \
< 0.1 ≥ 0.1
| |
Benign Malignant
(class 1) (class 0)
Each leaf node gives a final class prediction.
The path from root to leaf = the decision logic.
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(max_depth=4, random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
print("Decision Tree")
print("Accuracy:", accuracy_score(y_test, y_pred_dt))
# Feature importance
importance = pd.Series(dt.feature_importances_, index=X.columns)
print("\nTop 5 Important Features:")
print(importance.nlargest(5))
Random Forest
Random Forest builds many decision trees on random subsets of the data and features, then combines their predictions by majority vote. This process (called bagging) dramatically reduces overfitting compared to a single decision tree.
Diagram – Random Forest Ensemble
Training Data (with random sampling)
│
┌──┴──────────────┐
│ │
Random Random
Sample 1 Sample 2 ... Sample N
│ │
Tree 1 Tree 2 Tree N
│ │ │
"Benign" "Malignant" "Benign"
Majority Vote
│
"Benign" ← Final Prediction
(2 out of 3 trees voted "Benign")
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
print("Random Forest")
print("Accuracy:", accuracy_score(y_test, y_pred_rf))
# Feature importance from Random Forest
importance_rf = pd.Series(rf.feature_importances_, index=X.columns)
print("\nTop 5 Important Features:")
print(importance_rf.nlargest(5))
Support Vector Machine (SVM)
SVM finds the best boundary (called a hyperplane) that separates two classes with the maximum margin between them. Points closest to the boundary — called support vectors — define the margin. SVM performs well on high-dimensional data and small datasets.
Diagram – SVM Maximum Margin
Class A (●) and Class B (○):
○ ○ ○
○ ○
○ ○ ║ ← Best Hyperplane
║
● ● ║ ← Margin (as wide as possible)
● ●
● ●
Support Vectors: points closest to the boundary ─ they define where the line goes.
The SVM tries to push this margin as wide as possible.
from sklearn.svm import SVC
svm = SVC(kernel="rbf", C=1.0, random_state=42)
svm.fit(X_train_s, y_train)
y_pred_svm = svm.predict(X_test_s)
print("SVM (RBF Kernel)")
print("Accuracy:", accuracy_score(y_test, y_pred_svm))
K-Nearest Neighbours (KNN)
KNN classifies a new point by looking at its K nearest neighbours in the training data and taking a majority vote. It requires no training — it simply memorises the data and computes distances at prediction time.
Diagram – KNN with K=3
Training data: ● ● ○ ● ○ ○ ● ○ ● ● New point: ★ ● ● ○ ● ○ ○ ● ★ ○ ← New point to classify ● ● Find K=3 nearest neighbours: Neighbour 1: ○ (distance = 1.2) Neighbour 2: ○ (distance = 1.5) Neighbour 3: ● (distance = 1.8) Vote: 2× ○ vs 1× ● Decision: Classify ★ as ○
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_s, y_train)
y_pred_knn = knn.predict(X_test_s)
print("KNN (K=5)")
print("Accuracy:", accuracy_score(y_test, y_pred_knn))
Confusion Matrix – Detailed Evaluation
Accuracy alone can be misleading — especially when classes are imbalanced. The confusion matrix shows exactly which classes the model confuses with each other.
Diagram – Confusion Matrix Structure
Predicted
Neg (0) Pos (1)
Actual ┌──────────────────────┐
Neg (0) │ TN (True Neg) │ FP (False Pos) │
Pos (1) │ FN (False Neg) │ TP (True Pos) │
└──────────────────────┘
TN = Correctly predicted Negative (Healthy → predicted Healthy)
TP = Correctly predicted Positive (Sick → predicted Sick)
FP = False Alarm (Healthy → predicted Sick) – Type I Error
FN = Missed Case (Sick → predicted Healthy) – Type II Error
In medical diagnosis, FN is dangerous: predicting healthy when sick.
# Confusion matrix for Random Forest
cm = confusion_matrix(y_test, y_pred_rf)
print("Confusion Matrix:")
print(cm)
# Classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred_rf, target_names=data.target_names))
Classification Metrics Explained
| Metric | Formula | Meaning | Use When |
|---|---|---|---|
| Accuracy | (TP + TN) / Total | % of all predictions correct | Classes are balanced |
| Precision | TP / (TP + FP) | Of all "positive" predictions, how many were correct | False positives are costly (spam filter) |
| Recall (Sensitivity) | TP / (TP + FN) | Of all actual positives, how many were caught | False negatives are dangerous (disease detection) |
| F1-Score | 2 × (P × R) / (P + R) | Harmonic mean of Precision and Recall | Imbalanced classes, both errors matter |
Model Comparison – All Classifiers
from sklearn.metrics import accuracy_score, f1_score
results = []
models = {
"Logistic Regression": (lr, X_test_s, y_pred_lr),
"Decision Tree": (dt, X_test, y_pred_dt),
"Random Forest": (rf, X_test, y_pred_rf),
"SVM": (svm, X_test_s, y_pred_svm),
"KNN (K=5)": (knn, X_test_s, y_pred_knn)
}
print(f"{'Model':<25} {'Accuracy':>10} {'F1-Score':>10}")
print("-" * 47)
for name, (model, Xt, yp) in models.items():
acc = accuracy_score(y_test, yp)
f1 = f1_score(y_test, yp, average="weighted")
print(f"{name:<25} {acc:>10.4f} {f1:>10.4f}")
Choosing the Right Classifier
| Algorithm | Strengths | Weaknesses | Best For |
|---|---|---|---|
| Logistic Regression | Fast, interpretable, probabilistic | Linear boundary only | Baseline model, interpretable results |
| Decision Tree | Easy to explain, no scaling needed | Overfits easily | Rule extraction, explainability |
| Random Forest | High accuracy, handles outliers | Slow to predict, less interpretable | General purpose, tabular data |
| SVM | Effective in high dimensions | Slow on large datasets, needs scaling | Text classification, small datasets |
| KNN | Simple, no training needed | Slow at prediction, affected by scale | Small datasets, prototype testing |
Summary
- Classification predicts a discrete category, not a continuous number
- Logistic Regression calculates class probabilities using the sigmoid function
- Decision Trees make predictions through a series of yes/no questions — highly interpretable
- Random Forest combines many decision trees to reduce overfitting and improve accuracy
- SVM finds the widest possible boundary between classes using support vectors
- KNN classifies by majority vote among the K nearest training points
- The confusion matrix reveals exactly which classes the model misclassifies
- Use Precision when false positives are costly; use Recall when false negatives are dangerous
