ML Support Vector Machine

Support Vector Machine (SVM) is a classification algorithm that finds the best possible boundary line (or plane) to separate two classes of data. It focuses specifically on the data points closest to the boundary — these critical points are called support vectors — and maximizes the distance (margin) between the boundary and each class.

The Core Idea: Maximum Margin

Many boundary lines can separate two classes. SVM does not just find any separating line — it finds the one that sits as far as possible from both classes. This widest possible separation makes the model more confident and better at handling new data.

Data: Two Classes — Circles (○) and Crosses (×)

Poor Boundary (close to one class):
  ○ ○ ○ | × × ×
       ↑ line sits too close to circles — unsafe

SVM Boundary (maximum margin):
  ○ ○ ○  |  × × ×
         ↑ centered between both classes
     ← margin →

Support Vectors: The closest ○ and × to the boundary line.
These are the only points that define where the line sits.

Key Terminology

┌─────────────────────┬─────────────────────────────────────────────┐
│ Term                │ Meaning                                     │
├─────────────────────┼─────────────────────────────────────────────┤
│ Hyperplane          │ The decision boundary (a line in 2D,       │
│                     │ a plane in 3D, a hyperplane in higher dims) │
│ Support Vectors     │ Training points closest to the hyperplane   │
│                     │ — only these points shape the boundary      │
│ Margin              │ Total distance between the two class        │
│                     │ boundaries on either side of the hyperplane │
│ Maximum Margin      │ SVM's goal — maximize this distance         │
└─────────────────────┴─────────────────────────────────────────────┘

Hard Margin vs Soft Margin

Hard Margin SVM

Assumes data is perfectly separable — no overlaps allowed.
Every training point must sit on the correct side of the boundary.

Works well when:
  Classes are clearly separated with no noise.

Problem:
  Real-world data always has some overlap or outliers.
  Hard Margin fails completely with even one misplaced point.

○ ○ ○    ×   ← This × is on wrong side → Hard Margin fails
       |
       Boundary

Soft Margin SVM (C Parameter)

Allows some points to be on the wrong side of the boundary
(margin violations) but penalizes them.

C = Regularization Parameter:

  High C:
    Strict — penalizes every violation heavily
    Narrow margin → tries hard to classify all training points
    Risk: overfits

  Low C:
    Lenient — allows more violations
    Wide margin → ignores some misclassified points
    Risk: underfits but generalizes better

  ┌──────────────┬──────────────┬──────────────────┐
  │ C Value      │ Margin       │ Tolerance         │
  ├──────────────┼──────────────┼──────────────────┤
  │ High (e.g.100│ Narrow       │ Low (strict)      │
  │ Low (e.g.0.01│ Wide         │ High (lenient)    │
  └──────────────┴──────────────┴──────────────────┘

The Kernel Trick: Handling Non-Linear Data

Many real-world datasets cannot be separated by a straight line. SVM uses kernels to handle this. A kernel function transforms the original data into a higher-dimensional space where a straight boundary becomes possible.

Non-Linear Data in 2D (not separable by a line):

  ○ ○  ×  ○ ○
    × ○○ × × 
  No straight line separates these.

Kernel Trick — Map to 3D:
  Original features: (x, y)
  After kernel: (x, y, x²+y²)

  In 3D, a flat plane CAN separate the classes.
  SVM finds that plane.
  Project back to 2D → curved boundary appears.

Result in 2D after kernel:
  ○ ○   |   × ×      (curved boundary separates them)

Common Kernels

┌──────────────────────┬────────────────────────────────────────────┐
│ Kernel               │ Best Used When                             │
├──────────────────────┼────────────────────────────────────────────┤
│ Linear               │ Data is linearly separable (straight line) │
│ RBF (Radial Basis    │ Most common — handles complex boundaries   │
│ Function / Gaussian) │ Default kernel in most libraries           │
│ Polynomial           │ Curved but structured boundaries          │
│ Sigmoid              │ Similar to neural networks                 │
└──────────────────────┴────────────────────────────────────────────┘

RBF Kernel has a second parameter: Gamma (γ)

  High Gamma:
    Each training point has a very local influence
    → Tightly fitting boundary → Risk of overfitting

  Low Gamma:
    Each point influences a wide area
    → Smoother boundary → Better generalization

SVM for Multi-Class Classification

SVM is naturally binary (two classes only).
For multi-class, two strategies apply:

One-vs-One (OvO):
  Train one classifier for every pair of classes.
  3 classes (A, B, C) → 3 classifiers: A vs B, A vs C, B vs C
  Predict by majority vote across all classifiers.

One-vs-Rest (OvR):
  Train one classifier per class.
  3 classes → 3 classifiers: A vs rest, B vs rest, C vs rest
  Predict the class whose classifier is most confident.

SVM for Regression: SVR

Support Vector Regression (SVR) applies SVM concepts to continuous output prediction. Instead of maximizing the margin between classes, SVR finds a tube around the prediction line and minimizes how many training points fall outside that tube.

SVR tube concept:

  Actual values:   22, 25, 28, 32, 35
  SVR prediction line: a smooth curve

  ε (epsilon) = width of the tube around the line

  Points inside the tube → no penalty
  Points outside the tube → penalized

  ←── ε ──→
  ─────────── Upper tube boundary
  - - - - - - Predicted line
  ___________ Lower tube boundary

Advantages and Limitations of SVM

Advantages:
  ✓ Effective in high-dimensional spaces (many features)
  ✓ Works well when number of features > number of records
  ✓ Memory efficient — only support vectors are stored
  ✓ Flexible with different kernels for different data shapes
  ✓ Strong performance on text classification and image tasks

Limitations:
  ✗ Very slow to train on large datasets (>100,000 records)
  ✗ Sensitive to feature scaling (must normalize data first)
  ✗ Hard to interpret — no clear "if-then" explanation
  ✗ C and Gamma tuning requires careful effort
  ✗ Does not naturally provide probability estimates

SVM Workflow Diagram

Input Data
    │
    ▼
Scale Features (Normalization / Standardization)
    │
    ▼
Choose Kernel (Linear / RBF / Polynomial)
    │
    ▼
Find Hyperplane that Maximizes Margin
    │
    ▼
Only Support Vectors define the boundary
    │
    ▼
New Data Point:
    Which side of the hyperplane does it land on?
    │
    ├── Side A → Class 1
    └── Side B → Class 2

When to Use SVM

SVM Works Best For:
  ✓ Text classification (spam detection, news categorization)
  ✓ Image recognition (with RBF kernel)
  ✓ Medical diagnosis with clear class boundaries
  ✓ Small to medium datasets (under 50,000 records)
  ✓ High-dimensional data with few irrelevant features

Consider Alternatives When:
  ✗ Dataset has millions of records → Random Forest or XGBoost
  ✗ Classes heavily overlap with unclear boundaries
  ✗ Probability output is required with good calibration

Previous lesson

Back to course

Next lesson