ML Support Vector Machine
Support Vector Machine (SVM) is a classification algorithm that finds the best possible boundary line (or plane) to separate two classes of data. It focuses specifically on the data points closest to the boundary — these critical points are called support vectors — and maximizes the distance (margin) between the boundary and each class.
The Core Idea: Maximum Margin
Many boundary lines can separate two classes. SVM does not just find any separating line — it finds the one that sits as far as possible from both classes. This widest possible separation makes the model more confident and better at handling new data.
Data: Two Classes — Circles (○) and Crosses (×)
Poor Boundary (close to one class):
○ ○ ○ | × × ×
↑ line sits too close to circles — unsafe
SVM Boundary (maximum margin):
○ ○ ○ | × × ×
↑ centered between both classes
← margin →
Support Vectors: The closest ○ and × to the boundary line.
These are the only points that define where the line sits.
Key Terminology
┌─────────────────────┬─────────────────────────────────────────────┐ │ Term │ Meaning │ ├─────────────────────┼─────────────────────────────────────────────┤ │ Hyperplane │ The decision boundary (a line in 2D, │ │ │ a plane in 3D, a hyperplane in higher dims) │ │ Support Vectors │ Training points closest to the hyperplane │ │ │ — only these points shape the boundary │ │ Margin │ Total distance between the two class │ │ │ boundaries on either side of the hyperplane │ │ Maximum Margin │ SVM's goal — maximize this distance │ └─────────────────────┴─────────────────────────────────────────────┘
Hard Margin vs Soft Margin
Hard Margin SVM
Assumes data is perfectly separable — no overlaps allowed.
Every training point must sit on the correct side of the boundary.
Works well when:
Classes are clearly separated with no noise.
Problem:
Real-world data always has some overlap or outliers.
Hard Margin fails completely with even one misplaced point.
○ ○ ○ × ← This × is on wrong side → Hard Margin fails
|
Boundary
Soft Margin SVM (C Parameter)
Allows some points to be on the wrong side of the boundary
(margin violations) but penalizes them.
C = Regularization Parameter:
High C:
Strict — penalizes every violation heavily
Narrow margin → tries hard to classify all training points
Risk: overfits
Low C:
Lenient — allows more violations
Wide margin → ignores some misclassified points
Risk: underfits but generalizes better
┌──────────────┬──────────────┬──────────────────┐
│ C Value │ Margin │ Tolerance │
├──────────────┼──────────────┼──────────────────┤
│ High (e.g.100│ Narrow │ Low (strict) │
│ Low (e.g.0.01│ Wide │ High (lenient) │
└──────────────┴──────────────┴──────────────────┘
The Kernel Trick: Handling Non-Linear Data
Many real-world datasets cannot be separated by a straight line. SVM uses kernels to handle this. A kernel function transforms the original data into a higher-dimensional space where a straight boundary becomes possible.
Non-Linear Data in 2D (not separable by a line):
○ ○ × ○ ○
× ○○ × ×
No straight line separates these.
Kernel Trick — Map to 3D:
Original features: (x, y)
After kernel: (x, y, x²+y²)
In 3D, a flat plane CAN separate the classes.
SVM finds that plane.
Project back to 2D → curved boundary appears.
Result in 2D after kernel:
○ ○ | × × (curved boundary separates them)
Common Kernels
┌──────────────────────┬────────────────────────────────────────────┐
│ Kernel │ Best Used When │
├──────────────────────┼────────────────────────────────────────────┤
│ Linear │ Data is linearly separable (straight line) │
│ RBF (Radial Basis │ Most common — handles complex boundaries │
│ Function / Gaussian) │ Default kernel in most libraries │
│ Polynomial │ Curved but structured boundaries │
│ Sigmoid │ Similar to neural networks │
└──────────────────────┴────────────────────────────────────────────┘
RBF Kernel has a second parameter: Gamma (γ)
High Gamma:
Each training point has a very local influence
→ Tightly fitting boundary → Risk of overfitting
Low Gamma:
Each point influences a wide area
→ Smoother boundary → Better generalization
SVM for Multi-Class Classification
SVM is naturally binary (two classes only). For multi-class, two strategies apply: One-vs-One (OvO): Train one classifier for every pair of classes. 3 classes (A, B, C) → 3 classifiers: A vs B, A vs C, B vs C Predict by majority vote across all classifiers. One-vs-Rest (OvR): Train one classifier per class. 3 classes → 3 classifiers: A vs rest, B vs rest, C vs rest Predict the class whose classifier is most confident.
SVM for Regression: SVR
Support Vector Regression (SVR) applies SVM concepts to continuous output prediction. Instead of maximizing the margin between classes, SVR finds a tube around the prediction line and minimizes how many training points fall outside that tube.
SVR tube concept: Actual values: 22, 25, 28, 32, 35 SVR prediction line: a smooth curve ε (epsilon) = width of the tube around the line Points inside the tube → no penalty Points outside the tube → penalized ←── ε ──→ ─────────── Upper tube boundary - - - - - - Predicted line ___________ Lower tube boundary
Advantages and Limitations of SVM
Advantages: ✓ Effective in high-dimensional spaces (many features) ✓ Works well when number of features > number of records ✓ Memory efficient — only support vectors are stored ✓ Flexible with different kernels for different data shapes ✓ Strong performance on text classification and image tasks Limitations: ✗ Very slow to train on large datasets (>100,000 records) ✗ Sensitive to feature scaling (must normalize data first) ✗ Hard to interpret — no clear "if-then" explanation ✗ C and Gamma tuning requires careful effort ✗ Does not naturally provide probability estimates
SVM Workflow Diagram
Input Data
│
▼
Scale Features (Normalization / Standardization)
│
▼
Choose Kernel (Linear / RBF / Polynomial)
│
▼
Find Hyperplane that Maximizes Margin
│
▼
Only Support Vectors define the boundary
│
▼
New Data Point:
Which side of the hyperplane does it land on?
│
├── Side A → Class 1
└── Side B → Class 2
When to Use SVM
SVM Works Best For: ✓ Text classification (spam detection, news categorization) ✓ Image recognition (with RBF kernel) ✓ Medical diagnosis with clear class boundaries ✓ Small to medium datasets (under 50,000 records) ✓ High-dimensional data with few irrelevant features Consider Alternatives When: ✗ Dataset has millions of records → Random Forest or XGBoost ✗ Classes heavily overlap with unclear boundaries ✗ Probability output is required with good calibration
