Machine Learning K-Nearest Neighbors
K-Nearest Neighbors (KNN) is one of the simplest Machine Learning algorithms. It makes predictions by looking at the K closest training records to a new data point and taking a majority vote (for classification) or an average (for regression). There is no separate training phase — the algorithm stores all training data and computes at prediction time.
The Core Idea: Neighbors Vote
Analogy: Moving to a new neighborhood. To know whether a street is safe, ask the 5 nearest neighbors. If 4 say "safe" and 1 says "unsafe" → it is probably safe. KNN does exactly this with data points.
How KNN Works Step by Step
Dataset: Customer churn prediction Each customer has: (Age, Monthly Spend, Churned: Yes/No) New Customer: Age=35, Monthly Spend=₹2000 Goal: Will this customer churn? Step 1: Calculate distance from new point to every training record Step 2: Sort records by distance (closest first) Step 3: Pick top K records (say K=5) Step 4: Count class votes among those K neighbors Step 5: Majority class = prediction
Distance Calculation
Euclidean Distance (most common):
Distance = √((X2-X1)² + (Y2-Y1)²)
Example:
New Customer: Age=35, Spend=2000
Training Point A: Age=32, Spend=1900
Training Point B: Age=50, Spend=4500
Distance to A = √((35-32)² + (2000-1900)²)
= √(9 + 10000) = √10009 ≈ 100.05
Distance to B = √((35-50)² + (2000-4500)²)
= √(225 + 6250000) ≈ 2500.05
Point A is much closer → more likely to vote.
Complete Prediction with K=5
5 Nearest Neighbors of New Customer: ┌──────────────┬─────┬───────┬─────────┬──────────┐ │ Neighbor │ Age │ Spend │ Distance│ Churned? │ ├──────────────┼─────┼───────┼─────────┼──────────┤ │ Customer 12 │ 33 │ 1950 │ 51 │ No │ │ Customer 7 │ 36 │ 2100 │ 101 │ No │ │ Customer 45 │ 34 │ 1800 │ 201 │ Yes │ │ Customer 31 │ 37 │ 2300 │ 302 │ No │ │ Customer 18 │ 32 │ 2200 │ 361 │ No │ └──────────────┴─────┴───────┴─────────┴──────────┘ Votes: No=4, Yes=1 Prediction: No (will NOT churn) ✓
Choosing K
┌────────┬──────────────────────────────────────────────────────┐ │ K │ Effect │ ├────────┼──────────────────────────────────────────────────────┤ │ K = 1 │ Very sensitive — single nearest neighbor decides │ │ │ Overfits — memorizes training data │ │ K = 3 │ Slightly more stable │ │ K = 5 │ Common default — good balance │ │ K large│ Very smooth boundary — may underfit │ └────────┴──────────────────────────────────────────────────────┘ Rule of Thumb: K = √(number of training records) For 100 records: K ≈ 10 For 10000 records: K ≈ 100 Always use odd K for binary classification to avoid ties.
Importance of Feature Scaling in KNN
Without Scaling: Age range: 18 – 65 (difference of 47) Salary range: 20,000 – 5,00,000 (difference of 4,80,000) Distance calculation heavily dominated by Salary. Age effectively has no influence at all. With Scaling (normalize both to 0–1): Age and Salary both contribute equally to distance. More accurate neighbors. ALWAYS scale features before applying KNN.
KNN for Regression
Instead of majority vote, take the average of K neighbors' values. New house: Size=1500 sqft 5 Nearest Neighbors: House A: 1480 sqft → Price ₹2,40,000 House B: 1510 sqft → Price ₹2,55,000 House C: 1490 sqft → Price ₹2,48,000 House D: 1520 sqft → Price ₹2,60,000 House E: 1470 sqft → Price ₹2,38,000 Predicted Price = Average = ₹2,48,200
Advantages and Limitations of KNN
Advantages: ✓ No training phase — easy to add new data ✓ Simple and intuitive ✓ Naturally handles multi-class problems ✓ Good for non-linear boundaries Limitations: ✗ Very slow at prediction time for large datasets ✗ Memory-heavy — stores all training data ✗ Sensitive to irrelevant features ✗ Must scale features before use ✗ Struggles with high-dimensional data (curse of dimensionality)
