Machine Learning Data Preprocessing

Raw data collected from the real world is rarely clean or ready for a Machine Learning model. It contains missing values, inconsistent formats, outliers, and irrelevant columns. Data preprocessing is the process of transforming raw data into a clean, structured form that algorithms can understand and learn from effectively.

Why Preprocessing Matters

Algorithms work with numbers and patterns. When data contains errors, gaps, or wildly inconsistent scales, the model learns the wrong patterns. A model trained on messy data produces unreliable predictions regardless of how powerful the algorithm is.

Raw Data Problems → Model Problems

Missing values      → Algorithm crashes or learns wrong patterns
Wrong data types    → Calculations fail
Huge scale gaps     → One feature dominates all others
Duplicate records   → Model learns same example twice

Step 1: Handling Missing Values

Missing values appear as empty cells, "NA", "NULL", or similar placeholders. The approach for handling them depends on how many values are missing and the importance of that column.

Options for Missing Values

┌─────────────────────────┬────────────────────────────────────────────┐
│ Strategy                │ When to Use                                │
├─────────────────────────┼────────────────────────────────────────────┤
│ Remove the row          │ Very few rows are missing (less than 5%)   │
│ Remove the column       │ More than 60–70% of values are missing     │
│ Fill with Mean          │ Numerical data, no extreme outliers        │
│ Fill with Median        │ Numerical data with outliers               │
│ Fill with Mode          │ Categorical data (most common value)       │
│ Use a predictive model  │ Missing pattern is complex and important   │
└─────────────────────────┴────────────────────────────────────────────┘

Example

Original Data:
┌────────┬────────┬────────┐
│ Name   │ Age    │ Salary │
├────────┼────────┼────────┤
│ Amit   │ 28     │ 50000  │
│ Priya  │ NaN    │ 62000  │
│ Rohan  │ 35     │ NaN    │
│ Sneha  │ 30     │ 48000  │
└────────┴────────┴────────┘

After Filling:
  Priya Age → Mean Age = (28+35+30)/3 = 31 → Fill with 31
  Rohan Salary → Median Salary = 50000 → Fill with 50000

Step 2: Handling Duplicate Records

Duplicate rows appear when data is collected from multiple sources or when a form gets submitted twice. These rows teach the model the same example more than once, which skews its learning.

Before Removing Duplicates:
  Row 1: Amit, 28, 50000
  Row 2: Priya, 31, 62000
  Row 3: Amit, 28, 50000  ← Duplicate of Row 1

After Removing Duplicates:
  Row 1: Amit, 28, 50000
  Row 2: Priya, 31, 62000

Step 3: Handling Outliers

An outlier is a value that sits far outside the normal range of other values. Outliers can come from genuine rare events or from data entry errors. Both types affect model accuracy in different ways.

Age column values: 22, 25, 28, 30, 27, 150

150 is an outlier. A person cannot be 150 years old.
This is a data entry error and must be removed or corrected.

Salary column values: 30000, 35000, 32000, 500000

500000 might be a CEO's salary — a genuine rare value.
Depending on the model goal, this may stay or be handled separately.

Common Methods to Detect Outliers

Method 1 – IQR (Interquartile Range):
  Q1 = 25th percentile value
  Q3 = 75th percentile value
  IQR = Q3 - Q1

  Lower Bound = Q1 - 1.5 × IQR
  Upper Bound = Q3 + 1.5 × IQR

  Values outside these bounds = Outliers

Method 2 – Z-Score:
  If a value is more than 3 standard deviations from the mean
  → treat it as an outlier

Step 4: Encoding Categorical Data

Machine Learning algorithms work with numbers, not text labels. Categorical columns like "Color" (Red, Blue, Green) or "City" (Mumbai, Delhi) must be converted into numbers.

Label Encoding

Assign a number to each category. Best for ordinal data where order matters.

Size column: Small, Medium, Large

After Label Encoding:
  Small  → 0
  Medium → 1
  Large  → 2

This works because Small < Medium < Large (order is meaningful).

One-Hot Encoding

Create a new binary column for each category. Best for nominal data where order does not matter.

City column: Mumbai, Delhi, Chennai

After One-Hot Encoding:
┌─────────┬──────────┬───────┬─────────┐
│ City    │ Mumbai   │ Delhi │ Chennai │
├─────────┼──────────┼───────┼─────────┤
│ Mumbai  │ 1        │ 0     │ 0       │
│ Delhi   │ 0        │ 1     │ 0       │
│ Chennai │ 0        │ 0     │ 1       │
└─────────┴──────────┴───────┴─────────┘

Step 5: Feature Scaling

Features with large numerical ranges dominate features with small ranges. For example, a "Salary" column (range: 30,000–500,000) overpowers an "Age" column (range: 18–65) because the numbers are so much larger. Feature scaling brings all columns to a similar range.

Normalization (Min-Max Scaling)

Scales all values to a range between 0 and 1.

Formula:
  Scaled Value = (Value - Min) / (Max - Min)

Example (Age column: Min=18, Max=65):
  Age 28 → (28 - 18) / (65 - 18) = 10 / 47 = 0.21
  Age 45 → (45 - 18) / (65 - 18) = 27 / 47 = 0.57

Result: All ages now between 0 and 1

Standardization (Z-Score Scaling)

Scales values so the column has a mean of 0 and a standard deviation of 1. Works better when the data has outliers.

Formula:
  Scaled Value = (Value - Mean) / Standard Deviation

Example (Salary: Mean=50000, SD=10000):
  Salary 60000 → (60000 - 50000) / 10000 = 1.0
  Salary 40000 → (40000 - 50000) / 10000 = -1.0

Step 6: Removing Irrelevant Features

Not every column helps the model. Columns like "Customer ID" or "Timestamp" carry no predictive value and add noise. Removing them reduces computation time and often improves accuracy.

Dataset Before Cleaning:
┌──────────────┬──────────┬─────────┬───────────────┬────────┐
│ Customer ID  │ Age      │ Salary  │ Join Timestamp│ Bought │
├──────────────┼──────────┼─────────┼───────────────┼────────┤
│ CUS-00123    │ 34       │ 55000   │ 2023-01-15    │ Yes    │

Remove: Customer ID (unique identifier, no pattern)
Remove: Join Timestamp (not related to purchase behavior here)

Dataset After Cleaning:
┌──────────┬─────────┬────────┐
│ Age      │ Salary  │ Bought │
├──────────┼─────────┼────────┤
│ 34       │ 55000   │ Yes    │

Complete Preprocessing Pipeline

Raw Data
   │
   ▼
Step 1: Handle Missing Values (fill or remove)
   │
   ▼
Step 2: Remove Duplicate Rows
   │
   ▼
Step 3: Handle Outliers (cap or remove)
   │
   ▼
Step 4: Encode Categorical Columns (Label / One-Hot)
   │
   ▼
Step 5: Scale Numerical Features (Normalize or Standardize)
   │
   ▼
Step 6: Drop Irrelevant Columns
   │
   ▼
Clean, Model-Ready Dataset ✓

Quick Reference: When to Use Which Technique

┌────────────────────────────┬───────────────────────────────────────┐
│ Situation                  │ Best Approach                         │
├────────────────────────────┼───────────────────────────────────────┤
│ Few missing values         │ Remove those rows                     │
│ Many missing values        │ Fill with mean / median / mode        │
│ Text categories, no order  │ One-Hot Encoding                      │
│ Text categories, with order│ Label Encoding                        │
│ Algorithm uses distance    │ Normalize or Standardize              │
│   (KNN, SVM)               │                                       │
│ Tree-based algorithms      │ Scaling is optional                   │
│ Extreme outliers present   │ Standardization or IQR capping        │
└────────────────────────────┴───────────────────────────────────────┘

Leave a Comment