DS Data Cleaning and Preprocessing

Raw data is almost never ready for analysis. It contains missing values, duplicates, wrong formats, outliers, and inconsistencies. Data cleaning fixes these issues before modelling. Studies consistently show that data scientists spend 60–80% of their time cleaning data — making this one of the most critical skills in the field.

What Is Dirty Data

Dirty data refers to any dataset that contains errors, inconsistencies, or gaps that make it unreliable for analysis. Using dirty data produces incorrect insights — a concept known as "garbage in, garbage out."

Types of Data Problems

+--------------------+------------------------------------------+
| Problem Type       | Example                                  |
+--------------------+------------------------------------------+
| Missing Values     | Age column has blank cells               |
| Duplicate Rows     | Same order recorded twice                |
| Wrong Data Types   | Date stored as text "25-Jan-2024"        |
| Inconsistent Text  | "Delhi", "delhi", "DELHI" as one city    |
| Outliers           | Salary of ₹0 or ₹99,999,999             |
| Wrong Range        | Age = 200, Percentage = 150%             |
| Extra Whitespace   | " Alice " instead of "Alice"             |
+--------------------+------------------------------------------+

Step 1 – Load and Inspect the Data

The cleaning process always starts with a full inspection of the raw dataset to understand what problems exist.

import pandas as pd
import numpy as np

# Simulated messy dataset
data = {
    "Name":   ["Alice", "Bob", "Alice", "  Dave  ", "Eve", "Bob", None],
    "Age":    [25, 30, 25, 200, 28, 30, 22],
    "Salary": [50000, 60000, 50000, 75000, None, 60000, 45000],
    "City":   ["Delhi", "MUMBAI", "Delhi", "Mumbai", "Pune", "mumbai", "Delhi"],
    "Joined": ["2021-03-01", "2020-07-15", "2021-03-01",
               "15-Jan-2019", "2022-11-30", "2020-07-15", "2023-05-10"]
}

df = pd.DataFrame(data)
print(df)
print("\nShape:", df.shape)
print("\nMissing values:\n", df.isnull().sum())
print("\nData types:\n", df.dtypes)

Output – Missing values:

Name      1
Age       0
Salary    1
City      0
Joined    0
dtype: int64

Step 2 – Handle Missing Values

Missing values appear as NaN (Not a Number) in Pandas. There are three main strategies for handling them: drop the row, fill with a calculated value, or fill with a fixed value.

Diagram – Missing Value Strategies

Row has missing value
         |
+--------v----------+
|  Is the column    |
|  critical?        |
+--------+----------+
         |
   Yes   |   No
   |     |     |
   v     |     v
Drop the |   Fill with:
  row    |   - Mean (numeric, no outliers)
         |   - Median (numeric, with outliers)
         |   - Mode (categorical)
         |   - Forward fill (time series)
         |   - Fixed value ("Unknown", 0)
# Strategy 1: Drop rows where ALL values are missing
df.dropna(how="all", inplace=True)

# Strategy 2: Fill numeric missing values with median
df["Salary"].fillna(df["Salary"].median(), inplace=True)

# Strategy 3: Fill categorical missing values with mode
df["Name"].fillna(df["Name"].mode()[0], inplace=True)

# Strategy 4: Forward fill (good for time-series)
df["Salary"].fillna(method="ffill", inplace=True)

print("Missing after fix:\n", df.isnull().sum())

Step 3 – Remove Duplicate Rows

Duplicate rows inflate counts, distort averages, and produce wrong model results. Removing them is a mandatory step.

print("Rows before:", len(df))
print("Duplicates found:", df.duplicated().sum())

# View the duplicate rows
print(df[df.duplicated()])

# Remove duplicates – keep the first occurrence
df.drop_duplicates(inplace=True)
print("Rows after:", len(df))

# Check duplicates based on specific columns only
df.drop_duplicates(subset=["Name", "Salary"], keep="first", inplace=True)

Step 4 – Fix Incorrect Data Types

Data loaded from CSV files often stores numbers as text and dates as plain strings. Fixing data types enables proper sorting, calculation, and filtering.

# Check types
print(df.dtypes)

# Convert text to number
df["Age"]    = pd.to_numeric(df["Age"], errors="coerce")
df["Salary"] = pd.to_numeric(df["Salary"], errors="coerce")

# Convert text to date
df["Joined"] = pd.to_datetime(df["Joined"], dayfirst=True, errors="coerce")

# Convert text to category (saves memory for repeated values)
df["City"] = df["City"].astype("category")

print("\nFixed types:\n", df.dtypes)

Step 5 – Standardise Text Data

Inconsistent text values (different cases, extra spaces) create silent errors in groupby operations and filters. "Delhi", "delhi", and "DELHI" look different to Python but represent the same city.

# Convert city names to title case and remove extra whitespace
df["City"] = df["City"].str.strip().str.title()
df["Name"] = df["Name"].str.strip().str.title()

# Replace specific wrong values
df["City"] = df["City"].replace({
    "Bombay": "Mumbai",
    "Calcutta": "Kolkata",
    "Madras": "Chennai"
})

print(df["City"].value_counts())

Output – Before vs After:

Before:
"Delhi", "MUMBAI", "mumbai", "Mumbai", "  Dave  "

After:
"Delhi", "Mumbai", "Mumbai", "Mumbai", "Dave"

Step 6 – Detect and Handle Outliers

An outlier is a value that sits far outside the normal range of a dataset. Some outliers are genuine (a billionaire in a salary dataset), while others are data entry errors (age = 200).

Method 1 – IQR (Interquartile Range)

Q1 = df["Salary"].quantile(0.25)
Q3 = df["Salary"].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"Valid Salary range: ₹{lower_bound:.0f} – ₹{upper_bound:.0f}")

# Flag outliers
outliers = df[(df["Salary"] < lower_bound) | (df["Salary"] > upper_bound)]
print("Outlier rows:\n", outliers)

# Remove outliers
df_clean = df[(df["Salary"] >= lower_bound) & (df["Salary"] <= upper_bound)]

Diagram – IQR Outlier Detection

Data spread:
|---|---|---|---|---|---|---|---|---|
     Q1       Median      Q3
          <-- IQR -->

Lower Fence = Q1 - 1.5 × IQR
Upper Fence = Q3 + 1.5 × IQR

Values outside fences → Outliers

[===|============================|===]
 ↑                               ↑
Lower                          Upper
Fence                          Fence
(Flag / Remove)               (Flag / Remove)

Method 2 – Z-Score

from scipy import stats

z_scores = stats.zscore(df["Salary"].dropna())

# Values with |z| > 3 are outliers (more than 3 std from mean)
outlier_mask = (abs(z_scores) > 3)
print("Z-score outliers:", outlier_mask.sum())

Step 7 – Handle Impossible Values

# Age cannot be above 100 or below 0
df.loc[df["Age"] > 100, "Age"] = np.nan
df.loc[df["Age"] < 0, "Age"]   = np.nan

# Fill the now-missing ages with the median
df["Age"].fillna(df["Age"].median(), inplace=True)

# Salary cannot be negative
df.loc[df["Salary"] < 0, "Salary"] = df["Salary"].median()

Step 8 – Feature Engineering from Dates

Date columns contain rich information that standard models cannot use directly. Extracting individual components unlocks that information.

# Extract useful features from the Joined date
df["JoinYear"]  = df["Joined"].dt.year
df["JoinMonth"] = df["Joined"].dt.month
df["JoinDay"]   = df["Joined"].dt.dayofweek  # 0=Monday, 6=Sunday
df["Tenure"]    = (pd.Timestamp.today() - df["Joined"]).dt.days // 365

print(df[["Name", "Joined", "JoinYear", "JoinMonth", "Tenure"]])

Step 9 – Encode Categorical Variables

Machine learning models work with numbers, not text. Categorical columns like "City" or "Grade" must convert to numeric format before modelling.

Label Encoding – for Ordinal Categories

# Grade has a natural order: A > B > C > D
grade_map = {"A": 4, "B": 3, "C": 2, "D": 1, "F": 0}
df["GradeEncoded"] = df["Grade"].map(grade_map)

One-Hot Encoding – for Nominal Categories

# City has no natural order – use one-hot encoding
df_encoded = pd.get_dummies(df, columns=["City"], drop_first=True)
print(df_encoded.columns.tolist())

Diagram – One-Hot Encoding

Original:                   After One-Hot Encoding:
City                        City_Mumbai  City_Pune
-------                     -----------  ---------
Delhi     →                     0            0
Mumbai    →                     1            0
Pune      →                     0            1
Delhi     →                     0            0

(Delhi becomes the reference: 0, 0 for both columns)

Step 10 – Feature Scaling

Many machine learning algorithms perform poorly when features have very different scales. A salary in thousands and an age in tens need rescaling before feeding into models like KNN or SVM.

Min-Max Scaling (Normalisation)

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df[["Age_scaled", "Salary_scaled"]] = scaler.fit_transform(
    df[["Age", "Salary"]]
)

# All values now between 0 and 1
print(df[["Age", "Age_scaled", "Salary", "Salary_scaled"]].head())

Standard Scaling (Standardisation)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df[["Age_std", "Salary_std"]] = scaler.fit_transform(
    df[["Age", "Salary"]]
)
# Mean becomes 0, standard deviation becomes 1

Diagram – Scaling Comparison

Original Salary: [30000, 45000, 60000, 90000, 120000]

Min-Max (0 to 1):  [0.0,  0.17, 0.33,  0.67,  1.0]
Standard Scaled:   [-1.2, -0.7, -0.2,  0.6,   1.5]
(Mean=0, Std=1)

Both methods shrink the range — but MinMax keeps values between 0-1
while Standard Scaling allows negative values.

Complete Cleaning Pipeline

def clean_dataframe(df):
    # Step 1: Drop completely empty rows
    df.dropna(how="all", inplace=True)
    
    # Step 2: Remove duplicates
    df.drop_duplicates(inplace=True)
    
    # Step 3: Fix types
    df["Salary"] = pd.to_numeric(df["Salary"], errors="coerce")
    df["Age"]    = pd.to_numeric(df["Age"], errors="coerce")
    
    # Step 4: Standardise text
    df["City"] = df["City"].str.strip().str.title()
    
    # Step 5: Fill missing values
    df["Salary"].fillna(df["Salary"].median(), inplace=True)
    df["Age"].fillna(df["Age"].median(), inplace=True)
    
    # Step 6: Remove outliers by IQR
    Q1, Q3 = df["Salary"].quantile([0.25, 0.75])
    IQR = Q3 - Q1
    df = df[(df["Salary"] >= Q1 - 1.5*IQR) & (df["Salary"] <= Q3 + 1.5*IQR)]
    
    return df

df_cleaned = clean_dataframe(df.copy())
print("Clean dataset shape:", df_cleaned.shape)

Summary

  • Data cleaning is the most time-consuming part of any data science project
  • Missing values require a strategy: drop, fill with median/mean/mode, or forward fill
  • Duplicate rows silently inflate results — always check and remove them
  • Data type conversion enables correct arithmetic, sorting, and date operations
  • Text standardisation prevents duplicate categories like "Delhi", "delhi", "DELHI"
  • Outlier detection with IQR or Z-score identifies values that distort analysis
  • Categorical encoding converts text to numbers for machine learning
  • Feature scaling brings all numeric features to a comparable range

Leave a Comment