DS Data Cleaning and Preprocessing
Raw data is almost never ready for analysis. It contains missing values, duplicates, wrong formats, outliers, and inconsistencies. Data cleaning fixes these issues before modelling. Studies consistently show that data scientists spend 60–80% of their time cleaning data — making this one of the most critical skills in the field.
What Is Dirty Data
Dirty data refers to any dataset that contains errors, inconsistencies, or gaps that make it unreliable for analysis. Using dirty data produces incorrect insights — a concept known as "garbage in, garbage out."
Types of Data Problems
+--------------------+------------------------------------------+ | Problem Type | Example | +--------------------+------------------------------------------+ | Missing Values | Age column has blank cells | | Duplicate Rows | Same order recorded twice | | Wrong Data Types | Date stored as text "25-Jan-2024" | | Inconsistent Text | "Delhi", "delhi", "DELHI" as one city | | Outliers | Salary of ₹0 or ₹99,999,999 | | Wrong Range | Age = 200, Percentage = 150% | | Extra Whitespace | " Alice " instead of "Alice" | +--------------------+------------------------------------------+
Step 1 – Load and Inspect the Data
The cleaning process always starts with a full inspection of the raw dataset to understand what problems exist.
import pandas as pd
import numpy as np
# Simulated messy dataset
data = {
"Name": ["Alice", "Bob", "Alice", " Dave ", "Eve", "Bob", None],
"Age": [25, 30, 25, 200, 28, 30, 22],
"Salary": [50000, 60000, 50000, 75000, None, 60000, 45000],
"City": ["Delhi", "MUMBAI", "Delhi", "Mumbai", "Pune", "mumbai", "Delhi"],
"Joined": ["2021-03-01", "2020-07-15", "2021-03-01",
"15-Jan-2019", "2022-11-30", "2020-07-15", "2023-05-10"]
}
df = pd.DataFrame(data)
print(df)
print("\nShape:", df.shape)
print("\nMissing values:\n", df.isnull().sum())
print("\nData types:\n", df.dtypes)
Output – Missing values:
Name 1 Age 0 Salary 1 City 0 Joined 0 dtype: int64
Step 2 – Handle Missing Values
Missing values appear as NaN (Not a Number) in Pandas. There are three main strategies for handling them: drop the row, fill with a calculated value, or fill with a fixed value.
Diagram – Missing Value Strategies
Row has missing value
|
+--------v----------+
| Is the column |
| critical? |
+--------+----------+
|
Yes | No
| | |
v | v
Drop the | Fill with:
row | - Mean (numeric, no outliers)
| - Median (numeric, with outliers)
| - Mode (categorical)
| - Forward fill (time series)
| - Fixed value ("Unknown", 0)
# Strategy 1: Drop rows where ALL values are missing
df.dropna(how="all", inplace=True)
# Strategy 2: Fill numeric missing values with median
df["Salary"].fillna(df["Salary"].median(), inplace=True)
# Strategy 3: Fill categorical missing values with mode
df["Name"].fillna(df["Name"].mode()[0], inplace=True)
# Strategy 4: Forward fill (good for time-series)
df["Salary"].fillna(method="ffill", inplace=True)
print("Missing after fix:\n", df.isnull().sum())
Step 3 – Remove Duplicate Rows
Duplicate rows inflate counts, distort averages, and produce wrong model results. Removing them is a mandatory step.
print("Rows before:", len(df))
print("Duplicates found:", df.duplicated().sum())
# View the duplicate rows
print(df[df.duplicated()])
# Remove duplicates – keep the first occurrence
df.drop_duplicates(inplace=True)
print("Rows after:", len(df))
# Check duplicates based on specific columns only
df.drop_duplicates(subset=["Name", "Salary"], keep="first", inplace=True)
Step 4 – Fix Incorrect Data Types
Data loaded from CSV files often stores numbers as text and dates as plain strings. Fixing data types enables proper sorting, calculation, and filtering.
# Check types
print(df.dtypes)
# Convert text to number
df["Age"] = pd.to_numeric(df["Age"], errors="coerce")
df["Salary"] = pd.to_numeric(df["Salary"], errors="coerce")
# Convert text to date
df["Joined"] = pd.to_datetime(df["Joined"], dayfirst=True, errors="coerce")
# Convert text to category (saves memory for repeated values)
df["City"] = df["City"].astype("category")
print("\nFixed types:\n", df.dtypes)
Step 5 – Standardise Text Data
Inconsistent text values (different cases, extra spaces) create silent errors in groupby operations and filters. "Delhi", "delhi", and "DELHI" look different to Python but represent the same city.
# Convert city names to title case and remove extra whitespace
df["City"] = df["City"].str.strip().str.title()
df["Name"] = df["Name"].str.strip().str.title()
# Replace specific wrong values
df["City"] = df["City"].replace({
"Bombay": "Mumbai",
"Calcutta": "Kolkata",
"Madras": "Chennai"
})
print(df["City"].value_counts())
Output – Before vs After:
Before: "Delhi", "MUMBAI", "mumbai", "Mumbai", " Dave " After: "Delhi", "Mumbai", "Mumbai", "Mumbai", "Dave"
Step 6 – Detect and Handle Outliers
An outlier is a value that sits far outside the normal range of a dataset. Some outliers are genuine (a billionaire in a salary dataset), while others are data entry errors (age = 200).
Method 1 – IQR (Interquartile Range)
Q1 = df["Salary"].quantile(0.25)
Q3 = df["Salary"].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
print(f"Valid Salary range: ₹{lower_bound:.0f} – ₹{upper_bound:.0f}")
# Flag outliers
outliers = df[(df["Salary"] < lower_bound) | (df["Salary"] > upper_bound)]
print("Outlier rows:\n", outliers)
# Remove outliers
df_clean = df[(df["Salary"] >= lower_bound) & (df["Salary"] <= upper_bound)]
Diagram – IQR Outlier Detection
Data spread:
|---|---|---|---|---|---|---|---|---|
Q1 Median Q3
<-- IQR -->
Lower Fence = Q1 - 1.5 × IQR
Upper Fence = Q3 + 1.5 × IQR
Values outside fences → Outliers
[===|============================|===]
↑ ↑
Lower Upper
Fence Fence
(Flag / Remove) (Flag / Remove)
Method 2 – Z-Score
from scipy import stats
z_scores = stats.zscore(df["Salary"].dropna())
# Values with |z| > 3 are outliers (more than 3 std from mean)
outlier_mask = (abs(z_scores) > 3)
print("Z-score outliers:", outlier_mask.sum())
Step 7 – Handle Impossible Values
# Age cannot be above 100 or below 0 df.loc[df["Age"] > 100, "Age"] = np.nan df.loc[df["Age"] < 0, "Age"] = np.nan # Fill the now-missing ages with the median df["Age"].fillna(df["Age"].median(), inplace=True) # Salary cannot be negative df.loc[df["Salary"] < 0, "Salary"] = df["Salary"].median()
Step 8 – Feature Engineering from Dates
Date columns contain rich information that standard models cannot use directly. Extracting individual components unlocks that information.
# Extract useful features from the Joined date df["JoinYear"] = df["Joined"].dt.year df["JoinMonth"] = df["Joined"].dt.month df["JoinDay"] = df["Joined"].dt.dayofweek # 0=Monday, 6=Sunday df["Tenure"] = (pd.Timestamp.today() - df["Joined"]).dt.days // 365 print(df[["Name", "Joined", "JoinYear", "JoinMonth", "Tenure"]])
Step 9 – Encode Categorical Variables
Machine learning models work with numbers, not text. Categorical columns like "City" or "Grade" must convert to numeric format before modelling.
Label Encoding – for Ordinal Categories
# Grade has a natural order: A > B > C > D
grade_map = {"A": 4, "B": 3, "C": 2, "D": 1, "F": 0}
df["GradeEncoded"] = df["Grade"].map(grade_map)
One-Hot Encoding – for Nominal Categories
# City has no natural order – use one-hot encoding df_encoded = pd.get_dummies(df, columns=["City"], drop_first=True) print(df_encoded.columns.tolist())
Diagram – One-Hot Encoding
Original: After One-Hot Encoding: City City_Mumbai City_Pune ------- ----------- --------- Delhi → 0 0 Mumbai → 1 0 Pune → 0 1 Delhi → 0 0 (Delhi becomes the reference: 0, 0 for both columns)
Step 10 – Feature Scaling
Many machine learning algorithms perform poorly when features have very different scales. A salary in thousands and an age in tens need rescaling before feeding into models like KNN or SVM.
Min-Max Scaling (Normalisation)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[["Age_scaled", "Salary_scaled"]] = scaler.fit_transform(
df[["Age", "Salary"]]
)
# All values now between 0 and 1
print(df[["Age", "Age_scaled", "Salary", "Salary_scaled"]].head())
Standard Scaling (Standardisation)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[["Age_std", "Salary_std"]] = scaler.fit_transform(
df[["Age", "Salary"]]
)
# Mean becomes 0, standard deviation becomes 1
Diagram – Scaling Comparison
Original Salary: [30000, 45000, 60000, 90000, 120000] Min-Max (0 to 1): [0.0, 0.17, 0.33, 0.67, 1.0] Standard Scaled: [-1.2, -0.7, -0.2, 0.6, 1.5] (Mean=0, Std=1) Both methods shrink the range — but MinMax keeps values between 0-1 while Standard Scaling allows negative values.
Complete Cleaning Pipeline
def clean_dataframe(df):
# Step 1: Drop completely empty rows
df.dropna(how="all", inplace=True)
# Step 2: Remove duplicates
df.drop_duplicates(inplace=True)
# Step 3: Fix types
df["Salary"] = pd.to_numeric(df["Salary"], errors="coerce")
df["Age"] = pd.to_numeric(df["Age"], errors="coerce")
# Step 4: Standardise text
df["City"] = df["City"].str.strip().str.title()
# Step 5: Fill missing values
df["Salary"].fillna(df["Salary"].median(), inplace=True)
df["Age"].fillna(df["Age"].median(), inplace=True)
# Step 6: Remove outliers by IQR
Q1, Q3 = df["Salary"].quantile([0.25, 0.75])
IQR = Q3 - Q1
df = df[(df["Salary"] >= Q1 - 1.5*IQR) & (df["Salary"] <= Q3 + 1.5*IQR)]
return df
df_cleaned = clean_dataframe(df.copy())
print("Clean dataset shape:", df_cleaned.shape)
Summary
- Data cleaning is the most time-consuming part of any data science project
- Missing values require a strategy: drop, fill with median/mean/mode, or forward fill
- Duplicate rows silently inflate results — always check and remove them
- Data type conversion enables correct arithmetic, sorting, and date operations
- Text standardisation prevents duplicate categories like "Delhi", "delhi", "DELHI"
- Outlier detection with IQR or Z-score identifies values that distort analysis
- Categorical encoding converts text to numbers for machine learning
- Feature scaling brings all numeric features to a comparable range
