Machine Learning Exploratory Data Analysis
Exploratory Data Analysis (EDA) is the process of examining a dataset to understand its structure, spot patterns, find anomalies, and form hypotheses before building any model. It is the detective work phase of Machine Learning — done with curiosity and without assumptions.
Why EDA Comes Before Modeling
Jumping straight into model building without understanding the data leads to poor feature choices, missed data problems, and confusing model behavior. EDA prevents this by answering basic questions about the data first.
EDA answers these questions before model training: - How many records and features does the dataset have? - Are any values missing? - What does the distribution of each column look like? - Are any features strongly related to the target label? - Do any columns have outliers or unexpected patterns? - Are any two features highly correlated with each other?
Step 1: Understand the Dataset Shape
The first step is getting a quick view of what the data looks like — how many rows, how many columns, and what data types each column holds.
Example Dataset: Loan Applications Shape: 5000 rows × 8 columns ┌───────────────────┬───────────────────────────────────────┐ │ Column Name │ Data Type │ ├───────────────────┼───────────────────────────────────────┤ │ Applicant Age │ Integer (numerical) │ │ Annual Income │ Float (numerical) │ │ Loan Amount │ Float (numerical) │ │ Employment Status │ String (categorical) │ │ Credit Score │ Integer (numerical) │ │ Loan Purpose │ String (categorical) │ │ Previous Default │ Integer (0 or 1 — binary) │ │ Loan Approved │ String (Yes / No — target label) │ └───────────────────┴───────────────────────────────────────┘
Step 2: Check for Missing Values
EDA identifies missing values early so the team can decide the best preprocessing strategy before training begins.
Missing Value Summary: ┌───────────────────┬────────────────┬──────────────────┐ │ Column │ Missing Count │ Missing Percent │ ├───────────────────┼────────────────┼──────────────────┤ │ Applicant Age │ 0 │ 0.0% │ │ Annual Income │ 120 │ 2.4% │ │ Credit Score │ 45 │ 0.9% │ │ Employment Status │ 300 │ 6.0% │ └───────────────────┴────────────────┴──────────────────┘ Action: Income and Credit Score → Fill with median (small % missing) Employment Status → Fill with mode or investigate further
Step 3: Univariate Analysis
Univariate analysis examines one column at a time. The goal is to understand the distribution and range of each feature independently.
For Numerical Features — Summary Statistics
Annual Income Summary: Count : 4880 Mean : ₹4,85,000 Median : ₹4,10,000 Min : ₹80,000 Max : ₹42,00,000 Std Dev : ₹3,20,000 Observation: Mean > Median by a large gap → A few very high earners are pulling the mean up → Distribution is right-skewed (more people earn less) Visual (ASCII Histogram — Annual Income): ₹0–2L ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ (many people) ₹2–4L ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ (most people) ₹4–6L ▓▓▓▓▓▓▓▓▓▓▓▓ ₹6–8L ▓▓▓▓▓▓▓ ₹8L+ ▓▓▓ (very few)
For Categorical Features — Value Counts
Employment Status Distribution: ┌──────────────────┬────────┬───────────┐ │ Status │ Count │ Percent │ ├──────────────────┼────────┼───────────┤ │ Salaried │ 3200 │ 64% │ │ Self-Employed │ 1100 │ 22% │ │ Business Owner │ 400 │ 8% │ │ Unemployed │ 300 │ 6% │ └──────────────────┴────────┴───────────┘
Step 4: Bivariate Analysis
Bivariate analysis looks at how two columns relate to each other. The most important relationship to examine is each feature against the target label.
Numerical Feature vs Target Label
Credit Score vs Loan Approved: Approved Loans (Yes): Average Credit Score = 720 Rejected Loans (No): Average Credit Score = 580 Observation: Higher credit scores strongly associate with approval. Credit Score is likely an important feature for the model.
Categorical Feature vs Target Label
Employment Status vs Loan Approved: ┌──────────────────┬───────────┬──────────┐ │ Status │ Approved │ Rejected │ ├──────────────────┼───────────┼──────────┤ │ Salaried │ 78% │ 22% │ │ Self-Employed │ 55% │ 45% │ │ Business Owner │ 60% │ 40% │ │ Unemployed │ 18% │ 82% │ └──────────────────┴───────────┴──────────┘ Observation: Employment type clearly influences approval outcome. This is a valuable feature for model training.
Step 5: Correlation Analysis
Correlation measures how strongly two numerical features move together. A value close to +1 means they rise together. A value close to -1 means one rises when the other falls. A value close to 0 means no relationship.
Correlation Matrix (selected columns):
Age Income LoanAmt CreditScore
Age 1.00 0.45 0.30 0.38
Income 0.45 1.00 0.72 0.50
LoanAmt 0.30 0.72 1.00 0.33
CreditScore 0.38 0.50 0.33 1.00
Key Observations:
Income ↔ LoanAmt : 0.72 (strong) — higher earners request larger loans
Age ↔ Income : 0.45 (moderate) — older applicants tend to earn more
Warning: If two features are very highly correlated (above 0.90),
keeping both adds redundancy without extra information.
Step 6: Target Label Distribution
Checking how the target label distributes across its classes reveals whether the dataset is balanced or imbalanced. This matters for algorithm selection and evaluation strategy.
Loan Approved Column: Yes (Approved) : 3500 records → 70% No (Rejected) : 1500 records → 30% This is mildly imbalanced. The model might learn to predict "Yes" most of the time and still appear 70% accurate. Special handling may be needed. Extreme Imbalance Example (Fraud Detection): Not Fraud : 99,700 records → 99.7% Fraud : 300 records → 0.3% A model that always predicts "Not Fraud" gets 99.7% accuracy but catches zero fraud cases. Imbalance must be addressed.
Step 7: Outlier Visualization
Box Plot Concept for Annual Income:
┌──────────────────┐
─────┤ Q1 ├─────── Q3 ────────────────── ● (outlier)
└──────────────────┘
│ │
Min Max
(normal range)
Points beyond the whiskers = Outliers
EDA Checklist
┌─────────────────────────────────────────────┬──────────┐ │ EDA Task │ Done? │ ├─────────────────────────────────────────────┼──────────┤ │ Check dataset shape (rows × columns) │ ✓ │ │ Identify data types of each column │ ✓ │ │ Count and handle missing values │ ✓ │ │ Summary statistics for numerical features │ ✓ │ │ Value counts for categorical features │ ✓ │ │ Visualize distributions (histogram) │ ✓ │ │ Check feature vs target relationships │ ✓ │ │ Compute correlation matrix │ ✓ │ │ Check target label balance │ ✓ │ │ Identify and document outliers │ ✓ │ └─────────────────────────────────────────────┴──────────┘
EDA is not a one-time step. As modeling progresses, going back to the data with new questions is normal and valuable. The insights gained during EDA directly guide which features to use, which preprocessing steps to apply, and which algorithm to try first.
