Machine Learning Exploratory Data Analysis

Exploratory Data Analysis (EDA) is the process of examining a dataset to understand its structure, spot patterns, find anomalies, and form hypotheses before building any model. It is the detective work phase of Machine Learning — done with curiosity and without assumptions.

Why EDA Comes Before Modeling

Jumping straight into model building without understanding the data leads to poor feature choices, missed data problems, and confusing model behavior. EDA prevents this by answering basic questions about the data first.

EDA answers these questions before model training:
  - How many records and features does the dataset have?
  - Are any values missing?
  - What does the distribution of each column look like?
  - Are any features strongly related to the target label?
  - Do any columns have outliers or unexpected patterns?
  - Are any two features highly correlated with each other?

Step 1: Understand the Dataset Shape

The first step is getting a quick view of what the data looks like — how many rows, how many columns, and what data types each column holds.

Example Dataset: Loan Applications

Shape: 5000 rows × 8 columns

┌───────────────────┬───────────────────────────────────────┐
│ Column Name       │ Data Type                             │
├───────────────────┼───────────────────────────────────────┤
│ Applicant Age     │ Integer (numerical)                   │
│ Annual Income     │ Float (numerical)                     │
│ Loan Amount       │ Float (numerical)                     │
│ Employment Status │ String (categorical)                  │
│ Credit Score      │ Integer (numerical)                   │
│ Loan Purpose      │ String (categorical)                  │
│ Previous Default  │ Integer (0 or 1 — binary)             │
│ Loan Approved     │ String (Yes / No — target label)      │
└───────────────────┴───────────────────────────────────────┘

Step 2: Check for Missing Values

EDA identifies missing values early so the team can decide the best preprocessing strategy before training begins.

Missing Value Summary:
┌───────────────────┬────────────────┬──────────────────┐
│ Column            │ Missing Count  │ Missing Percent  │
├───────────────────┼────────────────┼──────────────────┤
│ Applicant Age     │ 0              │ 0.0%             │
│ Annual Income     │ 120            │ 2.4%             │
│ Credit Score      │ 45             │ 0.9%             │
│ Employment Status │ 300            │ 6.0%             │
└───────────────────┴────────────────┴──────────────────┘

Action:
  Income and Credit Score → Fill with median (small % missing)
  Employment Status       → Fill with mode or investigate further

Step 3: Univariate Analysis

Univariate analysis examines one column at a time. The goal is to understand the distribution and range of each feature independently.

For Numerical Features — Summary Statistics

Annual Income Summary:
  Count   : 4880
  Mean    : ₹4,85,000
  Median  : ₹4,10,000
  Min     : ₹80,000
  Max     : ₹42,00,000
  Std Dev : ₹3,20,000

Observation: Mean > Median by a large gap
→ A few very high earners are pulling the mean up
→ Distribution is right-skewed (more people earn less)

Visual (ASCII Histogram — Annual Income):

 ₹0–2L  ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓  (many people)
 ₹2–4L  ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓  (most people)
 ₹4–6L  ▓▓▓▓▓▓▓▓▓▓▓▓
 ₹6–8L  ▓▓▓▓▓▓▓
 ₹8L+   ▓▓▓  (very few)

For Categorical Features — Value Counts

Employment Status Distribution:
┌──────────────────┬────────┬───────────┐
│ Status           │ Count  │ Percent   │
├──────────────────┼────────┼───────────┤
│ Salaried         │ 3200   │ 64%       │
│ Self-Employed    │ 1100   │ 22%       │
│ Business Owner   │ 400    │ 8%        │
│ Unemployed       │ 300    │ 6%        │
└──────────────────┴────────┴───────────┘

Step 4: Bivariate Analysis

Bivariate analysis looks at how two columns relate to each other. The most important relationship to examine is each feature against the target label.

Numerical Feature vs Target Label

Credit Score vs Loan Approved:

Approved Loans (Yes):
  Average Credit Score = 720

Rejected Loans (No):
  Average Credit Score = 580

Observation: Higher credit scores strongly associate with approval.
Credit Score is likely an important feature for the model.

Categorical Feature vs Target Label

Employment Status vs Loan Approved:
┌──────────────────┬───────────┬──────────┐
│ Status           │ Approved  │ Rejected │
├──────────────────┼───────────┼──────────┤
│ Salaried         │ 78%       │ 22%      │
│ Self-Employed    │ 55%       │ 45%      │
│ Business Owner   │ 60%       │ 40%      │
│ Unemployed       │ 18%       │ 82%      │
└──────────────────┴───────────┴──────────┘

Observation: Employment type clearly influences approval outcome.
This is a valuable feature for model training.

Step 5: Correlation Analysis

Correlation measures how strongly two numerical features move together. A value close to +1 means they rise together. A value close to -1 means one rises when the other falls. A value close to 0 means no relationship.

Correlation Matrix (selected columns):

               Age    Income  LoanAmt  CreditScore
Age            1.00    0.45    0.30     0.38
Income         0.45    1.00    0.72     0.50
LoanAmt        0.30    0.72    1.00     0.33
CreditScore    0.38    0.50    0.33     1.00

Key Observations:
  Income ↔ LoanAmt : 0.72 (strong) — higher earners request larger loans
  Age    ↔ Income  : 0.45 (moderate) — older applicants tend to earn more

Warning: If two features are very highly correlated (above 0.90),
keeping both adds redundancy without extra information.

Step 6: Target Label Distribution

Checking how the target label distributes across its classes reveals whether the dataset is balanced or imbalanced. This matters for algorithm selection and evaluation strategy.

Loan Approved Column:
  Yes (Approved) : 3500 records  → 70%
  No (Rejected)  : 1500 records  → 30%

This is mildly imbalanced. The model might learn to predict
"Yes" most of the time and still appear 70% accurate.
Special handling may be needed.

Extreme Imbalance Example (Fraud Detection):
  Not Fraud : 99,700 records → 99.7%
  Fraud     :    300 records →  0.3%

A model that always predicts "Not Fraud" gets 99.7% accuracy
but catches zero fraud cases. Imbalance must be addressed.

Step 7: Outlier Visualization

Box Plot Concept for Annual Income:

     ┌──────────────────┐
─────┤ Q1               ├─────── Q3 ────────────────── ● (outlier)
     └──────────────────┘
     │                  │
    Min               Max
   (normal range)

Points beyond the whiskers = Outliers

EDA Checklist

┌─────────────────────────────────────────────┬──────────┐
│ EDA Task                                    │ Done?    │
├─────────────────────────────────────────────┼──────────┤
│ Check dataset shape (rows × columns)        │ ✓        │
│ Identify data types of each column          │ ✓        │
│ Count and handle missing values             │ ✓        │
│ Summary statistics for numerical features  │ ✓        │
│ Value counts for categorical features       │ ✓        │
│ Visualize distributions (histogram)         │ ✓        │
│ Check feature vs target relationships       │ ✓        │
│ Compute correlation matrix                  │ ✓        │
│ Check target label balance                  │ ✓        │
│ Identify and document outliers              │ ✓        │
└─────────────────────────────────────────────┴──────────┘

EDA is not a one-time step. As modeling progresses, going back to the data with new questions is normal and valuable. The insights gained during EDA directly guide which features to use, which preprocessing steps to apply, and which algorithm to try first.

Leave a Comment