DS Exploratory Data Analysis

Exploratory Data Analysis is the process of examining a dataset to understand its structure, spot patterns, find relationships, and identify anomalies — before building any model. EDA answers the question: "What does this data actually tell us?" Skipping EDA leads to poor model choices and incorrect conclusions.

What Is EDA and Why It Matters

EDA is like reading a book's table of contents before diving into the chapters. It gives a high-level map of the data before any detailed work begins. A thorough EDA prevents wasted effort by revealing whether the data can actually answer the original business question.

EDA Workflow

+---------------------------+
| 1. Understand the Data    |  → Shape, columns, data types
+---------------------------+
           |
+---------------------------+
| 2. Summary Statistics     |  → Mean, median, std, min, max
+---------------------------+
           |
+---------------------------+
| 3. Check Data Quality     |  → Missing values, duplicates, outliers
+---------------------------+
           |
+---------------------------+
| 4. Univariate Analysis    |  → One column at a time
+---------------------------+
           |
+---------------------------+
| 5. Bivariate Analysis     |  → Two columns at a time
+---------------------------+
           |
+---------------------------+
| 6. Multivariate Analysis  |  → Multiple columns together
+---------------------------+
           |
+---------------------------+
| 7. Record Findings        |  → Document insights for the team
+---------------------------+

Setting Up the Dataset

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Sample employee dataset
np.random.seed(42)
n = 200

df = pd.DataFrame({
    "Age":        np.random.randint(22, 60, n),
    "Salary":     np.random.normal(60000, 15000, n).round(0),
    "Experience": np.random.randint(0, 35, n),
    "Department": np.random.choice(["IT", "HR", "Finance", "Marketing"], n),
    "Gender":     np.random.choice(["Male", "Female"], n),
    "Rating":     np.random.choice([1, 2, 3, 4, 5], n)
})

# Introduce some missing values
df.loc[5:10, "Salary"] = np.nan
df.loc[15:18, "Age"]   = np.nan

Step 1 – Understand the Data Structure

# Basic structure
print("Shape:", df.shape)               # (200, 6)
print("Columns:", df.columns.tolist())
print("\nData Types:\n", df.dtypes)
print("\nFirst 5 rows:\n", df.head())
print("\nInfo:\n")
df.info()

Key questions at this stage:

How many rows and columns does the dataset have?
What does each column represent?
Are the data types correct for each column?
Which columns are numeric and which are categorical?

Step 2 – Summary Statistics

# Numeric columns
print(df.describe())

# Categorical columns
print(df.describe(include="object"))

What to look for in describe():

Statistic	What It Tells	Warning Sign
count	Number of non-missing values	count < total rows → missing values exist
mean	Average value	Very different from median → outliers present
std	Spread of values	Very high std → data is widely spread
min / max	Smallest and largest values	Unexpected extremes → data entry errors
25% / 75%	First and third quartile	Large gap → skewed distribution

Step 3 – Check Data Quality

# Missing values summary
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)

missing_report = pd.DataFrame({
    "Missing Count": missing,
    "Missing %": missing_pct
})
print(missing_report[missing_report["Missing Count"] > 0])

# Duplicates
print("\nDuplicate rows:", df.duplicated().sum())

# Unique values per categorical column
for col in df.select_dtypes(include="object").columns:
    print(f"\n{col} – {df[col].nunique()} unique values:")
    print(df[col].value_counts())

Step 4 – Univariate Analysis

Univariate analysis examines one column at a time. The goal is to understand the distribution and frequency of each individual feature.

For Numeric Columns – Distribution Analysis

# Histogram of Salary distribution
plt.figure(figsize=(8, 4))
plt.hist(df["Salary"].dropna(), bins=20, edgecolor="black", color="steelblue")
plt.title("Salary Distribution")
plt.xlabel("Salary (₹)")
plt.ylabel("Number of Employees")
plt.tight_layout()
plt.savefig("salary_distribution.png")
plt.show()

Distribution Shape Guide

Normal (Bell Curve):          Right-Skewed:           Left-Skewed:
       ▐█▌                         █                         █
      ▐███▌                        ██                       ██
     ▐█████▌                      ████                    ████
    ▐███████▌                    ██████                ██████
  ──────────────              ──────────────        ──────────────
  Mean≈Median≈Mode             Mean > Median          Mean < Median
  (Salary in large firm)       (House prices)         (Age at death)

# Measure skewness and kurtosis
print("Salary Skewness:", df["Salary"].skew().round(3))
# Positive → right-skewed, Negative → left-skewed, Near 0 → normal

print("Salary Kurtosis:", df["Salary"].kurt().round(3))
# High kurtosis → heavy tails (more extreme values)

For Categorical Columns – Frequency Analysis

# Count plot for Department
dept_counts = df["Department"].value_counts()
print(dept_counts)

plt.figure(figsize=(7, 4))
dept_counts.plot(kind="bar", color="coral", edgecolor="black")
plt.title("Employee Count by Department")
plt.xlabel("Department")
plt.ylabel("Count")
plt.xticks(rotation=0)
plt.tight_layout()
plt.savefig("dept_distribution.png")
plt.show()

Step 5 – Bivariate Analysis

Bivariate analysis examines the relationship between two columns. It answers: "Does changing one variable affect another?"

Numeric vs Numeric – Scatter Plot and Correlation

# Scatter plot: Age vs Salary
plt.figure(figsize=(7, 5))
plt.scatter(df["Age"], df["Salary"], alpha=0.5, color="teal")
plt.title("Age vs Salary")
plt.xlabel("Age")
plt.ylabel("Salary (₹)")
plt.tight_layout()
plt.savefig("age_vs_salary.png")
plt.show()

# Correlation coefficient
correlation = df["Age"].corr(df["Salary"])
print(f"Correlation (Age vs Salary): {correlation:.3f}")

Correlation Interpretation Guide

+-------------------+------------------------------+
| Correlation Value | Meaning                      |
+-------------------+------------------------------+
|  0.9 to 1.0       | Very strong positive          |
|  0.7 to 0.9       | Strong positive               |
|  0.4 to 0.7       | Moderate positive             |
|  0.1 to 0.4       | Weak positive                 |
|  -0.1 to 0.1      | No meaningful relationship    |
| -0.4 to -0.1      | Weak negative                 |
| -0.7 to -0.4      | Moderate negative             |
| -1.0 to -0.7      | Strong negative               |
+-------------------+------------------------------+

Numeric vs Categorical – Box Plot

# Salary distribution by Department
plt.figure(figsize=(8, 5))
df.boxplot(column="Salary", by="Department", figsize=(8, 5))
plt.title("Salary by Department")
plt.suptitle("")
plt.xlabel("Department")
plt.ylabel("Salary (₹)")
plt.tight_layout()
plt.savefig("salary_by_dept.png")
plt.show()

Box Plot Anatomy

         │  ← Outlier (beyond 1.5×IQR)
         ○
         │
    ─────┴─────
    |         |  ← Upper fence (Q3 + 1.5×IQR)
    |         |
    |─────────|  ← Q3 (75th percentile)
    |         |
    |────┬────|  ← Median (50th percentile)
    |         |
    |─────────|  ← Q1 (25th percentile)
    |         |
    ─────┬─────
         │
         │     ← Lower fence (Q1 - 1.5×IQR)
         ○     ← Outlier

Categorical vs Categorical – Cross Tabulation

# How is gender distributed across departments?
crosstab = pd.crosstab(df["Department"], df["Gender"])
print(crosstab)

# With percentages
crosstab_pct = pd.crosstab(df["Department"], df["Gender"], normalize="index") * 100
print(crosstab_pct.round(1))

Step 6 – Multivariate Analysis

Multivariate analysis examines three or more variables simultaneously to find hidden patterns.

Correlation Heatmap

# Correlation matrix for all numeric columns
corr_matrix = df[["Age", "Salary", "Experience", "Rating"]].corr()

plt.figure(figsize=(6, 5))
sns.heatmap(
    corr_matrix,
    annot=True,
    cmap="coolwarm",
    fmt=".2f",
    center=0
)
plt.title("Correlation Heatmap")
plt.tight_layout()
plt.savefig("correlation_heatmap.png")
plt.show()

Reading the Heatmap

Heatmap grid:
             Age    Salary  Exp   Rating
Age        [ 1.00   0.45   0.82   0.12 ]
Salary     [ 0.45   1.00   0.61   0.08 ]
Experience [ 0.82   0.61   1.00   0.05 ]
Rating     [ 0.12   0.08   0.05   1.00 ]

Colour Guide:
Dark Red (1.0)  → Perfect positive correlation
White    (0.0)  → No correlation
Dark Blue(-1.0) → Perfect negative correlation

→ Age and Experience have strong correlation (0.82)
→ As Age increases, Experience increases predictably

Pair Plot – All Numeric Columns at Once

# Pair plot shows scatter plots and distributions together
sns.pairplot(
    df[["Age", "Salary", "Experience"]],
    diag_kind="kde",
    plot_kws={"alpha": 0.5}
)
plt.suptitle("Pair Plot – Employee Data", y=1.02)
plt.savefig("pair_plot.png")
plt.show()

Step 7 – Outlier Visualisation

# Box plot to spot outliers visually
plt.figure(figsize=(10, 4))
df[["Age", "Experience", "Rating"]].boxplot()
plt.title("Box Plots – Outlier Detection")
plt.ylabel("Value")
plt.tight_layout()
plt.savefig("boxplot_outliers.png")
plt.show()

# Check specific column stats
Q1  = df["Salary"].quantile(0.25)
Q3  = df["Salary"].quantile(0.75)
IQR = Q3 - Q1
print(f"Outliers in Salary: {((df['Salary'] < Q1-1.5*IQR) | (df['Salary'] > Q3+1.5*IQR)).sum()}")

EDA Questions Checklist

Question	Tool
How many rows and columns?	df.shape
Any missing values?	df.isnull().sum()
Any duplicate rows?	df.duplicated().sum()
What is the distribution of numeric columns?	Histogram, describe()
What is the frequency of categories?	value_counts(), bar chart
Are there outliers?	Box plot, IQR method
Which numeric columns correlate with each other?	corr(), heatmap
Does salary differ across departments?	Box plot by category
Is gender distributed evenly across departments?	Crosstab, stacked bar chart

Summary

EDA is the first analytical step — it reveals what the data contains before any modelling begins
Summary statistics describe central tendency, spread, and shape of each column
Univariate analysis focuses on one column using histograms and frequency counts
Bivariate analysis examines relationships between pairs of columns using scatter plots, box plots, and cross-tabulations
Correlation measures the strength and direction of linear relationships between numeric columns
Heatmaps and pair plots reveal patterns across multiple columns simultaneously
Box plots identify outliers visually and confirm statistical outlier detection

Previous lessons

Back to courses

Next lessons