DS Exploratory Data Analysis
Exploratory Data Analysis is the process of examining a dataset to understand its structure, spot patterns, find relationships, and identify anomalies — before building any model. EDA answers the question: "What does this data actually tell us?" Skipping EDA leads to poor model choices and incorrect conclusions.
What Is EDA and Why It Matters
EDA is like reading a book's table of contents before diving into the chapters. It gives a high-level map of the data before any detailed work begins. A thorough EDA prevents wasted effort by revealing whether the data can actually answer the original business question.
EDA Workflow
+---------------------------+
| 1. Understand the Data | → Shape, columns, data types
+---------------------------+
|
+---------------------------+
| 2. Summary Statistics | → Mean, median, std, min, max
+---------------------------+
|
+---------------------------+
| 3. Check Data Quality | → Missing values, duplicates, outliers
+---------------------------+
|
+---------------------------+
| 4. Univariate Analysis | → One column at a time
+---------------------------+
|
+---------------------------+
| 5. Bivariate Analysis | → Two columns at a time
+---------------------------+
|
+---------------------------+
| 6. Multivariate Analysis | → Multiple columns together
+---------------------------+
|
+---------------------------+
| 7. Record Findings | → Document insights for the team
+---------------------------+
Setting Up the Dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Sample employee dataset
np.random.seed(42)
n = 200
df = pd.DataFrame({
"Age": np.random.randint(22, 60, n),
"Salary": np.random.normal(60000, 15000, n).round(0),
"Experience": np.random.randint(0, 35, n),
"Department": np.random.choice(["IT", "HR", "Finance", "Marketing"], n),
"Gender": np.random.choice(["Male", "Female"], n),
"Rating": np.random.choice([1, 2, 3, 4, 5], n)
})
# Introduce some missing values
df.loc[5:10, "Salary"] = np.nan
df.loc[15:18, "Age"] = np.nan
Step 1 – Understand the Data Structure
# Basic structure
print("Shape:", df.shape) # (200, 6)
print("Columns:", df.columns.tolist())
print("\nData Types:\n", df.dtypes)
print("\nFirst 5 rows:\n", df.head())
print("\nInfo:\n")
df.info()
Key questions at this stage:
- How many rows and columns does the dataset have?
- What does each column represent?
- Are the data types correct for each column?
- Which columns are numeric and which are categorical?
Step 2 – Summary Statistics
# Numeric columns print(df.describe()) # Categorical columns print(df.describe(include="object"))
What to look for in describe():
| Statistic | What It Tells | Warning Sign |
|---|---|---|
| count | Number of non-missing values | count < total rows → missing values exist |
| mean | Average value | Very different from median → outliers present |
| std | Spread of values | Very high std → data is widely spread |
| min / max | Smallest and largest values | Unexpected extremes → data entry errors |
| 25% / 75% | First and third quartile | Large gap → skewed distribution |
Step 3 – Check Data Quality
# Missing values summary
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
missing_report = pd.DataFrame({
"Missing Count": missing,
"Missing %": missing_pct
})
print(missing_report[missing_report["Missing Count"] > 0])
# Duplicates
print("\nDuplicate rows:", df.duplicated().sum())
# Unique values per categorical column
for col in df.select_dtypes(include="object").columns:
print(f"\n{col} – {df[col].nunique()} unique values:")
print(df[col].value_counts())
Step 4 – Univariate Analysis
Univariate analysis examines one column at a time. The goal is to understand the distribution and frequency of each individual feature.
For Numeric Columns – Distribution Analysis
# Histogram of Salary distribution
plt.figure(figsize=(8, 4))
plt.hist(df["Salary"].dropna(), bins=20, edgecolor="black", color="steelblue")
plt.title("Salary Distribution")
plt.xlabel("Salary (₹)")
plt.ylabel("Number of Employees")
plt.tight_layout()
plt.savefig("salary_distribution.png")
plt.show()
Distribution Shape Guide
Normal (Bell Curve): Right-Skewed: Left-Skewed:
▐█▌ █ █
▐███▌ ██ ██
▐█████▌ ████ ████
▐███████▌ ██████ ██████
────────────── ────────────── ──────────────
Mean≈Median≈Mode Mean > Median Mean < Median
(Salary in large firm) (House prices) (Age at death)
# Measure skewness and kurtosis
print("Salary Skewness:", df["Salary"].skew().round(3))
# Positive → right-skewed, Negative → left-skewed, Near 0 → normal
print("Salary Kurtosis:", df["Salary"].kurt().round(3))
# High kurtosis → heavy tails (more extreme values)
For Categorical Columns – Frequency Analysis
# Count plot for Department
dept_counts = df["Department"].value_counts()
print(dept_counts)
plt.figure(figsize=(7, 4))
dept_counts.plot(kind="bar", color="coral", edgecolor="black")
plt.title("Employee Count by Department")
plt.xlabel("Department")
plt.ylabel("Count")
plt.xticks(rotation=0)
plt.tight_layout()
plt.savefig("dept_distribution.png")
plt.show()
Step 5 – Bivariate Analysis
Bivariate analysis examines the relationship between two columns. It answers: "Does changing one variable affect another?"
Numeric vs Numeric – Scatter Plot and Correlation
# Scatter plot: Age vs Salary
plt.figure(figsize=(7, 5))
plt.scatter(df["Age"], df["Salary"], alpha=0.5, color="teal")
plt.title("Age vs Salary")
plt.xlabel("Age")
plt.ylabel("Salary (₹)")
plt.tight_layout()
plt.savefig("age_vs_salary.png")
plt.show()
# Correlation coefficient
correlation = df["Age"].corr(df["Salary"])
print(f"Correlation (Age vs Salary): {correlation:.3f}")
Correlation Interpretation Guide
+-------------------+------------------------------+ | Correlation Value | Meaning | +-------------------+------------------------------+ | 0.9 to 1.0 | Very strong positive | | 0.7 to 0.9 | Strong positive | | 0.4 to 0.7 | Moderate positive | | 0.1 to 0.4 | Weak positive | | -0.1 to 0.1 | No meaningful relationship | | -0.4 to -0.1 | Weak negative | | -0.7 to -0.4 | Moderate negative | | -1.0 to -0.7 | Strong negative | +-------------------+------------------------------+
Numeric vs Categorical – Box Plot
# Salary distribution by Department
plt.figure(figsize=(8, 5))
df.boxplot(column="Salary", by="Department", figsize=(8, 5))
plt.title("Salary by Department")
plt.suptitle("")
plt.xlabel("Department")
plt.ylabel("Salary (₹)")
plt.tight_layout()
plt.savefig("salary_by_dept.png")
plt.show()
Box Plot Anatomy
│ ← Outlier (beyond 1.5×IQR)
○
│
─────┴─────
| | ← Upper fence (Q3 + 1.5×IQR)
| |
|─────────| ← Q3 (75th percentile)
| |
|────┬────| ← Median (50th percentile)
| |
|─────────| ← Q1 (25th percentile)
| |
─────┬─────
│
│ ← Lower fence (Q1 - 1.5×IQR)
○ ← Outlier
Categorical vs Categorical – Cross Tabulation
# How is gender distributed across departments? crosstab = pd.crosstab(df["Department"], df["Gender"]) print(crosstab) # With percentages crosstab_pct = pd.crosstab(df["Department"], df["Gender"], normalize="index") * 100 print(crosstab_pct.round(1))
Step 6 – Multivariate Analysis
Multivariate analysis examines three or more variables simultaneously to find hidden patterns.
Correlation Heatmap
# Correlation matrix for all numeric columns
corr_matrix = df[["Age", "Salary", "Experience", "Rating"]].corr()
plt.figure(figsize=(6, 5))
sns.heatmap(
corr_matrix,
annot=True,
cmap="coolwarm",
fmt=".2f",
center=0
)
plt.title("Correlation Heatmap")
plt.tight_layout()
plt.savefig("correlation_heatmap.png")
plt.show()
Reading the Heatmap
Heatmap grid:
Age Salary Exp Rating
Age [ 1.00 0.45 0.82 0.12 ]
Salary [ 0.45 1.00 0.61 0.08 ]
Experience [ 0.82 0.61 1.00 0.05 ]
Rating [ 0.12 0.08 0.05 1.00 ]
Colour Guide:
Dark Red (1.0) → Perfect positive correlation
White (0.0) → No correlation
Dark Blue(-1.0) → Perfect negative correlation
→ Age and Experience have strong correlation (0.82)
→ As Age increases, Experience increases predictably
Pair Plot – All Numeric Columns at Once
# Pair plot shows scatter plots and distributions together
sns.pairplot(
df[["Age", "Salary", "Experience"]],
diag_kind="kde",
plot_kws={"alpha": 0.5}
)
plt.suptitle("Pair Plot – Employee Data", y=1.02)
plt.savefig("pair_plot.png")
plt.show()
Step 7 – Outlier Visualisation
# Box plot to spot outliers visually
plt.figure(figsize=(10, 4))
df[["Age", "Experience", "Rating"]].boxplot()
plt.title("Box Plots – Outlier Detection")
plt.ylabel("Value")
plt.tight_layout()
plt.savefig("boxplot_outliers.png")
plt.show()
# Check specific column stats
Q1 = df["Salary"].quantile(0.25)
Q3 = df["Salary"].quantile(0.75)
IQR = Q3 - Q1
print(f"Outliers in Salary: {((df['Salary'] < Q1-1.5*IQR) | (df['Salary'] > Q3+1.5*IQR)).sum()}")
EDA Questions Checklist
| Question | Tool |
|---|---|
| How many rows and columns? | df.shape |
| Any missing values? | df.isnull().sum() |
| Any duplicate rows? | df.duplicated().sum() |
| What is the distribution of numeric columns? | Histogram, describe() |
| What is the frequency of categories? | value_counts(), bar chart |
| Are there outliers? | Box plot, IQR method |
| Which numeric columns correlate with each other? | corr(), heatmap |
| Does salary differ across departments? | Box plot by category |
| Is gender distributed evenly across departments? | Crosstab, stacked bar chart |
Summary
- EDA is the first analytical step — it reveals what the data contains before any modelling begins
- Summary statistics describe central tendency, spread, and shape of each column
- Univariate analysis focuses on one column using histograms and frequency counts
- Bivariate analysis examines relationships between pairs of columns using scatter plots, box plots, and cross-tabulations
- Correlation measures the strength and direction of linear relationships between numeric columns
- Heatmaps and pair plots reveal patterns across multiple columns simultaneously
- Box plots identify outliers visually and confirm statistical outlier detection
