DS Pandas for Data Manipulation
Pandas is the most important data manipulation library in Python. It provides two powerful data structures — Series and DataFrame — that make loading, exploring, filtering, and transforming tabular data fast and intuitive. Almost every real-world data science project relies on Pandas for handling raw data before any analysis or modelling begins.
The Two Core Pandas Structures
Series – A Single Column of Data
A Series is a one-dimensional array with labels (called an index). Think of it as a single column from a spreadsheet.
import pandas as pd
# Create a Series of city populations (in millions)
population = pd.Series(
[21.0, 20.7, 13.2, 8.5, 7.1],
index=["Delhi", "Mumbai", "Kolkata", "Chennai", "Bengaluru"]
)
print(population)
print("\nDelhi population:", population["Delhi"])
print("Cities over 10M:", population[population > 10])
Output:
Delhi 21.0 Mumbai 20.7 Kolkata 13.2 Chennai 8.5 Bengaluru 7.1 dtype: float64 Delhi population: 21.0 Cities over 10M: Delhi 21.0 Mumbai 20.7 Kolkata 13.2
DataFrame – A Full Table of Data
A DataFrame is a two-dimensional table with rows and columns — just like an Excel spreadsheet or a SQL table. Each column is a Series.
# Create a DataFrame from a dictionary
students = pd.DataFrame({
"Name": ["Amit", "Priya", "Rohan", "Sneha", "Karan"],
"Score": [85, 92, 73, 88, 61],
"Grade": ["B", "A", "C", "B", "D"],
"Passed": [True, True, True, True, False]
})
print(students)
Output:
Name Score Grade Passed
0 Amit 85 B True
1 Priya 92 A True
2 Rohan 73 C True
3 Sneha 88 B True
4 Karan 61 D False
Diagram – DataFrame Structure
Name Score Grade Passed
Index 0: Amit 85 B True ← Row 0
Index 1: Priya 92 A True ← Row 1
Index 2: Rohan 73 C True ← Row 2
↑ ↑ ↑ ↑
Column Column Column Column
0 1 2 3
Loading Data from Files
Real datasets come from CSV files, Excel sheets, databases, or JSON APIs. Pandas reads all of these in one line.
# Load a CSV file
df = pd.read_csv("sales_data.csv")
# Load an Excel file
df = pd.read_excel("sales_data.xlsx", sheet_name="Sheet1")
# Load from a URL directly
url = "https://raw.githubusercontent.com/dataset/sample.csv"
df = pd.read_csv(url)
# First look at the data
print(df.head(5)) # First 5 rows
print(df.tail(3)) # Last 3 rows
print(df.shape) # (rows, columns)
print(df.columns) # Column names
print(df.dtypes) # Data type of each column
Exploring a DataFrame
After loading data, the first step is always to understand its structure, size, and quality.
# Using a sample sales dataset
import pandas as pd
import numpy as np
# Simulated sales data
np.random.seed(0)
df = pd.DataFrame({
"Month": ["Jan","Feb","Mar","Apr","May","Jun"],
"Sales": [45000, 52000, 48000, 61000, 57000, 63000],
"Units": [450, 520, 480, 610, 570, 630],
"Region": ["North","South","North","West","South","West"]
})
print(df.info()) # Column names, types, non-null counts
print("\n")
print(df.describe()) # Summary statistics for numeric columns
Output (describe):
Sales Units
count 6.00 6.00
mean 54333.33 543.33
std 6943.65 69.44
min 45000.00 450.00
25% 49500.00 495.00
50% 54500.00 545.00
75% 60000.00 600.00
max 63000.00 630.00
Selecting Columns and Rows
Selecting Columns
# Select a single column (returns a Series) print(df["Sales"]) # Select multiple columns (returns a DataFrame) print(df[["Month", "Sales", "Region"]])
Selecting Rows with iloc and loc
# iloc – select by position (like array indexing) print(df.iloc[0]) # First row print(df.iloc[1:4]) # Rows 1, 2, 3 print(df.iloc[0, 1]) # Row 0, Column 1 (Sales = 45000) # loc – select by label print(df.loc[0, "Month"]) # "Jan" print(df.loc[2:4, ["Month","Sales"]]) # Rows 2-4, two columns
Diagram – iloc vs loc
DataFrame: Index | Month | Sales | Units | Region 0 | Jan | 45000 | 450 | North ← iloc[0] selects this row 1 | Feb | 52000 | 520 | South 2 | Mar | 48000 | 480 | North 3 | Apr | 61000 | 610 | West 4 | May | 57000 | 570 | South 5 | Jun | 63000 | 630 | West iloc[0, 1] → row position 0, column position 1 → 45000 loc[0, "Sales"] → row label 0, column name "Sales" → 45000
Filtering Data with Conditions
# Months where Sales exceeded 55,000 high_sales = df[df["Sales"] > 55000] print(high_sales) # Months in the South region with Sales above 50,000 south_high = df[(df["Region"] == "South") & (df["Sales"] > 50000)] print(south_high) # Months NOT in the North region not_north = df[df["Region"] != "North"] print(not_north) # Filter using isin – multiple values selected = df[df["Region"].isin(["North", "West"])] print(selected)
Adding and Modifying Columns
# Add a new column: Revenue per Unit
df["RevenuePerUnit"] = df["Sales"] / df["Units"]
# Add a category column using conditions
df["Performance"] = df["Sales"].apply(
lambda x: "High" if x >= 57000 else "Low"
)
print(df[["Month", "Sales", "RevenuePerUnit", "Performance"]])
Output:
Month Sales RevenuePerUnit Performance 0 Jan 45000 100.0 Low 1 Feb 52000 100.0 Low 2 Mar 48000 100.0 Low 3 Apr 61000 100.0 High 4 May 57000 100.0 High 5 Jun 63000 100.0 High
Groupby – Aggregate by Category
Groupby splits a DataFrame into groups based on a column value, applies a function to each group, and combines the results.
# Total and average sales by region
region_summary = df.groupby("Region")["Sales"].agg(
Total="sum",
Average="mean",
Count="count"
).reset_index()
print(region_summary)
Output:
Region Total Average Count 0 North 93000 46500.0 2 1 South 109000 54500.0 2 2 West 124000 62000.0 2
Diagram – Groupby Process
Original DataFrame: Month | Sales | Region Jan | 45000 | North ──┐ Feb | 52000 | South ──┼──→ Split by Region Mar | 48000 | North ──┘ Apr | 61000 | West ──┐ May | 57000 | South ──┼──→ Split by Region Jun | 63000 | West ──┘ Groups: North: [45000, 48000] → sum=93000, mean=46500 South: [52000, 57000] → sum=109000, mean=54500 West: [61000, 63000] → sum=124000, mean=62000 Combine: One row per group in the result
Sorting Data
# Sort by Sales descending (highest first)
sorted_df = df.sort_values("Sales", ascending=False)
print(sorted_df[["Month", "Sales"]])
# Sort by multiple columns
multi_sorted = df.sort_values(
["Region", "Sales"],
ascending=[True, False]
)
Handling Missing Values in Pandas
# Create a DataFrame with missing values
data = pd.DataFrame({
"Name": ["Alice", "Bob", None, "Dave"],
"Score": [88, None, 72, 90],
"City": ["Delhi", "Mumbai", "Pune", None]
})
# Detect missing values
print(data.isnull()) # Boolean table: True where missing
print(data.isnull().sum()) # Count of missing per column
# Drop rows with any missing value
clean = data.dropna()
# Fill missing values with a default
data["Score"].fillna(data["Score"].mean(), inplace=True)
data["City"].fillna("Unknown", inplace=True)
Merging DataFrames
Merging combines two DataFrames based on a shared column — similar to SQL JOIN.
# Customers table
customers = pd.DataFrame({
"CustomerID": [1, 2, 3],
"Name": ["Amit", "Priya", "Rohan"]
})
# Orders table
orders = pd.DataFrame({
"OrderID": [101, 102, 103, 104],
"CustomerID": [1, 2, 1, 3],
"Amount": [500, 750, 300, 900]
})
# Merge on CustomerID
merged = pd.merge(orders, customers, on="CustomerID", how="left")
print(merged)
Output:
OrderID CustomerID Amount Name 0 101 1 500 Amit 1 102 2 750 Priya 2 103 1 300 Amit 3 104 3 900 Rohan
Saving Data to Files
# Save to CSV (no row index)
df.to_csv("cleaned_sales.csv", index=False)
# Save to Excel
df.to_excel("cleaned_sales.xlsx", index=False, sheet_name="Sales")
# Save to JSON
df.to_json("sales.json", orient="records")
Key Pandas Functions – Quick Reference
| Function | Purpose |
|---|---|
| pd.read_csv() | Load a CSV file into a DataFrame |
| df.head() / df.tail() | Preview first or last rows |
| df.info() | Column names, types, and null counts |
| df.describe() | Summary statistics for numeric columns |
| df["col"] | Select a single column |
| df[condition] | Filter rows by a condition |
| df.loc[] / df.iloc[] | Select rows/cols by label or position |
| df.groupby() | Aggregate data by group |
| df.sort_values() | Sort rows by one or more columns |
| df.merge() | Combine two DataFrames on a key column |
| df.isnull() | Detect missing values |
| df.dropna() | Remove rows with missing values |
| df.fillna() | Replace missing values with a default |
| df.to_csv() | Export DataFrame to a CSV file |
Summary
- A Series stores a single column of labelled data; a DataFrame stores a full table
- Pandas reads CSV, Excel, JSON, and SQL data in one line of code
- iloc selects by numeric position; loc selects by label name
- Boolean filtering selects only rows that meet a condition
- groupby splits data by category and applies aggregate functions
- merge combines two DataFrames like an SQL JOIN
- Missing value handling with isnull, dropna, and fillna keeps data clean
