DS Pandas for Data Manipulation

Pandas is the most important data manipulation library in Python. It provides two powerful data structures — Series and DataFrame — that make loading, exploring, filtering, and transforming tabular data fast and intuitive. Almost every real-world data science project relies on Pandas for handling raw data before any analysis or modelling begins.

The Two Core Pandas Structures

Series – A Single Column of Data

A Series is a one-dimensional array with labels (called an index). Think of it as a single column from a spreadsheet.

import pandas as pd

# Create a Series of city populations (in millions)
population = pd.Series(
    [21.0, 20.7, 13.2, 8.5, 7.1],
    index=["Delhi", "Mumbai", "Kolkata", "Chennai", "Bengaluru"]
)

print(population)
print("\nDelhi population:", population["Delhi"])
print("Cities over 10M:", population[population > 10])

Output:

Delhi        21.0
Mumbai       20.7
Kolkata      13.2
Chennai       8.5
Bengaluru     7.1
dtype: float64

Delhi population: 21.0
Cities over 10M:
Delhi      21.0
Mumbai     20.7
Kolkata    13.2

DataFrame – A Full Table of Data

A DataFrame is a two-dimensional table with rows and columns — just like an Excel spreadsheet or a SQL table. Each column is a Series.

# Create a DataFrame from a dictionary
students = pd.DataFrame({
    "Name":    ["Amit", "Priya", "Rohan", "Sneha", "Karan"],
    "Score":   [85, 92, 73, 88, 61],
    "Grade":   ["B", "A", "C", "B", "D"],
    "Passed":  [True, True, True, True, False]
})

print(students)

Output:

    Name  Score Grade  Passed
0   Amit     85     B    True
1  Priya     92     A    True
2  Rohan     73     C    True
3  Sneha     88     B    True
4  Karan     61     D   False

Diagram – DataFrame Structure

         Name   Score  Grade  Passed
Index 0: Amit     85     B     True    ← Row 0
Index 1: Priya    92     A     True    ← Row 1
Index 2: Rohan    73     C     True    ← Row 2
         ↑        ↑      ↑       ↑
       Column   Column Column  Column
         0       1       2       3

Loading Data from Files

Real datasets come from CSV files, Excel sheets, databases, or JSON APIs. Pandas reads all of these in one line.

# Load a CSV file
df = pd.read_csv("sales_data.csv")

# Load an Excel file
df = pd.read_excel("sales_data.xlsx", sheet_name="Sheet1")

# Load from a URL directly
url = "https://raw.githubusercontent.com/dataset/sample.csv"
df = pd.read_csv(url)

# First look at the data
print(df.head(5))      # First 5 rows
print(df.tail(3))      # Last 3 rows
print(df.shape)        # (rows, columns)
print(df.columns)      # Column names
print(df.dtypes)       # Data type of each column

Exploring a DataFrame

After loading data, the first step is always to understand its structure, size, and quality.

# Using a sample sales dataset
import pandas as pd
import numpy as np

# Simulated sales data
np.random.seed(0)
df = pd.DataFrame({
    "Month":    ["Jan","Feb","Mar","Apr","May","Jun"],
    "Sales":    [45000, 52000, 48000, 61000, 57000, 63000],
    "Units":    [450, 520, 480, 610, 570, 630],
    "Region":   ["North","South","North","West","South","West"]
})

print(df.info())        # Column names, types, non-null counts
print("\n")
print(df.describe())    # Summary statistics for numeric columns

Output (describe):

          Sales    Units
count      6.00     6.00
mean   54333.33   543.33
std     6943.65    69.44
min    45000.00   450.00
25%    49500.00   495.00
50%    54500.00   545.00
75%    60000.00   600.00
max    63000.00   630.00

Selecting Columns and Rows

Selecting Columns

# Select a single column (returns a Series)
print(df["Sales"])

# Select multiple columns (returns a DataFrame)
print(df[["Month", "Sales", "Region"]])

Selecting Rows with iloc and loc

# iloc – select by position (like array indexing)
print(df.iloc[0])         # First row
print(df.iloc[1:4])       # Rows 1, 2, 3
print(df.iloc[0, 1])      # Row 0, Column 1 (Sales = 45000)

# loc – select by label
print(df.loc[0, "Month"])           # "Jan"
print(df.loc[2:4, ["Month","Sales"]]) # Rows 2-4, two columns

Diagram – iloc vs loc

DataFrame:
Index | Month | Sales | Units | Region
  0   |  Jan  | 45000 |  450  | North   ← iloc[0] selects this row
  1   |  Feb  | 52000 |  520  | South
  2   |  Mar  | 48000 |  480  | North
  3   |  Apr  | 61000 |  610  | West
  4   |  May  | 57000 |  570  | South
  5   |  Jun  | 63000 |  630  | West

iloc[0, 1]  → row position 0, column position 1 → 45000
loc[0, "Sales"] → row label 0, column name "Sales" → 45000

Filtering Data with Conditions

# Months where Sales exceeded 55,000
high_sales = df[df["Sales"] > 55000]
print(high_sales)

# Months in the South region with Sales above 50,000
south_high = df[(df["Region"] == "South") & (df["Sales"] > 50000)]
print(south_high)

# Months NOT in the North region
not_north = df[df["Region"] != "North"]
print(not_north)

# Filter using isin – multiple values
selected = df[df["Region"].isin(["North", "West"])]
print(selected)

Adding and Modifying Columns

# Add a new column: Revenue per Unit
df["RevenuePerUnit"] = df["Sales"] / df["Units"]

# Add a category column using conditions
df["Performance"] = df["Sales"].apply(
    lambda x: "High" if x >= 57000 else "Low"
)

print(df[["Month", "Sales", "RevenuePerUnit", "Performance"]])

Output:

  Month  Sales  RevenuePerUnit Performance
0   Jan  45000      100.0         Low
1   Feb  52000      100.0         Low
2   Mar  48000      100.0         Low
3   Apr  61000      100.0        High
4   May  57000      100.0        High
5   Jun  63000      100.0        High

Groupby – Aggregate by Category

Groupby splits a DataFrame into groups based on a column value, applies a function to each group, and combines the results.

# Total and average sales by region
region_summary = df.groupby("Region")["Sales"].agg(
    Total="sum",
    Average="mean",
    Count="count"
).reset_index()

print(region_summary)

Output:

  Region    Total      Average  Count
0  North    93000     46500.0      2
1  South   109000     54500.0      2
2   West   124000     62000.0      2

Diagram – Groupby Process

Original DataFrame:
Month | Sales  | Region
Jan   | 45000  | North    ──┐
Feb   | 52000  | South    ──┼──→ Split by Region
Mar   | 48000  | North    ──┘
Apr   | 61000  | West     ──┐
May   | 57000  | South    ──┼──→ Split by Region
Jun   | 63000  | West     ──┘

Groups:
North: [45000, 48000] → sum=93000, mean=46500
South: [52000, 57000] → sum=109000, mean=54500
West:  [61000, 63000] → sum=124000, mean=62000

Combine: One row per group in the result

Sorting Data

# Sort by Sales descending (highest first)
sorted_df = df.sort_values("Sales", ascending=False)
print(sorted_df[["Month", "Sales"]])

# Sort by multiple columns
multi_sorted = df.sort_values(
    ["Region", "Sales"],
    ascending=[True, False]
)

Handling Missing Values in Pandas

# Create a DataFrame with missing values
data = pd.DataFrame({
    "Name":  ["Alice", "Bob", None, "Dave"],
    "Score": [88, None, 72, 90],
    "City":  ["Delhi", "Mumbai", "Pune", None]
})

# Detect missing values
print(data.isnull())          # Boolean table: True where missing
print(data.isnull().sum())    # Count of missing per column

# Drop rows with any missing value
clean = data.dropna()

# Fill missing values with a default
data["Score"].fillna(data["Score"].mean(), inplace=True)
data["City"].fillna("Unknown", inplace=True)

Merging DataFrames

Merging combines two DataFrames based on a shared column — similar to SQL JOIN.

# Customers table
customers = pd.DataFrame({
    "CustomerID": [1, 2, 3],
    "Name": ["Amit", "Priya", "Rohan"]
})

# Orders table
orders = pd.DataFrame({
    "OrderID":    [101, 102, 103, 104],
    "CustomerID": [1,   2,   1,   3],
    "Amount":     [500, 750, 300, 900]
})

# Merge on CustomerID
merged = pd.merge(orders, customers, on="CustomerID", how="left")
print(merged)

Output:

   OrderID  CustomerID  Amount   Name
0      101           1     500   Amit
1      102           2     750  Priya
2      103           1     300   Amit
3      104           3     900  Rohan

Saving Data to Files

# Save to CSV (no row index)
df.to_csv("cleaned_sales.csv", index=False)

# Save to Excel
df.to_excel("cleaned_sales.xlsx", index=False, sheet_name="Sales")

# Save to JSON
df.to_json("sales.json", orient="records")

Key Pandas Functions – Quick Reference

Function	Purpose
pd.read_csv()	Load a CSV file into a DataFrame
df.head() / df.tail()	Preview first or last rows
df.info()	Column names, types, and null counts
df.describe()	Summary statistics for numeric columns
df["col"]	Select a single column
df[condition]	Filter rows by a condition
df.loc[] / df.iloc[]	Select rows/cols by label or position
df.groupby()	Aggregate data by group
df.sort_values()	Sort rows by one or more columns
df.merge()	Combine two DataFrames on a key column
df.isnull()	Detect missing values
df.dropna()	Remove rows with missing values
df.fillna()	Replace missing values with a default
df.to_csv()	Export DataFrame to a CSV file

Summary

A Series stores a single column of labelled data; a DataFrame stores a full table
Pandas reads CSV, Excel, JSON, and SQL data in one line of code
iloc selects by numeric position; loc selects by label name
Boolean filtering selects only rows that meet a condition
groupby splits data by category and applies aggregate functions
merge combines two DataFrames like an SQL JOIN
Missing value handling with isnull, dropna, and fillna keeps data clean

Previous lessons

Back to courses

Next lessons

DS Pandas for Data Manipulation

The Two Core Pandas Structures

Series – A Single Column of Data

DataFrame – A Full Table of Data

Diagram – DataFrame Structure

Loading Data from Files

Exploring a DataFrame

Selecting Columns and Rows

Selecting Columns

Selecting Rows with iloc and loc

Diagram – iloc vs loc

Filtering Data with Conditions

Adding and Modifying Columns

Groupby – Aggregate by Category

Diagram – Groupby Process

Sorting Data

Handling Missing Values in Pandas

Merging DataFrames

Saving Data to Files

Key Pandas Functions – Quick Reference

Summary

Leave a Comment Cancel reply