DS Pandas for Data Manipulation

Pandas is the most important data manipulation library in Python. It provides two powerful data structures — Series and DataFrame — that make loading, exploring, filtering, and transforming tabular data fast and intuitive. Almost every real-world data science project relies on Pandas for handling raw data before any analysis or modelling begins.

The Two Core Pandas Structures

Series – A Single Column of Data

A Series is a one-dimensional array with labels (called an index). Think of it as a single column from a spreadsheet.

import pandas as pd

# Create a Series of city populations (in millions)
population = pd.Series(
    [21.0, 20.7, 13.2, 8.5, 7.1],
    index=["Delhi", "Mumbai", "Kolkata", "Chennai", "Bengaluru"]
)

print(population)
print("\nDelhi population:", population["Delhi"])
print("Cities over 10M:", population[population > 10])

Output:

Delhi        21.0
Mumbai       20.7
Kolkata      13.2
Chennai       8.5
Bengaluru     7.1
dtype: float64

Delhi population: 21.0
Cities over 10M:
Delhi      21.0
Mumbai     20.7
Kolkata    13.2

DataFrame – A Full Table of Data

A DataFrame is a two-dimensional table with rows and columns — just like an Excel spreadsheet or a SQL table. Each column is a Series.

# Create a DataFrame from a dictionary
students = pd.DataFrame({
    "Name":    ["Amit", "Priya", "Rohan", "Sneha", "Karan"],
    "Score":   [85, 92, 73, 88, 61],
    "Grade":   ["B", "A", "C", "B", "D"],
    "Passed":  [True, True, True, True, False]
})

print(students)

Output:

    Name  Score Grade  Passed
0   Amit     85     B    True
1  Priya     92     A    True
2  Rohan     73     C    True
3  Sneha     88     B    True
4  Karan     61     D   False

Diagram – DataFrame Structure

         Name   Score  Grade  Passed
Index 0: Amit     85     B     True    ← Row 0
Index 1: Priya    92     A     True    ← Row 1
Index 2: Rohan    73     C     True    ← Row 2
         ↑        ↑      ↑       ↑
       Column   Column Column  Column
         0       1       2       3

Loading Data from Files

Real datasets come from CSV files, Excel sheets, databases, or JSON APIs. Pandas reads all of these in one line.

# Load a CSV file
df = pd.read_csv("sales_data.csv")

# Load an Excel file
df = pd.read_excel("sales_data.xlsx", sheet_name="Sheet1")

# Load from a URL directly
url = "https://raw.githubusercontent.com/dataset/sample.csv"
df = pd.read_csv(url)

# First look at the data
print(df.head(5))      # First 5 rows
print(df.tail(3))      # Last 3 rows
print(df.shape)        # (rows, columns)
print(df.columns)      # Column names
print(df.dtypes)       # Data type of each column

Exploring a DataFrame

After loading data, the first step is always to understand its structure, size, and quality.

# Using a sample sales dataset
import pandas as pd
import numpy as np

# Simulated sales data
np.random.seed(0)
df = pd.DataFrame({
    "Month":    ["Jan","Feb","Mar","Apr","May","Jun"],
    "Sales":    [45000, 52000, 48000, 61000, 57000, 63000],
    "Units":    [450, 520, 480, 610, 570, 630],
    "Region":   ["North","South","North","West","South","West"]
})

print(df.info())        # Column names, types, non-null counts
print("\n")
print(df.describe())    # Summary statistics for numeric columns

Output (describe):

          Sales    Units
count      6.00     6.00
mean   54333.33   543.33
std     6943.65    69.44
min    45000.00   450.00
25%    49500.00   495.00
50%    54500.00   545.00
75%    60000.00   600.00
max    63000.00   630.00

Selecting Columns and Rows

Selecting Columns

# Select a single column (returns a Series)
print(df["Sales"])

# Select multiple columns (returns a DataFrame)
print(df[["Month", "Sales", "Region"]])

Selecting Rows with iloc and loc

# iloc – select by position (like array indexing)
print(df.iloc[0])         # First row
print(df.iloc[1:4])       # Rows 1, 2, 3
print(df.iloc[0, 1])      # Row 0, Column 1 (Sales = 45000)

# loc – select by label
print(df.loc[0, "Month"])           # "Jan"
print(df.loc[2:4, ["Month","Sales"]]) # Rows 2-4, two columns

Diagram – iloc vs loc

DataFrame:
Index | Month | Sales | Units | Region
  0   |  Jan  | 45000 |  450  | North   ← iloc[0] selects this row
  1   |  Feb  | 52000 |  520  | South
  2   |  Mar  | 48000 |  480  | North
  3   |  Apr  | 61000 |  610  | West
  4   |  May  | 57000 |  570  | South
  5   |  Jun  | 63000 |  630  | West

iloc[0, 1]  → row position 0, column position 1 → 45000
loc[0, "Sales"] → row label 0, column name "Sales" → 45000

Filtering Data with Conditions

# Months where Sales exceeded 55,000
high_sales = df[df["Sales"] > 55000]
print(high_sales)

# Months in the South region with Sales above 50,000
south_high = df[(df["Region"] == "South") & (df["Sales"] > 50000)]
print(south_high)

# Months NOT in the North region
not_north = df[df["Region"] != "North"]
print(not_north)

# Filter using isin – multiple values
selected = df[df["Region"].isin(["North", "West"])]
print(selected)

Adding and Modifying Columns

# Add a new column: Revenue per Unit
df["RevenuePerUnit"] = df["Sales"] / df["Units"]

# Add a category column using conditions
df["Performance"] = df["Sales"].apply(
    lambda x: "High" if x >= 57000 else "Low"
)

print(df[["Month", "Sales", "RevenuePerUnit", "Performance"]])

Output:

  Month  Sales  RevenuePerUnit Performance
0   Jan  45000      100.0         Low
1   Feb  52000      100.0         Low
2   Mar  48000      100.0         Low
3   Apr  61000      100.0        High
4   May  57000      100.0        High
5   Jun  63000      100.0        High

Groupby – Aggregate by Category

Groupby splits a DataFrame into groups based on a column value, applies a function to each group, and combines the results.

# Total and average sales by region
region_summary = df.groupby("Region")["Sales"].agg(
    Total="sum",
    Average="mean",
    Count="count"
).reset_index()

print(region_summary)

Output:

  Region    Total      Average  Count
0  North    93000     46500.0      2
1  South   109000     54500.0      2
2   West   124000     62000.0      2

Diagram – Groupby Process

Original DataFrame:
Month | Sales  | Region
Jan   | 45000  | North    ──┐
Feb   | 52000  | South    ──┼──→ Split by Region
Mar   | 48000  | North    ──┘
Apr   | 61000  | West     ──┐
May   | 57000  | South    ──┼──→ Split by Region
Jun   | 63000  | West     ──┘

Groups:
North: [45000, 48000] → sum=93000, mean=46500
South: [52000, 57000] → sum=109000, mean=54500
West:  [61000, 63000] → sum=124000, mean=62000

Combine: One row per group in the result

Sorting Data

# Sort by Sales descending (highest first)
sorted_df = df.sort_values("Sales", ascending=False)
print(sorted_df[["Month", "Sales"]])

# Sort by multiple columns
multi_sorted = df.sort_values(
    ["Region", "Sales"],
    ascending=[True, False]
)

Handling Missing Values in Pandas

# Create a DataFrame with missing values
data = pd.DataFrame({
    "Name":  ["Alice", "Bob", None, "Dave"],
    "Score": [88, None, 72, 90],
    "City":  ["Delhi", "Mumbai", "Pune", None]
})

# Detect missing values
print(data.isnull())          # Boolean table: True where missing
print(data.isnull().sum())    # Count of missing per column

# Drop rows with any missing value
clean = data.dropna()

# Fill missing values with a default
data["Score"].fillna(data["Score"].mean(), inplace=True)
data["City"].fillna("Unknown", inplace=True)

Merging DataFrames

Merging combines two DataFrames based on a shared column — similar to SQL JOIN.

# Customers table
customers = pd.DataFrame({
    "CustomerID": [1, 2, 3],
    "Name": ["Amit", "Priya", "Rohan"]
})

# Orders table
orders = pd.DataFrame({
    "OrderID":    [101, 102, 103, 104],
    "CustomerID": [1,   2,   1,   3],
    "Amount":     [500, 750, 300, 900]
})

# Merge on CustomerID
merged = pd.merge(orders, customers, on="CustomerID", how="left")
print(merged)

Output:

   OrderID  CustomerID  Amount   Name
0      101           1     500   Amit
1      102           2     750  Priya
2      103           1     300   Amit
3      104           3     900  Rohan

Saving Data to Files

# Save to CSV (no row index)
df.to_csv("cleaned_sales.csv", index=False)

# Save to Excel
df.to_excel("cleaned_sales.xlsx", index=False, sheet_name="Sales")

# Save to JSON
df.to_json("sales.json", orient="records")

Key Pandas Functions – Quick Reference

FunctionPurpose
pd.read_csv()Load a CSV file into a DataFrame
df.head() / df.tail()Preview first or last rows
df.info()Column names, types, and null counts
df.describe()Summary statistics for numeric columns
df["col"]Select a single column
df[condition]Filter rows by a condition
df.loc[] / df.iloc[]Select rows/cols by label or position
df.groupby()Aggregate data by group
df.sort_values()Sort rows by one or more columns
df.merge()Combine two DataFrames on a key column
df.isnull()Detect missing values
df.dropna()Remove rows with missing values
df.fillna()Replace missing values with a default
df.to_csv()Export DataFrame to a CSV file

Summary

  • A Series stores a single column of labelled data; a DataFrame stores a full table
  • Pandas reads CSV, Excel, JSON, and SQL data in one line of code
  • iloc selects by numeric position; loc selects by label name
  • Boolean filtering selects only rows that meet a condition
  • groupby splits data by category and applies aggregate functions
  • merge combines two DataFrames like an SQL JOIN
  • Missing value handling with isnull, dropna, and fillna keeps data clean

Leave a Comment