Python Pandas

Pandas is one of the most powerful and widely used Python libraries for working with data. Whether the data comes from a spreadsheet, a database, or a CSV file, Pandas gives the tools to read, clean, explore, and transform it quickly. This guide covers every important concept step by step — from installing Pandas to advanced techniques used in real-world data analysis.

Think of Pandas like a supercharged version of Microsoft Excel that works inside Python. Instead of clicking and dragging cells, everything is done with simple lines of code.

What is Pandas?
Installation & Setup
Importing Pandas
Pandas Series
Pandas DataFrame
Creating DataFrames
Reading Data from Files
Writing Data to Files
Inspecting a DataFrame
Selecting Columns and Rows
loc and iloc — Label vs Position
Filtering Data (Conditions)
Handling Missing Data
Adding and Dropping Columns/Rows
Renaming Columns and Index
Sorting Data
Data Types and Type Conversion
String Operations
Date and Time Operations
GroupBy and Aggregation
apply(), map() and Lambda Functions
Merging and Joining DataFrames
Concatenating DataFrames
Pivot Tables
Cross Tabulation (crosstab)
MultiIndex (Hierarchical Index)
Window Functions — Rolling & Expanding
Handling Duplicate Data
Reshaping Data — melt and stack
Performance Tips
Quick Reference Cheat Sheet

1. What is Pandas?

Pandas is an open-source Python library built for data manipulation and analysis. The name "Pandas" comes from "Panel Data" — a term used in statistics and economics. It was created by Wes McKinney in 2008 and has since become the standard tool for working with structured (tabular) data in Python.

Why Use Pandas?

Read data from CSV, Excel, JSON, SQL databases, and more.
Clean messy data — handle missing values, fix data types, remove duplicates.
Filter, sort, and reshape data with just a few lines of code.
Perform calculations and statistical analysis easily.
Works seamlessly with other Python libraries like NumPy, Matplotlib, and Scikit-learn.

Where is Pandas Used?

Data Science & Machine Learning — preparing datasets before building models.
Finance — analyzing stock prices, portfolio data, and financial reports.
Business Analytics — processing sales data, customer records, and KPI reports.
Research — working with experimental data, survey results, and scientific measurements.

2. Installation & Setup

Before using Pandas, it must be installed. Python's package manager pip handles this.

Install Using pip

pip install pandas

Install Along with NumPy and OpenPyXL (Recommended)

pip install pandas numpy openpyxl

numpy is the numerical computing library that Pandas depends on. openpyxl is needed to read and write Excel files.

Verify the Installation

import pandas as pd
print(pd.__version__)

Output:

2.2.1

If a version number appears, Pandas is installed correctly.

Using Pandas in Jupyter Notebook or Google Colab

If working in Jupyter Notebook or Google Colab, Pandas is already available. Just import it at the top of the notebook.

3. Importing Pandas

Every Pandas program starts by importing the library. The standard convention is to import Pandas with the alias pd. This is a widely accepted practice so that Pandas functions are accessed as pd.something instead of pandas.something.

import pandas as pd

This single line is always written at the top of any Python script that uses Pandas.

4. Pandas Series

A Series is the simplest data structure in Pandas. Think of it as a single column of data — like one column in a spreadsheet. It is a one-dimensional array that can hold any type of data: numbers, text, dates, and so on.

Every value in a Series has a label called an index. By default, the index is a number starting from 0.

Creating a Series from a List

import pandas as pd

marks = pd.Series([85, 90, 78, 95, 88])
print(marks)

Output:

0    85
1    90
2    78
3    95
4    88
dtype: int64

The left column (0, 1, 2, 3, 4) is the index. The right column contains the actual values.

Creating a Series with a Custom Index

marks = pd.Series([85, 90, 78], index=["Math", "Science", "English"])
print(marks)

Output:

Math       85
Science    90
English    78
dtype: int64

Creating a Series from a Dictionary

data = {"Apples": 50, "Bananas": 30, "Mangoes": 70}
fruits = pd.Series(data)
print(fruits)

Output:

Apples     50
Bananas    30
Mangoes    70
dtype: int64

Accessing Values in a Series

print(marks["Math"])       # Access by label → 85
print(marks[0])            # Access by position → 85

Series Attributes

Attribute	Description	Example
`series.values`	Returns the data as a NumPy array	`marks.values`
`series.index`	Returns the index labels	`marks.index`
`series.dtype`	Returns data type	`marks.dtype`
`series.name`	Name of the Series	`marks.name`
`series.size`	Number of elements	`marks.size`

5. Pandas DataFrame

A DataFrame is the most important and frequently used data structure in Pandas. It is a two-dimensional table — just like a spreadsheet with rows and columns. Each column in a DataFrame is actually a Series.

Imagine a school report card: students are the rows, subjects are the columns, and marks are the values. That is exactly what a DataFrame looks like.

Structure of a DataFrame

Rows — each row represents one record (e.g., one student, one transaction).
Columns — each column represents one attribute (e.g., name, age, salary).
Index — the row label (default: 0, 1, 2 ...).

Visual Representation

    Name    Age   City
0   Alice    25   Delhi
1   Bob      30   Mumbai
2   Carol    22   Chennai

Here, Name, Age, City are columns. The numbers 0, 1, 2 are the index (row labels).

6. Creating DataFrames

Method 1 — From a Dictionary

This is the most common way. Each key becomes a column name, and each list becomes the column values.

import pandas as pd

data = {
    "Name":  ["Alice", "Bob", "Carol"],
    "Age":   [25, 30, 22],
    "City":  ["Delhi", "Mumbai", "Chennai"]
}

df = pd.DataFrame(data)
print(df)

Output:

    Name  Age     City
0  Alice   25    Delhi
1    Bob   30   Mumbai
2  Carol   22  Chennai

Method 2 — From a List of Dictionaries

records = [
    {"Product": "Pen",    "Price": 10},
    {"Product": "Book",   "Price": 50},
    {"Product": "Pencil", "Price": 5}
]

df = pd.DataFrame(records)
print(df)

Output:

  Product  Price
0     Pen     10
1    Book     50
2  Pencil      5

Method 3 — From a List of Lists

rows = [["Alice", 25], ["Bob", 30], ["Carol", 22]]
df = pd.DataFrame(rows, columns=["Name", "Age"])
print(df)

Method 4 — From a NumPy Array

import numpy as np

arr = np.array([[1, 2], [3, 4], [5, 6]])
df = pd.DataFrame(arr, columns=["Col_A", "Col_B"])
print(df)

Setting a Custom Index

df = pd.DataFrame(data, index=["S1", "S2", "S3"])
print(df)

Output:

      Name  Age     City
S1   Alice   25    Delhi
S2     Bob   30   Mumbai
S3   Carol   22  Chennai

7. Reading Data from Files

In real projects, data is rarely typed manually. It comes from CSV files, Excel sheets, JSON files, or databases. Pandas makes reading these files very straightforward.

Reading a CSV File

CSV (Comma Separated Values) is the most common file format for storing tabular data.

df = pd.read_csv("students.csv")
print(df.head())

Common Parameters for read_csv()

Parameter	Purpose	Example
`sep`	Separator character	`sep=";"`
`header`	Row number to use as column names	`header=0`
`index_col`	Column to use as row index	`index_col="ID"`
`nrows`	Read only first N rows	`nrows=100`
`usecols`	Read only specific columns	`usecols=["Name","Age"]`
`skiprows`	Skip rows from the top	`skiprows=2`
`na_values`	Values to treat as NaN	`na_values=["NA","N/A"]`
`encoding`	File encoding	`encoding="utf-8"`

Reading an Excel File

df = pd.read_excel("sales_data.xlsx", sheet_name="Sheet1")
print(df.head())

Reading a JSON File

df = pd.read_json("data.json")
print(df.head())

Reading from a SQL Database

import sqlite3

conn = sqlite3.connect("company.db")
df = pd.read_sql("SELECT * FROM employees", conn)
print(df.head())

Reading from a URL (Online CSV)

url = "https://example.com/data/sales.csv"
df = pd.read_csv(url)
print(df.head())

8. Writing Data to Files

After processing data, the results need to be saved back to a file. Pandas supports writing to CSV, Excel, JSON, and SQL.

Write to CSV

df.to_csv("output.csv", index=False)

Setting index=False prevents the default row numbers from being saved as a separate column in the file.

Write to Excel

df.to_excel("output.xlsx", sheet_name="Results", index=False)

Write to JSON

df.to_json("output.json", orient="records", indent=2)

Write to SQL

import sqlite3
conn = sqlite3.connect("company.db")
df.to_sql("employees", conn, if_exists="replace", index=False)

Write Multiple Sheets to One Excel File

with pd.ExcelWriter("report.xlsx") as writer:
    df1.to_excel(writer, sheet_name="Sales", index=False)
    df2.to_excel(writer, sheet_name="Expenses", index=False)

9. Inspecting a DataFrame

Before doing any analysis, it is important to understand the shape and quality of the data. Pandas provides several built-in methods for this.

Sample Data Used in This Section

import pandas as pd

data = {
    "Name":   ["Alice", "Bob", "Carol", "David", "Eva"],
    "Age":    [25, 30, 22, 35, 28],
    "Score":  [85, 90, None, 78, 92],
    "City":   ["Delhi", "Mumbai", "Delhi", "Chennai", "Mumbai"]
}
df = pd.DataFrame(data)

head() — First N Rows

print(df.head())      # First 5 rows (default)
print(df.head(3))     # First 3 rows

tail() — Last N Rows

print(df.tail())      # Last 5 rows
print(df.tail(2))     # Last 2 rows

shape — Rows and Columns Count

print(df.shape)       # (5, 4) → 5 rows, 4 columns

info() — Column Names, Data Types, Non-Null Counts

print(df.info())

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Name    5 non-null      object
 1   Age     5 non-null      int64
 2   Score   4 non-null      float64
 3   City    5 non-null      object
dtypes: float64(1), int64(1), object(2)
memory usage: 288.0+ bytes

describe() — Statistical Summary

print(df.describe())

Output:

        Age      Score
count   5.0   4.000000
mean   28.0  86.250000
std     4.8   5.737305
min    22.0  78.000000
25%    25.0  83.250000
50%    28.0  87.500000
75%    30.0  90.500000
max    35.0  92.000000

Other Useful Inspection Methods

Method / Attribute	What It Returns
`df.columns`	List of column names
`df.index`	Row index labels
`df.dtypes`	Data type of each column
`df.shape`	Tuple of (rows, columns)
`df.size`	Total number of cells (rows × columns)
`df.ndim`	Number of dimensions (always 2 for DataFrame)
`df.sample(3)`	Random 3 rows
`df.count()`	Count of non-null values per column
`df.nunique()`	Number of unique values per column
`df.isnull().sum()`	Count of missing values per column

10. Selecting Columns and Rows

Selecting a Single Column

A single column is selected using the column name in square brackets. The result is a Series.

print(df["Name"])

Output:

0    Alice
1      Bob
2    Carol
3    David
4      Eva
Name: Name, dtype: object

Selecting Multiple Columns

Pass a list of column names. The result is a DataFrame.

print(df[["Name", "Age"]])

Selecting Rows by Index Number

print(df[1:3])    # Rows at index 1 and 2 (not 3)

Accessing a Column as an Attribute (Dot Notation)

print(df.Name)    # Same as df["Name"]

Note: Dot notation does not work if the column name has spaces or matches a built-in method name. Using square brackets is always safe.

11. loc and iloc — Label vs Position

Pandas provides two powerful ways to select specific rows and columns:

loc — selects by label (column name, index label)
iloc — selects by integer position (row number, column number)

Using loc

# Select row with index label 0
print(df.loc[0])

# Select rows 0 to 2, all columns
print(df.loc[0:2])

# Select rows 0 to 2, specific columns
print(df.loc[0:2, ["Name", "Age"]])

# Select a single cell
print(df.loc[1, "City"])     # Output: Mumbai

Using iloc

# Select first row (position 0)
print(df.iloc[0])

# Select rows 0, 1, 2 and first two columns
print(df.iloc[0:3, 0:2])

# Select a single cell by position
print(df.iloc[1, 3])         # Row 1, Column 3

Key Difference — loc vs iloc

Feature	loc	iloc
Selection method	By label (name)	By number (position)
End of range	Inclusive	Exclusive
Works with	Custom index, column names	Always numeric 0-based

at and iat — Faster Single Cell Access

print(df.at[2, "Name"])     # Access by label → Carol
print(df.iat[2, 0])         # Access by position → Carol

at and iat are faster than loc and iloc when accessing just one cell.

12. Filtering Data (Conditions)

Filtering means selecting only those rows that satisfy a given condition — for example, showing only students who scored above 80, or listing employees from a specific city.

Basic Filtering

# Students with Score greater than 85
result = df[df["Score"] > 85]
print(result)

Filtering with Text Match

# Rows where City is Mumbai
result = df[df["City"] == "Mumbai"]
print(result)

Multiple Conditions — AND (&)

# Age greater than 25 AND City is Delhi
result = df[(df["Age"] > 25) & (df["City"] == "Delhi")]
print(result)

Multiple Conditions — OR (|)

# City is Delhi OR City is Mumbai
result = df[(df["City"] == "Delhi") | (df["City"] == "Mumbai")]
print(result)

NOT Condition (~)

# Rows where City is NOT Delhi
result = df[~(df["City"] == "Delhi")]
print(result)

Filtering with isin()

isin() checks if a value is present in a given list. Useful for filtering multiple values at once.

# Rows where City is Delhi or Chennai
result = df[df["City"].isin(["Delhi", "Chennai"])]
print(result)

Filtering with between()

# Age between 25 and 30 (inclusive)
result = df[df["Age"].between(25, 30)]
print(result)

query() Method — SQL-like Filtering

query() allows filtering using a readable string expression — similar to writing SQL WHERE conditions.

result = df.query("Age > 25 and City == 'Mumbai'")
print(result)

13. Handling Missing Data

Real-world datasets almost always have missing values — cells that are empty or were not recorded. In Pandas, missing values are represented as NaN (Not a Number). Handling them correctly is a critical step in data cleaning.

Sample Data with Missing Values

import pandas as pd
import numpy as np

data = {
    "Name":   ["Alice", "Bob", "Carol", "David"],
    "Age":    [25, None, 22, 35],
    "Score":  [85, 90, None, 78]
}
df = pd.DataFrame(data)
print(df)

Output:

    Name   Age  Score
0  Alice  25.0   85.0
1    Bob   NaN   90.0
2  Carol  22.0    NaN
3  David  35.0   78.0

Detecting Missing Values

print(df.isnull())          # True where value is missing
print(df.isnull().sum())    # Count of missing values per column
print(df.notnull())         # True where value is present

Dropping Missing Values

# Drop any row that has at least one missing value
df_clean = df.dropna()

# Drop only rows where ALL values are missing
df_clean = df.dropna(how="all")

# Drop rows where specific columns have missing values
df_clean = df.dropna(subset=["Age"])

# Drop columns with missing values
df_clean = df.dropna(axis=1)

Filling Missing Values

# Fill all missing values with 0
df_filled = df.fillna(0)

# Fill missing values in a specific column
df["Age"] = df["Age"].fillna(df["Age"].mean())

# Fill using forward fill (use the previous row's value)
df_ffill = df.fillna(method="ffill")

# Fill using backward fill (use the next row's value)
df_bfill = df.fillna(method="bfill")

Filling with Different Values per Column

df = df.fillna({"Age": df["Age"].median(), "Score": df["Score"].mean()})

Replace Specific Values with NaN

df.replace(-1, np.nan, inplace=True)
df.replace(["N/A", "na", "unknown"], np.nan, inplace=True)

inplace Parameter

By default, Pandas operations return a new DataFrame and do not modify the original. Setting inplace=True modifies the original DataFrame directly without needing to reassign it.

df.dropna(inplace=True)    # Modifies df directly

14. Adding and Dropping Columns/Rows

Adding a New Column

df["Grade"] = ["A", "A+", "B", "B+", "A"]
print(df)

Adding a Column Based on Calculation

df["Age_in_5_Years"] = df["Age"] + 5

Adding a Column Using assign()

assign() is a cleaner way to add new columns, especially in method chains.

df = df.assign(Tax = df["Score"] * 0.1)

Dropping a Column

df = df.drop(columns=["Age_in_5_Years"])

# Or using axis
df = df.drop("Grade", axis=1)

Dropping Multiple Columns

df = df.drop(columns=["Grade", "Tax"])

Dropping a Row by Index

df = df.drop(index=2)         # Drop row at index 2
df = df.drop(index=[0, 3])    # Drop rows at index 0 and 3

Inserting a Column at a Specific Position

# Insert "Grade" column at position 1
df.insert(1, "Grade", ["A", "B", "A", "C", "A"])

Resetting the Index after Drop

df = df.reset_index(drop=True)    # drop=True prevents old index from becoming a column

15. Renaming Columns and Index

Rename Specific Columns

df = df.rename(columns={"Name": "Student_Name", "Age": "Student_Age"})
print(df.columns)

Rename All Columns Using a List

df.columns = ["student", "age", "marks", "location"]

Rename Columns to Lowercase

df.columns = df.columns.str.lower()
df.columns = df.columns.str.replace(" ", "_")

Rename the Index

df = df.rename(index={0: "Row1", 1: "Row2"})

Set a Column as Index

df = df.set_index("Name")
print(df)

Reset Index Back to Default

df = df.reset_index()

16. Sorting Data

Sort by a Single Column

# Sort by Age in ascending order (default)
df_sorted = df.sort_values("Age")
print(df_sorted)

# Sort in descending order
df_sorted = df.sort_values("Age", ascending=False)

Sort by Multiple Columns

# Sort by City ascending, then by Age descending
df_sorted = df.sort_values(["City", "Age"], ascending=[True, False])
print(df_sorted)

Sort by Index

df_sorted = df.sort_index()                  # Ascending
df_sorted = df.sort_index(ascending=False)   # Descending

nlargest() and nsmallest()

These methods quickly return the top or bottom N rows based on a column.

# Top 3 highest scores
print(df.nlargest(3, "Score"))

# Bottom 2 youngest students
print(df.nsmallest(2, "Age"))

17. Data Types and Type Conversion

Every column in a DataFrame has a data type (dtype). Common types include int64, float64, object (for text), bool, and datetime64.

Check Data Types

print(df.dtypes)

Convert a Column's Data Type — astype()

# Convert Age from float to int
df["Age"] = df["Age"].astype(int)

# Convert Score to float
df["Score"] = df["Score"].astype(float)

# Convert a column to string
df["ID"] = df["ID"].astype(str)

Convert to Category Type (Memory Efficient)

If a column has a limited number of unique repeated values (like "City" or "Gender"), converting it to category saves memory.

df["City"] = df["City"].astype("category")
print(df["City"].cat.categories)

Convert String Column to Numeric

# If column has text like "45.5" that should be a number
df["Score"] = pd.to_numeric(df["Score"], errors="coerce")

errors="coerce" converts invalid values to NaN instead of raising an error.

Convert to Datetime

df["DOB"] = pd.to_datetime(df["DOB"])
df["JoinDate"] = pd.to_datetime(df["JoinDate"], format="%d-%m-%Y")

18. String Operations

Pandas provides a .str accessor to perform string operations on text columns. These work on the entire column at once — no need for a loop.

Setup

data = {"Name": ["  alice  ", "BOB", "carol smith"], "City": ["Delhi", "mumbai", "CHENNAI"]}
df = pd.DataFrame(data)

Common String Methods

Method	Description	Example
`.str.upper()`	Convert to uppercase	`df["Name"].str.upper()`
`.str.lower()`	Convert to lowercase	`df["City"].str.lower()`
`.str.title()`	Title case (first letter caps)	`df["Name"].str.title()`
`.str.strip()`	Remove leading/trailing spaces	`df["Name"].str.strip()`
`.str.len()`	Length of each string	`df["Name"].str.len()`
`.str.replace()`	Replace a substring	`df["Name"].str.replace("Bob","Robert")`
`.str.contains()`	True if substring found	`df["Name"].str.contains("carol")`
`.str.startswith()`	True if starts with	`df["Name"].str.startswith("A")`
`.str.endswith()`	True if ends with	`df["Name"].str.endswith("h")`
`.str.split()`	Split into list	`df["Name"].str.split(" ")`
`.str.slice()`	Extract part of string	`df["Name"].str.slice(0, 3)`
`.str.count()`	Count occurrences	`df["Name"].str.count("l")`

Splitting a Column into Multiple Columns

df["Name"] = ["Alice Smith", "Bob Jones", "Carol Brown"]
df[["First", "Last"]] = df["Name"].str.split(" ", expand=True)
print(df)

Filtering with str.contains()

result = df[df["City"].str.contains("Delhi", case=False)]
print(result)

19. Date and Time Operations

Dates and times are common in real-world data — transaction dates, employee joining dates, event timestamps. Pandas has built-in tools to work with them easily.

Creating and Converting to Datetime

import pandas as pd

df = pd.DataFrame({"Date": ["2024-01-15", "2024-03-22", "2024-07-05"]})
df["Date"] = pd.to_datetime(df["Date"])
print(df.dtypes)

Extracting Date Components

df["Year"]    = df["Date"].dt.year
df["Month"]   = df["Date"].dt.month
df["Day"]     = df["Date"].dt.day
df["Weekday"] = df["Date"].dt.day_name()
df["Quarter"] = df["Date"].dt.quarter
print(df)

Output:

        Date  Year  Month  Day   Weekday  Quarter
0 2024-01-15  2024      1   15    Monday        1
1 2024-03-22  2024      3   22    Friday        1
2 2024-07-05  2024      7    5    Friday        3

Date Arithmetic

# Add 30 days to a date
df["Due_Date"] = df["Date"] + pd.Timedelta(days=30)

# Difference between two dates
df["Days_Passed"] = pd.Timestamp("2024-12-31") - df["Date"]

# Extract number of days from timedelta
df["Days_Count"] = df["Days_Passed"].dt.days

Filtering by Date

# Rows after March 2024
result = df[df["Date"] > "2024-03-01"]
print(result)

Resampling Time Series Data

Resampling aggregates time-series data over a time period — like grouping daily data into monthly totals.

df.set_index("Date", inplace=True)

# Monthly sum
monthly = df.resample("M").sum()
print(monthly)

Date Range Generation

# Generate 10 consecutive business days from Jan 1, 2024
date_range = pd.date_range(start="2024-01-01", periods=10, freq="B")
print(date_range)

20. GroupBy and Aggregation

GroupBy is one of the most powerful features in Pandas. It splits the data into groups based on one or more columns, then applies a function (like sum, mean, count) to each group. This is similar to GROUP BY in SQL.

Think of it like sorting test papers by class and then calculating the average marks for each class.

Sample Data

import pandas as pd

data = {
    "City":    ["Delhi", "Mumbai", "Delhi", "Mumbai", "Chennai", "Delhi"],
    "Product": ["Pen",   "Book",   "Pen",   "Pencil", "Pen",    "Book"],
    "Sales":   [200,      500,      150,     300,      400,       250]
}
df = pd.DataFrame(data)

Basic GroupBy — Total Sales per City

result = df.groupby("City")["Sales"].sum()
print(result)

Output:

City
Chennai    400
Delhi      600
Mumbai     800
Name: Sales, dtype: int64

Multiple Aggregations at Once — agg()

result = df.groupby("City")["Sales"].agg(["sum", "mean", "max", "count"])
print(result)

GroupBy on Multiple Columns

result = df.groupby(["City", "Product"])["Sales"].sum()
print(result)

Named Aggregations (Pandas 0.25+)

result = df.groupby("City").agg(
    Total_Sales = ("Sales", "sum"),
    Avg_Sales   = ("Sales", "mean"),
    Max_Sales   = ("Sales", "max")
)
print(result)

Applying Custom Aggregation Functions

result = df.groupby("City")["Sales"].agg(lambda x: x.max() - x.min())
print(result)

transform() — Keep Original Shape

transform() is like agg(), but it returns a Series of the same length as the original DataFrame — useful for adding a group-level calculation back to the original data.

# Add city-level average sales back to every row
df["City_Avg"] = df.groupby("City")["Sales"].transform("mean")
print(df)

filter() — Keep Groups Meeting a Condition

# Keep only cities with total sales greater than 500
result = df.groupby("City").filter(lambda x: x["Sales"].sum() > 500)
print(result)

21. apply(), map() and Lambda Functions

These are tools for applying custom transformations to columns or rows.

map() — Element-wise on a Series

map() applies a function or dictionary mapping to each element of a Series.

df["Grade"] = df["Score"].map({85: "A", 90: "A+", 78: "B+"})

# Or with a function
df["Level"] = df["Age"].map(lambda x: "Senior" if x > 28 else "Junior")

apply() on a Series

def classify(score):
    if score >= 90:
        return "Excellent"
    elif score >= 75:
        return "Good"
    else:
        return "Average"

df["Result"] = df["Score"].apply(classify)

apply() on a DataFrame Row

By setting axis=1, the function is applied row-by-row, giving access to multiple columns at once.

df["Summary"] = df.apply(
    lambda row: f"{row['Name']} from {row['City']} scored {row['Score']}",
    axis=1
)
print(df["Summary"])

apply() on Entire Columns (axis=0)

# Normalize all numeric columns (subtract min, divide by range)
def normalize(col):
    return (col - col.min()) / (col.max() - col.min())

df[["Age", "Score"]] = df[["Age", "Score"]].apply(normalize)

applymap() / map() on Entire DataFrame (Element-wise)

# In Pandas 2.x use .map() on DataFrame instead of applymap()
df_numeric = df[["Age", "Score"]].map(lambda x: round(x, 2))

Using Lambda with Conditions

df["Tax"] = df["Sales"].apply(lambda x: x * 0.18 if x > 300 else x * 0.05)

22. Merging and Joining DataFrames

Merging combines two DataFrames based on a common column — just like SQL JOIN. This is essential when data is spread across multiple tables and needs to be combined.

Sample DataFrames

students = pd.DataFrame({
    "ID":   [1, 2, 3, 4],
    "Name": ["Alice", "Bob", "Carol", "David"]
})

marks = pd.DataFrame({
    "ID":    [1, 2, 3, 5],
    "Score": [85, 90, 78, 88]
})

Inner Join — Keep Only Matching Rows

result = pd.merge(students, marks, on="ID", how="inner")
print(result)

Output:

   ID   Name  Score
0   1  Alice     85
1   2    Bob     90
2   3  Carol     78

David (ID=4) and the record with ID=5 are excluded because they don't match in both tables.

Left Join — Keep All Rows from Left Table

result = pd.merge(students, marks, on="ID", how="left")
print(result)

Output: David appears with NaN for Score since no matching row exists in marks.

Right Join — Keep All Rows from Right Table

result = pd.merge(students, marks, on="ID", how="right")

Outer Join — Keep All Rows from Both Tables

result = pd.merge(students, marks, on="ID", how="outer")
print(result)

Merging on Different Column Names

result = pd.merge(students, marks, left_on="ID", right_on="StudentID")

Merging on Index

result = pd.merge(students, marks, left_index=True, right_index=True)

Join Types Summary

Join Type	Result
inner	Only rows that match in BOTH tables
left	All rows from LEFT table + matching from right
right	All rows from RIGHT table + matching from left
outer	All rows from BOTH tables

23. Concatenating DataFrames

While merge() combines DataFrames based on a key column, concat() stacks DataFrames either vertically (one on top of another) or horizontally (side by side).

Concatenate Vertically (Stacking Rows)

df1 = pd.DataFrame({"Name": ["Alice", "Bob"], "Score": [85, 90]})
df2 = pd.DataFrame({"Name": ["Carol", "David"], "Score": [78, 95]})

result = pd.concat([df1, df2], ignore_index=True)
print(result)

Output:

    Name  Score
0  Alice     85
1    Bob     90
2  Carol     78
3  David     95

ignore_index=True resets the index in the combined result.

Concatenate Horizontally (Stacking Columns)

df_names  = pd.DataFrame({"Name": ["Alice", "Bob"]})
df_scores = pd.DataFrame({"Score": [85, 90]})

result = pd.concat([df_names, df_scores], axis=1)
print(result)

Concatenate with Keys (Track Source)

result = pd.concat([df1, df2], keys=["Group_A", "Group_B"])
print(result)

Difference between concat and merge

Feature	concat	merge
Purpose	Stack DataFrames together	Combine based on common column
Requires common column	No	Yes
Direction	Rows (axis=0) or Columns (axis=1)	Always on columns

24. Pivot Tables

A pivot table reorganizes and summarizes data by converting unique values of one column into new columns. This is the same concept as Excel's pivot table feature.

Sample Data

data = {
    "City":    ["Delhi", "Delhi", "Mumbai", "Mumbai", "Chennai"],
    "Product": ["Pen",   "Book",  "Pen",   "Book",   "Pen"],
    "Sales":   [200,      300,     400,     500,      150]
}
df = pd.DataFrame(data)

Basic Pivot Table

pivot = pd.pivot_table(df, values="Sales", index="City", columns="Product", aggfunc="sum")
print(pivot)

Output:

Product  Book   Pen
City
Chennai   NaN   150
Delhi     300   200
Mumbai    500   400

With Fill Value and Margins

pivot = pd.pivot_table(
    df,
    values="Sales",
    index="City",
    columns="Product",
    aggfunc="sum",
    fill_value=0,
    margins=True         # Adds row and column totals
)
print(pivot)

Multiple Aggregation Functions

pivot = pd.pivot_table(df, values="Sales", index="City", aggfunc=["sum", "mean"])
print(pivot)

pivot() — Simple Reshape (No Aggregation)

pivot() is used when there is no aggregation needed — just reshaping the data.

result = df.pivot(index="City", columns="Product", values="Sales")
print(result)

25. Cross Tabulation (crosstab)

Cross tabulation (or crosstab) computes a frequency table of two or more variables. It is commonly used in survey analysis and categorical data exploration.

import pandas as pd

data = {
    "Gender":  ["M", "F", "M", "F", "M", "F"],
    "City":    ["Delhi", "Mumbai", "Delhi", "Delhi", "Mumbai", "Chennai"]
}
df = pd.DataFrame(data)

result = pd.crosstab(df["City"], df["Gender"])
print(result)

Output:

Gender    F  M
City
Chennai   1  0
Delhi     2  2
Mumbai    0  2  (approx)

Crosstab with Percentages

result = pd.crosstab(df["City"], df["Gender"], normalize="index")
print(result)

Crosstab with Margins

result = pd.crosstab(df["City"], df["Gender"], margins=True, margins_name="Total")
print(result)

26. MultiIndex (Hierarchical Index)

A MultiIndex (also called hierarchical index) allows a DataFrame to have multiple levels of row or column labels. This is useful when data is naturally hierarchical — like sales by year and month, or students grouped by class and section.

Creating a MultiIndex DataFrame

import pandas as pd

data = {
    "Sales": [200, 150, 300, 250, 180, 320]
}
index = pd.MultiIndex.from_tuples(
    [("2023", "Q1"), ("2023", "Q2"), ("2023", "Q3"),
     ("2024", "Q1"), ("2024", "Q2"), ("2024", "Q3")],
    names=["Year", "Quarter"]
)
df = pd.DataFrame(data, index=index)
print(df)

Output:

             Sales
Year Quarter
2023 Q1        200
     Q2        150
     Q3        300
2024 Q1        250
     Q2        180
     Q3        320

Accessing Data in MultiIndex

# All rows for year 2023
print(df.loc["2023"])

# Specific quarter in a year
print(df.loc[("2023", "Q2")])

Unstacking — Convert Row Index to Columns

print(df.unstack())

Stacking — Convert Column Headers to Row Index

print(df.unstack().stack())

Reset MultiIndex

df_flat = df.reset_index()
print(df_flat)

Creating MultiIndex from GroupBy

sales_data = pd.DataFrame({
    "Year":    [2023, 2023, 2024, 2024],
    "Product": ["Pen", "Book", "Pen", "Book"],
    "Sales":   [500, 700, 600, 800]
})

result = sales_data.groupby(["Year", "Product"])["Sales"].sum()
print(result)

27. Window Functions — Rolling & Expanding

Window functions perform calculations over a sliding or expanding window of rows. They are widely used in time-series analysis — for example, calculating a 7-day moving average of sales or stock prices.

Rolling — Fixed-Size Sliding Window

import pandas as pd

df = pd.DataFrame({
    "Day":   range(1, 8),
    "Sales": [100, 150, 130, 200, 180, 210, 170]
})

# 3-day rolling average
df["Rolling_Avg"] = df["Sales"].rolling(window=3).mean()
print(df)

Output:

   Day  Sales  Rolling_Avg
0    1    100          NaN
1    2    150          NaN
2    3    130   126.666667
3    4    200   160.000000
4    5    180   170.000000
5    6    210   196.666667
6    7    170   186.666667

The first two rows show NaN because there aren't enough previous rows to fill the window of 3.

Rolling Sum and Min

df["Rolling_Sum"] = df["Sales"].rolling(window=3).sum()
df["Rolling_Min"] = df["Sales"].rolling(window=3).min()

Expanding — Growing Window

An expanding window grows from the first row up to the current row — useful for running totals or cumulative averages.

df["Cumulative_Avg"] = df["Sales"].expanding().mean()
df["Cumulative_Sum"] = df["Sales"].expanding().sum()
print(df[["Day", "Sales", "Cumulative_Avg", "Cumulative_Sum"]])

Exponential Weighted Moving Average (EWM)

EWM gives more weight to recent values — useful in finance and trend analysis.

df["EWM_Avg"] = df["Sales"].ewm(span=3).mean()

28. Handling Duplicate Data

Duplicate rows occur when the same record appears more than once. They can skew analysis results and must be identified and handled properly.

Sample Data

data = {
    "Name":  ["Alice", "Bob", "Alice", "Carol", "Bob"],
    "Score": [85, 90, 85, 78, 90]
}
df = pd.DataFrame(data)

Detect Duplicates

print(df.duplicated())               # True for duplicate rows
print(df.duplicated().sum())         # Total count of duplicates
print(df[df.duplicated()])           # Show duplicate rows

Detect Duplicates Based on Specific Columns

print(df.duplicated(subset=["Name"]))     # Duplicates only in Name column

Remove Duplicates

# Keep first occurrence, remove later duplicates
df_clean = df.drop_duplicates()

# Keep last occurrence
df_clean = df.drop_duplicates(keep="last")

# Remove all rows that are duplicated (keep none)
df_clean = df.drop_duplicates(keep=False)

Remove Based on Specific Columns

df_clean = df.drop_duplicates(subset=["Name"], keep="first")

29. Reshaping Data — melt and stack

Sometimes data is in a wide format (many columns) when it needs to be in a long format (more rows), or vice versa. Pandas provides melt(), stack(), and unstack() for this.

melt() — Wide to Long Format

melt() "unpivots" a DataFrame — turning column headers into row values. This is the opposite of a pivot table.

df = pd.DataFrame({
    "Student": ["Alice", "Bob"],
    "Math":    [85, 90],
    "Science": [78, 88],
    "English": [92, 75]
})
print(df)

Wide Format:

  Student  Math  Science  English
0   Alice    85       78       92
1     Bob    90       88       75

df_melted = df.melt(id_vars="Student", var_name="Subject", value_name="Marks")
print(df_melted)

Long Format (after melt):

  Student  Subject  Marks
0   Alice     Math     85
1     Bob     Math     90
2   Alice  Science     78
3     Bob  Science     88
4   Alice  English     92
5     Bob  English     75

stack() — Column Labels Become Row Index

df_stacked = df.set_index("Student").stack()
print(df_stacked)

unstack() — Row Index Becomes Column Labels

df_unstacked = df_stacked.unstack()
print(df_unstacked)

30. Performance Tips

As datasets grow larger, writing efficient Pandas code becomes important. Here are practical techniques to improve performance.

Tip 1 — Use Appropriate Data Types

Storing text columns with few unique values as category type instead of object reduces memory significantly.

df["City"] = df["City"].astype("category")
df["Gender"] = df["Gender"].astype("category")

Tip 2 — Use vectorized operations instead of loops

Pandas operations on entire columns are much faster than Python loops.

# Slow way — using a loop
for i in range(len(df)):
    df.loc[i, "Tax"] = df.loc[i, "Sales"] * 0.18

# Fast way — vectorized
df["Tax"] = df["Sales"] * 0.18

Tip 3 — Read Only Necessary Columns

df = pd.read_csv("large_file.csv", usecols=["Name", "Sales", "City"])

Tip 4 — Check Memory Usage

print(df.memory_usage(deep=True))
print(df.memory_usage(deep=True).sum() / 1024 ** 2, "MB")

Tip 5 — Use query() for Faster Filtering

# query() is faster than boolean indexing for large DataFrames
df.query("Age > 25 and City == 'Delhi'")

Tip 6 — Avoid iterrows() — Use apply() or Vectorization

# Avoid — very slow
for index, row in df.iterrows():
    print(row["Name"])

# Better — use vectorized or apply
df["Name"].apply(print)

Tip 7 — Read Large CSV Files in Chunks

chunk_size = 10000
for chunk in pd.read_csv("huge_file.csv", chunksize=chunk_size):
    process(chunk)   # Process each chunk separately

Tip 8 — Downcast Numeric Types

df["Age"]   = pd.to_numeric(df["Age"],   downcast="integer")
df["Score"] = pd.to_numeric(df["Score"], downcast="float")

Tip 9 — Use eval() for Complex Expressions

# Faster than standard column operations for large DataFrames
df.eval("Total = Sales + Tax", inplace=True)

31. Quick Reference Cheat Sheet

Creating Data

Task	Code
Create Series	`pd.Series([1, 2, 3])`
Create DataFrame	`pd.DataFrame({"A": [1,2], "B": [3,4]})`
Read CSV	`pd.read_csv("file.csv")`
Read Excel	`pd.read_excel("file.xlsx")`
Write CSV	`df.to_csv("out.csv", index=False)`

Inspection

Task	Code
First 5 rows	`df.head()`
Shape	`df.shape`
Info	`df.info()`
Statistics	`df.describe()`
Data types	`df.dtypes`
Missing counts	`df.isnull().sum()`

Selection

Task	Code
Select column	`df["Name"]`
Select multiple columns	`df[["Name", "Age"]]`
Select by label	`df.loc[0, "Name"]`
Select by position	`df.iloc[0, 1]`
Filter rows	`df[df["Age"] > 25]`

Cleaning

Task	Code
Drop NA rows	`df.dropna()`
Fill NA	`df.fillna(0)`
Drop duplicates	`df.drop_duplicates()`
Rename column	`df.rename(columns={"old":"new"})`
Change type	`df["A"].astype(int)`

Transformation

Task	Code
Sort rows	`df.sort_values("Col")`
Group and sum	`df.groupby("City")["Sales"].sum()`
Apply function	`df["Col"].apply(func)`
Pivot table	`pd.pivot_table(df, values=..., index=..., columns=...)`
Merge	`pd.merge(df1, df2, on="ID")`
Concatenate	`pd.concat([df1, df2])`
Melt (wide to long)	`df.melt(id_vars="Name")`
Rolling average	`df["Col"].rolling(3).mean()`

String Operations

Task	Code
Lowercase	`df["Name"].str.lower()`
Strip spaces	`df["Name"].str.strip()`
Contains	`df["Name"].str.contains("Al")`
Replace	`df["Name"].str.replace("Bob","Robert")`
Split	`df["Name"].str.split(" ", expand=True)`

DateTime Operations

Task	Code
Convert to datetime	`pd.to_datetime(df["Date"])`
Extract year	`df["Date"].dt.year`
Extract month	`df["Date"].dt.month`
Day name	`df["Date"].dt.day_name()`
Date range	`pd.date_range("2024-01-01", periods=10)`

Summary

This guide has covered the full journey of working with Pandas — starting from installing the library and understanding its two core data structures (Series and DataFrame), all the way to advanced techniques like MultiIndex, window functions, reshaping with melt/stack, and performance optimization.

Here is a recap of the key stages in any typical data analysis workflow using Pandas:

Load — Read data from CSV, Excel, JSON, or a database.
Inspect — Use head(), info(), describe() to understand the dataset.
Clean — Handle missing values, fix data types, remove duplicates.
Transform — Filter, sort, group, merge, pivot, and reshape as needed.
Analyze — Aggregate, apply functions, compute statistics.
Export — Save results to CSV, Excel, or a database.

Mastering Pandas opens the door to advanced areas like data visualization (Matplotlib, Seaborn), machine learning (Scikit-learn), and big data processing (Dask, PySpark). The consistent practice of applying these concepts on real datasets is the fastest path to proficiency.

Previous lessons

Back to courses

Python Pandas

Table of Contents

1. What is Pandas?

Why Use Pandas?

Where is Pandas Used?

2. Installation & Setup

Install Using pip

Install Along with NumPy and OpenPyXL (Recommended)

Verify the Installation

Using Pandas in Jupyter Notebook or Google Colab

3. Importing Pandas

4. Pandas Series

Creating a Series from a List

Creating a Series with a Custom Index

Creating a Series from a Dictionary

Accessing Values in a Series

Series Attributes

5. Pandas DataFrame

Structure of a DataFrame

Visual Representation

6. Creating DataFrames

Method 1 — From a Dictionary

Method 2 — From a List of Dictionaries

Method 3 — From a List of Lists

Method 4 — From a NumPy Array

Setting a Custom Index

7. Reading Data from Files

Reading a CSV File

Common Parameters for read_csv()

Reading an Excel File

Reading a JSON File

Reading from a SQL Database

Reading from a URL (Online CSV)

8. Writing Data to Files

Write to CSV

Write to Excel

Write to JSON

Write to SQL

Write Multiple Sheets to One Excel File

9. Inspecting a DataFrame

Sample Data Used in This Section

head() — First N Rows

tail() — Last N Rows

shape — Rows and Columns Count

info() — Column Names, Data Types, Non-Null Counts

describe() — Statistical Summary

Other Useful Inspection Methods

10. Selecting Columns and Rows

Selecting a Single Column

Selecting Multiple Columns

Selecting Rows by Index Number

Accessing a Column as an Attribute (Dot Notation)

11. loc and iloc — Label vs Position

Using loc

Using iloc

Key Difference — loc vs iloc

at and iat — Faster Single Cell Access

12. Filtering Data (Conditions)

Basic Filtering

Filtering with Text Match

Multiple Conditions — AND (&)

Multiple Conditions — OR (|)

NOT Condition (~)

Filtering with isin()

Filtering with between()

query() Method — SQL-like Filtering

13. Handling Missing Data

Sample Data with Missing Values

Detecting Missing Values

Dropping Missing Values

Filling Missing Values

Filling with Different Values per Column

Replace Specific Values with NaN

inplace Parameter

14. Adding and Dropping Columns/Rows

Adding a New Column

Adding a Column Based on Calculation

Adding a Column Using assign()

Dropping a Column

Dropping Multiple Columns