Python Pandas

Pandas is one of the most powerful and widely used Python libraries for working with data. Whether the data comes from a spreadsheet, a database, or a CSV file, Pandas gives the tools to read, clean, explore, and transform it quickly. This guide covers every important concept step by step — from installing Pandas to advanced techniques used in real-world data analysis.

Think of Pandas like a supercharged version of Microsoft Excel that works inside Python. Instead of clicking and dragging cells, everything is done with simple lines of code.


Table of Contents

  1. What is Pandas?
  2. Installation & Setup
  3. Importing Pandas
  4. Pandas Series
  5. Pandas DataFrame
  6. Creating DataFrames
  7. Reading Data from Files
  8. Writing Data to Files
  9. Inspecting a DataFrame
  10. Selecting Columns and Rows
  11. loc and iloc — Label vs Position
  12. Filtering Data (Conditions)
  13. Handling Missing Data
  14. Adding and Dropping Columns/Rows
  15. Renaming Columns and Index
  16. Sorting Data
  17. Data Types and Type Conversion
  18. String Operations
  19. Date and Time Operations
  20. GroupBy and Aggregation
  21. apply(), map() and Lambda Functions
  22. Merging and Joining DataFrames
  23. Concatenating DataFrames
  24. Pivot Tables
  25. Cross Tabulation (crosstab)
  26. MultiIndex (Hierarchical Index)
  27. Window Functions — Rolling & Expanding
  28. Handling Duplicate Data
  29. Reshaping Data — melt and stack
  30. Performance Tips
  31. Quick Reference Cheat Sheet

1. What is Pandas?

Pandas is an open-source Python library built for data manipulation and analysis. The name "Pandas" comes from "Panel Data" — a term used in statistics and economics. It was created by Wes McKinney in 2008 and has since become the standard tool for working with structured (tabular) data in Python.

Why Use Pandas?

  • Read data from CSV, Excel, JSON, SQL databases, and more.
  • Clean messy data — handle missing values, fix data types, remove duplicates.
  • Filter, sort, and reshape data with just a few lines of code.
  • Perform calculations and statistical analysis easily.
  • Works seamlessly with other Python libraries like NumPy, Matplotlib, and Scikit-learn.

Where is Pandas Used?

  • Data Science & Machine Learning — preparing datasets before building models.
  • Finance — analyzing stock prices, portfolio data, and financial reports.
  • Business Analytics — processing sales data, customer records, and KPI reports.
  • Research — working with experimental data, survey results, and scientific measurements.

2. Installation & Setup

Before using Pandas, it must be installed. Python's package manager pip handles this.

Install Using pip

pip install pandas

Install Along with NumPy and OpenPyXL (Recommended)

pip install pandas numpy openpyxl

numpy is the numerical computing library that Pandas depends on. openpyxl is needed to read and write Excel files.

Verify the Installation

import pandas as pd
print(pd.__version__)

Output:

2.2.1

If a version number appears, Pandas is installed correctly.

Using Pandas in Jupyter Notebook or Google Colab

If working in Jupyter Notebook or Google Colab, Pandas is already available. Just import it at the top of the notebook.


3. Importing Pandas

Every Pandas program starts by importing the library. The standard convention is to import Pandas with the alias pd. This is a widely accepted practice so that Pandas functions are accessed as pd.something instead of pandas.something.

import pandas as pd

This single line is always written at the top of any Python script that uses Pandas.


4. Pandas Series

A Series is the simplest data structure in Pandas. Think of it as a single column of data — like one column in a spreadsheet. It is a one-dimensional array that can hold any type of data: numbers, text, dates, and so on.

Every value in a Series has a label called an index. By default, the index is a number starting from 0.

Creating a Series from a List

import pandas as pd

marks = pd.Series([85, 90, 78, 95, 88])
print(marks)

Output:

0    85
1    90
2    78
3    95
4    88
dtype: int64

The left column (0, 1, 2, 3, 4) is the index. The right column contains the actual values.

Creating a Series with a Custom Index

marks = pd.Series([85, 90, 78], index=["Math", "Science", "English"])
print(marks)

Output:

Math       85
Science    90
English    78
dtype: int64

Creating a Series from a Dictionary

data = {"Apples": 50, "Bananas": 30, "Mangoes": 70}
fruits = pd.Series(data)
print(fruits)

Output:

Apples     50
Bananas    30
Mangoes    70
dtype: int64

Accessing Values in a Series

print(marks["Math"])       # Access by label → 85
print(marks[0])            # Access by position → 85

Series Attributes

AttributeDescriptionExample
series.valuesReturns the data as a NumPy arraymarks.values
series.indexReturns the index labelsmarks.index
series.dtypeReturns data typemarks.dtype
series.nameName of the Seriesmarks.name
series.sizeNumber of elementsmarks.size

5. Pandas DataFrame

A DataFrame is the most important and frequently used data structure in Pandas. It is a two-dimensional table — just like a spreadsheet with rows and columns. Each column in a DataFrame is actually a Series.

Imagine a school report card: students are the rows, subjects are the columns, and marks are the values. That is exactly what a DataFrame looks like.

Structure of a DataFrame

  • Rows — each row represents one record (e.g., one student, one transaction).
  • Columns — each column represents one attribute (e.g., name, age, salary).
  • Index — the row label (default: 0, 1, 2 ...).

Visual Representation

    Name    Age   City
0   Alice    25   Delhi
1   Bob      30   Mumbai
2   Carol    22   Chennai

Here, Name, Age, City are columns. The numbers 0, 1, 2 are the index (row labels).


6. Creating DataFrames

Method 1 — From a Dictionary

This is the most common way. Each key becomes a column name, and each list becomes the column values.

import pandas as pd

data = {
    "Name":  ["Alice", "Bob", "Carol"],
    "Age":   [25, 30, 22],
    "City":  ["Delhi", "Mumbai", "Chennai"]
}

df = pd.DataFrame(data)
print(df)

Output:

    Name  Age     City
0  Alice   25    Delhi
1    Bob   30   Mumbai
2  Carol   22  Chennai

Method 2 — From a List of Dictionaries

records = [
    {"Product": "Pen",    "Price": 10},
    {"Product": "Book",   "Price": 50},
    {"Product": "Pencil", "Price": 5}
]

df = pd.DataFrame(records)
print(df)

Output:

  Product  Price
0     Pen     10
1    Book     50
2  Pencil      5

Method 3 — From a List of Lists

rows = [["Alice", 25], ["Bob", 30], ["Carol", 22]]
df = pd.DataFrame(rows, columns=["Name", "Age"])
print(df)

Method 4 — From a NumPy Array

import numpy as np

arr = np.array([[1, 2], [3, 4], [5, 6]])
df = pd.DataFrame(arr, columns=["Col_A", "Col_B"])
print(df)

Setting a Custom Index

df = pd.DataFrame(data, index=["S1", "S2", "S3"])
print(df)

Output:

      Name  Age     City
S1   Alice   25    Delhi
S2     Bob   30   Mumbai
S3   Carol   22  Chennai

7. Reading Data from Files

In real projects, data is rarely typed manually. It comes from CSV files, Excel sheets, JSON files, or databases. Pandas makes reading these files very straightforward.

Reading a CSV File

CSV (Comma Separated Values) is the most common file format for storing tabular data.

df = pd.read_csv("students.csv")
print(df.head())

Common Parameters for read_csv()

ParameterPurposeExample
sepSeparator charactersep=";"
headerRow number to use as column namesheader=0
index_colColumn to use as row indexindex_col="ID"
nrowsRead only first N rowsnrows=100
usecolsRead only specific columnsusecols=["Name","Age"]
skiprowsSkip rows from the topskiprows=2
na_valuesValues to treat as NaNna_values=["NA","N/A"]
encodingFile encodingencoding="utf-8"

Reading an Excel File

df = pd.read_excel("sales_data.xlsx", sheet_name="Sheet1")
print(df.head())

Reading a JSON File

df = pd.read_json("data.json")
print(df.head())

Reading from a SQL Database

import sqlite3

conn = sqlite3.connect("company.db")
df = pd.read_sql("SELECT * FROM employees", conn)
print(df.head())

Reading from a URL (Online CSV)

url = "https://example.com/data/sales.csv"
df = pd.read_csv(url)
print(df.head())

8. Writing Data to Files

After processing data, the results need to be saved back to a file. Pandas supports writing to CSV, Excel, JSON, and SQL.

Write to CSV

df.to_csv("output.csv", index=False)

Setting index=False prevents the default row numbers from being saved as a separate column in the file.

Write to Excel

df.to_excel("output.xlsx", sheet_name="Results", index=False)

Write to JSON

df.to_json("output.json", orient="records", indent=2)

Write to SQL

import sqlite3
conn = sqlite3.connect("company.db")
df.to_sql("employees", conn, if_exists="replace", index=False)

Write Multiple Sheets to One Excel File

with pd.ExcelWriter("report.xlsx") as writer:
    df1.to_excel(writer, sheet_name="Sales", index=False)
    df2.to_excel(writer, sheet_name="Expenses", index=False)

9. Inspecting a DataFrame

Before doing any analysis, it is important to understand the shape and quality of the data. Pandas provides several built-in methods for this.

Sample Data Used in This Section

import pandas as pd

data = {
    "Name":   ["Alice", "Bob", "Carol", "David", "Eva"],
    "Age":    [25, 30, 22, 35, 28],
    "Score":  [85, 90, None, 78, 92],
    "City":   ["Delhi", "Mumbai", "Delhi", "Chennai", "Mumbai"]
}
df = pd.DataFrame(data)

head() — First N Rows

print(df.head())      # First 5 rows (default)
print(df.head(3))     # First 3 rows

tail() — Last N Rows

print(df.tail())      # Last 5 rows
print(df.tail(2))     # Last 2 rows

shape — Rows and Columns Count

print(df.shape)       # (5, 4) → 5 rows, 4 columns

info() — Column Names, Data Types, Non-Null Counts

print(df.info())

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Name    5 non-null      object
 1   Age     5 non-null      int64
 2   Score   4 non-null      float64
 3   City    5 non-null      object
dtypes: float64(1), int64(1), object(2)
memory usage: 288.0+ bytes

describe() — Statistical Summary

print(df.describe())

Output:

        Age      Score
count   5.0   4.000000
mean   28.0  86.250000
std     4.8   5.737305
min    22.0  78.000000
25%    25.0  83.250000
50%    28.0  87.500000
75%    30.0  90.500000
max    35.0  92.000000

Other Useful Inspection Methods

Method / AttributeWhat It Returns
df.columnsList of column names
df.indexRow index labels
df.dtypesData type of each column
df.shapeTuple of (rows, columns)
df.sizeTotal number of cells (rows × columns)
df.ndimNumber of dimensions (always 2 for DataFrame)
df.sample(3)Random 3 rows
df.count()Count of non-null values per column
df.nunique()Number of unique values per column
df.isnull().sum()Count of missing values per column

10. Selecting Columns and Rows

Selecting a Single Column

A single column is selected using the column name in square brackets. The result is a Series.

print(df["Name"])

Output:

0    Alice
1      Bob
2    Carol
3    David
4      Eva
Name: Name, dtype: object

Selecting Multiple Columns

Pass a list of column names. The result is a DataFrame.

print(df[["Name", "Age"]])

Selecting Rows by Index Number

print(df[1:3])    # Rows at index 1 and 2 (not 3)

Accessing a Column as an Attribute (Dot Notation)

print(df.Name)    # Same as df["Name"]

Note: Dot notation does not work if the column name has spaces or matches a built-in method name. Using square brackets is always safe.


11. loc and iloc — Label vs Position

Pandas provides two powerful ways to select specific rows and columns:

  • loc — selects by label (column name, index label)
  • iloc — selects by integer position (row number, column number)

Using loc

# Select row with index label 0
print(df.loc[0])

# Select rows 0 to 2, all columns
print(df.loc[0:2])

# Select rows 0 to 2, specific columns
print(df.loc[0:2, ["Name", "Age"]])

# Select a single cell
print(df.loc[1, "City"])     # Output: Mumbai

Using iloc

# Select first row (position 0)
print(df.iloc[0])

# Select rows 0, 1, 2 and first two columns
print(df.iloc[0:3, 0:2])

# Select a single cell by position
print(df.iloc[1, 3])         # Row 1, Column 3

Key Difference — loc vs iloc

Featurelociloc
Selection methodBy label (name)By number (position)
End of rangeInclusiveExclusive
Works withCustom index, column namesAlways numeric 0-based

at and iat — Faster Single Cell Access

print(df.at[2, "Name"])     # Access by label → Carol
print(df.iat[2, 0])         # Access by position → Carol

at and iat are faster than loc and iloc when accessing just one cell.


12. Filtering Data (Conditions)

Filtering means selecting only those rows that satisfy a given condition — for example, showing only students who scored above 80, or listing employees from a specific city.

Basic Filtering

# Students with Score greater than 85
result = df[df["Score"] > 85]
print(result)

Filtering with Text Match

# Rows where City is Mumbai
result = df[df["City"] == "Mumbai"]
print(result)

Multiple Conditions — AND (&)

# Age greater than 25 AND City is Delhi
result = df[(df["Age"] > 25) & (df["City"] == "Delhi")]
print(result)

Multiple Conditions — OR (|)

# City is Delhi OR City is Mumbai
result = df[(df["City"] == "Delhi") | (df["City"] == "Mumbai")]
print(result)

NOT Condition (~)

# Rows where City is NOT Delhi
result = df[~(df["City"] == "Delhi")]
print(result)

Filtering with isin()

isin() checks if a value is present in a given list. Useful for filtering multiple values at once.

# Rows where City is Delhi or Chennai
result = df[df["City"].isin(["Delhi", "Chennai"])]
print(result)

Filtering with between()

# Age between 25 and 30 (inclusive)
result = df[df["Age"].between(25, 30)]
print(result)

query() Method — SQL-like Filtering

query() allows filtering using a readable string expression — similar to writing SQL WHERE conditions.

result = df.query("Age > 25 and City == 'Mumbai'")
print(result)

13. Handling Missing Data

Real-world datasets almost always have missing values — cells that are empty or were not recorded. In Pandas, missing values are represented as NaN (Not a Number). Handling them correctly is a critical step in data cleaning.

Sample Data with Missing Values

import pandas as pd
import numpy as np

data = {
    "Name":   ["Alice", "Bob", "Carol", "David"],
    "Age":    [25, None, 22, 35],
    "Score":  [85, 90, None, 78]
}
df = pd.DataFrame(data)
print(df)

Output:

    Name   Age  Score
0  Alice  25.0   85.0
1    Bob   NaN   90.0
2  Carol  22.0    NaN
3  David  35.0   78.0

Detecting Missing Values

print(df.isnull())          # True where value is missing
print(df.isnull().sum())    # Count of missing values per column
print(df.notnull())         # True where value is present

Dropping Missing Values

# Drop any row that has at least one missing value
df_clean = df.dropna()

# Drop only rows where ALL values are missing
df_clean = df.dropna(how="all")

# Drop rows where specific columns have missing values
df_clean = df.dropna(subset=["Age"])

# Drop columns with missing values
df_clean = df.dropna(axis=1)

Filling Missing Values

# Fill all missing values with 0
df_filled = df.fillna(0)

# Fill missing values in a specific column
df["Age"] = df["Age"].fillna(df["Age"].mean())

# Fill using forward fill (use the previous row's value)
df_ffill = df.fillna(method="ffill")

# Fill using backward fill (use the next row's value)
df_bfill = df.fillna(method="bfill")

Filling with Different Values per Column

df = df.fillna({"Age": df["Age"].median(), "Score": df["Score"].mean()})

Replace Specific Values with NaN

df.replace(-1, np.nan, inplace=True)
df.replace(["N/A", "na", "unknown"], np.nan, inplace=True)

inplace Parameter

By default, Pandas operations return a new DataFrame and do not modify the original. Setting inplace=True modifies the original DataFrame directly without needing to reassign it.

df.dropna(inplace=True)    # Modifies df directly

14. Adding and Dropping Columns/Rows

Adding a New Column

df["Grade"] = ["A", "A+", "B", "B+", "A"]
print(df)

Adding a Column Based on Calculation

df["Age_in_5_Years"] = df["Age"] + 5

Adding a Column Using assign()

assign() is a cleaner way to add new columns, especially in method chains.

df = df.assign(Tax = df["Score"] * 0.1)

Dropping a Column

df = df.drop(columns=["Age_in_5_Years"])

# Or using axis
df = df.drop("Grade", axis=1)

Dropping Multiple Columns

df = df.drop(columns=["Grade", "Tax"])

Dropping a Row by Index

df = df.drop(index=2)         # Drop row at index 2
df = df.drop(index=[0, 3])    # Drop rows at index 0 and 3

Inserting a Column at a Specific Position

# Insert "Grade" column at position 1
df.insert(1, "Grade", ["A", "B", "A", "C", "A"])

Resetting the Index after Drop

df = df.reset_index(drop=True)    # drop=True prevents old index from becoming a column

15. Renaming Columns and Index

Rename Specific Columns

df = df.rename(columns={"Name": "Student_Name", "Age": "Student_Age"})
print(df.columns)

Rename All Columns Using a List

df.columns = ["student", "age", "marks", "location"]

Rename Columns to Lowercase

df.columns = df.columns.str.lower()
df.columns = df.columns.str.replace(" ", "_")

Rename the Index

df = df.rename(index={0: "Row1", 1: "Row2"})

Set a Column as Index

df = df.set_index("Name")
print(df)

Reset Index Back to Default

df = df.reset_index()

16. Sorting Data

Sort by a Single Column

# Sort by Age in ascending order (default)
df_sorted = df.sort_values("Age")
print(df_sorted)

# Sort in descending order
df_sorted = df.sort_values("Age", ascending=False)

Sort by Multiple Columns

# Sort by City ascending, then by Age descending
df_sorted = df.sort_values(["City", "Age"], ascending=[True, False])
print(df_sorted)

Sort by Index

df_sorted = df.sort_index()                  # Ascending
df_sorted = df.sort_index(ascending=False)   # Descending

nlargest() and nsmallest()

These methods quickly return the top or bottom N rows based on a column.

# Top 3 highest scores
print(df.nlargest(3, "Score"))

# Bottom 2 youngest students
print(df.nsmallest(2, "Age"))

17. Data Types and Type Conversion

Every column in a DataFrame has a data type (dtype). Common types include int64, float64, object (for text), bool, and datetime64.

Check Data Types

print(df.dtypes)

Convert a Column's Data Type — astype()

# Convert Age from float to int
df["Age"] = df["Age"].astype(int)

# Convert Score to float
df["Score"] = df["Score"].astype(float)

# Convert a column to string
df["ID"] = df["ID"].astype(str)

Convert to Category Type (Memory Efficient)

If a column has a limited number of unique repeated values (like "City" or "Gender"), converting it to category saves memory.

df["City"] = df["City"].astype("category")
print(df["City"].cat.categories)

Convert String Column to Numeric

# If column has text like "45.5" that should be a number
df["Score"] = pd.to_numeric(df["Score"], errors="coerce")

errors="coerce" converts invalid values to NaN instead of raising an error.

Convert to Datetime

df["DOB"] = pd.to_datetime(df["DOB"])
df["JoinDate"] = pd.to_datetime(df["JoinDate"], format="%d-%m-%Y")

18. String Operations

Pandas provides a .str accessor to perform string operations on text columns. These work on the entire column at once — no need for a loop.

Setup

data = {"Name": ["  alice  ", "BOB", "carol smith"], "City": ["Delhi", "mumbai", "CHENNAI"]}
df = pd.DataFrame(data)

Common String Methods

MethodDescriptionExample
.str.upper()Convert to uppercasedf["Name"].str.upper()
.str.lower()Convert to lowercasedf["City"].str.lower()
.str.title()Title case (first letter caps)df["Name"].str.title()
.str.strip()Remove leading/trailing spacesdf["Name"].str.strip()
.str.len()Length of each stringdf["Name"].str.len()
.str.replace()Replace a substringdf["Name"].str.replace("Bob","Robert")
.str.contains()True if substring founddf["Name"].str.contains("carol")
.str.startswith()True if starts withdf["Name"].str.startswith("A")
.str.endswith()True if ends withdf["Name"].str.endswith("h")
.str.split()Split into listdf["Name"].str.split(" ")
.str.slice()Extract part of stringdf["Name"].str.slice(0, 3)
.str.count()Count occurrencesdf["Name"].str.count("l")

Splitting a Column into Multiple Columns

df["Name"] = ["Alice Smith", "Bob Jones", "Carol Brown"]
df[["First", "Last"]] = df["Name"].str.split(" ", expand=True)
print(df)

Filtering with str.contains()

result = df[df["City"].str.contains("Delhi", case=False)]
print(result)

19. Date and Time Operations

Dates and times are common in real-world data — transaction dates, employee joining dates, event timestamps. Pandas has built-in tools to work with them easily.

Creating and Converting to Datetime

import pandas as pd

df = pd.DataFrame({"Date": ["2024-01-15", "2024-03-22", "2024-07-05"]})
df["Date"] = pd.to_datetime(df["Date"])
print(df.dtypes)

Extracting Date Components

df["Year"]    = df["Date"].dt.year
df["Month"]   = df["Date"].dt.month
df["Day"]     = df["Date"].dt.day
df["Weekday"] = df["Date"].dt.day_name()
df["Quarter"] = df["Date"].dt.quarter
print(df)

Output:

        Date  Year  Month  Day   Weekday  Quarter
0 2024-01-15  2024      1   15    Monday        1
1 2024-03-22  2024      3   22    Friday        1
2 2024-07-05  2024      7    5    Friday        3

Date Arithmetic

# Add 30 days to a date
df["Due_Date"] = df["Date"] + pd.Timedelta(days=30)

# Difference between two dates
df["Days_Passed"] = pd.Timestamp("2024-12-31") - df["Date"]

# Extract number of days from timedelta
df["Days_Count"] = df["Days_Passed"].dt.days

Filtering by Date

# Rows after March 2024
result = df[df["Date"] > "2024-03-01"]
print(result)

Resampling Time Series Data

Resampling aggregates time-series data over a time period — like grouping daily data into monthly totals.

df.set_index("Date", inplace=True)

# Monthly sum
monthly = df.resample("M").sum()
print(monthly)

Date Range Generation

# Generate 10 consecutive business days from Jan 1, 2024
date_range = pd.date_range(start="2024-01-01", periods=10, freq="B")
print(date_range)

20. GroupBy and Aggregation

GroupBy is one of the most powerful features in Pandas. It splits the data into groups based on one or more columns, then applies a function (like sum, mean, count) to each group. This is similar to GROUP BY in SQL.

Think of it like sorting test papers by class and then calculating the average marks for each class.

Sample Data

import pandas as pd

data = {
    "City":    ["Delhi", "Mumbai", "Delhi", "Mumbai", "Chennai", "Delhi"],
    "Product": ["Pen",   "Book",   "Pen",   "Pencil", "Pen",    "Book"],
    "Sales":   [200,      500,      150,     300,      400,       250]
}
df = pd.DataFrame(data)

Basic GroupBy — Total Sales per City

result = df.groupby("City")["Sales"].sum()
print(result)

Output:

City
Chennai    400
Delhi      600
Mumbai     800
Name: Sales, dtype: int64

Multiple Aggregations at Once — agg()

result = df.groupby("City")["Sales"].agg(["sum", "mean", "max", "count"])
print(result)

GroupBy on Multiple Columns

result = df.groupby(["City", "Product"])["Sales"].sum()
print(result)

Named Aggregations (Pandas 0.25+)

result = df.groupby("City").agg(
    Total_Sales = ("Sales", "sum"),
    Avg_Sales   = ("Sales", "mean"),
    Max_Sales   = ("Sales", "max")
)
print(result)

Applying Custom Aggregation Functions

result = df.groupby("City")["Sales"].agg(lambda x: x.max() - x.min())
print(result)

transform() — Keep Original Shape

transform() is like agg(), but it returns a Series of the same length as the original DataFrame — useful for adding a group-level calculation back to the original data.

# Add city-level average sales back to every row
df["City_Avg"] = df.groupby("City")["Sales"].transform("mean")
print(df)

filter() — Keep Groups Meeting a Condition

# Keep only cities with total sales greater than 500
result = df.groupby("City").filter(lambda x: x["Sales"].sum() > 500)
print(result)

21. apply(), map() and Lambda Functions

These are tools for applying custom transformations to columns or rows.

map() — Element-wise on a Series

map() applies a function or dictionary mapping to each element of a Series.

df["Grade"] = df["Score"].map({85: "A", 90: "A+", 78: "B+"})

# Or with a function
df["Level"] = df["Age"].map(lambda x: "Senior" if x > 28 else "Junior")

apply() on a Series

def classify(score):
    if score >= 90:
        return "Excellent"
    elif score >= 75:
        return "Good"
    else:
        return "Average"

df["Result"] = df["Score"].apply(classify)

apply() on a DataFrame Row

By setting axis=1, the function is applied row-by-row, giving access to multiple columns at once.

df["Summary"] = df.apply(
    lambda row: f"{row['Name']} from {row['City']} scored {row['Score']}",
    axis=1
)
print(df["Summary"])

apply() on Entire Columns (axis=0)

# Normalize all numeric columns (subtract min, divide by range)
def normalize(col):
    return (col - col.min()) / (col.max() - col.min())

df[["Age", "Score"]] = df[["Age", "Score"]].apply(normalize)

applymap() / map() on Entire DataFrame (Element-wise)

# In Pandas 2.x use .map() on DataFrame instead of applymap()
df_numeric = df[["Age", "Score"]].map(lambda x: round(x, 2))

Using Lambda with Conditions

df["Tax"] = df["Sales"].apply(lambda x: x * 0.18 if x > 300 else x * 0.05)

22. Merging and Joining DataFrames

Merging combines two DataFrames based on a common column — just like SQL JOIN. This is essential when data is spread across multiple tables and needs to be combined.

Sample DataFrames

students = pd.DataFrame({
    "ID":   [1, 2, 3, 4],
    "Name": ["Alice", "Bob", "Carol", "David"]
})

marks = pd.DataFrame({
    "ID":    [1, 2, 3, 5],
    "Score": [85, 90, 78, 88]
})

Inner Join — Keep Only Matching Rows

result = pd.merge(students, marks, on="ID", how="inner")
print(result)

Output:

   ID   Name  Score
0   1  Alice     85
1   2    Bob     90
2   3  Carol     78

David (ID=4) and the record with ID=5 are excluded because they don't match in both tables.

Left Join — Keep All Rows from Left Table

result = pd.merge(students, marks, on="ID", how="left")
print(result)

Output: David appears with NaN for Score since no matching row exists in marks.

Right Join — Keep All Rows from Right Table

result = pd.merge(students, marks, on="ID", how="right")

Outer Join — Keep All Rows from Both Tables

result = pd.merge(students, marks, on="ID", how="outer")
print(result)

Merging on Different Column Names

result = pd.merge(students, marks, left_on="ID", right_on="StudentID")

Merging on Index

result = pd.merge(students, marks, left_index=True, right_index=True)

Join Types Summary

Join TypeResult
innerOnly rows that match in BOTH tables
leftAll rows from LEFT table + matching from right
rightAll rows from RIGHT table + matching from left
outerAll rows from BOTH tables

23. Concatenating DataFrames

While merge() combines DataFrames based on a key column, concat() stacks DataFrames either vertically (one on top of another) or horizontally (side by side).

Concatenate Vertically (Stacking Rows)

df1 = pd.DataFrame({"Name": ["Alice", "Bob"], "Score": [85, 90]})
df2 = pd.DataFrame({"Name": ["Carol", "David"], "Score": [78, 95]})

result = pd.concat([df1, df2], ignore_index=True)
print(result)

Output:

    Name  Score
0  Alice     85
1    Bob     90
2  Carol     78
3  David     95

ignore_index=True resets the index in the combined result.

Concatenate Horizontally (Stacking Columns)

df_names  = pd.DataFrame({"Name": ["Alice", "Bob"]})
df_scores = pd.DataFrame({"Score": [85, 90]})

result = pd.concat([df_names, df_scores], axis=1)
print(result)

Concatenate with Keys (Track Source)

result = pd.concat([df1, df2], keys=["Group_A", "Group_B"])
print(result)

Difference between concat and merge

Featureconcatmerge
PurposeStack DataFrames togetherCombine based on common column
Requires common columnNoYes
DirectionRows (axis=0) or Columns (axis=1)Always on columns

24. Pivot Tables

A pivot table reorganizes and summarizes data by converting unique values of one column into new columns. This is the same concept as Excel's pivot table feature.

Sample Data

data = {
    "City":    ["Delhi", "Delhi", "Mumbai", "Mumbai", "Chennai"],
    "Product": ["Pen",   "Book",  "Pen",   "Book",   "Pen"],
    "Sales":   [200,      300,     400,     500,      150]
}
df = pd.DataFrame(data)

Basic Pivot Table

pivot = pd.pivot_table(df, values="Sales", index="City", columns="Product", aggfunc="sum")
print(pivot)

Output:

Product  Book   Pen
City
Chennai   NaN   150
Delhi     300   200
Mumbai    500   400

With Fill Value and Margins

pivot = pd.pivot_table(
    df,
    values="Sales",
    index="City",
    columns="Product",
    aggfunc="sum",
    fill_value=0,
    margins=True         # Adds row and column totals
)
print(pivot)

Multiple Aggregation Functions

pivot = pd.pivot_table(df, values="Sales", index="City", aggfunc=["sum", "mean"])
print(pivot)

pivot() — Simple Reshape (No Aggregation)

pivot() is used when there is no aggregation needed — just reshaping the data.

result = df.pivot(index="City", columns="Product", values="Sales")
print(result)

25. Cross Tabulation (crosstab)

Cross tabulation (or crosstab) computes a frequency table of two or more variables. It is commonly used in survey analysis and categorical data exploration.

import pandas as pd

data = {
    "Gender":  ["M", "F", "M", "F", "M", "F"],
    "City":    ["Delhi", "Mumbai", "Delhi", "Delhi", "Mumbai", "Chennai"]
}
df = pd.DataFrame(data)

result = pd.crosstab(df["City"], df["Gender"])
print(result)

Output:

Gender    F  M
City
Chennai   1  0
Delhi     2  2
Mumbai    0  2  (approx)

Crosstab with Percentages

result = pd.crosstab(df["City"], df["Gender"], normalize="index")
print(result)

Crosstab with Margins

result = pd.crosstab(df["City"], df["Gender"], margins=True, margins_name="Total")
print(result)

26. MultiIndex (Hierarchical Index)

A MultiIndex (also called hierarchical index) allows a DataFrame to have multiple levels of row or column labels. This is useful when data is naturally hierarchical — like sales by year and month, or students grouped by class and section.

Creating a MultiIndex DataFrame

import pandas as pd

data = {
    "Sales": [200, 150, 300, 250, 180, 320]
}
index = pd.MultiIndex.from_tuples(
    [("2023", "Q1"), ("2023", "Q2"), ("2023", "Q3"),
     ("2024", "Q1"), ("2024", "Q2"), ("2024", "Q3")],
    names=["Year", "Quarter"]
)
df = pd.DataFrame(data, index=index)
print(df)

Output:

             Sales
Year Quarter
2023 Q1        200
     Q2        150
     Q3        300
2024 Q1        250
     Q2        180
     Q3        320

Accessing Data in MultiIndex

# All rows for year 2023
print(df.loc["2023"])

# Specific quarter in a year
print(df.loc[("2023", "Q2")])

Unstacking — Convert Row Index to Columns

print(df.unstack())

Stacking — Convert Column Headers to Row Index

print(df.unstack().stack())

Reset MultiIndex

df_flat = df.reset_index()
print(df_flat)

Creating MultiIndex from GroupBy

sales_data = pd.DataFrame({
    "Year":    [2023, 2023, 2024, 2024],
    "Product": ["Pen", "Book", "Pen", "Book"],
    "Sales":   [500, 700, 600, 800]
})

result = sales_data.groupby(["Year", "Product"])["Sales"].sum()
print(result)

27. Window Functions — Rolling & Expanding

Window functions perform calculations over a sliding or expanding window of rows. They are widely used in time-series analysis — for example, calculating a 7-day moving average of sales or stock prices.

Rolling — Fixed-Size Sliding Window

import pandas as pd

df = pd.DataFrame({
    "Day":   range(1, 8),
    "Sales": [100, 150, 130, 200, 180, 210, 170]
})

# 3-day rolling average
df["Rolling_Avg"] = df["Sales"].rolling(window=3).mean()
print(df)

Output:

   Day  Sales  Rolling_Avg
0    1    100          NaN
1    2    150          NaN
2    3    130   126.666667
3    4    200   160.000000
4    5    180   170.000000
5    6    210   196.666667
6    7    170   186.666667

The first two rows show NaN because there aren't enough previous rows to fill the window of 3.

Rolling Sum and Min

df["Rolling_Sum"] = df["Sales"].rolling(window=3).sum()
df["Rolling_Min"] = df["Sales"].rolling(window=3).min()

Expanding — Growing Window

An expanding window grows from the first row up to the current row — useful for running totals or cumulative averages.

df["Cumulative_Avg"] = df["Sales"].expanding().mean()
df["Cumulative_Sum"] = df["Sales"].expanding().sum()
print(df[["Day", "Sales", "Cumulative_Avg", "Cumulative_Sum"]])

Exponential Weighted Moving Average (EWM)

EWM gives more weight to recent values — useful in finance and trend analysis.

df["EWM_Avg"] = df["Sales"].ewm(span=3).mean()

28. Handling Duplicate Data

Duplicate rows occur when the same record appears more than once. They can skew analysis results and must be identified and handled properly.

Sample Data

data = {
    "Name":  ["Alice", "Bob", "Alice", "Carol", "Bob"],
    "Score": [85, 90, 85, 78, 90]
}
df = pd.DataFrame(data)

Detect Duplicates

print(df.duplicated())               # True for duplicate rows
print(df.duplicated().sum())         # Total count of duplicates
print(df[df.duplicated()])           # Show duplicate rows

Detect Duplicates Based on Specific Columns

print(df.duplicated(subset=["Name"]))     # Duplicates only in Name column

Remove Duplicates

# Keep first occurrence, remove later duplicates
df_clean = df.drop_duplicates()

# Keep last occurrence
df_clean = df.drop_duplicates(keep="last")

# Remove all rows that are duplicated (keep none)
df_clean = df.drop_duplicates(keep=False)

Remove Based on Specific Columns

df_clean = df.drop_duplicates(subset=["Name"], keep="first")

29. Reshaping Data — melt and stack

Sometimes data is in a wide format (many columns) when it needs to be in a long format (more rows), or vice versa. Pandas provides melt(), stack(), and unstack() for this.

melt() — Wide to Long Format

melt() "unpivots" a DataFrame — turning column headers into row values. This is the opposite of a pivot table.

df = pd.DataFrame({
    "Student": ["Alice", "Bob"],
    "Math":    [85, 90],
    "Science": [78, 88],
    "English": [92, 75]
})
print(df)

Wide Format:

  Student  Math  Science  English
0   Alice    85       78       92
1     Bob    90       88       75
df_melted = df.melt(id_vars="Student", var_name="Subject", value_name="Marks")
print(df_melted)

Long Format (after melt):

  Student  Subject  Marks
0   Alice     Math     85
1     Bob     Math     90
2   Alice  Science     78
3     Bob  Science     88
4   Alice  English     92
5     Bob  English     75

stack() — Column Labels Become Row Index

df_stacked = df.set_index("Student").stack()
print(df_stacked)

unstack() — Row Index Becomes Column Labels

df_unstacked = df_stacked.unstack()
print(df_unstacked)

30. Performance Tips

As datasets grow larger, writing efficient Pandas code becomes important. Here are practical techniques to improve performance.

Tip 1 — Use Appropriate Data Types

Storing text columns with few unique values as category type instead of object reduces memory significantly.

df["City"] = df["City"].astype("category")
df["Gender"] = df["Gender"].astype("category")

Tip 2 — Use vectorized operations instead of loops

Pandas operations on entire columns are much faster than Python loops.

# Slow way — using a loop
for i in range(len(df)):
    df.loc[i, "Tax"] = df.loc[i, "Sales"] * 0.18

# Fast way — vectorized
df["Tax"] = df["Sales"] * 0.18

Tip 3 — Read Only Necessary Columns

df = pd.read_csv("large_file.csv", usecols=["Name", "Sales", "City"])

Tip 4 — Check Memory Usage

print(df.memory_usage(deep=True))
print(df.memory_usage(deep=True).sum() / 1024 ** 2, "MB")

Tip 5 — Use query() for Faster Filtering

# query() is faster than boolean indexing for large DataFrames
df.query("Age > 25 and City == 'Delhi'")

Tip 6 — Avoid iterrows() — Use apply() or Vectorization

# Avoid — very slow
for index, row in df.iterrows():
    print(row["Name"])

# Better — use vectorized or apply
df["Name"].apply(print)

Tip 7 — Read Large CSV Files in Chunks

chunk_size = 10000
for chunk in pd.read_csv("huge_file.csv", chunksize=chunk_size):
    process(chunk)   # Process each chunk separately

Tip 8 — Downcast Numeric Types

df["Age"]   = pd.to_numeric(df["Age"],   downcast="integer")
df["Score"] = pd.to_numeric(df["Score"], downcast="float")

Tip 9 — Use eval() for Complex Expressions

# Faster than standard column operations for large DataFrames
df.eval("Total = Sales + Tax", inplace=True)

31. Quick Reference Cheat Sheet

Creating Data

TaskCode
Create Seriespd.Series([1, 2, 3])
Create DataFramepd.DataFrame({"A": [1,2], "B": [3,4]})
Read CSVpd.read_csv("file.csv")
Read Excelpd.read_excel("file.xlsx")
Write CSVdf.to_csv("out.csv", index=False)

Inspection

TaskCode
First 5 rowsdf.head()
Shapedf.shape
Infodf.info()
Statisticsdf.describe()
Data typesdf.dtypes
Missing countsdf.isnull().sum()

Selection

TaskCode
Select columndf["Name"]
Select multiple columnsdf[["Name", "Age"]]
Select by labeldf.loc[0, "Name"]
Select by positiondf.iloc[0, 1]
Filter rowsdf[df["Age"] > 25]

Cleaning

TaskCode
Drop NA rowsdf.dropna()
Fill NAdf.fillna(0)
Drop duplicatesdf.drop_duplicates()
Rename columndf.rename(columns={"old":"new"})
Change typedf["A"].astype(int)

Transformation

TaskCode
Sort rowsdf.sort_values("Col")
Group and sumdf.groupby("City")["Sales"].sum()
Apply functiondf["Col"].apply(func)
Pivot tablepd.pivot_table(df, values=..., index=..., columns=...)
Mergepd.merge(df1, df2, on="ID")
Concatenatepd.concat([df1, df2])
Melt (wide to long)df.melt(id_vars="Name")
Rolling averagedf["Col"].rolling(3).mean()

String Operations

TaskCode
Lowercasedf["Name"].str.lower()
Strip spacesdf["Name"].str.strip()
Containsdf["Name"].str.contains("Al")
Replacedf["Name"].str.replace("Bob","Robert")
Splitdf["Name"].str.split(" ", expand=True)

DateTime Operations

TaskCode
Convert to datetimepd.to_datetime(df["Date"])
Extract yeardf["Date"].dt.year
Extract monthdf["Date"].dt.month
Day namedf["Date"].dt.day_name()
Date rangepd.date_range("2024-01-01", periods=10)

Summary

This guide has covered the full journey of working with Pandas — starting from installing the library and understanding its two core data structures (Series and DataFrame), all the way to advanced techniques like MultiIndex, window functions, reshaping with melt/stack, and performance optimization.

Here is a recap of the key stages in any typical data analysis workflow using Pandas:

  1. Load — Read data from CSV, Excel, JSON, or a database.
  2. Inspect — Use head(), info(), describe() to understand the dataset.
  3. Clean — Handle missing values, fix data types, remove duplicates.
  4. Transform — Filter, sort, group, merge, pivot, and reshape as needed.
  5. Analyze — Aggregate, apply functions, compute statistics.
  6. Export — Save results to CSV, Excel, or a database.

Mastering Pandas opens the door to advanced areas like data visualization (Matplotlib, Seaborn), machine learning (Scikit-learn), and big data processing (Dask, PySpark). The consistent practice of applying these concepts on real datasets is the fastest path to proficiency.

Leave a Comment

Your email address will not be published. Required fields are marked *