Python Pandas
Pandas is one of the most powerful and widely used Python libraries for working with data. Whether the data comes from a spreadsheet, a database, or a CSV file, Pandas gives the tools to read, clean, explore, and transform it quickly. This guide covers every important concept step by step — from installing Pandas to advanced techniques used in real-world data analysis.
Think of Pandas like a supercharged version of Microsoft Excel that works inside Python. Instead of clicking and dragging cells, everything is done with simple lines of code.
Table of Contents
- What is Pandas?
- Installation & Setup
- Importing Pandas
- Pandas Series
- Pandas DataFrame
- Creating DataFrames
- Reading Data from Files
- Writing Data to Files
- Inspecting a DataFrame
- Selecting Columns and Rows
- loc and iloc — Label vs Position
- Filtering Data (Conditions)
- Handling Missing Data
- Adding and Dropping Columns/Rows
- Renaming Columns and Index
- Sorting Data
- Data Types and Type Conversion
- String Operations
- Date and Time Operations
- GroupBy and Aggregation
- apply(), map() and Lambda Functions
- Merging and Joining DataFrames
- Concatenating DataFrames
- Pivot Tables
- Cross Tabulation (crosstab)
- MultiIndex (Hierarchical Index)
- Window Functions — Rolling & Expanding
- Handling Duplicate Data
- Reshaping Data — melt and stack
- Performance Tips
- Quick Reference Cheat Sheet
1. What is Pandas?
Pandas is an open-source Python library built for data manipulation and analysis. The name "Pandas" comes from "Panel Data" — a term used in statistics and economics. It was created by Wes McKinney in 2008 and has since become the standard tool for working with structured (tabular) data in Python.
Why Use Pandas?
- Read data from CSV, Excel, JSON, SQL databases, and more.
- Clean messy data — handle missing values, fix data types, remove duplicates.
- Filter, sort, and reshape data with just a few lines of code.
- Perform calculations and statistical analysis easily.
- Works seamlessly with other Python libraries like NumPy, Matplotlib, and Scikit-learn.
Where is Pandas Used?
- Data Science & Machine Learning — preparing datasets before building models.
- Finance — analyzing stock prices, portfolio data, and financial reports.
- Business Analytics — processing sales data, customer records, and KPI reports.
- Research — working with experimental data, survey results, and scientific measurements.
2. Installation & Setup
Before using Pandas, it must be installed. Python's package manager pip handles this.
Install Using pip
pip install pandasInstall Along with NumPy and OpenPyXL (Recommended)
pip install pandas numpy openpyxlnumpy is the numerical computing library that Pandas depends on. openpyxl is needed to read and write Excel files.
Verify the Installation
import pandas as pd
print(pd.__version__)Output:
2.2.1If a version number appears, Pandas is installed correctly.
Using Pandas in Jupyter Notebook or Google Colab
If working in Jupyter Notebook or Google Colab, Pandas is already available. Just import it at the top of the notebook.
3. Importing Pandas
Every Pandas program starts by importing the library. The standard convention is to import Pandas with the alias pd. This is a widely accepted practice so that Pandas functions are accessed as pd.something instead of pandas.something.
import pandas as pdThis single line is always written at the top of any Python script that uses Pandas.
4. Pandas Series
A Series is the simplest data structure in Pandas. Think of it as a single column of data — like one column in a spreadsheet. It is a one-dimensional array that can hold any type of data: numbers, text, dates, and so on.
Every value in a Series has a label called an index. By default, the index is a number starting from 0.
Creating a Series from a List
import pandas as pd
marks = pd.Series([85, 90, 78, 95, 88])
print(marks)Output:
0 85
1 90
2 78
3 95
4 88
dtype: int64The left column (0, 1, 2, 3, 4) is the index. The right column contains the actual values.
Creating a Series with a Custom Index
marks = pd.Series([85, 90, 78], index=["Math", "Science", "English"])
print(marks)Output:
Math 85
Science 90
English 78
dtype: int64Creating a Series from a Dictionary
data = {"Apples": 50, "Bananas": 30, "Mangoes": 70}
fruits = pd.Series(data)
print(fruits)Output:
Apples 50
Bananas 30
Mangoes 70
dtype: int64Accessing Values in a Series
print(marks["Math"]) # Access by label → 85
print(marks[0]) # Access by position → 85Series Attributes
| Attribute | Description | Example |
|---|---|---|
series.values | Returns the data as a NumPy array | marks.values |
series.index | Returns the index labels | marks.index |
series.dtype | Returns data type | marks.dtype |
series.name | Name of the Series | marks.name |
series.size | Number of elements | marks.size |
5. Pandas DataFrame
A DataFrame is the most important and frequently used data structure in Pandas. It is a two-dimensional table — just like a spreadsheet with rows and columns. Each column in a DataFrame is actually a Series.
Imagine a school report card: students are the rows, subjects are the columns, and marks are the values. That is exactly what a DataFrame looks like.
Structure of a DataFrame
- Rows — each row represents one record (e.g., one student, one transaction).
- Columns — each column represents one attribute (e.g., name, age, salary).
- Index — the row label (default: 0, 1, 2 ...).
Visual Representation
Name Age City
0 Alice 25 Delhi
1 Bob 30 Mumbai
2 Carol 22 ChennaiHere, Name, Age, City are columns. The numbers 0, 1, 2 are the index (row labels).
6. Creating DataFrames
Method 1 — From a Dictionary
This is the most common way. Each key becomes a column name, and each list becomes the column values.
import pandas as pd
data = {
"Name": ["Alice", "Bob", "Carol"],
"Age": [25, 30, 22],
"City": ["Delhi", "Mumbai", "Chennai"]
}
df = pd.DataFrame(data)
print(df)Output:
Name Age City
0 Alice 25 Delhi
1 Bob 30 Mumbai
2 Carol 22 ChennaiMethod 2 — From a List of Dictionaries
records = [
{"Product": "Pen", "Price": 10},
{"Product": "Book", "Price": 50},
{"Product": "Pencil", "Price": 5}
]
df = pd.DataFrame(records)
print(df)Output:
Product Price
0 Pen 10
1 Book 50
2 Pencil 5Method 3 — From a List of Lists
rows = [["Alice", 25], ["Bob", 30], ["Carol", 22]]
df = pd.DataFrame(rows, columns=["Name", "Age"])
print(df)Method 4 — From a NumPy Array
import numpy as np
arr = np.array([[1, 2], [3, 4], [5, 6]])
df = pd.DataFrame(arr, columns=["Col_A", "Col_B"])
print(df)Setting a Custom Index
df = pd.DataFrame(data, index=["S1", "S2", "S3"])
print(df)Output:
Name Age City
S1 Alice 25 Delhi
S2 Bob 30 Mumbai
S3 Carol 22 Chennai7. Reading Data from Files
In real projects, data is rarely typed manually. It comes from CSV files, Excel sheets, JSON files, or databases. Pandas makes reading these files very straightforward.
Reading a CSV File
CSV (Comma Separated Values) is the most common file format for storing tabular data.
df = pd.read_csv("students.csv")
print(df.head())Common Parameters for read_csv()
| Parameter | Purpose | Example |
|---|---|---|
sep | Separator character | sep=";" |
header | Row number to use as column names | header=0 |
index_col | Column to use as row index | index_col="ID" |
nrows | Read only first N rows | nrows=100 |
usecols | Read only specific columns | usecols=["Name","Age"] |
skiprows | Skip rows from the top | skiprows=2 |
na_values | Values to treat as NaN | na_values=["NA","N/A"] |
encoding | File encoding | encoding="utf-8" |
Reading an Excel File
df = pd.read_excel("sales_data.xlsx", sheet_name="Sheet1")
print(df.head())Reading a JSON File
df = pd.read_json("data.json")
print(df.head())Reading from a SQL Database
import sqlite3
conn = sqlite3.connect("company.db")
df = pd.read_sql("SELECT * FROM employees", conn)
print(df.head())Reading from a URL (Online CSV)
url = "https://example.com/data/sales.csv"
df = pd.read_csv(url)
print(df.head())8. Writing Data to Files
After processing data, the results need to be saved back to a file. Pandas supports writing to CSV, Excel, JSON, and SQL.
Write to CSV
df.to_csv("output.csv", index=False)Setting index=False prevents the default row numbers from being saved as a separate column in the file.
Write to Excel
df.to_excel("output.xlsx", sheet_name="Results", index=False)Write to JSON
df.to_json("output.json", orient="records", indent=2)Write to SQL
import sqlite3
conn = sqlite3.connect("company.db")
df.to_sql("employees", conn, if_exists="replace", index=False)Write Multiple Sheets to One Excel File
with pd.ExcelWriter("report.xlsx") as writer:
df1.to_excel(writer, sheet_name="Sales", index=False)
df2.to_excel(writer, sheet_name="Expenses", index=False)9. Inspecting a DataFrame
Before doing any analysis, it is important to understand the shape and quality of the data. Pandas provides several built-in methods for this.
Sample Data Used in This Section
import pandas as pd
data = {
"Name": ["Alice", "Bob", "Carol", "David", "Eva"],
"Age": [25, 30, 22, 35, 28],
"Score": [85, 90, None, 78, 92],
"City": ["Delhi", "Mumbai", "Delhi", "Chennai", "Mumbai"]
}
df = pd.DataFrame(data)head() — First N Rows
print(df.head()) # First 5 rows (default)
print(df.head(3)) # First 3 rowstail() — Last N Rows
print(df.tail()) # Last 5 rows
print(df.tail(2)) # Last 2 rowsshape — Rows and Columns Count
print(df.shape) # (5, 4) → 5 rows, 4 columnsinfo() — Column Names, Data Types, Non-Null Counts
print(df.info())Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 5 non-null object
1 Age 5 non-null int64
2 Score 4 non-null float64
3 City 5 non-null object
dtypes: float64(1), int64(1), object(2)
memory usage: 288.0+ bytesdescribe() — Statistical Summary
print(df.describe())Output:
Age Score
count 5.0 4.000000
mean 28.0 86.250000
std 4.8 5.737305
min 22.0 78.000000
25% 25.0 83.250000
50% 28.0 87.500000
75% 30.0 90.500000
max 35.0 92.000000Other Useful Inspection Methods
| Method / Attribute | What It Returns |
|---|---|
df.columns | List of column names |
df.index | Row index labels |
df.dtypes | Data type of each column |
df.shape | Tuple of (rows, columns) |
df.size | Total number of cells (rows × columns) |
df.ndim | Number of dimensions (always 2 for DataFrame) |
df.sample(3) | Random 3 rows |
df.count() | Count of non-null values per column |
df.nunique() | Number of unique values per column |
df.isnull().sum() | Count of missing values per column |
10. Selecting Columns and Rows
Selecting a Single Column
A single column is selected using the column name in square brackets. The result is a Series.
print(df["Name"])Output:
0 Alice
1 Bob
2 Carol
3 David
4 Eva
Name: Name, dtype: objectSelecting Multiple Columns
Pass a list of column names. The result is a DataFrame.
print(df[["Name", "Age"]])Selecting Rows by Index Number
print(df[1:3]) # Rows at index 1 and 2 (not 3)Accessing a Column as an Attribute (Dot Notation)
print(df.Name) # Same as df["Name"]Note: Dot notation does not work if the column name has spaces or matches a built-in method name. Using square brackets is always safe.
11. loc and iloc — Label vs Position
Pandas provides two powerful ways to select specific rows and columns:
loc— selects by label (column name, index label)iloc— selects by integer position (row number, column number)
Using loc
# Select row with index label 0
print(df.loc[0])
# Select rows 0 to 2, all columns
print(df.loc[0:2])
# Select rows 0 to 2, specific columns
print(df.loc[0:2, ["Name", "Age"]])
# Select a single cell
print(df.loc[1, "City"]) # Output: MumbaiUsing iloc
# Select first row (position 0)
print(df.iloc[0])
# Select rows 0, 1, 2 and first two columns
print(df.iloc[0:3, 0:2])
# Select a single cell by position
print(df.iloc[1, 3]) # Row 1, Column 3Key Difference — loc vs iloc
| Feature | loc | iloc |
|---|---|---|
| Selection method | By label (name) | By number (position) |
| End of range | Inclusive | Exclusive |
| Works with | Custom index, column names | Always numeric 0-based |
at and iat — Faster Single Cell Access
print(df.at[2, "Name"]) # Access by label → Carol
print(df.iat[2, 0]) # Access by position → Carolat and iat are faster than loc and iloc when accessing just one cell.
12. Filtering Data (Conditions)
Filtering means selecting only those rows that satisfy a given condition — for example, showing only students who scored above 80, or listing employees from a specific city.
Basic Filtering
# Students with Score greater than 85
result = df[df["Score"] > 85]
print(result)Filtering with Text Match
# Rows where City is Mumbai
result = df[df["City"] == "Mumbai"]
print(result)Multiple Conditions — AND (&)
# Age greater than 25 AND City is Delhi
result = df[(df["Age"] > 25) & (df["City"] == "Delhi")]
print(result)Multiple Conditions — OR (|)
# City is Delhi OR City is Mumbai
result = df[(df["City"] == "Delhi") | (df["City"] == "Mumbai")]
print(result)NOT Condition (~)
# Rows where City is NOT Delhi
result = df[~(df["City"] == "Delhi")]
print(result)Filtering with isin()
isin() checks if a value is present in a given list. Useful for filtering multiple values at once.
# Rows where City is Delhi or Chennai
result = df[df["City"].isin(["Delhi", "Chennai"])]
print(result)Filtering with between()
# Age between 25 and 30 (inclusive)
result = df[df["Age"].between(25, 30)]
print(result)query() Method — SQL-like Filtering
query() allows filtering using a readable string expression — similar to writing SQL WHERE conditions.
result = df.query("Age > 25 and City == 'Mumbai'")
print(result)13. Handling Missing Data
Real-world datasets almost always have missing values — cells that are empty or were not recorded. In Pandas, missing values are represented as NaN (Not a Number). Handling them correctly is a critical step in data cleaning.
Sample Data with Missing Values
import pandas as pd
import numpy as np
data = {
"Name": ["Alice", "Bob", "Carol", "David"],
"Age": [25, None, 22, 35],
"Score": [85, 90, None, 78]
}
df = pd.DataFrame(data)
print(df)Output:
Name Age Score
0 Alice 25.0 85.0
1 Bob NaN 90.0
2 Carol 22.0 NaN
3 David 35.0 78.0Detecting Missing Values
print(df.isnull()) # True where value is missing
print(df.isnull().sum()) # Count of missing values per column
print(df.notnull()) # True where value is presentDropping Missing Values
# Drop any row that has at least one missing value
df_clean = df.dropna()
# Drop only rows where ALL values are missing
df_clean = df.dropna(how="all")
# Drop rows where specific columns have missing values
df_clean = df.dropna(subset=["Age"])
# Drop columns with missing values
df_clean = df.dropna(axis=1)Filling Missing Values
# Fill all missing values with 0
df_filled = df.fillna(0)
# Fill missing values in a specific column
df["Age"] = df["Age"].fillna(df["Age"].mean())
# Fill using forward fill (use the previous row's value)
df_ffill = df.fillna(method="ffill")
# Fill using backward fill (use the next row's value)
df_bfill = df.fillna(method="bfill")Filling with Different Values per Column
df = df.fillna({"Age": df["Age"].median(), "Score": df["Score"].mean()})Replace Specific Values with NaN
df.replace(-1, np.nan, inplace=True)
df.replace(["N/A", "na", "unknown"], np.nan, inplace=True)inplace Parameter
By default, Pandas operations return a new DataFrame and do not modify the original. Setting inplace=True modifies the original DataFrame directly without needing to reassign it.
df.dropna(inplace=True) # Modifies df directly14. Adding and Dropping Columns/Rows
Adding a New Column
df["Grade"] = ["A", "A+", "B", "B+", "A"]
print(df)Adding a Column Based on Calculation
df["Age_in_5_Years"] = df["Age"] + 5Adding a Column Using assign()
assign() is a cleaner way to add new columns, especially in method chains.
df = df.assign(Tax = df["Score"] * 0.1)Dropping a Column
df = df.drop(columns=["Age_in_5_Years"])
# Or using axis
df = df.drop("Grade", axis=1)Dropping Multiple Columns
df = df.drop(columns=["Grade", "Tax"])Dropping a Row by Index
df = df.drop(index=2) # Drop row at index 2
df = df.drop(index=[0, 3]) # Drop rows at index 0 and 3Inserting a Column at a Specific Position
# Insert "Grade" column at position 1
df.insert(1, "Grade", ["A", "B", "A", "C", "A"])Resetting the Index after Drop
df = df.reset_index(drop=True) # drop=True prevents old index from becoming a column15. Renaming Columns and Index
Rename Specific Columns
df = df.rename(columns={"Name": "Student_Name", "Age": "Student_Age"})
print(df.columns)Rename All Columns Using a List
df.columns = ["student", "age", "marks", "location"]Rename Columns to Lowercase
df.columns = df.columns.str.lower()
df.columns = df.columns.str.replace(" ", "_")Rename the Index
df = df.rename(index={0: "Row1", 1: "Row2"})Set a Column as Index
df = df.set_index("Name")
print(df)Reset Index Back to Default
df = df.reset_index()16. Sorting Data
Sort by a Single Column
# Sort by Age in ascending order (default)
df_sorted = df.sort_values("Age")
print(df_sorted)
# Sort in descending order
df_sorted = df.sort_values("Age", ascending=False)Sort by Multiple Columns
# Sort by City ascending, then by Age descending
df_sorted = df.sort_values(["City", "Age"], ascending=[True, False])
print(df_sorted)Sort by Index
df_sorted = df.sort_index() # Ascending
df_sorted = df.sort_index(ascending=False) # Descendingnlargest() and nsmallest()
These methods quickly return the top or bottom N rows based on a column.
# Top 3 highest scores
print(df.nlargest(3, "Score"))
# Bottom 2 youngest students
print(df.nsmallest(2, "Age"))17. Data Types and Type Conversion
Every column in a DataFrame has a data type (dtype). Common types include int64, float64, object (for text), bool, and datetime64.
Check Data Types
print(df.dtypes)Convert a Column's Data Type — astype()
# Convert Age from float to int
df["Age"] = df["Age"].astype(int)
# Convert Score to float
df["Score"] = df["Score"].astype(float)
# Convert a column to string
df["ID"] = df["ID"].astype(str)Convert to Category Type (Memory Efficient)
If a column has a limited number of unique repeated values (like "City" or "Gender"), converting it to category saves memory.
df["City"] = df["City"].astype("category")
print(df["City"].cat.categories)Convert String Column to Numeric
# If column has text like "45.5" that should be a number
df["Score"] = pd.to_numeric(df["Score"], errors="coerce")errors="coerce" converts invalid values to NaN instead of raising an error.
Convert to Datetime
df["DOB"] = pd.to_datetime(df["DOB"])
df["JoinDate"] = pd.to_datetime(df["JoinDate"], format="%d-%m-%Y")18. String Operations
Pandas provides a .str accessor to perform string operations on text columns. These work on the entire column at once — no need for a loop.
Setup
data = {"Name": [" alice ", "BOB", "carol smith"], "City": ["Delhi", "mumbai", "CHENNAI"]}
df = pd.DataFrame(data)Common String Methods
| Method | Description | Example |
|---|---|---|
.str.upper() | Convert to uppercase | df["Name"].str.upper() |
.str.lower() | Convert to lowercase | df["City"].str.lower() |
.str.title() | Title case (first letter caps) | df["Name"].str.title() |
.str.strip() | Remove leading/trailing spaces | df["Name"].str.strip() |
.str.len() | Length of each string | df["Name"].str.len() |
.str.replace() | Replace a substring | df["Name"].str.replace("Bob","Robert") |
.str.contains() | True if substring found | df["Name"].str.contains("carol") |
.str.startswith() | True if starts with | df["Name"].str.startswith("A") |
.str.endswith() | True if ends with | df["Name"].str.endswith("h") |
.str.split() | Split into list | df["Name"].str.split(" ") |
.str.slice() | Extract part of string | df["Name"].str.slice(0, 3) |
.str.count() | Count occurrences | df["Name"].str.count("l") |
Splitting a Column into Multiple Columns
df["Name"] = ["Alice Smith", "Bob Jones", "Carol Brown"]
df[["First", "Last"]] = df["Name"].str.split(" ", expand=True)
print(df)Filtering with str.contains()
result = df[df["City"].str.contains("Delhi", case=False)]
print(result)19. Date and Time Operations
Dates and times are common in real-world data — transaction dates, employee joining dates, event timestamps. Pandas has built-in tools to work with them easily.
Creating and Converting to Datetime
import pandas as pd
df = pd.DataFrame({"Date": ["2024-01-15", "2024-03-22", "2024-07-05"]})
df["Date"] = pd.to_datetime(df["Date"])
print(df.dtypes)Extracting Date Components
df["Year"] = df["Date"].dt.year
df["Month"] = df["Date"].dt.month
df["Day"] = df["Date"].dt.day
df["Weekday"] = df["Date"].dt.day_name()
df["Quarter"] = df["Date"].dt.quarter
print(df)Output:
Date Year Month Day Weekday Quarter
0 2024-01-15 2024 1 15 Monday 1
1 2024-03-22 2024 3 22 Friday 1
2 2024-07-05 2024 7 5 Friday 3Date Arithmetic
# Add 30 days to a date
df["Due_Date"] = df["Date"] + pd.Timedelta(days=30)
# Difference between two dates
df["Days_Passed"] = pd.Timestamp("2024-12-31") - df["Date"]
# Extract number of days from timedelta
df["Days_Count"] = df["Days_Passed"].dt.daysFiltering by Date
# Rows after March 2024
result = df[df["Date"] > "2024-03-01"]
print(result)Resampling Time Series Data
Resampling aggregates time-series data over a time period — like grouping daily data into monthly totals.
df.set_index("Date", inplace=True)
# Monthly sum
monthly = df.resample("M").sum()
print(monthly)Date Range Generation
# Generate 10 consecutive business days from Jan 1, 2024
date_range = pd.date_range(start="2024-01-01", periods=10, freq="B")
print(date_range)20. GroupBy and Aggregation
GroupBy is one of the most powerful features in Pandas. It splits the data into groups based on one or more columns, then applies a function (like sum, mean, count) to each group. This is similar to GROUP BY in SQL.
Think of it like sorting test papers by class and then calculating the average marks for each class.
Sample Data
import pandas as pd
data = {
"City": ["Delhi", "Mumbai", "Delhi", "Mumbai", "Chennai", "Delhi"],
"Product": ["Pen", "Book", "Pen", "Pencil", "Pen", "Book"],
"Sales": [200, 500, 150, 300, 400, 250]
}
df = pd.DataFrame(data)Basic GroupBy — Total Sales per City
result = df.groupby("City")["Sales"].sum()
print(result)Output:
City
Chennai 400
Delhi 600
Mumbai 800
Name: Sales, dtype: int64Multiple Aggregations at Once — agg()
result = df.groupby("City")["Sales"].agg(["sum", "mean", "max", "count"])
print(result)GroupBy on Multiple Columns
result = df.groupby(["City", "Product"])["Sales"].sum()
print(result)Named Aggregations (Pandas 0.25+)
result = df.groupby("City").agg(
Total_Sales = ("Sales", "sum"),
Avg_Sales = ("Sales", "mean"),
Max_Sales = ("Sales", "max")
)
print(result)Applying Custom Aggregation Functions
result = df.groupby("City")["Sales"].agg(lambda x: x.max() - x.min())
print(result)transform() — Keep Original Shape
transform() is like agg(), but it returns a Series of the same length as the original DataFrame — useful for adding a group-level calculation back to the original data.
# Add city-level average sales back to every row
df["City_Avg"] = df.groupby("City")["Sales"].transform("mean")
print(df)filter() — Keep Groups Meeting a Condition
# Keep only cities with total sales greater than 500
result = df.groupby("City").filter(lambda x: x["Sales"].sum() > 500)
print(result)21. apply(), map() and Lambda Functions
These are tools for applying custom transformations to columns or rows.
map() — Element-wise on a Series
map() applies a function or dictionary mapping to each element of a Series.
df["Grade"] = df["Score"].map({85: "A", 90: "A+", 78: "B+"})
# Or with a function
df["Level"] = df["Age"].map(lambda x: "Senior" if x > 28 else "Junior")apply() on a Series
def classify(score):
if score >= 90:
return "Excellent"
elif score >= 75:
return "Good"
else:
return "Average"
df["Result"] = df["Score"].apply(classify)apply() on a DataFrame Row
By setting axis=1, the function is applied row-by-row, giving access to multiple columns at once.
df["Summary"] = df.apply(
lambda row: f"{row['Name']} from {row['City']} scored {row['Score']}",
axis=1
)
print(df["Summary"])apply() on Entire Columns (axis=0)
# Normalize all numeric columns (subtract min, divide by range)
def normalize(col):
return (col - col.min()) / (col.max() - col.min())
df[["Age", "Score"]] = df[["Age", "Score"]].apply(normalize)applymap() / map() on Entire DataFrame (Element-wise)
# In Pandas 2.x use .map() on DataFrame instead of applymap()
df_numeric = df[["Age", "Score"]].map(lambda x: round(x, 2))Using Lambda with Conditions
df["Tax"] = df["Sales"].apply(lambda x: x * 0.18 if x > 300 else x * 0.05)22. Merging and Joining DataFrames
Merging combines two DataFrames based on a common column — just like SQL JOIN. This is essential when data is spread across multiple tables and needs to be combined.
Sample DataFrames
students = pd.DataFrame({
"ID": [1, 2, 3, 4],
"Name": ["Alice", "Bob", "Carol", "David"]
})
marks = pd.DataFrame({
"ID": [1, 2, 3, 5],
"Score": [85, 90, 78, 88]
})Inner Join — Keep Only Matching Rows
result = pd.merge(students, marks, on="ID", how="inner")
print(result)Output:
ID Name Score
0 1 Alice 85
1 2 Bob 90
2 3 Carol 78David (ID=4) and the record with ID=5 are excluded because they don't match in both tables.
Left Join — Keep All Rows from Left Table
result = pd.merge(students, marks, on="ID", how="left")
print(result)Output: David appears with NaN for Score since no matching row exists in marks.
Right Join — Keep All Rows from Right Table
result = pd.merge(students, marks, on="ID", how="right")Outer Join — Keep All Rows from Both Tables
result = pd.merge(students, marks, on="ID", how="outer")
print(result)Merging on Different Column Names
result = pd.merge(students, marks, left_on="ID", right_on="StudentID")Merging on Index
result = pd.merge(students, marks, left_index=True, right_index=True)Join Types Summary
| Join Type | Result |
|---|---|
| inner | Only rows that match in BOTH tables |
| left | All rows from LEFT table + matching from right |
| right | All rows from RIGHT table + matching from left |
| outer | All rows from BOTH tables |
23. Concatenating DataFrames
While merge() combines DataFrames based on a key column, concat() stacks DataFrames either vertically (one on top of another) or horizontally (side by side).
Concatenate Vertically (Stacking Rows)
df1 = pd.DataFrame({"Name": ["Alice", "Bob"], "Score": [85, 90]})
df2 = pd.DataFrame({"Name": ["Carol", "David"], "Score": [78, 95]})
result = pd.concat([df1, df2], ignore_index=True)
print(result)Output:
Name Score
0 Alice 85
1 Bob 90
2 Carol 78
3 David 95ignore_index=True resets the index in the combined result.
Concatenate Horizontally (Stacking Columns)
df_names = pd.DataFrame({"Name": ["Alice", "Bob"]})
df_scores = pd.DataFrame({"Score": [85, 90]})
result = pd.concat([df_names, df_scores], axis=1)
print(result)Concatenate with Keys (Track Source)
result = pd.concat([df1, df2], keys=["Group_A", "Group_B"])
print(result)Difference between concat and merge
| Feature | concat | merge |
|---|---|---|
| Purpose | Stack DataFrames together | Combine based on common column |
| Requires common column | No | Yes |
| Direction | Rows (axis=0) or Columns (axis=1) | Always on columns |
24. Pivot Tables
A pivot table reorganizes and summarizes data by converting unique values of one column into new columns. This is the same concept as Excel's pivot table feature.
Sample Data
data = {
"City": ["Delhi", "Delhi", "Mumbai", "Mumbai", "Chennai"],
"Product": ["Pen", "Book", "Pen", "Book", "Pen"],
"Sales": [200, 300, 400, 500, 150]
}
df = pd.DataFrame(data)Basic Pivot Table
pivot = pd.pivot_table(df, values="Sales", index="City", columns="Product", aggfunc="sum")
print(pivot)Output:
Product Book Pen
City
Chennai NaN 150
Delhi 300 200
Mumbai 500 400With Fill Value and Margins
pivot = pd.pivot_table(
df,
values="Sales",
index="City",
columns="Product",
aggfunc="sum",
fill_value=0,
margins=True # Adds row and column totals
)
print(pivot)Multiple Aggregation Functions
pivot = pd.pivot_table(df, values="Sales", index="City", aggfunc=["sum", "mean"])
print(pivot)pivot() — Simple Reshape (No Aggregation)
pivot() is used when there is no aggregation needed — just reshaping the data.
result = df.pivot(index="City", columns="Product", values="Sales")
print(result)25. Cross Tabulation (crosstab)
Cross tabulation (or crosstab) computes a frequency table of two or more variables. It is commonly used in survey analysis and categorical data exploration.
import pandas as pd
data = {
"Gender": ["M", "F", "M", "F", "M", "F"],
"City": ["Delhi", "Mumbai", "Delhi", "Delhi", "Mumbai", "Chennai"]
}
df = pd.DataFrame(data)
result = pd.crosstab(df["City"], df["Gender"])
print(result)Output:
Gender F M
City
Chennai 1 0
Delhi 2 2
Mumbai 0 2 (approx)Crosstab with Percentages
result = pd.crosstab(df["City"], df["Gender"], normalize="index")
print(result)Crosstab with Margins
result = pd.crosstab(df["City"], df["Gender"], margins=True, margins_name="Total")
print(result)26. MultiIndex (Hierarchical Index)
A MultiIndex (also called hierarchical index) allows a DataFrame to have multiple levels of row or column labels. This is useful when data is naturally hierarchical — like sales by year and month, or students grouped by class and section.
Creating a MultiIndex DataFrame
import pandas as pd
data = {
"Sales": [200, 150, 300, 250, 180, 320]
}
index = pd.MultiIndex.from_tuples(
[("2023", "Q1"), ("2023", "Q2"), ("2023", "Q3"),
("2024", "Q1"), ("2024", "Q2"), ("2024", "Q3")],
names=["Year", "Quarter"]
)
df = pd.DataFrame(data, index=index)
print(df)Output:
Sales
Year Quarter
2023 Q1 200
Q2 150
Q3 300
2024 Q1 250
Q2 180
Q3 320Accessing Data in MultiIndex
# All rows for year 2023
print(df.loc["2023"])
# Specific quarter in a year
print(df.loc[("2023", "Q2")])Unstacking — Convert Row Index to Columns
print(df.unstack())Stacking — Convert Column Headers to Row Index
print(df.unstack().stack())Reset MultiIndex
df_flat = df.reset_index()
print(df_flat)Creating MultiIndex from GroupBy
sales_data = pd.DataFrame({
"Year": [2023, 2023, 2024, 2024],
"Product": ["Pen", "Book", "Pen", "Book"],
"Sales": [500, 700, 600, 800]
})
result = sales_data.groupby(["Year", "Product"])["Sales"].sum()
print(result)27. Window Functions — Rolling & Expanding
Window functions perform calculations over a sliding or expanding window of rows. They are widely used in time-series analysis — for example, calculating a 7-day moving average of sales or stock prices.
Rolling — Fixed-Size Sliding Window
import pandas as pd
df = pd.DataFrame({
"Day": range(1, 8),
"Sales": [100, 150, 130, 200, 180, 210, 170]
})
# 3-day rolling average
df["Rolling_Avg"] = df["Sales"].rolling(window=3).mean()
print(df)Output:
Day Sales Rolling_Avg
0 1 100 NaN
1 2 150 NaN
2 3 130 126.666667
3 4 200 160.000000
4 5 180 170.000000
5 6 210 196.666667
6 7 170 186.666667The first two rows show NaN because there aren't enough previous rows to fill the window of 3.
Rolling Sum and Min
df["Rolling_Sum"] = df["Sales"].rolling(window=3).sum()
df["Rolling_Min"] = df["Sales"].rolling(window=3).min()Expanding — Growing Window
An expanding window grows from the first row up to the current row — useful for running totals or cumulative averages.
df["Cumulative_Avg"] = df["Sales"].expanding().mean()
df["Cumulative_Sum"] = df["Sales"].expanding().sum()
print(df[["Day", "Sales", "Cumulative_Avg", "Cumulative_Sum"]])Exponential Weighted Moving Average (EWM)
EWM gives more weight to recent values — useful in finance and trend analysis.
df["EWM_Avg"] = df["Sales"].ewm(span=3).mean()28. Handling Duplicate Data
Duplicate rows occur when the same record appears more than once. They can skew analysis results and must be identified and handled properly.
Sample Data
data = {
"Name": ["Alice", "Bob", "Alice", "Carol", "Bob"],
"Score": [85, 90, 85, 78, 90]
}
df = pd.DataFrame(data)Detect Duplicates
print(df.duplicated()) # True for duplicate rows
print(df.duplicated().sum()) # Total count of duplicates
print(df[df.duplicated()]) # Show duplicate rowsDetect Duplicates Based on Specific Columns
print(df.duplicated(subset=["Name"])) # Duplicates only in Name columnRemove Duplicates
# Keep first occurrence, remove later duplicates
df_clean = df.drop_duplicates()
# Keep last occurrence
df_clean = df.drop_duplicates(keep="last")
# Remove all rows that are duplicated (keep none)
df_clean = df.drop_duplicates(keep=False)Remove Based on Specific Columns
df_clean = df.drop_duplicates(subset=["Name"], keep="first")29. Reshaping Data — melt and stack
Sometimes data is in a wide format (many columns) when it needs to be in a long format (more rows), or vice versa. Pandas provides melt(), stack(), and unstack() for this.
melt() — Wide to Long Format
melt() "unpivots" a DataFrame — turning column headers into row values. This is the opposite of a pivot table.
df = pd.DataFrame({
"Student": ["Alice", "Bob"],
"Math": [85, 90],
"Science": [78, 88],
"English": [92, 75]
})
print(df)Wide Format:
Student Math Science English
0 Alice 85 78 92
1 Bob 90 88 75df_melted = df.melt(id_vars="Student", var_name="Subject", value_name="Marks")
print(df_melted)Long Format (after melt):
Student Subject Marks
0 Alice Math 85
1 Bob Math 90
2 Alice Science 78
3 Bob Science 88
4 Alice English 92
5 Bob English 75stack() — Column Labels Become Row Index
df_stacked = df.set_index("Student").stack()
print(df_stacked)unstack() — Row Index Becomes Column Labels
df_unstacked = df_stacked.unstack()
print(df_unstacked)30. Performance Tips
As datasets grow larger, writing efficient Pandas code becomes important. Here are practical techniques to improve performance.
Tip 1 — Use Appropriate Data Types
Storing text columns with few unique values as category type instead of object reduces memory significantly.
df["City"] = df["City"].astype("category")
df["Gender"] = df["Gender"].astype("category")Tip 2 — Use vectorized operations instead of loops
Pandas operations on entire columns are much faster than Python loops.
# Slow way — using a loop
for i in range(len(df)):
df.loc[i, "Tax"] = df.loc[i, "Sales"] * 0.18
# Fast way — vectorized
df["Tax"] = df["Sales"] * 0.18Tip 3 — Read Only Necessary Columns
df = pd.read_csv("large_file.csv", usecols=["Name", "Sales", "City"])Tip 4 — Check Memory Usage
print(df.memory_usage(deep=True))
print(df.memory_usage(deep=True).sum() / 1024 ** 2, "MB")Tip 5 — Use query() for Faster Filtering
# query() is faster than boolean indexing for large DataFrames
df.query("Age > 25 and City == 'Delhi'")Tip 6 — Avoid iterrows() — Use apply() or Vectorization
# Avoid — very slow
for index, row in df.iterrows():
print(row["Name"])
# Better — use vectorized or apply
df["Name"].apply(print)Tip 7 — Read Large CSV Files in Chunks
chunk_size = 10000
for chunk in pd.read_csv("huge_file.csv", chunksize=chunk_size):
process(chunk) # Process each chunk separatelyTip 8 — Downcast Numeric Types
df["Age"] = pd.to_numeric(df["Age"], downcast="integer")
df["Score"] = pd.to_numeric(df["Score"], downcast="float")Tip 9 — Use eval() for Complex Expressions
# Faster than standard column operations for large DataFrames
df.eval("Total = Sales + Tax", inplace=True)31. Quick Reference Cheat Sheet
Creating Data
| Task | Code |
|---|---|
| Create Series | pd.Series([1, 2, 3]) |
| Create DataFrame | pd.DataFrame({"A": [1,2], "B": [3,4]}) |
| Read CSV | pd.read_csv("file.csv") |
| Read Excel | pd.read_excel("file.xlsx") |
| Write CSV | df.to_csv("out.csv", index=False) |
Inspection
| Task | Code |
|---|---|
| First 5 rows | df.head() |
| Shape | df.shape |
| Info | df.info() |
| Statistics | df.describe() |
| Data types | df.dtypes |
| Missing counts | df.isnull().sum() |
Selection
| Task | Code |
|---|---|
| Select column | df["Name"] |
| Select multiple columns | df[["Name", "Age"]] |
| Select by label | df.loc[0, "Name"] |
| Select by position | df.iloc[0, 1] |
| Filter rows | df[df["Age"] > 25] |
Cleaning
| Task | Code |
|---|---|
| Drop NA rows | df.dropna() |
| Fill NA | df.fillna(0) |
| Drop duplicates | df.drop_duplicates() |
| Rename column | df.rename(columns={"old":"new"}) |
| Change type | df["A"].astype(int) |
Transformation
| Task | Code |
|---|---|
| Sort rows | df.sort_values("Col") |
| Group and sum | df.groupby("City")["Sales"].sum() |
| Apply function | df["Col"].apply(func) |
| Pivot table | pd.pivot_table(df, values=..., index=..., columns=...) |
| Merge | pd.merge(df1, df2, on="ID") |
| Concatenate | pd.concat([df1, df2]) |
| Melt (wide to long) | df.melt(id_vars="Name") |
| Rolling average | df["Col"].rolling(3).mean() |
String Operations
| Task | Code |
|---|---|
| Lowercase | df["Name"].str.lower() |
| Strip spaces | df["Name"].str.strip() |
| Contains | df["Name"].str.contains("Al") |
| Replace | df["Name"].str.replace("Bob","Robert") |
| Split | df["Name"].str.split(" ", expand=True) |
DateTime Operations
| Task | Code |
|---|---|
| Convert to datetime | pd.to_datetime(df["Date"]) |
| Extract year | df["Date"].dt.year |
| Extract month | df["Date"].dt.month |
| Day name | df["Date"].dt.day_name() |
| Date range | pd.date_range("2024-01-01", periods=10) |
Summary
This guide has covered the full journey of working with Pandas — starting from installing the library and understanding its two core data structures (Series and DataFrame), all the way to advanced techniques like MultiIndex, window functions, reshaping with melt/stack, and performance optimization.
Here is a recap of the key stages in any typical data analysis workflow using Pandas:
- Load — Read data from CSV, Excel, JSON, or a database.
- Inspect — Use
head(),info(),describe()to understand the dataset. - Clean — Handle missing values, fix data types, remove duplicates.
- Transform — Filter, sort, group, merge, pivot, and reshape as needed.
- Analyze — Aggregate, apply functions, compute statistics.
- Export — Save results to CSV, Excel, or a database.
Mastering Pandas opens the door to advanced areas like data visualization (Matplotlib, Seaborn), machine learning (Scikit-learn), and big data processing (Dask, PySpark). The consistent practice of applying these concepts on real datasets is the fastest path to proficiency.
