Regular Expressions (RegEx) in Python

Regular Expressions (often shortened to RegEx) are like a super-powered “Find and Replace” tool for programmers. While standard search looks for exact text (like finding “apple”), RegEx allows you to define a pattern or a blueprint (like finding “any word that starts with ‘a’ and ends with ‘e'”).

The re Module (Prerequisite)

Before you can use any RegEx functions in Python, you must import the standard library module named re. This module contains all the necessary tools to compile patterns and search through strings.
import re

# Now you can use re.search(), re.match(), etc.
print("Module imported successfully!")

Special Characters

These are symbols that have a “magic” meaning in RegEx and don’t just match themselves. Instead of representing a literal character, they give instructions to the search engine, such as “look for the start of a line” or “match any character here”.
  • . (Dot): Matches any single character (except a newline).
  • ^ (Caret): Matches the start of the string.
  • $ (Dollar): Matches the end of the string.
import re

pattern = "^Hello"
text = "Hello World"
# This checks if the text literally starts with "Hello"
result = re.search(pattern, text)
print(bool(result))  # Output: True

Sequence Characters

Sequence characters are shorthand codes that represent a specific category of characters, saving you from typing out long lists. They are always preceded by a backslash (\). For instance, instead of asking Python to look for “0, 1, 2, 3…9”, you can simply use a sequence character that stands for “any digit”.
  • \d: Matches any digit (0-9).
  • \w: Matches any alphanumeric character (a-z, A-Z, 0-9, and underscores).
  • \s: Matches any whitespace (spaces, tabs, newlines).
import re

text = "Order ID: 4590"
# \d+ looks for one or more digits in a row
result = re.search(r"\d+", text)
print(result.group()) # Output: 4590

Quantifiers

Quantifiers tell the RegEx engine how many times a character or group should appear in your pattern. They allow your search to be flexible, handling cases where a user might type a letter once, ten times, or not at all. Without quantifiers, your patterns would be extremely rigid and brittle.
  • *: Matches 0 or more times.
  • +: Matches 1 or more times.
  • ?: Matches 0 or 1 time (optional).
  • {n}: Matches exactly n times.
import re

text = "The colouuur is red."
# 'u+' means match the letter 'u' one or more times
result = re.search(r"colou+r", text)
print(result.group()) # Output: colouuur

match() Function

The match() function checks for a pattern only at the very beginning of the string. If the pattern is present somewhere in the middle or end, match() will ignore it and return None. It is best used when you want to validate that a string starts exactly correctly, such as checking if a URL starts with “http”.
import re

text = "Welcome to Python"
# Checks if "Welcome" is at the very start
result = re.match(r"Welcome", text)

if result:
  print("Found match!")

search() Function

Unlike match(), the search() function scans through the entire string to find the first location where the pattern produces a match. It stops as soon as it finds one success. This is your go-to function when you know the data exists somewhere in the text but aren’t sure exactly where.
import re

text = "Error code: 404 not found"
# Scans the whole string for the first sequence of digits
result = re.search(r"\d+", text)
print(result.group()) # Output: 404

findall() Function

While search() stops after the first discovery, findall() continues scanning until the end of the string and returns a list of all non-overlapping matches. If no matches are found, it returns an empty list. This is perfect for extracting all instances of a specific data type, like scraping all email addresses from a document.
import re

text = "Apples cost $5, Bananas cost $3"
# \$\d+ looks for a dollar sign followed by digits
prices = re.findall(r"\$\d+", text)
print(prices) # Output: ['$5', '$3']

split() Function

The split() function breaks a string apart into a list of substrings based on a specific RegEx pattern. It works similarly to the standard Python .split() method, but it is much more powerful because you can split by complex patterns (like “any number”) rather than just a fixed character (like a comma).
import re

text = "apple,orange; banana|grape"
# Split by comma, semicolon, or pipe
fruits = re.split(r"[,;|]", text) #r means raw string
print(fruits) # Output: ['apple', 'orange', ' banana', 'grape']

sub() Function (Substitute)

The sub() function is used to replace occurrences of a pattern with a new string. It takes three main arguments: the pattern to look for, the text to replace it with, and the original string. This is incredibly useful for cleaning data, such as removing bad characters or formatting phone numbers uniformly.
import re

text = "Contact: 123-456-7890"
# Replace all hyphens with spaces
cleaned = re.sub(r"-", " ", text)
print(cleaned) # Output: Contact: 123 456 7890

Groups

Grouping is done using parentheses (...) to bundle parts of a pattern together. This allows you to apply quantifiers to entire phrases rather than single letters, or to extract specific parts of a match separately. It is essential for parsing complex strings where you need to isolate specific data points, like separating an area code from a phone number.
import re

text = "Agent 007 reporting."
# Group 1 captures the word, Group 2 captures the digits
match = re.search(r"(\w+) (\d+)", text)
print(match.group(1)) # Output: Agent
print(match.group(2)) # Output: 007

Practical Example: Matching Dates

Combining the concepts above, we can create a pattern to identify dates within a text. We use \d for digits and {n} to specify the exact number of digits (2 for the day/month, 4 for the year). This example demonstrates how to enforce a specific format (DD-MM-YYYY) in a messy string.
import re

text = "The invoice is due on 15-08-2025."
# \d{2} = 2 digits, \d{4} = 4 digits, separated by hyphens
date_pattern = r"\d{2}-\d{2}-\d{4}"

match = re.search(date_pattern, text)
print(f"Date found: {match.group()}") # Output: Date found: 15-08-2025

Why use r"" (Raw Strings)?

Note: You may notice the r before patterns (e.g., r"\d+"). This tells Python to treat backslashes as literal characters rather than escape sequences (like a new line). Always use raw strings for RegEx to avoid weird bugs!
Post a comment

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top