Regular Expressions (often shortened to RegEx) are like a super-powered “Find and Replace” tool for programmers. While standard search looks for exact text (like finding “apple”), RegEx allows you to define a pattern or a blueprint (like finding “any word that starts with ‘a’ and ends with ‘e'”).
The re Module (Prerequisite)
Before you can use any RegEx functions in Python, you must import the standard library module named re. This module contains all the necessary tools to compile patterns and search through strings.
importre# Now you can use re.search(), re.match(), etc.print("Module imported successfully!")
Special Characters
These are symbols that have a “magic” meaning in RegEx and don’t just match themselves. Instead of representing a literal character, they give instructions to the search engine, such as “look for the start of a line” or “match any character here”.
. (Dot): Matches any single character (except a newline).
^ (Caret): Matches the start of the string.
$ (Dollar): Matches the end of the string.
importre
pattern="^Hello" text="Hello World" # This checks if the text literally starts with "Hello" result=re.search(pattern, text) print(bool(result)) # Output: True
Sequence Characters
Sequence characters are shorthand codes that represent a specific category of characters, saving you from typing out long lists. They are always preceded by a backslash (\). For instance, instead of asking Python to look for “0, 1, 2, 3…9”, you can simply use a sequence character that stands for “any digit”.
\d: Matches any digit (0-9).
\w: Matches any alphanumeric character (a-z, A-Z, 0-9, and underscores).
\s: Matches any whitespace (spaces, tabs, newlines).
importre
text="Order ID: 4590" # \d+ looks for one or more digits in a row result=re.search(r"\d+", text) print(result.group()) # Output: 4590
Quantifiers
Quantifiers tell the RegEx engine how many times a character or group should appear in your pattern. They allow your search to be flexible, handling cases where a user might type a letter once, ten times, or not at all. Without quantifiers, your patterns would be extremely rigid and brittle.
*: Matches 0 or more times.
+: Matches 1 or more times.
?: Matches 0 or 1 time (optional).
{n}: Matches exactly n times.
importre
text="The colouuur is red." # 'u+' means match the letter 'u' one or more times result=re.search(r"colou+r", text) print(result.group()) # Output: colouuur
match() Function
The match() function checks for a pattern only at the very beginning of the string. If the pattern is present somewhere in the middle or end, match() will ignore it and return None. It is best used when you want to validate that a string starts exactly correctly, such as checking if a URL starts with “http”.
importre
text="Welcome to Python" # Checks if "Welcome" is at the very start result=re.match(r"Welcome", text)
ifresult: print("Found match!")
search() Function
Unlike match(), the search() function scans through the entire string to find the first location where the pattern produces a match. It stops as soon as it finds one success. This is your go-to function when you know the data exists somewhere in the text but aren’t sure exactly where.
importre
text="Error code: 404 not found" # Scans the whole string for the first sequence of digits result=re.search(r"\d+", text) print(result.group()) # Output: 404
findall() Function
While search() stops after the first discovery, findall() continues scanning until the end of the string and returns a list of all non-overlapping matches. If no matches are found, it returns an empty list. This is perfect for extracting all instances of a specific data type, like scraping all email addresses from a document.
importre
text="Apples cost $5, Bananas cost $3" # \$\d+ looks for a dollar sign followed by digits prices=re.findall(r"\$\d+", text) print(prices) # Output: ['$5', '$3']
split() Function
The split() function breaks a string apart into a list of substrings based on a specific RegEx pattern. It works similarly to the standard Python .split() method, but it is much more powerful because you can split by complex patterns (like “any number”) rather than just a fixed character (like a comma).
importre
text="apple,orange; banana|grape" # Split by comma, semicolon, or pipe fruits=re.split(r"[,;|]", text) #r means raw string print(fruits) # Output: ['apple', 'orange', ' banana', 'grape']
sub() Function (Substitute)
The sub() function is used to replace occurrences of a pattern with a new string. It takes three main arguments: the pattern to look for, the text to replace it with, and the original string. This is incredibly useful for cleaning data, such as removing bad characters or formatting phone numbers uniformly.
importre
text="Contact: 123-456-7890" # Replace all hyphens with spaces cleaned=re.sub(r"-", " ", text) print(cleaned) # Output: Contact: 123 456 7890
Groups
Grouping is done using parentheses (...) to bundle parts of a pattern together. This allows you to apply quantifiers to entire phrases rather than single letters, or to extract specific parts of a match separately. It is essential for parsing complex strings where you need to isolate specific data points, like separating an area code from a phone number.
importre
text="Agent 007 reporting." # Group 1 captures the word, Group 2 captures the digits match=re.search(r"(\w+)(\d+)", text) print(match.group(1)) # Output: Agent print(match.group(2)) # Output: 007
Practical Example: Matching Dates
Combining the concepts above, we can create a pattern to identify dates within a text. We use \d for digits and {n} to specify the exact number of digits (2 for the day/month, 4 for the year). This example demonstrates how to enforce a specific format (DD-MM-YYYY) in a messy string.
importre
text="The invoice is due on 15-08-2025." # \d{2} = 2 digits, \d{4} = 4 digits, separated by hyphens date_pattern=r"\d{2}-\d{2}-\d{4}"
match=re.search(date_pattern, text) print(f"Date found: {match.group()}") # Output: Date found: 15-08-2025
Why use r"" (Raw Strings)?
Note: You may notice the r before patterns (e.g., r"\d+"). This tells Python to treat backslashes as literal characters rather than escape sequences (like a new line). Always use raw strings for RegEx to avoid weird bugs!