REGEX
Regex Tutorial

Fantastic! We'll go step-by-step, starting with the basics and progressing to advanced topics, ensuring you master Python regex. To make the course hands-on, I’ll create a comprehensive dataset that we’ll parse together throughout the course. We’ll gradually increase the complexity as we move forward.


Course Outline:

  1. Introduction to Regex & Python’s re Module
  2. Basic Patterns: Matching Single Characters
  3. Quantifiers: Repeating Patterns
  4. Character Classes & Sets
  5. Anchors: Beginning, End, Word Boundaries
  6. Grouping and Capturing
  7. Lookaheads and Lookbehinds
  8. Greedy vs. Non-Greedy Matching
  9. Flags for Regex Customization
  10. Advanced Data Extraction
  11. Real-World Parsing and Case Studies
  12. Performance Optimization in Regex

Step 1: Introduction to Regex & Python’s re Module

Before diving into regex, we need to understand its building blocks. In Python, regular expressions are handled using the re module. This module provides multiple functions to search and manipulate strings:

import re

Here are the most common re functions:

  1. re.search(pattern, string): Searches the entire string for the first match.
  2. re.match(pattern, string): Matches the pattern only at the beginning of the string.
  3. re.findall(pattern, string): Returns all occurrences of the pattern as a list.
  4. re.sub(pattern, replacement, string): Replaces matches with a string or function.
  5. re.split(pattern, string): Splits the string by the occurrences of the pattern.

Step 2: Basic Patterns — Matching Single Characters

Patterns are the heart of regex. They are used to describe the text you want to match.

  • .: Matches any character (except newline).
  • \d: Matches any digit (0-9).
  • \w: Matches any word character (alphanumeric + underscore).
  • \s: Matches any whitespace (spaces, tabs, etc.).

Example:

text = "Python3 is fun!"
pattern = r'\w+'  # Matches any word
 
result = re.findall(pattern, text)
print(result)  # Output: ['Python3', 'is', 'fun']

Assignment:

  • Try using \d to find all the digits in the text.
  • Experiment with . to match any single character.

Step 3: Quantifiers — Repeating Patterns

Quantifiers allow you to specify how many times a pattern can occur.

  • *: Zero or more times.
  • +: One or more times.
  • ?: Zero or one time.
  • {n}: Exactly n times.
  • {n,m}: Between n and m times.

Example:

text = "I have 10 apples and 200 oranges."
pattern = r'\d+'  # Matches one or more digits
 
result = re.findall(pattern, text)
print(result)  # Output: ['10', '200']

Assignment:

  • Use {2} to match exactly two-digit numbers.
  • Try {1,3} to match numbers with 1 to 3 digits.

Step 4: Character Classes & Sets

Character classes are used to define a set of characters you want to match.

  • [abc]: Matches either a, b, or c.
  • [^abc]: Matches anything except a, b, or c.

Example:

text = "cat cot cut"
pattern = r'c[aeiou]t'  # Matches cat, cot, cut
 
result = re.findall(pattern, text)
print(result)  # Output: ['cat', 'cot', 'cut']

Assignment:

  • Find all words that start with a consonant and end with a vowel.
  • Create a pattern that matches a string starting with a digit and ending with a letter.

Step 5: Anchors — Beginning, End, Word Boundaries

Anchors are used to match patterns at specific positions in the string.

  • ^: Matches the start of a string.
  • $: Matches the end of a string.
  • \b: Matches a word boundary.

Example:

text = "Start here and end there."
pattern = r'\b\w{3}\b'  # Matches any 3-letter word
 
result = re.findall(pattern, text)
print(result)  # Output: ['end']

Assignment:

  • Find all words that begin with a capital letter.
  • Find all strings that start with "Python" and end with a number.

Step 6: Grouping and Capturing

Grouping allows you to extract specific parts of a match.

  • (): Captures the part of the match inside the parentheses.
  • \1, \2, ...: Backreferences to the captured groups.

Example:

text = "My phone number is 123-456-7890."
pattern = r'(\d{3})-(\d{3})-(\d{4})'  # Groups for phone number parts
 
match = re.search(pattern, text)
if match:
    print(match.groups())  # Output: ('123', '456', '7890')

Assignment:

  • Write a regex to capture dates in the format DD-MM-YYYY and extract day, month, and year.
  • Use backreferences to reformat a phone number from 123-456-7890 to (123) 456-7890.

Step 7: Lookaheads and Lookbehinds

These allow you to match a pattern only if it is (or isn’t) followed by another pattern.

  • Positive Lookahead (?=): Ensures the pattern is followed by another.
  • Negative Lookahead (?!): Ensures the pattern is not followed by another.
  • Positive Lookbehind (?<=): Ensures the pattern is preceded by another.
  • Negative Lookbehind (?<!): Ensures the pattern is not preceded by another.

Example:

text = "apple pie and apple cake"
pattern = r'apple(?= pie)'  # Matches 'apple' only if followed by 'pie'
 
result = re.findall(pattern, text)
print(result)  # Output: ['apple']

Assignment:

  • Write a regex to find all numbers not followed by a letter.
  • Use lookbehind to match words preceded by a specific keyword.

Step 8: Greedy vs. Non-Greedy Matching

By default, regex is greedy (it matches as much text as possible). You can make it non-greedy by adding a ?.

  • Greedy: .* matches as much as possible.
  • Non-greedy: .*? matches as little as possible.

Example:

text = "<tag>content</tag><tag>more content</tag>"
pattern = r'<tag>.*?</tag>'  # Non-greedy match
 
result = re.findall(pattern, text)
print(result)  # Output: ['<tag>content</tag>', '<tag>more content</tag>']

Assignment:

  • Use greedy and non-greedy matching to extract the first and last <tag> contents.

Step 9: Flags for Regex Customization

Flags modify the behavior of regex. Some common flags include:

  • re.IGNORECASE (or re.I): Makes the pattern case-insensitive.
  • re.MULTILINE (or re.M): Allows ^ and $ to match at the start and end of each line.
  • re.DOTALL (or re.S): Allows . to match newline characters.

Example:

text = "This is python.\nPYTHON is fun!"
pattern = r'python'
 
result = re.findall(pattern, text, re.IGNORECASE)
print(result)  # Output: ['python', 'PYTHON']

Assignment:

  • Use re.MULTILINE to match all words that begin a line.
  • Use re.DOTALL to match text across multiple lines.

Step 10: Advanced Data Extraction

Now that you've mastered the essentials, let's work on complex data extraction tasks involving dates, email addresses, URLs, and more.


Step 11: Real-World Parsing and Case Studies

We'll work with a dataset (I will provide this file) to demonstrate how to apply regex for extracting and manipulating data from logs, documents, and reports.


Step 12: Performance Optimization in Regex

Regex can be computationally expensive. In this section, we will cover best practices for optimizing regex performance.


Data for Parsing:

Let's create a dataset that includes various types of data for parsing throughout the course.

This file will include dates, phone numbers, email addresses, web URLs, and mixed content.

data = """ John Doe, born on 12-08-1985, is a software engineer working at Example Corp. His email is john.doe@example.com. He can be reached at (123) 456-7890. John has a blog at http://johndoe.dev (opens in a new tab), and his personal website is https://johns-portfolio.com (opens in a new tab).

Jane Smith, a project manager, was born on 03/14/1990. Her contact number is 987-654-3210, and her email is j.smith@sample.org. She maintains a project blog at https://pm-jane.com (opens in a new tab).

An event titled "Tech Conference 2023" is scheduled for 15-09-2023. For more details, visit https://techconf2023.org (opens in a new tab).

Internal server logs: [ERROR] 2024-01-05 13:55:23 - Connection failed for user 'admin' from IP 192.168.1.15. [INFO] 2024-01-05 13:57:45 - Reconnection attempt successful. [WARNING] 2024-01-05 14:05:10 - High memory usage detected: 85%.

Random data: foo@bar@baz.com, 09/31/2023, (555) 987-6543, https://example_website.com/contact-us

"""

Great! Let’s start using the dataset you’ve downloaded. We’ll begin with basic regex operations and gradually work through more advanced patterns.


Step 2: Basic Patterns — Matching Single Characters

In this step, you’ll learn how to match individual characters and simple patterns.

Goal:

We’ll start by extracting individual components such as dates, email addresses, and phone numbers from the dataset.


1. Matching Dates

Dates in the dataset follow two formats: DD-MM-YYYY and MM/DD/YYYY. Let's create a regex to match both.

Regex Explanation:

  • \d{2}: Matches exactly two digits (for day or month).
  • \d{4}: Matches exactly four digits (for year).
  • [/-]: Matches either a slash / or a dash -.

Example Code:

import re
 
# Load the dataset
with open('large_dataset.txt', 'r') as file:
    data = file.read()
 
# Regex for matching dates in both DD-MM-YYYY and MM/DD/YYYY formats
date_pattern = r'\b\d{2}[/-]\d{2}[/-]\d{4}\b'
 
# Find all dates in the dataset
dates = re.findall(date_pattern, data)
 
print("Extracted Dates:", dates)

This pattern will match dates like 12-08-1985 and 03/14/1990.


2. Matching Email Addresses

Email addresses follow a common pattern: username@domain.com. Let’s write a regex to match them.

Regex Explanation:

  • \w+: Matches one or more word characters (letters, digits, underscores).
  • [.-]?: Matches an optional dot or dash.
  • @: Matches the "@" symbol.
  • \w+: Matches domain names (like example).
  • \.[a-z]{2,}: Matches the domain suffix (like .com or .org).

Example Code:

# Regex for matching email addresses
email_pattern = r'\b[\w.-]+@[\w.-]+\.[a-z]{2,}\b'
 
# Find all email addresses in the dataset
emails = re.findall(email_pattern, data)
 
print("Extracted Emails:", emails)

This will extract email addresses such as john.doe@example.com and j.smith@sample.org.


3. Matching Phone Numbers

Phone numbers in the dataset follow formats like (123) 456-7890 and 987-654-3210. Let’s write a regex to capture them.

Regex Explanation:

  • \(\d{3}\): Matches a phone number with area code in parentheses.
  • \d{3}[- ]\d{4}: Matches the rest of the phone number with either a dash or space separator.

Example Code:

# Regex for matching phone numbers
phone_pattern = r'\(?\d{3}\)?[- ]\d{3}[- ]\d{4}'
 
# Find all phone numbers in the dataset
phones = re.findall(phone_pattern, data)
 
print("Extracted Phone Numbers:", phones)

This pattern will match both formats of phone numbers like (123) 456-7890 and 987-654-3210.


Assignment:

  1. Extract Dates: Use the provided regex to extract all dates in the dataset.
  2. Extract Emails: Modify the email regex if necessary and extract all email addresses.
  3. Extract Phone Numbers: Write a regex to capture the phone numbers, ensuring that both formats (with or without parentheses) are matched.

Awesome! Let's continue by adding more complexity to your regex skills. Now that you've practiced matching basic patterns like dates, emails, and phone numbers, we’ll move into quantifiers and character classes, giving us the ability to match more sophisticated patterns.


Step 3: Quantifiers — Repeating Patterns

Quantifiers allow us to specify how many times a character or pattern should be repeated. You’ve already used some quantifiers, like {2} and {4}, in the previous exercises. Now we’ll cover them more comprehensively.

Common Quantifiers:

  • *: Matches zero or more occurrences.
  • +: Matches one or more occurrences.
  • ?: Matches zero or one occurrence.
  • {n}: Matches exactly n occurrences.
  • {n,m}: Matches between n and m occurrences.

1. Matching URLs with Quantifiers

Now, let’s write a regex to extract URLs from the dataset. URLs can be complex, so quantifiers will help match their varying lengths.

Regex Breakdown:

  • https?: Matches either http or https.
  • ://: Matches the protocol separator.
  • [\w.-]+: Matches the domain name.
  • \.[a-z]{2,}: Matches the domain suffix (like .com, .org).
  • [\w./?&=-]*: Matches any additional URL path or query parameters.

Example Code:

# Regex for matching URLs
url_pattern = r'https?://[\w.-]+\.[a-z]{2,}[\w./?&=-]*'
 
# Find all URLs in the dataset
urls = re.findall(url_pattern, data)
 
print("Extracted URLs:", urls)

This will capture URLs like http://johndoe.dev and https://techconf2023.org.


2. Matching IP Addresses

IP addresses typically look like 192.168.1.15, which consists of four groups of digits separated by periods. We’ll write a regex to match this pattern, using quantifiers.

Regex Breakdown:

  • \d{1,3}: Matches 1 to 3 digits (since each part of an IP can be between 0 and 255).
  • \.: Matches the literal dot . (escape it with a backslash).

Example Code:

# Regex for matching IP addresses
ip_pattern = r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b'
 
# Find all IP addresses in the dataset
ips = re.findall(ip_pattern, data)
 
print("Extracted IP Addresses:", ips)

This will extract IP addresses such as 192.168.1.15.


3. Matching Error Logs with Quantifiers

Let’s extract specific types of log entries (like ERROR, INFO, and WARNING) from the log section of the dataset.

Regex Breakdown:

  • \[\w+\]: Matches log levels like [ERROR] or [INFO].
  • \d{4}-\d{2}-\d{2}: Matches the date in YYYY-MM-DD format.
  • .*: Matches any character sequence after the timestamp.

Example Code:

# Regex for matching log entries
log_pattern = r'\[\w+\] \d{4}-\d{2}-\d{2} .*'
 
# Find all log entries in the dataset
logs = re.findall(log_pattern, data)
 
print("Extracted Log Entries:", logs)

This will extract lines like [ERROR] 2024-01-05 13:55:23 - Connection failed for user 'admin'.


Assignment:

  1. Extract URLs: Use the regex provided to extract all URLs in the dataset.
  2. Extract IP Addresses: Write a regex to capture all IP addresses in the dataset.
  3. Extract Log Entries: Write a regex to capture the log entries and categorize them by type (e.g., ERROR, INFO).

Step 4: Character Classes & Sets

Character classes allow us to define a set of characters that we want to match. We’ve touched on this in Step 2, but now we’ll cover more advanced uses.

Character Classes:

  • [abc]: Matches either a, b, or c.
  • [^abc]: Matches anything except a, b, or c.
  • [a-z]: Matches any lowercase letter.
  • [A-Z]: Matches any uppercase letter.
  • [0-9]: Matches any digit.

Example 1: Matching Hexadecimal Colors

Hex colors, like #FF5733, are a common pattern that use character classes. Let’s write a regex to capture them.

Regex Breakdown:

  • #: Matches the literal # symbol.
  • [0-9A-Fa-f]{6}: Matches exactly six characters that are digits or letters A-F (case-insensitive).

Example Code:

# Regex for matching hexadecimal color codes
color_pattern = r'#[0-9A-Fa-f]{6}'
 
# Example dataset with hex colors
color_data = "Here are some colors: #FF5733, #00FF00, and #0000FF."
 
# Find all hexadecimal colors in the dataset
colors = re.findall(color_pattern, color_data)
 
print("Extracted Colors:", colors)

Example 2: Matching Alphanumeric Codes

Sometimes we need to extract alphanumeric codes like product IDs (ABC123). Let’s write a regex for that.

Regex Breakdown:

  • [A-Z]{3}: Matches exactly three uppercase letters.
  • \d{3}: Matches exactly three digits.

Example Code:

# Regex for matching alphanumeric codes (e.g., product IDs)
code_pattern = r'[A-Z]{3}\d{3}'
 
# Example dataset with product IDs
code_data = "Product IDs: ABC123, DEF456, and GHI789."
 
# Find all product IDs in the dataset
codes = re.findall(code_pattern, code_data)
 
print("Extracted Product IDs:", codes)

Assignment:

  1. Extract Hexadecimal Colors: Try writing a regex to find any hexadecimal colors in a text.
  2. Extract Product IDs: Write a regex to capture alphanumeric product IDs (e.g., ABC123).