How to Learn Regex Step by Step with Examples
Regular expressions (regex) are one of the most powerful tools in a programmer's toolkit, yet they often intimidate beginners with their cryptic syntax and complex patterns. This comprehensive guide will take you from regex novice to confident practitioner through step-by-step explanations, practical examples, and hands-on exercises across multiple platforms including Python, JavaScript, and Linux CLI.
What is Regular Expression (Regex)?
Regular expressions are sequences of characters that define search patterns, primarily used for string matching, validation, and text manipulation. Think of regex as a sophisticated "find and replace" tool that can identify complex patterns in text data with surgical precision.
Why Learn Regex?
- Data Validation: Verify email addresses, phone numbers, and user input - Text Processing: Extract specific information from large datasets - Log Analysis: Parse server logs and system files efficiently - Web Scraping: Extract structured data from HTML content - Code Refactoring: Perform complex search-and-replace operations across codebases
Basic Regex Syntax and Metacharacters
Literal Characters
The simplest regex patterns are literal characters that match themselves exactly:
Python Example:
`python
import re
text = "Hello World"
pattern = "Hello"
match = re.search(pattern, text)
print(match.group()) # Output: Hello
`
JavaScript Example:
`javascript
const text = "Hello World";
const pattern = /Hello/;
const match = text.match(pattern);
console.log(match[0]); // Output: Hello
`
Linux CLI Example:
`bash
echo "Hello World" | grep "Hello"
Output: Hello World
`Essential Metacharacters
#### The Dot (.) - Any Character The dot matches any single character except newline:
Python:
`python
import re
text = "cat bat rat"
pattern = r".at"
matches = re.findall(pattern, text)
print(matches) # Output: ['cat', 'bat', 'rat']
`
JavaScript:
`javascript
const text = "cat bat rat";
const pattern = /.at/g;
const matches = text.match(pattern);
console.log(matches); // Output: ['cat', 'bat', 'rat']
`
#### Anchors (^ and $)
- ^ matches the beginning of a string
- $ matches the end of a string
Python:
`python
import re
emails = ["user@email.com", "invalid-email", "test@domain.org"] pattern = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
for email in emails:
if re.match(pattern, email):
print(f"{email} is valid")
else:
print(f"{email} is invalid")
`
Character Classes and Ranges
Character classes allow you to match any character from a specific set.
Basic Character Classes
Square Brackets [ ]:
`python
import re
text = "The year 2023 was great"
pattern = r"[0-9]" # Match any digit
matches = re.findall(pattern, text)
print(matches) # Output: ['2', '0', '2', '3']
`
Character Ranges:
`python
import re
text = "Hello World 123"
pattern = r"[a-z]" # Match lowercase letters
matches = re.findall(pattern, text)
print(matches) # Output: ['e', 'l', 'l', 'o', 'o', 'r', 'l', 'd']
`
Predefined Character Classes
| Class | Description | Equivalent |
|-------|-------------|------------|
| \d | Digits | [0-9] |
| \w | Word characters | [a-zA-Z0-9_] |
| \s | Whitespace | [ \t\n\r\f\v] |
| \D | Non-digits | [^0-9] |
| \W | Non-word characters | [^a-zA-Z0-9_] |
| \S | Non-whitespace | [^ \t\n\r\f\v] |
JavaScript Example:
`javascript
const text = "Phone: 123-456-7890";
const phonePattern = /\d{3}-\d{3}-\d{4}/;
const match = text.match(phonePattern);
console.log(match[0]); // Output: 123-456-7890
`
Quantifiers: Controlling Match Frequency
Quantifiers specify how many times a character or group should be matched.
Basic Quantifiers
| Quantifier | Description | Example |
|------------|-------------|---------|
| | 0 or more | a matches "", "a", "aa", "aaa" |
| + | 1 or more | a+ matches "a", "aa", "aaa" |
| ? | 0 or 1 | a? matches "", "a" |
| {n} | Exactly n | a{3} matches "aaa" |
| {n,} | n or more | a{2,} matches "aa", "aaa", "aaaa" |
| {n,m} | Between n and m | a{2,4} matches "aa", "aaa", "aaaa" |
Python Example - Password Validation:
`python
import re
def validate_password(password): # At least 8 characters, contains uppercase, lowercase, digit, special char pattern = r"^(?=.[a-z])(?=.[A-Z])(?=.\d)(?=.[@$!%?&])[A-Za-z\d@$!%?&]{8,}$" return bool(re.match(pattern, password))
passwords = ["Password123!", "weak", "NoSpecial123", "SHORT1!"]
for pwd in passwords:
result = "Valid" if validate_password(pwd) else "Invalid"
print(f"{pwd}: {result}")
`
Greedy vs. Non-Greedy Matching
By default, quantifiers are greedy (match as much as possible). Add ? to make them non-greedy:
JavaScript Example:
`javascript
const html = "
// Greedy matching const greedyPattern = /
// Non-greedy matching const nonGreedyPattern = /
`Groups and Capturing
Groups allow you to treat multiple characters as a single unit and capture matched content for later use.
Basic Grouping with Parentheses
Python Example - Extracting Date Components:
`python
import re
text = "Today is 2023-12-25" pattern = r"(\d{4})-(\d{2})-(\d{2})" match = re.search(pattern, text)
if match:
year, month, day = match.groups()
print(f"Year: {year}, Month: {month}, Day: {day}")
# Output: Year: 2023, Month: 12, Day: 25
`
Named Groups
Named groups make your regex more readable and maintainable:
Python:
`python
import re
text = "John Doe, Age: 30, Email: john@email.com"
pattern = r"(?P
if match:
print(f"Name: {match.group('name')}")
print(f"Age: {match.group('age')}")
print(f"Email: {match.group('email')}")
`
Non-Capturing Groups
Use (?:...) when you need grouping but don't want to capture the content:
JavaScript:
`javascript
const text = "The colors are red, blue, and green";
const pattern = /(?:red|blue|green)/g;
const matches = text.match(pattern);
console.log(matches); // Output: ['red', 'blue', 'green']
`
Alternation and Choice
The pipe symbol | allows you to match one of several alternatives:
Python Example - File Extension Validation:
`python
import re
def get_file_type(filename): pattern = r"\.(?:jpg|jpeg|png|gif|bmp)$" if re.search(pattern, filename, re.IGNORECASE): return "Image" pattern = r"\.(?:pdf|doc|docx|txt)$" if re.search(pattern, filename, re.IGNORECASE): return "Document" return "Unknown"
files = ["photo.jpg", "document.pdf", "script.py", "image.PNG"]
for file in files:
print(f"{file}: {get_file_type(file)}")
`
Lookahead and Lookbehind Assertions
Assertions allow you to match based on what comes before or after without including it in the match.
Positive Lookahead (?=...)
JavaScript Example - Password with Requirements:
`javascript
function validatePassword(password) {
// Must contain at least one uppercase letter
const hasUpper = /(?=.*[A-Z])/.test(password);
// Must contain at least one digit
const hasDigit = /(?=.*\d)/.test(password);
// Must be at least 8 characters
const hasLength = /(?=.{8,})/.test(password);
return hasUpper && hasDigit && hasLength;
}
const passwords = ["Password123", "password", "PASSWORD123", "Pass1"];
passwords.forEach(pwd => {
console.log(${pwd}: ${validatePassword(pwd) ? 'Valid' : 'Invalid'});
});
`
Negative Lookahead (?!...)
Python Example - Exclude Certain Patterns:
`python
import re
text = "apple application apply appreciate"
Match words starting with "app" but not "apple"
pattern = r"\bapp(?!le)\w*" matches = re.findall(pattern, text) print(matches) # Output: ['application', 'apply', 'appreciate']`Real-World Regex Examples
Email Validation
Comprehensive Email Regex:
`python
import re
def validate_email(email): pattern = r""" ^ # Start of string [a-zA-Z0-9._%+-]+ # Username part @ # @ symbol [a-zA-Z0-9.-]+ # Domain name \. # Dot [a-zA-Z]{2,} # Top-level domain $ # End of string """ return bool(re.match(pattern, email, re.VERBOSE))
emails = [ "user@example.com", "test.email+tag@domain.co.uk", "invalid@", "@invalid.com", "user@domain" ]
for email in emails:
result = "✓" if validate_email(email) else "✗"
print(f"{result} {email}")
`
Phone Number Extraction
JavaScript Example:
`javascript
function extractPhoneNumbers(text) {
// Matches various phone number formats
const patterns = [
/\b\d{3}-\d{3}-\d{4}\b/g, // 123-456-7890
/\b\(\d{3}\)\s?\d{3}-\d{4}\b/g, // (123) 456-7890 or (123)456-7890
/\b\d{3}\.\d{3}\.\d{4}\b/g, // 123.456.7890
/\b\d{10}\b/g // 1234567890
];
let allMatches = [];
patterns.forEach(pattern => {
const matches = text.match(pattern);
if (matches) {
allMatches = allMatches.concat(matches);
}
});
return allMatches;
}
const text = ` Contact us at 123-456-7890 or (555) 123-4567. You can also reach us at 555.987.6543 or 9876543210. `;
console.log(extractPhoneNumbers(text));
`
URL Extraction and Validation
Python Example:
`python
import re
def extract_urls(text): pattern = r""" https?:// # http:// or https:// (?:[-\w.])+ # Domain name (?:\.[a-zA-Z]{2,})? # Optional TLD (?:/ # Optional path (?:[\w.,@?^=%&:/~+#-]* # Path characters [\w@?^=%&/~+#-])? # Path must end with these chars )? """ return re.findall(pattern, text, re.VERBOSE)
text = """ Visit our website at https://www.example.com or check out http://subdomain.site.org/path/to/page?param=value """
urls = extract_urls(text)
for url in urls:
print(f"Found URL: {url}")
`
Regex in Different Programming Languages
Python Regex Module
Key Functions:
`python
import re
text = "The quick brown fox jumps over the lazy dog"
re.search() - Find first match
match = re.search(r"brown \w+", text) print(match.group() if match else "Not found")re.findall() - Find all matches
words = re.findall(r"\b\w{4}\b", text) # 4-letter words print(words)re.sub() - Replace matches
result = re.sub(r"\b\w{4}\b", "", text) print(result)re.split() - Split by pattern
parts = re.split(r"\s+", text) print(parts)`Compiled Patterns for Performance:
`python
import re
Compile pattern once for multiple uses
email_pattern = re.compile(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}")emails = ["user1@test.com", "user2@example.org", "invalid-email"]
for email in emails:
if email_pattern.match(email):
print(f"Valid: {email}")
`
JavaScript Regex
Regex Methods in JavaScript:
`javascript
const text = "JavaScript is awesome, and JavaScript is powerful";
const pattern = /JavaScript/gi; // Global, case-insensitive
// String methods console.log(text.match(pattern)); // Find all matches console.log(text.search(pattern)); // Find first match index console.log(text.replace(pattern, "JS")); // Replace matches
// RegExp methods
console.log(pattern.test(text)); // Test if pattern exists
console.log(pattern.exec(text)); // Get match details
`
Advanced JavaScript Example:
`javascript
class TextProcessor {
constructor() {
this.patterns = {
email: /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/g,
phone: /\b\d{3}-\d{3}-\d{4}\b/g,
url: /https?:\/\/(?:[-\w.])+(?:\.[a-zA-Z]{2,})?(?:\/[^\s]*)?/g
};
}
extractData(text) {
const result = {};
for (const [key, pattern] of Object.entries(this.patterns)) {
result[key] = text.match(pattern) || [];
}
return result;
}
}
const processor = new TextProcessor(); const sampleText = ` Contact John at john@email.com or 123-456-7890. Visit https://www.example.com for more info. `;
console.log(processor.extractData(sampleText));
`
Linux Command Line Regex Tools
grep - Pattern Searching
Basic grep Usage:
`bash
Search for pattern in file
grep "error" /var/log/syslogCase-insensitive search
grep -i "warning" /var/log/syslogShow line numbers
grep -n "pattern" file.txtRecursive search in directories
grep -r "TODO" /path/to/projectExtended regex with -E
grep -E "^(error|warning|critical)" /var/log/syslog`Advanced grep Examples:
`bash
Find IP addresses in log files
grep -E "\b([0-9]{1,3}\.){3}[0-9]{1,3}\b" access.logFind email addresses
grep -E "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}" contacts.txtFind lines NOT matching pattern
grep -v "^#" config.file # Exclude commentsCount matches
grep -c "error" log.file`sed - Stream Editor
Find and Replace:
`bash
Basic substitution
sed 's/old/new/' file.txtGlobal replacement (all occurrences)
sed 's/old/new/g' file.txtCase-insensitive replacement
sed 's/old/new/gi' file.txtReplace only on specific lines
sed '1,5s/old/new/g' file.txtUse different delimiter
sed 's|/old/path|/new/path|g' file.txt`Advanced sed Examples:
`bash
Remove empty lines
sed '/^$/d' file.txtRemove lines matching pattern
sed '/pattern/d' file.txtExtract specific groups
echo "Date: 2023-12-25" | sed 's/Date: \([0-9-]*\)/\1/'Multiple operations
sed -e 's/old1/new1/g' -e 's/old2/new2/g' file.txt`awk - Pattern Processing
Basic awk with Regex:
`bash
Print lines matching pattern
awk '/pattern/ {print}' file.txtPrint specific fields from matching lines
awk '/error/ {print $1, $3}' log.fileCase-insensitive matching
awk 'tolower($0) ~ /error/ {print}' file.txtField matching with regex
awk '$3 ~ /^[0-9]+$/ {print "Line " NR ": " $0}' data.txt`Common Regex Patterns and Use Cases
Data Validation Patterns
Credit Card Numbers:
`python
import re
def validate_credit_card(number): # Remove spaces and dashes clean_number = re.sub(r'[\s-]', '', number) patterns = { 'Visa': r'^4[0-9]{12}(?:[0-9]{3})?