How to Use Regular Expressions (Regex): Complete Developer Guide
Learn regular expressions from scratch. Basic syntax, character classes, quantifiers, groups, lookahead, lookbehind and common patterns for email, phone, URL and IP. With practical examples.
What are regular expressions and what are they for
Regular expressions (regex or regexp) are search patterns that allow you to find, validate, and manipulate text with extreme precision. They are a fundamental tool in programming, system administration, and data processing.
A regular expression is essentially a sequence of characters that defines a search pattern. With a single line of regex, you can accomplish what would otherwise require dozens of lines of conditional code.
Common uses for regular expressions:
- Data validation: Verifying that an email, phone number, URL, or postal code has the correct format
- Search and replace: Finding patterns in long texts and replacing them (as in text editors or IDEs)
- Data extraction: Pulling specific information from unstructured text (web scraping, logs)
- Log parsing: Analyzing server and application log files
- Linting and formatting: Verifying that code follows certain conventions
- Routing in web frameworks: Defining URL patterns in Express, Django, Rails, etc.
Regex is available in virtually every programming language: JavaScript, Python, Java, C#, PHP, Ruby, Go, Rust, and many more. They are also used in command-line tools like grep, sed, and awk.
If you want to practice while reading this guide, open our regex testing tool in another tab.
Basic syntax: literal characters and metacharacters
Regex syntax is divided into two types of characters: literals (which are matched as-is) and metacharacters (which have special meaning).
Literal characters:
Letters, numbers, and most symbols are matched literally. The pattern cat finds the word "cat" in the text.
The fundamental metacharacters:
| Metacharacter | Meaning | Example | Matches |
|---|---|---|---|
. | Any character (except newline) | c.t | "cat", "cot", "c3t" |
^ | Start of line/string | ^Hello | "Hello world" (only at start) |
$ | End of line/string | world$ | "Hello world" (only at end) |
* | 0 or more repetitions | ab*c | "ac", "abc", "abbc", "abbbc" |
+ | 1 or more repetitions | ab+c | "abc", "abbc" (not "ac") |
? | 0 or 1 repetition (optional) | colou?r | "color", "colour" |
| | Alternative (OR) | cat|dog | "cat" or "dog" |
\ | Escape (literal next character) | \. | A literal period |
Escaping metacharacters:
If you need to search for a metacharacter as a literal character, you must escape it with \. For example:
\.searches for a literal period (not "any character")\*searches for a literal asterisk\?searches for a literal question mark\(and\)search for literal parentheses\\searches for a literal backslash
Practical example: To search for the string "price: $9.99" you need: price: \$9\.99
Character classes and predefined classes
Character classes allow you to define a set of characters that are valid at a specific position in the pattern.
Custom classes with brackets [ ]:
| Pattern | Meaning | Match example |
|---|---|---|
[abc] | Any of: a, b, or c | "a", "b", "c" |
[a-z] | Any lowercase letter | "a", "m", "z" |
[A-Z] | Any uppercase letter | "A", "M", "Z" |
[0-9] | Any digit | "0", "5", "9" |
[a-zA-Z] | Any letter | "a", "Z", "m" |
[a-zA-Z0-9] | Any alphanumeric | "a", "3", "Z" |
[^abc] | Any EXCEPT a, b, c | "d", "1", "Z" |
[^0-9] | Any that is NOT a digit | "a", "!", " " |
Predefined classes (shorthand):
Predefined classes are shortcuts for common combinations:
| Shorthand | Equivalent | Meaning |
|---|---|---|
\d | [0-9] | Any digit |
\D | [^0-9] | Any non-digit |
\w | [a-zA-Z0-9_] | Any word character (alphanumeric + underscore) |
\W | [^a-zA-Z0-9_] | Any non-word character |
\s | [\t\n\r\f\v ] | Any whitespace |
\S | [^\t\n\r\f\v ] | Any non-whitespace |
\b | (no direct equivalent) | Word boundary |
Word boundary (\b):
\b is especially useful for matching whole words. \bcat\b finds "cat" but NOT "caterpillar" or "scat". It is a position anchor that does not consume characters.
Practical example: To validate that a string contains only letters, numbers, and hyphens: ^[a-zA-Z0-9-]+$
Quantifiers and repetition modifiers
Quantifiers specify how many times the preceding element must appear.
Basic quantifiers:
| Quantifier | Meaning | Example | Matches |
|---|---|---|---|
* | 0 or more times | \d* | "", "5", "123", "99999" |
+ | 1 or more times | \d+ | "5", "123", "99999" (not "") |
? | 0 or 1 time | -?\d+ | "42", "-42" |
{n} | Exactly n times | \d{4} | "2026", "1234" (exactly 4 digits) |
{n,} | n or more times | \d{2,} | "12", "123", "1234" (2+ digits) |
{n,m} | Between n and m times | \d{2,4} | "12", "123", "1234" (2 to 4 digits) |
Greedy vs lazy quantifiers:
By default, quantifiers are greedy: they try to match as much text as possible. By adding ? after the quantifier, they become lazy: they match as little as possible.
Example of the difference:
Text: <b>Hello</b> and <b>World</b>
<b>.*</b>(greedy): matches<b>Hello</b> and <b>World</b>(everything)<b>.*?</b>(lazy): matches<b>Hello</b>and<b>World</b>(separately)
The difference is crucial when working with HTML, XML, or any text with repeated delimiters. Lazy mode is almost always what you want when searching for pairs of tags or delimiters.
Practical example: Validating a US ZIP code with optional dash and 4-digit extension: ^\d{5}(-\d{4})?$
This matches "12345" and "12345-6789" but not "1234" or "123456".
Capture groups and references
Groups allow you to group parts of a pattern, capture matching text for later use, and apply quantifiers to complete subexpressions.
Types of groups:
| Syntax | Type | Description |
|---|---|---|
(pattern) | Capture group | Groups and captures the matched text |
(?:pattern) | Non-capture group | Groups but does NOT capture (more efficient) |
(?<name>pattern) | Named group | Captures with an identifiable name |
\1, \2 | Backreference | Reference to text captured by group 1, 2, etc. |
Basic capture group:
To extract the year, month, and day from a date in YYYY-MM-DD format:
(\d{4})-(\d{2})-(\d{2})
- Group 1: Year (e.g., "2026")
- Group 2: Month (e.g., "03")
- Group 3: Day (e.g., "16")
In JavaScript: "2026-03-16".match(/(\d{4})-(\d{2})-(\d{2})/) returns an array where [1] is "2026", [2] is "03", and [3] is "16".
Named groups:
(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})
In JavaScript: match.groups.year, match.groups.month, match.groups.day
Backreferences:
Allow you to refer to the exact text captured by a previous group:
(\w+)\s+\1finds duplicate words ("the the", "is is")(['"])(.*?)\1finds text between quotes, ensuring opening and closing quotes match
Alternation within groups:
(https?|ftp):// matches "http://", "https://", or "ftp://"
Try these patterns in real-time with our regex tool.
Lookahead and lookbehind: position assertions
Lookahead and lookbehind are assertions that check whether a pattern exists before or after the current position, without consuming characters. They are extremely powerful for complex validations.
The 4 types of assertions:
| Syntax | Name | Meaning |
|---|---|---|
(?=pattern) | Positive lookahead | What follows MUST match the pattern |
(?!pattern) | Negative lookahead | What follows must NOT match the pattern |
(?<=pattern) | Positive lookbehind | What precedes MUST match the pattern |
(?<!pattern) | Negative lookbehind | What precedes must NOT match the pattern |
Example 1: Validate a strong password
A password requiring at least one uppercase, one lowercase, one digit, and 8+ characters:
^(?=.*[A-Z])(?=.*[a-z])(?=.*\d).{8,}$
(?=.*[A-Z]): Positive lookahead - must contain at least one uppercase letter(?=.*[a-z]): Must contain at least one lowercase letter(?=.*\d): Must contain at least one digit.{8,}: Must be 8 or more characters
Example 2: Find prices without the currency symbol
(?<=\$)\d+\.\d{2}
In the text "The price is $29.99 and shipping is $5.00", it captures "29.99" and "5.00" but not the "$".
Example 3: Find words NOT followed by a certain pattern
\w+(?!\s*:)
Finds words NOT followed by a colon. Useful for distinguishing between keys and values in "key: value" text.
Example 4: Numbers NOT preceded by a minus sign
(?<!-)\b\d+\b
Finds positive numbers while ignoring negative ones. In "5 -3 8 -12", it finds "5" and "8" but not "3" or "12".
Compatibility note: Lookbehind is not supported in all regex flavors. JavaScript has supported it since ES2018. Python, Java, C#, and .NET fully support it. Some flavors limit lookbehind to fixed-length patterns.
Common patterns: email, phone, URL and IP
Here are tested regular expressions for the most common validation patterns. Remember that no regex is 100% perfect for complex formats like email; for production, complement with server-side validation.
1. Email (practical validation):
^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$
- Local part: letters, numbers, periods, hyphens, underscores, %, +
- @: mandatory separator
- Domain: letters, numbers, periods, hyphens
- TLD: at least 2 letters
- Valid: user@example.com, first.last@company.co.uk
2. International phone (E.164 format):
^\+?[1-9]\d{1,14}$
- Optional + at the start
- First digit: 1-9 (cannot start with 0)
- Up to 15 digits total
- Valid: +12025551234, 447911123456
3. URL (HTTP/HTTPS):
^https?:\/\/[\w.-]+(?:\.[a-zA-Z]{2,})(?:\/[\w.~:/?#\[\]@!$&'()*+,;=-]*)?$
- Protocol: http:// or https://
- Domain: alphanumeric with periods and hyphens
- TLD: at least 2 letters
- Path: any valid URL character (optional)
4. IPv4 address:
^(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)$
- 4 octets separated by periods
- Each octet: 0-255
- Valid: 192.168.1.1, 10.0.0.1, 255.255.255.0
- Rejects: 256.1.1.1, 192.168.1.999
5. ISO date format (YYYY-MM-DD):
^\d{4}-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12]\d|3[01])$
- Year: 4 digits
- Month: 01-12
- Day: 01-31
- Does not validate impossible days (like 02-30); for that you need additional logic
6. CSS color hex:
^#(?:[0-9a-fA-F]{3}|[0-9a-fA-F]{6}|[0-9a-fA-F]{8})$
- Formats: #RGB, #RRGGBB, #RRGGBBAA
- Valid: #fff, #FF5733, #FF573380
Test and refine all these patterns in our regex testing tool. To validate JSON data containing these patterns, use our JSON validator.
Flags, performance and best practices
To master regex, you need to know about flags (modifiers) and follow best practices that prevent performance and maintainability issues.
Common flags:
| Flag | Name | Effect |
|---|---|---|
g | Global | Find ALL matches, not just the first |
i | Case insensitive | Does not distinguish uppercase from lowercase |
m | Multiline | ^ and $ match start/end of each LINE, not just the string |
s | Dotall | The dot (.) also matches newlines |
u | Unicode | Full Unicode character support |
In JavaScript: /pattern/flags - example: /hello world/gi
In Python: re.compile(r'pattern', re.IGNORECASE | re.MULTILINE)
Performance best practices:
- Avoid catastrophic backtracking: Patterns like
(a+)+$can cause the regex engine to try an exponential number of combinations. This is known as ReDoS (Regular Expression Denial of Service). Use possessive quantifiers (++) or atomic groups when available - Be specific:
[a-z]+is better than.+when you only expect letters. The more specific your pattern, the faster it runs - Use anchors:
^and$tell the engine where to start and stop, avoiding unnecessary searching - Prefer non-capturing groups:
(?:...)is more efficient than(...)when you do not need to capture - Compile the regex if used repeatedly: In Python use
re.compile(), in Java usePattern.compile()
Maintainability best practices:
- Comment your regex: In Python you can use the
re.VERBOSEflag to add readable comments and whitespace - Split complex patterns: Instead of one monolithic regex, break it into parts and combine them programmatically
- Write tests: Always write test cases for both positive and negative matches
- Do not use regex for everything: For parsing HTML or XML, use a dedicated parser. For JSON, use
JSON.parse(). Regex is not suitable for languages with nesting
Practice and refine your patterns with our regex testing tool, which shows matches in real-time and explains each part of the pattern.
Try this tool:
Open tool→Frequently asked questions
What is the difference between * + and ? in regex?
The asterisk (*) means 0 or more repetitions of the preceding element: 'ab*c' matches 'ac', 'abc', 'abbc'. The plus sign (+) means 1 or more repetitions: 'ab+c' matches 'abc', 'abbc' but NOT 'ac'. The question mark (?) means 0 or 1 repetition (optional): 'colou?r' matches 'color' and 'colour'.
How do I validate an email with a regular expression?
A practical pattern is: ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$. This validates the basic structure: local part with allowed characters, at sign, domain, and TLD of at least 2 letters. However, the full email specification (RFC 5322) is extremely complex. For production, complement regex validation with server-side verification.
What do \d \w and \s mean in regex?
They are predefined character classes (shorthand). \d equals [0-9] (any digit). \w equals [a-zA-Z0-9_] (any alphanumeric character plus underscore). \s matches any whitespace (space, tab, newline). Their uppercase versions (\D, \W, \S) are the negation: any character that is NOT a digit, word character, or whitespace, respectively.
What is a lookahead and what is it for?
A lookahead is an assertion that checks whether a pattern exists after the current position without consuming characters. Positive lookahead (?=pattern) verifies the pattern DOES exist. Negative (?!pattern) verifies it does NOT exist. They are useful for complex validations like passwords: ^(?=.*[A-Z])(?=.*\d).{8,}$ verifies at least one uppercase and one digit regardless of position.
What is catastrophic backtracking and how do I avoid it?
Catastrophic backtracking occurs when an ambiguous pattern causes the regex engine to try an exponential number of combinations. Example: (a+)+$ with the input 'aaaaaaaaaaab'. The engine tries every way to split the 'a's between the two quantifiers before failing. To avoid it: be specific in your patterns, avoid nested quantifiers like (a+)+, and use possessive quantifiers (a++) or atomic groups when available.
Does regex work the same in all programming languages?
Not exactly. While the basic syntax is similar, each language has its own regex 'flavor'. JavaScript does not support variable-length lookbehind. Python has the re module and the more advanced regex module. Java requires double escaping (\\d instead of \d). PCRE (PHP, Perl) supports recursion and conditionals. The most common differences are in lookahead/lookbehind, Unicode support, and advanced features like recursion.