What are regular expressions?
Regular expressions (regex or regexp) are a pattern-matching language used across virtually every programming language, text editor, and command-line tool. They describe search patterns using a concise, symbolic syntax — letting you find, extract, validate, and replace text that matches complex rules in a single expression.
The core regex syntax is mostly universal, but implementations differ in what's called "dialect" or "flavour." PCRE (Perl Compatible Regular Expressions) is the most feature-rich, used by PHP, many text editors, and tools like grep -P. JavaScript's ECMAScript regex engine (used by this tool) supports most common features including lookahead, lookbehind (since ES2018), named groups, and Unicode property escapes. POSIX regex, used in Unix utilities like sed and awk, has a more limited feature set.
Despite these differences, the fundamentals — character classes, quantifiers, anchors, and grouping — work the same way everywhere. Learning regex once gives you a skill that transfers across languages and tools, from database queries to CI pipeline configurations.
Regex syntax quick reference
| Pattern | Meaning | Example |
|---|---|---|
. | Any character (except newline by default) | a.c → "abc", "a1c" |
\d | Digit [0-9] | \d{3} → "123", "456" |
\w | Word character [a-zA-Z0-9_] | \w+ → "hello", "var_1" |
\s | Whitespace (space, tab, newline) | a\sb → "a b", "a\tb" |
[abc] | Any character in the set | [aeiou] → vowels |
[a-z] | Character range | [A-Z] → uppercase letters |
[^abc] | Any character NOT in the set | [^0-9] → non-digits |
^ | Start of string (or line with m flag) | ^Hello → starts with "Hello" |
$ | End of string (or line with m flag) | end$ → ends with "end" |
* | Zero or more (greedy) | ab*c → "ac", "abc", "abbc" |
+ | One or more (greedy) | ab+c → "abc", "abbc" |
? | Zero or one (optional) | colou?r → "color", "colour" |
{n} | Exactly n times | \d{4} → "2026" |
{n,m} | Between n and m times | \d{2,4} → "12", "123", "1234" |
() | Capture group | (\d+)px → captures "12" from "12px" |
(?:) | Non-capturing group | (?:ab)+ → groups without capturing |
| | Alternation (OR) | cat|dog → "cat" or "dog" |
\b | Word boundary | \bword\b → whole word match |
Common regex patterns
Email validation is one of the most frequently searched regex tasks — and one of the most misunderstood. A simple pattern like [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,} handles most real-world emails, but a fully RFC 5322-compliant regex is notoriously complex (thousands of characters long). In practice, validate the format loosely with regex and confirm the address exists by sending a verification email.
URL matching can be straightforward for http:// and https:// links, but gets complicated with internationalised domain names, ports, query strings, and fragments. For most use cases, https?://[\w.-]+(?:\.[a-z]{2,})(?:/[^\s]*)? works well enough.
Extracting numbers from text is a common ETL task. Use \d+ for integers, or -?\d+\.?\d* to include negative numbers and decimals. For comma-formatted numbers like "1,234,567", try \d{1,3}(?:,\d{3})*(?:\.\d+)?.
Password strength validation typically uses lookaheads: ^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[!@#$%]).{8,}$ requires at least one lowercase, one uppercase, one digit, one special character, and a minimum of 8 characters — all in a single pattern.
Finding duplicate words is a clever use of backreferences: \b(\w+)\s+\1\b matches repeated words like "the the" or "is is". This is a favourite in proofreading and text cleanup workflows.