Cach su dung bieu thuc chinh quy (regex): huong dan day du cho lap trinh vien
Hoc regular expression tu dau. Cu phap co ban, lop ky tu, bo luong tu, nhom, lookahead, lookbehind va cac mau pho bien cho email, so dien thoai, URL va IP. Co vi du thuc te.
Bieu thuc chinh quy la gi va dung de lam gi
Regular expressions (regex hoac regexp) la cac mau tim kiem giup ban tim, kiem tra va xu ly van ban voi do chinh xac rat cao. Day la cong cu cot loi trong lap trinh, quan tri he thong va xu ly du lieu.
Ve ban chat, mot regular expression la chuoi ky tu mo ta mot mau tim kiem. Chi voi mot dong regex, ban co the lam viec ma neu dung cau lenh dieu kien thong thuong thi se ton den hang chuc dong code.
Cac truong hop su dung pho bien:
- Xac thuc du lieu: Kiem tra email, so dien thoai, URL hoac ma buu chinh co dung dinh dang khong
- Tim va thay the: Tim cac mau trong van ban dai va thay the chung, nhu trong trinh soan thao hoac IDE
- Trich xuat du lieu: Lay thong tin cu the tu van ban phi cau truc nhu log hoac noi dung web
- Phan tich log: Xu ly file log cua may chu va ung dung
- Linting va dinh dang: Kiem tra code co tuan theo quy uoc hay khong
- Dinh tuyen trong framework web: Xay dung mau URL trong Express, Django, Rails va nhieu framework khac
Regex co mat trong gan nhu tat ca cac ngon ngu lap trinh: JavaScript, Python, Java, C#, PHP, Ruby, Go, Rust va nhieu ngon ngu khac. No cung duoc dung trong cong cu dong lenh nhu grep, sed va awk.
Neu muon vua doc vua thuc hanh, hay mo cong cu test regex cua chung toi trong tab khac.
Cu phap co ban: ky tu thong thuong va metacharacter
Cu phap regex duoc chia thanh hai nhom: ky tu thong thuong (duoc khop nguyen van) va metacharacter (co y nghia dac biet).
Ky tu thong thuong:
Chu cai, chu so va phan lon ky hieu se duoc khop dung nguyen van. Mau cat se tim thay tu "cat" trong van ban.
Cac metacharacter quan trong nhat:
| Metacharacter | Y nghia | Vi du | Khong gian khop |
|---|---|---|---|
. | Bat ky ky tu nao (tru xuong dong) | c.t | "cat", "cot", "c3t" |
^ | Dau dong/chuoi | ^Hello | "Hello world" (chi o dau) |
$ | Cuoi dong/chuoi | world$ | "Hello world" (chi o cuoi) |
* | Lap lai 0 hoac nhieu lan | ab*c | "ac", "abc", "abbc", "abbbc" |
+ | Lap lai 1 hoac nhieu lan | ab+c | "abc", "abbc" (khong phai "ac") |
? | 0 hoac 1 lan (tuy chon) | colou?r | "color", "colour" |
| | Lua chon (OR) | cat|dog | "cat" hoac "dog" |
\ | Escape (bien ky tu sau thanh ky tu thuong) | \. | Dau cham theo nghia literal |
Cach escape metacharacter:
Neu ban muon tim metacharacter nhu mot ky tu thong thuong, can dat \ phia truoc. Vi du:
\.tim dau cham that su, khong phai "bat ky ky tu nao"\*tim dau sao\?tim dau hoi\(va\)tim dau ngoac don\\tim dau gach cheo nguoc
Vi du thuc te: De tim chuoi "price: $9.99", ban can dung price: \$9\.99.
Lop ky tu va cac lop duoc dinh nghia san
Lop ky tu cho phep ban xac dinh tap hop ky tu hop le tai mot vi tri cu the trong mau regex.
Lop tu dinh nghia bang dau ngoac vuong [ ]:
| Mau | Y nghia | Vi du khop |
|---|---|---|
[abc] | Mot trong ba ky tu a, b hoac c | "a", "b", "c" |
[a-z] | Bat ky chu thuong nao | "a", "m", "z" |
[A-Z] | Bat ky chu hoa nao | "A", "M", "Z" |
[0-9] | Bat ky chu so nao | "0", "5", "9" |
[a-zA-Z] | Bat ky chu cai nao | "a", "Z", "m" |
[a-zA-Z0-9] | Bat ky ky tu chu va so nao | "a", "3", "Z" |
[^abc] | Bat ky ky tu nao ngoai a, b, c | "d", "1", "Z" |
[^0-9] | Bat ky ky tu nao khong phai so | "a", "!", " " |
Cac lop viet tat co san:
Day la cac cach viet ngan gon cho nhung ket hop pho bien:
| Ky hieu | Tuong duong | Y nghia |
|---|---|---|
\d | [0-9] | Bat ky chu so nao |
\D | [^0-9] | Bat ky ky tu nao khong phai so |
\w | [a-zA-Z0-9_] | Ky tu tu (chu, so va dau gach duoi) |
\W | [^a-zA-Z0-9_] | Ky tu khong phai ky tu tu |
\s | [\t\n\r\f\v ] | Bat ky khoang trang nao |
\S | [^\t\n\r\f\v ] | Bat ky ky tu nao khong phai khoang trang |
\b | (khong co tuong duong truc tiep) | Ranh gioi tu |
Ranh gioi tu (\b):
\b dac biet huu ich khi ban can khop dung ca tu. \bcat\b se tim "cat" nhung khong tim "caterpillar" hay "scat". Day la mot moc vi tri, khong tieu thu ky tu.
Vi du thuc te: De kiem tra chuoi chi gom chu cai, chu so va dau gach ngang, ban co the dung ^[a-zA-Z0-9-]+$.
Bo luong tu va cach dieu khien so lan lap
Bo luong tu xac dinh phan tu dung truoc no phai xuat hien bao nhieu lan.
Cac bo luong tu co ban:
| Bo luong tu | Y nghia | Vi du | Khong gian khop |
|---|---|---|---|
* | 0 hoac nhieu lan | \d* | "", "5", "123", "99999" |
+ | 1 hoac nhieu lan | \d+ | "5", "123", "99999" (khong phai chuoi rong) |
? | 0 hoac 1 lan | -?\d+ | "42", "-42" |
{n} | Chinh xac n lan | \d{4} | "2026", "1234" |
{n,} | n lan tro len | \d{2,} | "12", "123", "1234" |
{n,m} | Tu n den m lan | \d{2,4} | "12", "123", "1234" |
Greedy va lazy:
Mac dinh, bo luong tu hoat dong theo kieu greedy, tuc la no se co gang bat phan khop dai nhat co the. Khi them ? vao sau, no tro thanh lazy va chi lay phan khop ngan nhat can thiet.
<.*>co the bat tu dau<dau tien den dau>cuoi cung trong ca chuoi<.*?>thuong dung de bat tung the HTML rieng le mot cach an toan hon
Vi du thuc te:
De kiem tra so dien thoai co 10 hoac 11 chu so: ^\d{10,11}$
De kiem tra username dai tu 3 den 16 ky tu: ^[a-zA-Z0-9_]{3,16}$
Nhom, bat gia tri va backreference
Nhom giup ban gom nhieu phan cua regex lai ve mat logic. Nho do, ban co the bat mot phan khop cu the, gom cac lua chon vao cung mot khoi hoac ap bo luong tu len nhieu ky tu mot luc.
Cac loai nhom:
| Cu phap | Loai | Muc dich |
|---|---|---|
(...) | Nhom bat | Luu lai van ban da khop |
(?:...) | Nhom khong bat | Chi dung de nhom logic, khong luu ket qua |
(?<name>...) | Nhom dat ten | Gan ten cho nhom de truy cap ro rang hon |
Vi du ve nhom bat:
De tach nam, thang va ngay tu dinh dang YYYY-MM-DD:
(\d{4})-(\d{2})-(\d{2})
- Nhom 1: Nam (vi du "2026")
- Nhom 2: Thang (vi du "03")
- Nhom 3: Ngay (vi du "16")
Trong JavaScript: "2026-03-16".match(/(\d{4})-(\d{2})-(\d{2})/) tra ve mang trong do [1] la "2026", [2] la "03" va [3] la "16".
Nhom dat ten:
(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})
Trong JavaScript: match.groups.year, match.groups.month, match.groups.day
Backreference:
Cho phep tham chieu lai chinh van ban ma mot nhom truoc do da bat:
(\w+)\s+\1tim cac tu bi lap lai lien tiep ("the the", "is is")(['"])(.*?)\1tim van ban nam trong dau nhay va dam bao dau mo va dau dong la cung mot loai
Lua chon ben trong nhom:
(https?|ftp):// co the khop "http://", "https://" hoac "ftp://".
Ban co the thu cac mau nay ngay lap tuc bang cong cu regex cua chung toi.
Lookahead va lookbehind: cac phep khang dinh vi tri
Lookahead va lookbehind la cac phep khang dinh kiem tra xem mot mau co xuat hien truoc hay sau vi tri hien tai hay khong, ma khong tieu thu ky tu. Chung rat manh trong cac bai toan xac thuc phuc tap.
4 loai assertion pho bien:
| Cu phap | Ten goi | Y nghia |
|---|---|---|
(?=pattern) | Positive lookahead | Phan phia sau bat buoc phai khop voi mau |
(?!pattern) | Negative lookahead | Phan phia sau khong duoc khop voi mau |
(?<=pattern) | Positive lookbehind | Phan phia truoc bat buoc phai khop voi mau |
(?<!pattern) | Negative lookbehind | Phan phia truoc khong duoc khop voi mau |
Vi du 1: Kiem tra mat khau manh
Mat khau yeu cau it nhat mot chu hoa, mot chu thuong, mot chu so va tu 8 ky tu tro len:
^(?=.*[A-Z])(?=.*[a-z])(?=.*\d).{8,}$
(?=.*[A-Z]): Phai co it nhat mot chu hoa(?=.*[a-z]): Phai co it nhat mot chu thuong(?=.*\d): Phai co it nhat mot chu so.{8,}: Do dai toi thieu 8 ky tu
Vi du 2: Tim gia tien ma khong lay ky hieu tien te
(?<=\$)\d+\.\d{2}
Trong chuoi "The price is $29.99 and shipping is $5.00", mau nay se bat "29.99" va "5.00" nhung bo qua dau "$".
Vi du 3: Tim tu KHONG duoc theo sau boi mot mau nhat dinh
\w+(?!\s*:)
Mau nay tim cac tu khong di kem dau hai cham ngay sau do. No huu ich khi tach khoa va gia tri trong chuoi dang "key: value".
Vi du 4: So KHONG duoc dung truoc boi dau tru
(?<!-)\b\d+\b
No se tim so duong va bo qua so am. Trong chuoi "5 -3 8 -12", ket qua la "5" va "8".
Luu y ve tinh tuong thich: Lookbehind khong duoc ho tro dong deu o moi regex engine. JavaScript ho tro tu ES2018. Python, Java, C# va .NET ho tro rat tot. Mot so engine chi chap nhan lookbehind co do dai co dinh.
Cac mau pho bien: email, so dien thoai, URL va IP
Duoi day la nhung regex da duoc kiem tra cho cac nhu cau xac thuc pho bien nhat. Luu y rang nhung dinh dang phuc tap nhu email khong the duoc mo ta hoan hao bang mot regex duy nhat; trong moi truong that, hay ket hop voi xac thuc phia server.
1. Email (xac thuc thuc dung):
^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$
- Phan local: chu cai, chu so, dau cham, dau gach ngang, gach duoi, %, +
- @: ky tu phan tach bat buoc
- Ten mien: chu cai, chu so, dau cham va dau gach ngang
- TLD: toi thieu 2 chu cai
- Vi du hop le: user@example.com, first.last@company.co.uk
2. So dien thoai quoc te (dinh dang E.164):
^\+?[1-9]\d{1,14}$
- Dau + la tuy chon
- Chu so dau tien: 1-9 (khong duoc bat dau bang 0)
- Toi da 15 chu so
- Vi du hop le: +84901234567, 447911123456
3. URL (HTTP/HTTPS):
^https?:\/\/[\w.-]+(?:\.[a-zA-Z]{2,})(?:\/[\w.~:/?#\[\]@!$&'()*+,;=-]*)?$
- Giao thuc: http:// hoac https://
- Ten mien: ky tu chu-so, co the kem dau cham va dau gach ngang
- TLD: it nhat 2 chu cai
- Duong dan: bat ky ky tu hop le nao trong URL (tuy chon)
4. Dia chi IPv4:
^(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)$
- Gom 4 octet cach nhau boi dau cham
- Moi octet nam trong khoang 0-255
- Hop le: 192.168.1.1, 10.0.0.1, 255.255.255.0
- Khong hop le: 256.1.1.1, 192.168.1.999
5. Ngay theo dinh dang ISO (YYYY-MM-DD):
^\d{4}-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12]\d|3[01])$
- Nam: 4 chu so
- Thang: 01-12
- Ngay: 01-31
- Khong tu no kiem tra ngay khong ton tai nhu 02-30; can logic bo sung
6. Ma mau CSS dang hex:
^#(?:[0-9a-fA-F]{3}|[0-9a-fA-F]{6}|[0-9a-fA-F]{8})$
- Ho tro #RGB, #RRGGBB va #RRGGBBAA
- Vi du hop le: #fff, #FF5733, #FF573380
Hay test va tinh chinh toan bo cac mau nay bang cong cu regex cua chung toi. Neu can kiem tra JSON co chua cac mau do, ban co the dung trinh xac thuc JSON.
Flag, hieu nang va cac thong le tot
De su dung regex thanh thao, ban can nam ro cac flag (modifier) va ap dung nhung thong le giup tranh loi ve hieu nang cung nhu de bao tri hon.
Cac flag pho bien:
| Flag | Ten goi | Tac dung |
|---|---|---|
g | Global | Tim TAT CA cac ket qua khop, khong chi ket qua dau tien |
i | Case insensitive | Bo qua phan biet chu hoa va chu thuong |
m | Multiline | ^ va $ khop dau/cuoi tung dong |
s | Dotall | Dau cham . se khop ca ky tu xuong dong |
u | Unicode | Ho tro day du ky tu Unicode |
Trong JavaScript: /pattern/flags - vi du: /hello world/gi
Trong Python: re.compile(r'pattern', re.IGNORECASE | re.MULTILINE)
Thong le tot ve hieu nang:
- Tranh catastrophic backtracking: Cac mau nhu
(a+)+$co the khien regex engine thu so luong to hop tang theo ham mu. Hien tuong nay duoc goi la ReDoS. Neu co the, hay dung possessive quantifier (++) hoac atomic group - Mo ta cu the: Neu ban chi mong doi chu cai,
[a-z]+tot hon.+. Mau cang cu the thi chay cang nhanh - Dung anchor:
^va$giup regex engine biet diem bat dau va ket thuc, giam viec quet khong can thiet - Uu tien non-capturing group:
(?:...)hieu qua hon(...)neu ban khong can giu lai ket qua - Test voi edge case: Luon thu mau voi chuoi rong, ky tu la va du lieu loi truoc khi dua vao san pham that
Thong le tot de de bao tri:
- Ghi chu hoac tach regex dai thanh tung phan de nguoi khac co the doc duoc
- Voi bai toan qua phuc tap, hay can nhac ket hop regex va logic code thong thuong
- Tao test case gom ca du lieu hop le va khong hop le
Ban co the kiem tra mau mot cach an toan bang cong cu test regex va doi chieu du lieu bang trinh xac thuc JSON cua chung toi.
Thử công cụ này:
Mở công cụ→Câu hỏi thường gặp
Nguoi moi hoc regex nen bat dau tu dau?
Hay bat dau voi ky tu thong thuong, lop ky tu, anchor (^ va $) va cac bo luong tu co ban nhu *, +, ?, {n}. Khi da quen nhung khoi nay, ban se hoc nhom, backreference va lookahead/lookbehind nhanh hon nhieu.
Mot regex co chay giong nhau tren moi ngon ngu lap trinh khong?
Khong hoan toan. Cu phap nen tang thuong giong nhau, nhung moi regex engine lai co khac biet ve lookbehind, xu ly Unicode va mot so tinh nang mo rong. Hay kiem tra tai lieu cua ngon ngu hoac thu vien ban dang dung.
Chi dung mot regex co du de xac thuc email khong?
Khong. Regex rat tot de kiem tra dinh dang co ban, nhung voi ung dung that ban van nen xac minh them o phia server, chang han kiem tra domain, MX record hoac xac nhan bang email.
Vi sao regex doi khi chay rat cham?
Ly do pho bien la backtracking qua muc, nhat la voi cac mau long nhau va mo ho. Cach khac phuc la viet mau cu the hon, dung anchor va han che cac cau truc gay nhieu nha phan tich phai thu lai qua nhieu lan.
Co nen dung regex de parse HTML khong?
Chi nen dung cho viec rat nho va co kiem soat. HTML that thuong co cau truc long nhau va cac truong hop bien, vi vay parser chuyen dung van an toan hon regex trong da so tinh huong.
Cach nhanh nhat de gioi hon regex la gi?
Hoc tung khai niem ngan gon roi ap dung ngay vao bai toan that: email, URL, ngay thang, log va du lieu form. Khi ket hop ly thuyet voi mot regex tester, toc do tien bo se nhanh hon rat nhieu.