正則表達式與文本處理的藝術

引言

在前端開發領域，文本處理是一項核心技能。正則表達式作為一種強大的模式匹配工具，能夠幫助我們高效地處理各種復雜的文本操作任務。

正則表達式基礎

什么是正則表達式？

正則表達式是一種用于匹配字符串中字符組合的模式。它由一系列字符和特殊符號組成，用于定義搜索模式。

// 基本示例：匹配所有數字
const numberPattern = /\d+/g;
const text = "我有23個蘋果和45個橙子";
const numbers = text.match(numberPattern); // 結果: ["23", "45"]

基本語法元素

元素	描述	示例
`.`	匹配任意單個字符	`/a.c/` 匹配 “abc”, “axc” 等
`[]`	字符集，匹配方括號內的任意字符	`/[abc]/` 匹配 “a”, “b”, 或 “c”
`[^]`	否定字符集，匹配任何不在方括號內的字符	`/[^abc]/` 匹配除 “a”, “b”, “c” 之外的字符
`\d`	匹配任意數字，等價于 `[0-9]`	`/\d{3}/` 匹配三個連續數字
`\w`	匹配任意字母、數字或下劃線，等價于 `[A-Za-z0-9_]`	`/\w+/` 匹配一個或多個字母數字字符
`\s`	匹配任意空白字符	`/\s/` 匹配空格、制表符等

量詞

量詞決定了模式應該匹配多少次。

量詞	描述	示例
`*`	匹配前一個元素零次或多次	`/a*/` 匹配 “”, “a”, “aa”, …
`+`	匹配前一個元素一次或多次	`/a+/` 匹配 “a”, “aa”, … 但不匹配 “”
`?`	匹配前一個元素零次或一次	`/a?/` 匹配 “” 或 “a”
`{n}`	精確匹配前一個元素n次	`/a{3}/` 匹配 “aaa”
`{n,}`	匹配前一個元素至少n次	`/a{2,}/` 匹配 “aa”, “aaa”, …
`{n,m}`	匹配前一個元素n至m次	`/a{1,3}/` 匹配 “a”, “aa”, 或 “aaa”

錨點

錨點用于指定匹配的位置。

// 使用錨點匹配行首和行尾
const pattern = /^開始.*結束$/;
console.log(pattern.test("開始這是中間內容結束")); // true
console.log(pattern.test("這不是開始的內容結束")); // false

貪婪與惰性匹配

正則表達式的默認行為是貪婪匹配，它會盡可能多地匹配字符。相比之下，惰性匹配則盡可能少地匹配字符。

貪婪匹配

// 貪婪匹配示例
const htmlText = "<div>內容1</div><div>內容2</div>";
const greedyPattern = /<div>.*<\/div>/;
const greedyMatch = htmlText.match(greedyPattern);
console.log(greedyMatch[0]); // 結果: "<div>內容1</div><div>內容2</div>"

貪婪模式下，.* 會匹配盡可能多的字符，導致整個字符串都被匹配。

惰性匹配

// 惰性匹配示例
const htmlText = "<div>內容1</div><div>內容2</div>";
const lazyPattern = /<div>.*?<\/div>/g;
const lazyMatches = htmlText.match(lazyPattern);
console.log(lazyMatches); // 結果: ["<div>內容1</div>", "<div>內容2</div>"]

通過在量詞后添加問號 ?，可以將貪婪匹配轉為惰性匹配。惰性模式下，正則表達式引擎會盡可能少地匹配字符，在第一次找到完整匹配后就停止。

性能對比

// 貪婪匹配性能測試
const longText = "<div>".repeat(1000) + "</div>".repeat(1000);
console.time('greedy');
const greedyResult = /<div>.*<\/div>/.test(longText);
console.timeEnd('greedy'); // 可能需要很長時間甚至超時// 惰性匹配性能測試
console.time('lazy');
const lazyResult = /<div>.*?<\/div>/.test(longText);
console.timeEnd('lazy'); // 通常比貪婪匹配快得多

在處理長文本時，惰性匹配通常比貪婪匹配有更好的性能，因為它避免了過度回溯。

捕獲組

捕獲組允許我們提取模式的特定部分，這在需要處理復雜文本時尤為有用。

基本捕獲組

// 基本捕獲組
const dateString = "今天是2023-05-15";
const datePattern = /(\d{4})-(\d{2})-(\d{2})/;
const match = dateString.match(datePattern);
console.log(match[0]); // "2023-05-15"（完整匹配）
console.log(match[1]); // "2023"（第一個捕獲組）
console.log(match[2]); // "05"（第二個捕獲組）
console.log(match[3]); // "15"（第三個捕獲組）

命名捕獲組

命名捕獲組使代碼更易理解，特別是在復雜模式中。

// 命名捕獲組
const dateString = "今天是2023-05-15";
const datePattern = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/;
const match = dateString.match(datePattern);
console.log(match.groups.year);  // "2023"
console.log(match.groups.month); // "05"
console.log(match.groups.day);   // "15"

非捕獲組

當我們只需要分組但不需要捕獲匹配內容時，可以使用非捕獲組。

// 非捕獲組
const text = "HTML和CSS都是前端必備技能";
const pattern = /(?:HTML|CSS)和(?:HTML|CSS)/;
console.log(pattern.test(text)); // true

反向引用

反向引用允許我們在模式中引用之前的捕獲組。

// 反向引用
const htmlWithAttrs = '<div class="container">內容</div>';
const pattern = /<(\w+)([^>]*)>(.*?)<\/\1>/;
const match = htmlWithAttrs.match(pattern);
console.log(match[1]); // "div"（標簽名）
console.log(match[2]); // ' class="container"'（屬性）
console.log(match[3]); // "內容"（內容）

性能優化技巧

避免過度使用貪婪模式

貪婪模式可能導致大量回溯，降低性能。在適當的情況下，使用惰性匹配可以顯著提高效率。

// 不推薦（在大文本中可能很慢）
const slowPattern = /<div>.*<\/div>/;// 推薦
const fastPattern = /<div>.*?<\/div>/;

優先使用更具體的模式

// 不推薦（太寬泛）
const emailCheck1 = /.*@.*/;// 推薦（更具體）
const emailCheck2 = /[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/;

避免嵌套量詞

嵌套量詞如 (a+)+ 可能導致指數級的性能下降，被稱為"災難性回溯"。

// 危險模式，可能導致回溯爆炸
const badPattern = /^(a+)*$/;
const input = "aaaaaaaaaaaaaaa!"; // 以感嘆號結尾
console.time('test');
badPattern.test(input); // 可能導致瀏覽器掛起
console.timeEnd('test');

使用原子組優化

在支持原子組的環境中，可以使用原子組 (?>...) 來控制回溯。

// 在某些正則實現中支持原子組（JavaScript標準還不支持）
// const atomicGroup = /(?>a+)b/;

實際應用案例

表單驗證

// 郵箱驗證
function validateEmail(email) {const pattern = /^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$/;return pattern.test(email);
}// 密碼復雜度驗證（至少8位，包含大小寫字母、數字和特殊字符）
function validatePassword(password) {const pattern = /^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[!@#$%^&*()_+])[A-Za-z\d!@#$%^&*()_+]{8,}$/;return pattern.test(password);
}// 手機號驗證（中國大陸）
function validatePhone(phone) {const pattern = /^1[3-9]\d{9}$/;return pattern.test(phone);
}

高亮文本匹配

// 搜索關鍵詞高亮
function highlightKeywords(text, keyword) {const escapedKeyword = keyword.replace(/[.*+?^${}()|[\]\\]/g, '\\$&');const pattern = new RegExp(`(${escapedKeyword})`, 'gi');return text.replace(pattern, '<span class="highlight">$1</span>');
}// 使用示例
const searchResult = highlightKeywords("JavaScript是一種用于網頁交互的編程語言","javascript"
);
console.log(searchResult); // "<span class="highlight">JavaScript</span>是一種用于網頁交互的編程語言"

URL解析

// 提取URL參數
function getUrlParams(url) {const params = {};const pattern = /[?&]([^=&#]+)=([^&#]*)/g;let match;while ((match = pattern.exec(url)) !== null) {params[decodeURIComponent(match[1])] = decodeURIComponent(match[2]);}return params;
}// 使用示例
const url = "https://example.com/search?q=正則表達式&page=1&sort=desc";
const params = getUrlParams(url);
console.log(params); // {q: "正則表達式", page: "1", sort: "desc"}

代碼格式化

// 格式化數字為千分位表示
function formatNumber(num) {return num.toString().replace(/\B(?=(\d{3})+(?!\d))/g, ",");
}// 使用示例
console.log(formatNumber(1234567)); // "1,234,567"

邊緣情況和限制

正則表達式的局限性

正則表達式不適合處理一些特定的文本結構，如HTML解析或嵌套結構。

// 錯誤的做法：使用正則表達式解析HTML
const htmlContent = '<div><p>文本1</p><p>文本2 <a href="#">鏈接</a></p></div>';
const badPattern = /<p>(.*?)<\/p>/g; // 不能正確處理嵌套標簽// 更好的做法：使用DOM解析
function extractParagraphText(html) {const parser = new DOMParser();const doc = parser.parseFromString(html, 'text/html');const paragraphs = doc.querySelectorAll('p');return Array.from(paragraphs).map(p => p.textContent);
}

處理Unicode字符

JavaScript正則表達式對Unicode的支持有限，需要使用u標志。

// 沒有u標志，無法正確處理Unicode
console.log(/^.$/.test('😊')); // false（表情符號被視為兩個字符）// 使用u標志正確處理Unicode
console.log(/^.$/u.test('😊')); // true

避免過度依賴正則表達式

有時候，使用字符串方法或專門的解析庫可能是更好的選擇。

// 對于簡單的字符串操作，使用內置方法可能更清晰
// 不推薦
const csv = "a,b,c";
const values1 = csv.match(/([^,]+),([^,]+),([^,]+)/);// 推薦
const values2 = csv.split(',');

對比分析

正則表達式 vs. 字符串方法

方法	優勢	劣勢
正則表達式	強大的模式匹配能力，簡潔的代碼	學習曲線陡峭，調試困難，性能問題
字符串方法	直觀易懂，性能可預測	復雜模式匹配需要更多代碼

// 提取域名 - 正則表達式方法
function getDomainRegex(url) {const match = url.match(/^https?:\/\/([^/]+)/);return match ? match[1] : null;
}// 提取域名 - 字符串方法
function getDomainString(url) {if (!url.startsWith('http://') && !url.startsWith('https://')) {return null;}const withoutProtocol = url.replace(/^https?:\/\//, '');const firstSlash = withoutProtocol.indexOf('/');return firstSlash === -1 ? withoutProtocol : withoutProtocol.substring(0, firstSlash);
}

瀏覽器兼容性

大多數現代瀏覽器支持ES2018中引入的正則表達式功能（如命名捕獲組），但在支持舊瀏覽器的項目中需要注意。

// 命名捕獲組（在較舊的瀏覽器中不支持）
const datePattern = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/;// 向后兼容的替代方案
const oldDatePattern = /(\d{4})-(\d{2})-(\d{2})/;
const match = "2023-05-15".match(oldDatePattern);
const [_, year, month, day] = match;