Python 正則表達式模塊 re

Python 正則表達式模塊 `re`

flyfish

一、正則表達式基礎

1. 什么是正則表達式？

正則表達式（Regular Expression, RE）是一種用于匹配、查找和替換文本模式的工具，由普通字符（如字母、數字）和特殊字符（元字符）組成。

2. 常用元字符

元字符	說明	示例
`.`	匹配任意單個字符（除換行符）	`a.c` → `abc`, `adc`
`\w`	匹配字母、數字或下劃線	`\w+` → `hello123`
`\d`	匹配數字	`\d{3}` → `123`
`\s`	匹配空白字符（空格、制表符等）	`\s+` → 多個空格
`*`	匹配前一個字符零次或多次	`ab*` → `a`, `ab`, `abb`
`+`	匹配前一個字符一次或多次	`ab+` → `ab`, `abb`
`?`	匹配前一個字符零次或一次	`ab?` → `a` 或 `ab`
`^`	匹配字符串開頭	`^abc` → 以`abc`開頭
`$`	匹配字符串結尾	`abc$` → 以`abc`結尾

二、Python 正則表達式模塊 `re`

1. 模塊導入

import re

2. 常用函數

函數名	作用描述
`re.compile()`	編譯正則表達式，提高重復使用效率
`re.match()`	從字符串開頭匹配模式
`re.search()`	在字符串任意位置搜索模式
`re.findall()`	查找所有匹配項，返回列表
`re.finditer()`	查找所有匹配項，返回迭代器
`re.sub()`	替換匹配項
`re.subn()`	替換匹配項并返回替換次數
`re.split()`	按模式分割字符串
`re.fullmatch()`	要求整個字符串完全匹配模式

三、核心功能詳解

1. 匹配操作

re.match()（從開頭匹配）

match = re.match(r'hello', 'hello world')
print(match.group())  # 輸出: hello

match = re.search(r'```json(.*?)```', content, re.DOTALL)

`re.search()` 函數

re.search(pattern, string, flags=0) 是 re 模塊中的一個函數，用于在字符串 string 中搜索第一個與模式 pattern 匹配的子字符串。如果找到匹配項，則返回一個匹配對象；如果沒有找到，則返回 None。

pattern：要搜索的正則表達式模式。
string：要在其中進行搜索的字符串，這里是 content。
flags：可選參數，用于指定正則表達式的匹配模式。這里使用了 re.DOTALL。

正則表達式模式 r'```json(.*?)```'

r：在字符串前面加上 r 表示這是一個原始字符串。在原始字符串中，反斜杠 \ 不會被當作轉義字符處理，這樣可以避免在編寫正則表達式時出現過多的轉義字符，提高代碼的可讀性。
json ````：這是一個普通的字符串，表示匹配以 json ````開頭的文本。
(.*?)：這是一個捕獲組，用于匹配任意字符（除換行符外，除非使用了 re.DOTALL 標志）。
- .：匹配除換行符外的任意單個字符。
- *：表示前面的字符（即 .）可以出現零次或多次。
- ?：在 * 后面加上 ? 表示非貪婪匹配。貪婪匹配會盡可能多地匹配字符，而非貪婪匹配會盡可能少地匹配字符。例如，如果字符串中有多個 json...代碼塊，非貪婪匹配會只匹配到第一個 ```````````就停止。
：表示匹配以 結尾的文本。

`re.DOTALL` 標志

re.DOTALL 是 re 模塊中的一個標志，它會改變 . 的匹配行為。默認情況下，. 不匹配換行符，但使用 re.DOTALL 后，. 可以匹配包括換行符在內的任意字符。這意味著代碼塊中可以包含換行符，能夠正確匹配多行的 JSON 代碼塊。

re.search()（全局搜索）

search = re.search(r'world', 'hello world')
print(search.group())  # 輸出: world

2. 查找所有匹配項

re.findall()

numbers = re.findall(r'\d+', 'a123b456c')
print(numbers)  # 輸出: ['123', '456']

3. 替換操作

re.sub()

text = re.sub(r'\d+', 'X', 'a123b456c')
print(text)  # 輸出: aXbXc

4. 分割字符串

re.split()

parts = re.split(r'\s+', 'hello   world')
print(parts)  # 輸出: ['hello', 'world']

四、捕獲組與 `group()` 方法

1. 基本用法

pattern = r'(\d{4})-(\d{2})-(\d{2})'
date_str = '2025-03-11'
match = re.search(pattern, date_str)print(match.group(0))  # 完整匹配結果 → '2025-03-11'
print(match.group(1))  # 第一個捕獲組 → '2025'
print(match.group(2))  # 第二個捕獲組 → '03'
print(match.group(3))  # 第三個捕獲組 → '11'

2. 查看捕獲組數量

使用 groups()

groups = match.groups()
print(len(groups))  # 輸出: 3

命名捕獲組（使用 groupdict()）

pattern = r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})'
match = re.search(pattern, date_str)
print(match.groupdict())  # 輸出: {'year': '2025', 'month': '03', 'day': '11'}

五、`re.match` vs `re.search`

基本概念對比

re.match：該函數會從字符串的起始位置開始嘗試匹配正則表達式模式。如果字符串的起始位置不符合模式，即使字符串的其他部分存在匹配內容，re.match 也會返回 None。也就是說，它要求模式必須從字符串的第一個字符開始匹配成功。
re.search：此函數會在整個字符串中進行搜索，查找與正則表達式模式匹配的第一個位置。只要字符串中存在一處符合模式的內容，re.search 就會返回一個匹配對象。

詳細示例對比

示例 1：模式在字符串起始位置匹配

import re# 定義字符串和模式
pattern = r'hello'
string = 'hello world'# 使用 re.match
match_result = re.match(pattern, string)
if match_result:print("re.match 匹配成功，匹配內容為:", match_result.group())
else:print("re.match 匹配失敗")# 使用 re.search
search_result = re.search(pattern, string)
if search_result:print("re.search 匹配成功，匹配內容為:", search_result.group())
else:print("re.search 匹配失敗")

結果分析：在這個例子中，模式 'hello' 位于字符串 'hello world' 的起始位置。因此，re.match 和 re.search 都能成功匹配，并且都能返回匹配到的 'hello'。

示例 2：模式不在字符串起始位置

import re# 定義字符串和模式
pattern = r'world'
string = 'hello world'# 使用 re.match
match_result = re.match(pattern, string)
if match_result:print("re.match 匹配成功，匹配內容為:", match_result.group())
else:print("re.match 匹配失敗")# 使用 re.search
search_result = re.search(pattern, string)
if search_result:print("re.search 匹配成功，匹配內容為:", search_result.group())
else:print("re.search 匹配失敗")

結果分析：模式 'world' 不在字符串 'hello world' 的起始位置，所以 re.match 會匹配失敗，返回 None。而 re.search 會在整個字符串中搜索，能夠找到 'world' 并返回匹配對象，輸出匹配內容 'world'。

示例 3：模式部分在起始位置但不完全匹配

import re# 定義字符串和模式
pattern = r'hello world!'
string = 'hello world'# 使用 re.match
match_result = re.match(pattern, string)
if match_result:print("re.match 匹配成功，匹配內容為:", match_result.group())
else:print("re.match 匹配失敗")# 使用 re.search
search_result = re.search(pattern, string)
if search_result:print("re.search 匹配成功，匹配內容為:", search_result.group())
else:print("re.search 匹配失敗")

結果分析：模式 'hello world!' 雖然前部分 'hello world' 與字符串起始部分相同，但整體模式不完全匹配，所以 re.match 會失敗。re.search 同樣在整個字符串中找不到完全匹配的內容，也會匹配失敗。

性能考慮

re.match：由于它只從字符串起始位置開始匹配，不需要對整個字符串進行遍歷，在某些情況下性能可能會更好，特別是當你明確知道要匹配的內容應該在字符串開頭時。
re.search：需要遍歷整個字符串來查找匹配位置，所以在處理較長字符串時，性能可能會相對較低。但它的靈活性更高，適用于不確定匹配內容位置的情況。

六、正則表達式 `re` 模塊的常用例子

1. 匹配以特定字符開頭的字符串

import retext = "apple banana cherry"
pattern = r'^apple'
result = re.search(pattern, text)
if result:print("匹配成功:", result.group())
else:print("匹配失敗")

2. 匹配以特定字符結尾的字符串

import retext = "apple banana cherry"
pattern = r'cherry$'
result = re.search(pattern, text)
if result:print("匹配成功:", result.group())
else:print("匹配失敗")

3. 匹配包含特定單詞的字符串

import retext = "The quick brown fox jumps over the lazy dog"
pattern = r'fox'
result = re.search(pattern, text)
if result:print("匹配成功:", result.group())
else:print("匹配失敗")

4. 匹配連續數字

import retext = "abc123def"
pattern = r'\d+'
result = re.findall(pattern, text)
print("匹配結果:", result)

5. 匹配字母和數字的組合

import retext = "abc123def"
pattern = r'[a-zA-Z0-9]+'
result = re.findall(pattern, text)
print("匹配結果:", result)

6. 匹配郵箱地址

import retext = "example@example.com"
pattern = r'^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$'
result = re.fullmatch(pattern, text)
if result:print("匹配成功:", result.group())
else:print("匹配失敗")

7. 匹配手機號碼

import retext = "13800138000"
pattern = r'^1[3-9]\d{9}$'
result = re.fullmatch(pattern, text)
if result:print("匹配成功:", result.group())
else:print("匹配失敗")

8. 匹配日期格式（YYYY-MM-DD）

import retext = "2025-03-11"
pattern = r'^\d{4}-\d{2}-\d{2}$'
result = re.fullmatch(pattern, text)
if result:print("匹配成功:", result.group())
else:print("匹配失敗")

9. 替換所有數字為指定字符

import retext = "abc123def456"
pattern = r'\d+'
replacement = 'X'
result = re.sub(pattern, replacement, text)
print("替換結果:", result)

10. 分割字符串

import retext = "apple,banana,cherry"
pattern = r','
result = re.split(pattern, text)
print("分割結果:", result)

11. 提取 HTML 標簽中的內容

import rehtml = '<p>Hello, World!</p>'
pattern = r'<p>(.*?)</p>'
result = re.findall(pattern, html)
print("提取結果:", result)

12. 匹配中文

import retext = "你好，世界！"
pattern = r'[\u4e00-\u9fa5]+'
result = re.findall(pattern, text)
print("匹配結果:", result)

13. 匹配多個單詞中的任意一個

import retext = "cat dog elephant"
pattern = r'cat|dog'
result = re.findall(pattern, text)
print("匹配結果:", result)

14. 匹配重復的字符

import retext = "aaaaabbbccc"
pattern = r'(.)\1+'
result = re.findall(pattern, text)
print("匹配結果:", result)

15. 匹配不包含特定字符的字符串

import retext = "abcde"
pattern = r'[^abc]+'
result = re.findall(pattern, text)
print("匹配結果:", result)

16. 匹配單詞邊界

import retext = "The quick brown fox jumps"
pattern = r'\bfox\b'
result = re.search(pattern, text)
if result:print("匹配成功:", result.group())
else:print("匹配失敗")

17. 匹配 IP 地址

import retext = "192.168.1.1"
pattern = r'^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$'
result = re.fullmatch(pattern, text)
if result:print("匹配成功:", result.group())
else:print("匹配失敗")

18. 匹配 URL

import retext = "https://www.example.com"
pattern = r'^https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+'
result = re.fullmatch(pattern, text)
if result:print("匹配成功:", result.group())
else:print("匹配失敗")

19. 統計匹配次數

import retext = "apple apple banana cherry apple"
pattern = r'apple'
matches = re.findall(pattern, text)
count = len(matches)
print("匹配次數:", count)

20. 使用編譯后的正則表達式進行匹配

import retext = "abc123def"
pattern = re.compile(r'\d+')
result = pattern.findall(text)
print("匹配結果:", result)