Python：正則表達式

正則表達式是處理文本數據的強大工具，Python通過re模塊提供了完整的正則表達式功能。本文將詳細介紹Python正則表達式的使用方法，包括基礎語法、高級技巧和re模塊API的詳細解析。

一、正則表達式基礎

1.1 什么是正則表達式

正則表達式(Regular Expression)是一種用于匹配字符串中字符組合的模式，可以用于搜索、替換和驗證文本數據。

1.2 Python中的re模塊

Python通過內置的re模塊提供正則表達式支持：

import re

二、正則表達式基本語法

2.1 普通字符

大多數字母和字符只會匹配它們自身：

pattern = r"hello"
text = "hello world"
match = re.search(pattern, text)
if match:print("找到匹配:", match.group())  # 輸出: 找到匹配: hello

2.2 元字符

正則表達式中具有特殊含義的字符：

.?匹配任意單個字符(除了換行符)
^?匹配字符串的開頭
$?匹配字符串的結尾
*?匹配前面的子表達式零次或多次
+?匹配前面的子表達式一次或多次
??匹配前面的子表達式零次或一次
{m,n}?匹配前面的子表達式m到n次
[]?字符集，匹配其中任意一個字符
|?或操作，匹配左邊或右邊的表達式
()?分組，標記一個子表達式的開始和結束位置

2.3 字符類

\d?匹配任意數字，等價于[0-9]
\D?匹配任意非數字字符
\s?匹配任意空白字符(空格、制表符、換行符等)
\S?匹配任意非空白字符
\w?匹配任意字母數字字符，等價于[a-zA-Z0-9_]
\W?匹配任意非字母數字字符

三、re模塊API詳解

3.1 re.compile(pattern, flags=0)

編譯正則表達式模式，返回一個正則表達式對象。

參數說明：

pattern: 要編譯的正則表達式字符串
flags: 可選標志，用于修改正則表達式的匹配方式

常用flags：

re.IGNORECASE?或?re.I: 忽略大小寫
re.MULTILINE?或?re.M: 多行模式，影響^和$
re.DOTALL?或?re.S: 使.匹配包括換行符在內的所有字符

示例：

# 編譯一個正則表達式對象
pattern = re.compile(r'\d{3}-\d{3}-\d{4}', re.IGNORECASE)# 使用編譯后的對象進行匹配
text = "我的電話號碼是123-456-7890"
match = pattern.search(text)
if match:print("找到電話號碼:", match.group())  # 輸出: 找到電話號碼: 123-456-7890

3.2 re.search(pattern, string, flags=0)

掃描整個字符串并返回第一個成功的匹配。

參數說明：

pattern: 要匹配的正則表達式
string: 要搜索的字符串
flags: 可選標志

示例：

text = "Python是一種流行的編程語言，Python簡單易學"
match = re.search(r'Python', text)
if match:print("找到匹配:", match.group())  # 輸出: 找到匹配: Pythonprint("匹配位置:", match.span())  # 輸出: 匹配位置: (0, 6)

3.3 re.match(pattern, string, flags=0)

嘗試從字符串的起始位置匹配一個模式，如果不是起始位置匹配成功的話，就返回None。

與search的區別：

match只在字符串開頭匹配
search在整個字符串中搜索第一個匹配

示例：

text1 = "Python很棒"
text2 = "學習Python很棒"print(re.match(r'Python', text1))  # 返回匹配對象
print(re.match(r'Python', text2))  # 返回None

3.4 re.findall(pattern, string, flags=0)

返回字符串中所有與模式匹配的非重疊匹配項，作為字符串列表。

示例：

text = "蘋果10元，香蕉5元，橙子8元"
prices = re.findall(r'\d+元', text)
print(prices)  # 輸出: ['10元', '5元', '8元']

3.5 re.finditer(pattern, string, flags=0)

返回一個迭代器，產生所有非重疊匹配的匹配對象。

與findall的區別：

findall返回字符串列表
finditer返回匹配對象迭代器

示例：

text = "Python 3.8, Python 3.9, Python 3.10"
matches = re.finditer(r'Python \d+\.\d+', text)
for match in matches:print(f"找到: {match.group()} 在位置 {match.span()}")
# 輸出:
# 找到: Python 3.8 在位置 (0, 9)
# 找到: Python 3.9 在位置 (11, 20)
# 找到: Python 3.10 在位置 (22, 32)

3.6 re.sub(pattern, repl, string, count=0, flags=0)

替換字符串中的匹配項。

參數說明：

pattern: 正則表達式模式
repl: 替換的字符串或函數
string: 原始字符串
count: 最大替換次數，0表示替換所有
flags: 可選標志

示例：

text = "今天是2023-05-15，明天是2023-05-16"
# 替換日期格式
new_text = re.sub(r'(\d{4})-(\d{2})-(\d{2})', r'\1年\2月\3日', text)
print(new_text)  # 輸出: 今天是2023年05月15日，明天是2023年05月16日# 使用函數作為替換
def to_upper(match):return match.group().upper()text = "hello world"
new_text = re.sub(r'\w+', to_upper, text)
print(new_text)  # 輸出: HELLO WORLD

3.7 re.split(pattern, string, maxsplit=0, flags=0)

按照能夠匹配的子串將字符串分割后返回列表。

參數說明：

pattern: 分隔符正則表達式
string: 要分割的字符串
maxsplit: 最大分割次數，0表示不限制
flags: 可選標志

示例：

text = "蘋果,香蕉,,橙子, 西瓜"
# 按逗號分割，忽略空格和空字符串
items = re.split(r'\s*,\s*', text.strip())
print(items)  # 輸出: ['蘋果', '香蕉', '', '橙子', '西瓜']# 使用多個分隔符
text = "蘋果 香蕉,橙子；西瓜"
items = re.split(r'[ ,；]', text)
print(items)  # 輸出: ['蘋果', '香蕉', '橙子', '西瓜']

四、匹配對象的方法

當使用search()或match()成功匹配后，會返回一個匹配對象，該對象有以下方法：

4.1 group([group1, ...])

返回匹配的一個或多個子組。

示例：

text = "John Doe, 30歲"
match = re.search(r'(\w+) (\w+), (\d+)歲', text)
if match:print("完整匹配:", match.group(0))  # 輸出: 完整匹配: John Doe, 30歲print("名字:", match.group(1))     # 輸出: 名字: Johnprint("姓氏:", match.group(2))     # 輸出: 姓氏: Doeprint("年齡:", match.group(3))     # 輸出: 年齡: 30print("所有組:", match.groups())   # 輸出: 所有組: ('John', 'Doe', '30')

4.2 groups(default=None)

返回一個包含所有子組的元組。

4.3 groupdict(default=None)

返回一個包含所有命名子組的字典，鍵為子組名。

4.4 start([group]) 和 end([group])

返回匹配的子組的開始和結束位置。

4.5 span([group])

返回一個元組包含匹配的子組的 (開始, 結束) 位置。

五、高級正則表達式技巧

5.1 非貪婪匹配

默認情況下，*和+是貪婪的，會匹配盡可能多的字符。添加?使其變為非貪婪：

text = "<h1>標題</h1><p>段落</p>"
# 貪婪匹配
greedy = re.search(r'<.*>', text)
print(greedy.group())  # 輸出: <h1>標題</h1><p>段落</p># 非貪婪匹配
non_greedy = re.search(r'<.*?>', text)
print(non_greedy.group())  # 輸出: <h1>

5.2 前向斷言和后向斷言

(?=...)?正向前視斷言
(?!...)?負向前視斷言
(?<=...)?正向后視斷言
(?<!...)?負向后視斷言

示例：

# 匹配后面跟著"元"的數字
text = "蘋果10元，香蕉5元，橙子8個"
prices = re.findall(r'\d+(?=元)', text)
print(prices)  # 輸出: ['10', '5']# 匹配前面是"價格："的數字
text = "價格：100，數量：5"
numbers = re.findall(r'(?<=價格：)\d+', text)
print(numbers)  # 輸出: ['100']

5.3 命名組

使用(?P<name>...)語法為組命名：

text = "2023-05-15"
match = re.search(r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})', text)
if match:print(match.groupdict())  # 輸出: {'year': '2023', 'month': '05', 'day': '15'}

5.4 條件匹配

使用(?(id/name)yes-pattern|no-pattern)：

# 如果第一個組匹配"Mr"，則匹配"Smith"，否則匹配"Smithson"
text1 = "Mr Smith"
text2 = "Mrs Smithson"
pattern = r'(Mr)? (?(1)Smith|Smithson)'print(re.match(pattern, text1))  # 匹配
print(re.match(pattern, text2))  # 匹配

六、實際應用示例

6.1 驗證電子郵件地址

def validate_email(email):pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'return re.match(pattern, email) is not Noneprint(validate_email("test@example.com"))  # True
print(validate_email("invalid.email@"))    # False

6.2 提取URL信息?

def extract_url_info(url):pattern = r'(https?)://([^/]+)(/.*)?'match = re.match(pattern, url)if match:return {'protocol': match.group(1),'domain': match.group(2),'path': match.group(3) or '/'}return Noneurl_info = extract_url_info("https://www.example.com/path/to/page")
print(url_info)
# 輸出: {'protocol': 'https', 'domain': 'www.example.com', 'path': '/path/to/page'}

6.3 日志分析?

log_line = '127.0.0.1 - - [10/May/2023:15:32:45 +0800] "GET /index.html HTTP/1.1" 200 1234'pattern = r'^(\S+) \S+ \S+ \[([^\]]+)\] "(\S+) (\S+) (\S+)" (\d+) (\d+)'
match = re.match(pattern, log_line)if match:log_data = {'ip': match.group(1),'time': match.group(2),'method': match.group(3),'path': match.group(4),'protocol': match.group(5),'status': int(match.group(6)),'size': int(match.group(7))}print(log_data)# 輸出: {'ip': '127.0.0.1', 'time': '10/May/2023:15:32:45 +0800', # 'method': 'GET', 'path': '/index.html', 'protocol': 'HTTP/1.1',# 'status': 200, 'size': 1234}

七、性能優化建議

預編譯正則表達式：對于重復使用的正則表達式，使用re.compile()預先編譯。
使用非貪婪匹配：當可能時，使用非貪婪限定符*?、+?等。
避免回溯災難：復雜的正則表達式可能導致性能問題，盡量簡化。
使用原子組：(?>...)可以防止回溯。
合理使用字符類：[abc]比(a|b|c)更高效。

八、常見問題與解決方案

8.1 匹配多行文本

使用re.MULTILINE標志：

text = """第一行
第二行
第三行"""
matches = re.findall(r'^第\w+', text, re.MULTILINE)
print(matches)  # 輸出: ['第一行', '第二行', '第三行']

8.2 忽略大小寫匹配

使用re.IGNORECASE標志：

text = "Python python PYTHON"
matches = re.findall(r'python', text, re.IGNORECASE)
print(matches)  # 輸出: ['Python', 'python', 'PYTHON']

8.3 匹配Unicode字符

使用\u或\x轉義，或直接包含Unicode字符：

text = "中文Chinese にほんご"
matches = re.findall(r'[\u4e00-\u9fa5]+', text)  # 匹配中文字符
print(matches)  # 輸出: ['中文']

九、總結

Python的正則表達式功能強大而靈活，re模塊提供了豐富的API來處理各種文本匹配需求。掌握正則表達式可以大大提高文本處理的效率和能力。記住：

復雜的正則表達式可以先分解為多個簡單的部分
使用re.VERBOSE標志可以使復雜的正則表達式更易讀
測試正則表達式時可以使用在線工具如regex101.com
對于非常復雜的文本處理，可能需要結合其他方法(如解析器)

希望本博客能幫助你掌握Python正則表達式的使用！