一文掌握python中正則表達式的各種使用

文章目錄

- 1. 正則表達式基礎
- - 1.1 常用元字符
  - 1.2 基本用法
- 2. 正則表達式高級功能
- - 2.1 分組捕獲
  - 2.2 命名分組
  - 2.3 非貪婪匹配
  - 2.4 零寬斷言
  - 2.5 編譯正則表達式
  - 2.6 轉義字符
- 3. 常見應用場景
- - 3.1 驗證郵箱格式
  - 3.2 提取 URL
  - 3.3 提取日期
  - 3.4 提取HTML中的鏈接
  - 3.5 提取HTML中的圖片鏈接
  - 3.6 提取JSON中的特定字段
- 4. 總結

在Python爬蟲中， 正則表達式（ Regular Expression，簡稱Regex）是一種強大的工具，用于從文本中提取、匹配和替換特定的字符串模式。正則表達式可以幫助我們從HTML、JSON等格式的文本中提取所需的數據。

1. 正則表達式基礎

1.1 常用元字符

正則表達式由一系列字符和特殊符號組成，用于定義匹配模式。以下是一些常用的正則表達式元字符：

. 匹配任意字符（除了換行符 \n）
^ 匹配字符串的開頭
$ 匹配字符串的結尾
* 匹配前面的字符 0 次或多次
+ 匹配前面的字符 1 次或多次
? 匹配前面的字符 0 次或 1 次
{n} 匹配前面的字符恰好 n 次
{n,} 匹配前面的字符至少 n 次
{n,m} 匹配前面的字符至少 n 次，至多 m 次
\d 匹配數字（等價于 [0-9]）
\D 匹配非數字（等價于 [^0-9]）
\w 匹配字母、數字或下劃線（等價于 [a-zA-Z0-9_]）
\W 匹配非字母、數字或下劃線
\s 匹配空白字符（空格、制表符、換行符等）
\S 匹配非空白字符
[...] 匹配括號內的任意一個字符
[^...] 匹配不在括號內的任意一個字符

1.2 基本用法

Python提供了re模塊來支持正則表達式的操作。以下是re模塊中常用的函數：

re.search(pattern, string)：在字符串中搜索匹配正則表達式的第一個位置，返回一個匹配對象。如果沒有匹配項，返回None。
re.match(pattern, string)：從字符串的開頭開始匹配正則表達式，返回一個匹配對象。如果沒有匹配項，返回None。
re.findall(pattern, string)：返回字符串中所有匹配正則表達式的子串，返回一個列表。
re.sub(pattern, repl, string)：將字符串中所有匹配正則表達式的子串替換為repl，返回替換后的字符串。
re.compile(pattern)：將正則表達式編譯為一個正則表達式對象，可以重復使用。

Python 的 re 模塊使用樣例如下：

import re# re.match()
result = re.match(r'hello', 'hello world')
if result:print("匹配成功:", result.group())  # 輸出: 匹配成功: hello
else:print("匹配失敗")# re.search()
result = re.search(r'world', 'hello world')
if result:print("匹配成功:", result.group())  # 輸出: 匹配成功: world
else:print("匹配失敗")# re.findall()
result = re.findall(r'\d+', '3 apples, 5 bananas, 10 cherries')
print(result)  # 輸出: ['3', '5', '10']#re.finditer()
matches = re.finditer(r'\d+', '3 apples, 5 bananas, 10 cherries')
for match in matches:print(match.group())  # 輸出: 3, 5, 10# re.sub()
text = '3 apples, 5 bananas, 10 cherries'
result = re.sub(r'\d+', 'X', text)
print(result)  # 輸出: X apples, X bananas, X cherries# re.split()
result = re.split(r'\s+', 'hello   world')
print(result)  # 輸出: ['hello', 'world']

2. 正則表達式高級功能

2.1 分組捕獲

使用 () 可以將匹配的內容分組，并通過 group() 方法獲取。

import retext = 'John: 30, Jane: 25'
result = re.search(r'(\w+): (\d+)', text)
if result:print("姓名:", result.group(1))  # 輸出: 姓名: Johnprint("年齡:", result.group(2))  # 輸出: 年齡: 30

2.2 命名分組

可以為分組命名，方便后續引用。

import retext = 'John: 30'
result = re.search(r'(?P<name>\w+): (?P<age>\d+)', text)
if result:print("姓名:", result.group('name'))  # 輸出: 姓名: Johnprint("年齡:", result.group('age'))  # 輸出: 年齡: 30

2.3 非貪婪匹配

正則表達式默認是貪婪匹配，即盡可能多地匹配字符。例如，.*會匹配盡可能多的字符。可以使用.*?進行非貪婪匹配。

import retext = '<div>content1</div><div>content2</div>'
result = re.findall(r'<div>(.*?)</div>', text)
print(result)  # 輸出: ['content1', 'content2']

2.4 零寬斷言

零寬斷言用于指定匹配的位置，但不消耗字符。

正向先行斷言：(?=…)，匹配后面是 … 的位置。
負向先行斷言：(?!..)，匹配后面不是 … 的位置。
正向后行斷言：(?<=…)，匹配前面是 … 的位置。
負向后行斷言：(?<!..)，匹配前面不是 … 的位置。

import re# 匹配后面是數字的字母
result = re.findall(r'\w+(?=\d)', 'apple3 banana5 cherry10')
print(result)  # 輸出: ['apple', 'banana', 'cherry']# 匹配前面是數字的字母
result = re.findall(r'(?<=\d)\w+', '3apple 5banana 10cherry')
print(result)  # 輸出: ['apple', 'banana', 'cherry']

2.5 編譯正則表達式

如果需要多次使用同一個正則表達式，可以將其編譯為 re.Pattern 對象，以提高效率。

import repattern = re.compile(r'\d+')
result = pattern.findall('3 apples, 5 bananas, 10 cherries')
print(result)  # 輸出: ['3', '5', '10']

2.6 轉義字符

在正則表達式中，某些字符（如.、*、?等）具有特殊含義。如果要匹配這些字符本身，需要使用反斜杠\進行轉義。例如，.匹配實際的.字符。

3. 常見應用場景

3.1 驗證郵箱格式

import redef validate_email(email):pattern = r'^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$'return re.match(pattern, email) is not Noneprint(validate_email('test@example.com'))  # 輸出: True
print(validate_email('invalid-email'))    # 輸出: False

3.2 提取 URL

import retext = 'Visit https://www.example.com or http://example.org'
urls = re.findall(r'https?://[^\s]+', text)
print(urls)  # 輸出: ['https://www.example.com', 'http://example.org']

3.3 提取日期

import retext = 'Today is 2023-10-05, and tomorrow is 2023-10-06.'
dates = re.findall(r'\d{4}-\d{2}-\d{2}', text)
print(dates)  # 輸出: ['2023-10-05', '2023-10-06']

3.4 提取HTML中的鏈接

假設我們要從HTML中提取所有的鏈接，可以使用正則表達式來匹配標簽中的href屬性。

import rehtml = """
<a href="https://www.example.com">Example</a>
<a href="https://www.google.com">Google</a>
<a href="https://www.python.org">Python</a>
"""# 正則表達式匹配<a>標簽中的href屬性
pattern = r'<a href="(.*?)">'# 使用re.findall提取所有匹配的鏈接
links = re.findall(pattern, html)print(links)
# 輸出：['https://www.example.com', 'https://www.google.com', 'https://www.python.org']

3.5 提取HTML中的圖片鏈接

假設我們要從HTML中提取所有的圖片鏈接，可以使用正則表達式來匹配標簽中的src屬性。

import rehtml = """
<img src="https://www.example.com/image1.jpg" alt="Image 1">
<img src="https://www.google.com/logo.png" alt="Google Logo">
<img src="https://www.python.org/python.png" alt="Python Logo">
"""# 正則表達式匹配<img>標簽中的src屬性
pattern = r'<img src="(.*?)"'# 使用re.findall提取所有匹配的圖片鏈接
image_links = re.findall(pattern, html)print(image_links)
# 輸出：['https://www.example.com/image1.jpg', 'https://www.google.com/logo.png', 'https://www.python.org/python.png']

3.6 提取JSON中的特定字段

假設我們有一個JSON字符串，想要提取其中的某個字段，可以使用正則表達式來匹配。

import rejson_data = '{"name": "Alice", "age": 25, "city": "New York"}'# 正則表達式匹配"name"字段的值
pattern = r'"name": "(.*?)"'# 使用re.search提取匹配的字段值
match = re.search(pattern, json_data)if match:print(match.group(1))

4. 總結

正則表達式是處理文本的強大工具，Python 的 re 模塊提供了豐富的功能來支持正則表達式的使用。通過掌握基礎語法和高級功能，可以輕松應對各種文本處理任務。

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/bicheng/71864.shtml
繁體地址，請注明出處：http://hk.pswp.cn/bicheng/71864.shtml
英文地址，請注明出處：http://en.pswp.cn/bicheng/71864.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！