【Python Cookbook】S02E04 文本模式的匹配和查找 match()、search()、findall() 以及捕獲組和 + 的含義

問題

本文討論一些按照特定的文本模式進行的查找和匹配。

解決方案

如果想要匹配的只是簡單文字，通常我們使用一些內置的基本字符串方法即可，如：str.find()，str.startwith()，str.endswith() 或類似的函數。

text = "hello world"match_str1 = text == 'hello world'
match_str2 = text.startswith("hello")
match_str3 = text.endswith("world")
match_str4 = text.find("w")
match_str5 = text.find("wo")
print(match_str1, match_str2, match_str3, match_str4, match_str5)

對于更為復雜的匹配則需要使用正則表達式以及 re 模塊。本文中，將主要圍繞 re 模塊的兩大函數 match() search() 以及 findall() 展開。

match()
請思考，為什么下列兩個字符串中使用相同的正則化匹配結果不同？

import retext_1 = "11/10/2023"
text_2 = "I just found my heart beat quickly from 11/10/2023, but I don't think that is love."if re.match(r'\d+/\d+/\d+', text_1):print(True)
else:print(False)if re.match(r'\d+/\d+/\d+', text_2):print(True)
else:print(False)

真實原因在于 re.match() 函數只在字符串的開始處進行匹配，text_1 中日期出現了開頭處，但是在 text_2 中，日期在字符串的中間。

如果我們希望匹配到字符串中任何位置的日期，則應該使用 re.search() 函數。

re.search()

import remessage = "I just found my heart beat quickly from 11/10/2023, but I don't think that is love."
match = re.search(r'\d+/\d+/\d+', message)
if match:print("The message contains the value of date. And, the date is", match.group())
else:print("The message does not contain the value of date.")

結果：

在這里插入圖片描述

上述代碼中，通過 match.group() 方法從正則化對象 match 提取出匹配到的結果。

但是如果我們一段描述中包含多個日期內容，那么 search() 函數能夠找到全部的日期內容嗎？如果找不到，有什么別的函數？

findall()

import rebut = "I just found my heart beat quickly from 11/10/2023, but I don't think that is love. And now, 06/06/2024, I think it is time to put all down."
match_1 = re.search(r'\d+/\d+/\d+', but)
print("match_1:", match_1.group())
match_2 = re.findall(r'\d+/\d+/\d+', but)
print("match_2:", match_2)

結果：

在這里插入圖片描述

顧名思義，findall()，即 “找到所有”，其作用的確是在字符串中找到所有的滿足正則化規則的值，并以列表形式返回。

print(type(match_1))
print(type(match_2))

結果：

<class 're.Match'>
<class 'list'>

明顯，search() 函數的結果是正則化對象，而*findall()* 函數的結果是列表的形式。

討論

更多的，如果我們打算對同一種模式做多次匹配，即，對很多字符串匹配同一個正則化規則，我們可以將正則表達式模式提取出來，預編譯成一個模式對象。

import remessage_1 = "yesterday is 05/06/2024."
message_2 = "today is 06/06/2024."
message_3 = "tomorrow is 07/06/2024"datepat = re.compile(r'\d+/\d+/\d+')
print(datepat.search(message_1).group())
print(datepat.search(message_2).group())
print(datepat.search(message_3).group())

更多的，讀者有沒有思考過，group() 函數中可以有什么參數不？

當定義正則表達式時，我們常常會將部分模式用括號包起來的方式引入捕獲組。如

import remessage = "yesterday is 05/06/2024."
datepat = re.compile(r'(\d+)/(\d+)/(\d+)')print(datepat.search(message).group())
print(datepat.search(message).group(2))

結果：

05/06/2024
2024

在正則表達式模式中，我們用 “()” 包裹了三個捕獲組，對應到本樣例中，即 group(1) -> dd，group(2) -> mm，group(3) -> yyyy

最后，\d+ 中，+ 是什么含義？？

import repattern_1 = re.compile(r'\d')
pattern_2 = re.compile(r'\d+')message = "我今年25歲了"
print(re.search(pattern_1, message).group())
print(re.search(pattern_2, message).group())

對比 pattern_1 與 pattern_2 的結果，可知在正則化表達式模式中，+ 不代表數字加，不代表字符串的連結，而是代表一種“更多”的含義，在本案例中，即可以匹配 更多的 \d 整數，所以能匹配到 25，而不帶 + 的 pattern_1 只能匹配到一個數字。

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/web/24096.shtml
繁體地址，請注明出處：http://hk.pswp.cn/web/24096.shtml
英文地址，請注明出處：http://en.pswp.cn/web/24096.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！