【362】python 正則表達式

參考：正則表達式 - 廖雪峰

參考：Python3 正則表達式 - 菜鳥教程

參考：正則表達式 - 教程

re.match 嘗試從字符串的起始位置匹配一個模式，如果不是起始位置匹配成功的話，match()就返回none。

re.search 掃描整個字符串并返回第一個成功的匹配。

span()：返回搜索的索引區間
group()：返回匹配的結果

re.sub 用于替換字符串中的匹配項。

re.match只匹配字符串的開始，如果字符串開始不符合正則表達式，則匹配失敗，函數返回None；而re.search匹配整個字符串，直到找到一個匹配。

Python 的re模塊提供了re.sub用于替換字符串中的匹配項。

compile 函數用于編譯正則表達式，生成一個正則表達式（ Pattern ）對象，供 match() 和 search() 這兩個函數使用。

findall 在字符串中找到正則表達式所匹配的所有子串，并返回一個列表，如果沒有找到匹配的，則返回空列表。

注意：?match 和 search 是匹配一次 findall 匹配所有。

finditer 和 findall 類似，在字符串中找到正則表達式所匹配的所有子串，并把它們作為一個迭代器返回。

split 方法按照能夠匹配的子串將字符串分割后返回列表?

?\d?可以匹配一個數字；

\d?matches any digit, while?\D?matches any nondigit:

?\w?可以匹配一個字母或數字或者下劃線；

\w?matches any character that can be part of a word (Python identifier), that is, a letter, the underscore or a digit, while?\W?matches any other character:

?\W?可以匹配非數字字母下劃線；

?\s?表示一個空白格（也包括Tab、回車等空白格）；

\s?matches any space, while?\S?matches any nonspace character:

?.?表示任意字符；

?*?表示任意字符長度（包括0個）（>=0）；（其前面的一個字符，或者通過小括號匹配多個字符）

# 匹配最左邊，即是0個字符
>>> re.search('\d*', 'a123456b')
<_sre.SRE_Match object; span=(0, 0), match=''># 匹配最長
>>> re.search('\d\d\d*', 'a123456b')
<_sre.SRE_Match object; span=(1, 7), match='123456'>>>> re.search('\d\d*', 'a123456b')
<_sre.SRE_Match object; span=(1, 7), match='123456'># 兩個的倍數匹配
>>> re.search('\d(\d\d)*', 'a123456b')
<_sre.SRE_Match object; span=(1, 6), match='12345'>

?+?表示至少一個字符（>=1）；（其前面的一個字符，或者通過小括號匹配多個字符）

>>> re.search('.\d+', 'a123456b')
<_sre.SRE_Match object; span=(0, 7), match='a123456'>>>> re.search('(.\d)+', 'a123456b')
<_sre.SRE_Match object; span=(0, 6), match='a12345'>

???表示0個或1個字符；（其前面的一個字符，或者通過小括號匹配多個字符）

>>> re.search('\s(\d\d)?\s', 'a 12 b')
<_sre.SRE_Match object; span=(1, 5), match=' 12 '>>>> re.search('\s(\d\d)?\s', 'a  b')
<_sre.SRE_Match object; span=(1, 3), match='  '>>>> re.search('\s(\d\d)?\s', 'a 1 b')
# 無返回值，沒有匹配成功

[] 匹配，同時需要轉義的字符，在里面不需要，如 [.] 表示點

>>> re.search('[.]', 'abcabc.123456.defdef')
<re.Match object; span=(6, 7), match='.'>>>> # 一次匹配中括號里面的任意字符
>>> re.search('[cba]+', 'abcabc.123456.defdef')
<re.Match object; span=(0, 6), match='abcabc'>>>> re.search('.[\d]*', 'abcabc.123456.defdef')
<re.Match object; span=(0, 1), match='a'>>>> re.search('\.[\d]*', 'abcabc.123456.defdef')
<re.Match object; span=(6, 13), match='.123456'>>>> re.search('[.\d]+', 'abcabc.123456.defdef')
<re.Match object; span=(6, 14), match='.123456.'>

?{n}?表示n個字符；

?{n,m}?表示n-m個字符；

?[0-9a-zA-Z\_]?可以匹配一個數字、字母或者下劃線；

?[0-9a-zA-Z\_]+?可以匹配至少由一個數字、字母或者下劃線組成的字符串，比如'a100'，'0_Z'，'Py3000'等等；

?[a-zA-Z\_][0-9a-zA-Z\_]*?可以匹配由字母或下劃線開頭，后接任意個由一個數字、字母或者下劃線組成的字符串，也就是Python合法的變量；

?[a-zA-Z\_][0-9a-zA-Z\_]{0, 19}?更精確地限制了變量的長度是1-20個字符（前面1個字符+后面最多19個字符）。

- 在 [] 中表示范圍，如果橫線挨著中括號則被視為真正的橫線
Ranges of letters or digits can be provided within square brackets, letting a hyphen separate the first and last characters in the range. A hyphen placed after the opening square bracket or before the closing square bracket is interpreted as a literal character:

>>> re.search('[e-h]+', 'ahgfea')
<re.Match object; span=(1, 5), match='hgfe'>>>> re.search('[B-D]+', 'ABCBDA')
<re.Match object; span=(1, 5), match='BCBD'>>>> re.search('[4-7]+', '154465571')
<re.Match object; span=(1, 8), match='5446557'>>>> re.search('[-e-gb]+', 'a--bg--fbe--z')
<re.Match object; span=(1, 12), match='--bg--fbe--'>>>> re.search('[73-5-]+', '14-34-576')
<re.Match object; span=(1, 8), match='4-34-57'>

^ 在 [] 中表示后面字符除外的其他字符

Within a square bracket, a caret after placed after the opening square bracket excludes the characters that follow within the brackets:

>>> re.search('[^4-60]+', '0172853')
<re.Match object; span=(1, 5), match='1728'>>>> re.search('[^-u-w]+', '-stv')
<re.Match object; span=(1, 3), match='st'>

?A|B?可以匹配A或B，所以(P|p)ython可以匹配'Python'或者'python'。

Whereas square brackets surround alternative characters, a vertical bar separates alternative patterns:

>>> re.search('two|three|four', 'one three two')
<re.Match object; span=(4, 9), match='three'>>>> re.search('|two|three|four', 'one three two')
<re.Match object; span=(0, 0), match=''>>>> re.search('[1-3]+|[4-6]+', '01234567')
<re.Match object; span=(1, 4), match='123'>>>> re.search('([1-3]|[4-6])+', '01234567')
<re.Match object; span=(1, 7), match='123456'>>>> re.search('_\d+|[a-z]+_', '_abc_def_234_')
<re.Match object; span=(1, 5), match='abc_'>>>> re.search('_(\d+|[a-z]+)_', '_abc_def_234_')
<re.Match object; span=(0, 5), match='_abc_'>

?^?表示行的開頭，^\d表示必須以數字開頭。

?$?表示行的結束，\d$表示必須以數字結束。

A caret at the beginning of the pattern string matches the beginning of the data string; a dollar at the end of the pattern string matches the end of the data string:

>>> re.search('\d*', 'abc')
<re.Match object; span=(0, 0), match=''>>>> re.search('^\d*', 'abc')
<re.Match object; span=(0, 0), match=''>>>> re.search('\d*$', 'abc')
<re.Match object; span=(3, 3), match=''>>>> re.search('^\d*$', 'abc')>>> re.search('^\s*\d*\s*$', ' 345 ')
<re.Match object; span=(0, 5), match=' 345 '>

如果不在最前或最后，可以視為普通字符，但是在最前最后的時候想變成普通字符需要加上反斜杠

Escaping a dollar at the end of the pattern string, escaping a caret at the beginning of the pattern string or after the opening square bracket of a character class, makes dollar and caret lose the special meaning they have in those contexts context and let them be treated as literal characters:

>>> re.search('\$', '$*')
<re.Match object; span=(0, 1), match='$'>>>> re.search('\^', '*^')
<re.Match object; span=(1, 2), match='^'>>>> re.search('[\^]', '^*')
<re.Match object; span=(0, 1), match='^'>>>> re.search('[^^]', '^*')
<re.Match object; span=(1, 2), match='*'>

?^(\d{3})-(\d{3,8})$?分別定義了兩個組，可以直接從匹配的字符串中提取出區號和本地號碼：

group(0)：永遠是原始字符串；
group(1)：表示第1個子串；
group(2)：表示第2個子串，以此類推。

分組順序：按照左括號的順序開始

Parentheses allow matched parts to be saved. The object returned by?re.search()?has a?group()?method that without argument, returns the whole match and with arguments, returns partial matches; it also has a?groups()method that returns all partial matches:

>>> R = re.search('((\d+) ((\d+) \d+)) (\d+ (\d+))','  1 23 456 78 9 0 ')>>> R
<re.Match object; span=(2, 15), match='1 23 456 78 9'>>>> R.group()
'1 23 456 78 9'>>> R.groups()
('1 23 456', '1', '23 456', '23', '78 9', '9')>>> [R.group(i) for i in range(len(R.groups()) + 1)]
['1 23 456 78 9', '1 23 456', '1', '23 456', '23', '78 9', '9']

?: 二選一，括號不計入分組

>>> R = re.search('([+-]?(?:0|[1-9]\d*)).*([+-]?(?:0|[1-9]\d*))',' a = -3014, b = 0 ')>>> R
<re.Match object; span=(5, 17), match='-3014, b = 0'>>>> R.groups()
('-3014', '0')

?.*?表示任意匹配除換行符（\n、\r）之外的任何單個或多個字符

模式	描述
^	匹配字符串的開頭
$	匹配字符串的末尾。
.	匹配任意字符，除了換行符，當re.DOTALL標記被指定時，則可以匹配包括換行符的任意字符。
[...]	用來表示一組字符,單獨列出：[amk] 匹配 'a'，'m'或'k'
[^...]	不在[]中的字符：[^abc] 匹配除了a,b,c之外的字符。
re*	匹配0個或多個的表達式。
re+	匹配1個或多個的表達式。
re?	匹配0個或1個由前面的正則表達式定義的片段，非貪婪方式
re{ n}	匹配n個前面表達式。例如，"o{2}"不能匹配"Bob"中的"o"，但是能匹配"food"中的兩個o。
re{ n,}	精確匹配n個前面表達式。例如，"o{2,}"不能匹配"Bob"中的"o"，但能匹配"foooood"中的所有o。"o{1,}"等價于"o+"。"o{0,}"則等價于"o*"。
re{ n, m}	匹配 n 到 m 次由前面的正則表達式定義的片段，貪婪方式
a\| b	匹配a或b
(re)	匹配括號內的表達式，也表示一個組
(?imx)	正則表達式包含三種可選標志：i, m, 或 x 。只影響括號中的區域。
(?-imx)	正則表達式關閉 i, m, 或 x 可選標志。只影響括號中的區域。
(?: re)	類似 (...), 但是不表示一個組
(?imx: re)	在括號中使用i, m, 或 x 可選標志
(?-imx: re)	在括號中不使用i, m, 或 x 可選標志
(?#...)	注釋.
(?= re)	前向肯定界定符。如果所含正則表達式，以 ... 表示，在當前位置成功匹配時成功，否則失敗。但一旦所含表達式已經嘗試，匹配引擎根本沒有提高；模式的剩余部分還要嘗試界定符的右邊。
(?! re)	前向否定界定符。與肯定界定符相反；當所含表達式不能在字符串當前位置匹配時成功。
(?> re)	匹配的獨立模式，省去回溯。
\w	匹配數字字母下劃線
\W	匹配非數字字母下劃線
\s	匹配任意空白字符，等價于 [\t\n\r\f]。
\S	匹配任意非空字符
\d	匹配任意數字，等價于 [0-9]。
\D	匹配任意非數字
\A	匹配字符串開始
\Z	匹配字符串結束，如果是存在換行，只匹配到換行前的結束字符串。
\z	匹配字符串結束
\G	匹配最后匹配完成的位置。
\b	匹配一個單詞邊界，也就是指單詞和空格間的位置。例如， 'er\b' 可以匹配"never" 中的 'er'，但不能匹配 "verb" 中的 'er'。
\B	匹配非單詞邊界。'er\B' 能匹配 "verb" 中的 'er'，但不能匹配 "never" 中的 'er'。
\n, \t, 等。	匹配一個換行符。匹配一個制表符, 等
\1...\9	匹配第n個分組的內容。
\10	匹配第n個分組的內容，如果它經匹配。否則指的是八進制字符碼的表達式。

舉例：

?\d{3}?：匹配3個數字

?\s+?：至少有一個空格

?\d{3,8}?：3-8個數字

>>> mySent = 'This book is the best book on Python or M.L. I have ever laid eyes upon.'>>> mySent.split(' ')
['This', 'book', 'is', 'the', 'best', 'book', 'on', 'Python', 'or', 'M.L.', 'I', 'have', 'ever', 'laid', 'eyes', 'upon.']>>> import re>>> listOfTokens = re.split(r'\W*', mySent)>>> listOfTokens
['This', 'book', 'is', 'the', 'best', 'book', 'on', 'Python', 'or', 'M', 'L', 'I', 'have', 'ever', 'laid', 'eyes', 'upon', '']>>> [tok for tok in listOfTokens if len(tok) > 0]
['This', 'book', 'is', 'the', 'best', 'book', 'on', 'Python', 'or', 'M', 'L', 'I', 'have', 'ever', 'laid', 'eyes', 'upon']>>> [tok.lower() for tok in listOfTokens if len(tok) > 0]
['this', 'book', 'is', 'the', 'best', 'book', 'on', 'python', 'or', 'm', 'l', 'i', 'have', 'ever', 'laid', 'eyes', 'upon']>>> [tok.lower() for tok in listOfTokens if len(tok) > 2]
['this', 'book', 'the', 'best', 'book', 'python', 'have', 'ever', 'laid', 'eyes', 'upon']
>>>

參考：python爬蟲（5）--正則表達式 - 小學森也要學編程 - 博客園??

實現刪除引號內部的內容，注意任意匹配使用【.*】

a = 'Sir Nina said: \"I am a Knight,\" but I am not sure'
b = "Sir Nina said: \"I am a Knight,\" but I am not sure"
print(re.sub(r'"(.*)"', '', a),
re.sub(r'"(.*)"', '', b), sep='\n')Output:
Sir Nina said:  but I am not sure
Sir Nina said:  but I am not sure

Example from Eric Martin's learning materials of COMP9021

The following function checks that its argument is a string:

that from the beginning:?^
consists of possibly some spaces:??*
followed by an opening parenthesis:?\(
possibly followed by spaces:??*
possibly followed by either + or -:?[+-]?
followed by either 0, or a nonzero digit followed by any sequence of digits:?0|[1-9]\d*
possibly followed by spaces:??*
followed by a comma:?,
followed by characters matching the pattern described by 1-7
followed by a closing parenthesis:?\)
possibly followed by some spaces:??*
all the way to the end:?$

Pairs of parentheses surround both numbers to match to capture them. For point 5, a surrounding pair of parentheses is needed;??:?makes it non-capturing:

>>> def validate_and_extract_payoffs(provided_input):pattern = '^ *\( *([+-]?(?:0|[1-9]\d*)) *,'\' *([+-]?(?:0|[1-9]\d*)) *\) *$'match = re.search(pattern, provided_input)if match:return (match.groups())>>> validate_and_extract_payoffs('(+0, -7 )')
('+0', '-7')>>> validate_and_extract_payoffs('  (-3014,0)  ')
('-3014', '0')

轉載于:https://www.cnblogs.com/alex-bn-lee/p/10325559.html