Python】從文本字符串中提取數字、電話號碼、日期、網址的方法

關于從文本字符串中提取數字、電話號碼、日期和網址的方法：

提取數字：

在 Python 中，使用正則表達式?\d?來匹配數字。?\d?表示匹配一個數字字符（0-9）。如果要匹配連續的數字，可以使用?\d+?。

import re

def extract_digits(text):

? ? digit_list = re.findall(r'\d+', text)

? ? return [int(digit) for digit in digit_list] # 將提取的數字字符串轉換為整數

text = "I have 15 apples and 20 oranges. The price is $30."

print(extract_digits(text))

提取電話號碼：

電話號碼的格式多種多樣，但常見的有以下幾種：

1.?以 ?+? 開頭，后面跟國家代碼和電話號碼，例如：?+8613812345678?

2.?沒有 ?+? ，直接是國家代碼和電話號碼，例如：?8613812345678?

3.?國內的手機號碼，通常是 11 位數字，以 1 開頭，例如：?13812345678?

4.?固定電話號碼，可能有區號，例如：?010-12345678? 或 ?021 12345678?

以下是一個相對復雜的電話號碼提取函數示例：

import re

def extract_phone_numbers(text):

? ? patterns = [

? ? ? ? r'\+\d{1,3}\d{10}', # 以 + 開頭，后面是國家代碼和 10 位電話號碼

? ? ? ? r'\d{11}', # 11 位手機號碼

? ? ? ? r'\d{3}-\d{7,8}', # 區號 3 位，號碼 7 到 8 位

? ? ? ? r'\d{4}-\d{7}', # 區號 4 位，號碼 7 位

? ? ? ? r'\d{3}\s\d{7,8}', # 區號 3 位，空格分隔，號碼 7 到 8 位

? ? ? ? r'\d{4}\s\d{7}' # 區號 4 位，空格分隔，號碼 7 位

? ? ]

? ? phone_numbers = []

? ? for pattern in patterns:

? ? ? ? found_numbers = re.findall(pattern, text)

? ? ? ? phone_numbers.extend(found_numbers)

? ? return phone_numbers

text = "My phone number is +8613812345678. Another one is 010-12345678 and 15912345678"

print(extract_phone_numbers(text))

提取日期：

日期的格式非常多，常見的有 ?YYYY-MM-DD?、?MM/DD/YYYY?、?DD-MM-YYYY? 等。以下是一個能夠處理多種常見日期格式的示例：

from datetime import datetime

import re

def extract_dates(text):

? ? date_patterns = [

? ? ? ? r'\d{4}-\d{2}-\d{2}', # YYYY-MM-DD

? ? ? ? r'\d{2}/\d{2}/\d{4}', # MM/DD/YYYY

? ? ? ? r'\d{2}-\d{2}-\d{4}', # DD-MM-YYYY

? ? ]

? ? dates = []

? ? for pattern in date_patterns:

? ? ? ? found_dates = re.findall(pattern, text)

? ? ? ? for date_str in found_dates:

? ? ? ? ? ? try:

? ? ? ? ? ? ? ? date = datetime.strptime(date_str, pattern)

? ? ? ? ? ? ? ? dates.append(date)

? ? ? ? ? ? except ValueError:

? ? ? ? ? ? ? ? pass

? ? return dates

text = "The event is on 2024-07-07 and another one on 07/07/2024 and 07-07-2024"

print(extract_dates(text))

提取網址：

網址的格式通常以 ?http? 或 ?https? 開頭，后面跟著域名和路徑等。以下是一個提取網址的示例：

import re

def extract_urls(text):

? ? url_pattern = r'https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+'

? ? return re.findall(url_pattern, text)

text = "Check out these websites: https://www.example.com/page?param=value and http://another-site.org"

print(extract_urls(text))

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/diannao/42488.shtml
繁體地址，請注明出處：http://hk.pswp.cn/diannao/42488.shtml
英文地址，請注明出處：http://en.pswp.cn/diannao/42488.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！