正則表達式
正則表達式(Regular Expression),?種使?表達式的?式對字符串進?匹配的語法規則。
正則的語法:使?元字符進?排列組合來匹配字符串。
在線測試正則表達式:在線正則表達式測試OSCHINA.NET在線工具,ostools為開發設計人員提供在線工具,提供jsbin在線 CSS、JS 調試,在線 Java API文檔,在線 PHP API文檔,在線 Node.js API文檔,Less CSS編譯器,MarkDown編譯器等其他在線工具https://tool.oschina.net/regex/
元字符:具有固定含義的特殊符號。
.*?
?表示盡可能少的匹配,.*
表示盡可能多的匹配。
Re模塊
案例1、手刃豆瓣TOP250電影信息
import requests
import re
import csv
import time
import random# 設置請求頭
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) ""AppleWebKit/537.36 (KHTML, like Gecko) ""Chrome/114.0.0.0 Safari/537.36","Referer": "https://movie.douban.com/top250"
}# 定義正則表達式
pattern = re.compile(r'<li>.*?<span class="title">(?P<name>.*?)</span>.*?'r'<br>\s*(?P<year>\d{4}).*?'r'<span class="rating_num"[^>]*>(?P<score>\d+\.\d+)</span>.*?'r'<span>(?P<num>[\d,]+)人評價</span>',re.S
)# 創建 CSV 文件并寫入表頭
with open("douban_top250.csv", mode="w", newline="", encoding="utf-8-sig") as f:csvwriter = csv.writer(f)csvwriter.writerow(["電影名稱", "上映年份", "評分", "評分人數"])# 抓取10頁(每頁25部電影)for start in range(0, 250, 25):url = f"https://movie.douban.com/top250?start={start}"print(f"正在抓取第 {start//25 + 1} 頁:{url}")try:resp = requests.get(url, headers=headers, timeout=10)resp.encoding = resp.apparent_encodinghtml = resp.text# 保存頁面用于調試(可選)with open(f"page_debug_{start}.html", "w", encoding="utf-8") as f_debug:f_debug.write(html)matches = list(pattern.finditer(html))print(f"? 第 {start//25 + 1} 頁成功,匹配到 {len(matches)} 條")for m in matches:name = m.group("name").strip()year = m.group("year").strip()score = m.group("score").strip()num = m.group("num").replace(",", "").strip()csvwriter.writerow([name, year, score, num])# 加延時防止被反爬time.sleep(random.uniform(1, 2))except Exception as e:print(f"? 抓取失敗:{e}")print("🎉 所有頁面抓取完畢,已保存到 douban_top250.csv")
結果:
案例2、Quotes to Scrape 爬此網站
import re
import requestsdef scrape_quotes():url = "http://quotes.toscrape.com/"response = requests.get(url)if response.status_code != 200:print("Failed to fetch page")returnhtml = response.text# 正則匹配所有名言區塊quotes_pattern = re.compile(r'<div class="quote".*?>(.*?)</div>', re.DOTALL)quotes = quotes_pattern.findall(html)results = []for quote in quotes:# 提取名言文本text_match = re.search(r'<span class="text".*?>(.*?)</span>', quote)text = text_match.group(1).strip() if text_match else "N/A"# 提取作者author_match = re.search(r'<small class="author".*?>(.*?)</small>', quote)author = author_match.group(1).strip() if author_match else "N/A"# 提取標簽tags = re.findall(r'<a class="tag".*?>(.*?)</a>', quote)results.append({"text": text,"author": author,"tags": tags})# 打印結果for i, result in enumerate(results, 1):print(f"Quote {i}:")print(f" Text: {result['text']}")print(f" Author: {result['author']}")print(f" Tags: {', '.join(result['tags'])}\n")if __name__ == "__main__":scrape_quotes()
結果顯示: