Datawhale 網絡爬蟲技術入門第2次筆記

正則表達式

正則表達式（Regular Expression），?種使?表達式的?式對字符串進?匹配的語法規則。

正則的語法：使?元字符進?排列組合來匹配字符串。

在線測試正則表達式：在線正則表達式測試OSCHINA.NET在線工具,ostools為開發設計人員提供在線工具，提供jsbin在線 CSS、JS 調試，在線 Java API文檔,在線 PHP API文檔,在線 Node.js API文檔,Less CSS編譯器，MarkDown編譯器等其他在線工具https://tool.oschina.net/regex/

元字符：具有固定含義的特殊符號。

.*??表示盡可能少的匹配，.*表示盡可能多的匹配。

Re模塊

案例1、手刃豆瓣TOP250電影信息

import requests
import re
import csv
import time
import random# 設置請求頭
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) ""AppleWebKit/537.36 (KHTML, like Gecko) ""Chrome/114.0.0.0 Safari/537.36","Referer": "https://movie.douban.com/top250"
}# 定義正則表達式
pattern = re.compile(r'<li>.*?<span class="title">(?P<name>.*?)</span>.*?'r'<br>\s*(?P<year>\d{4}).*?'r'<span class="rating_num"[^>]*>(?P<score>\d+\.\d+)</span>.*?'r'<span>(?P<num>[\d,]+)人評價</span>',re.S
)# 創建 CSV 文件并寫入表頭
with open("douban_top250.csv", mode="w", newline="", encoding="utf-8-sig") as f:csvwriter = csv.writer(f)csvwriter.writerow(["電影名稱", "上映年份", "評分", "評分人數"])# 抓取10頁（每頁25部電影）for start in range(0, 250, 25):url = f"https://movie.douban.com/top250?start={start}"print(f"正在抓取第 {start//25 + 1} 頁：{url}")try:resp = requests.get(url, headers=headers, timeout=10)resp.encoding = resp.apparent_encodinghtml = resp.text# 保存頁面用于調試（可選）with open(f"page_debug_{start}.html", "w", encoding="utf-8") as f_debug:f_debug.write(html)matches = list(pattern.finditer(html))print(f"? 第 {start//25 + 1} 頁成功，匹配到 {len(matches)} 條")for m in matches:name = m.group("name").strip()year = m.group("year").strip()score = m.group("score").strip()num = m.group("num").replace(",", "").strip()csvwriter.writerow([name, year, score, num])# 加延時防止被反爬time.sleep(random.uniform(1, 2))except Exception as e:print(f"? 抓取失敗：{e}")print("🎉 所有頁面抓取完畢，已保存到 douban_top250.csv")

結果：

案例2、Quotes to Scrape 爬此網站

import re
import requestsdef scrape_quotes():url = "http://quotes.toscrape.com/"response = requests.get(url)if response.status_code != 200:print("Failed to fetch page")returnhtml = response.text# 正則匹配所有名言區塊quotes_pattern = re.compile(r'<div class="quote".*?>(.*?)</div>', re.DOTALL)quotes = quotes_pattern.findall(html)results = []for quote in quotes:# 提取名言文本text_match = re.search(r'<span class="text".*?>(.*?)</span>', quote)text = text_match.group(1).strip() if text_match else "N/A"# 提取作者author_match = re.search(r'<small class="author".*?>(.*?)</small>', quote)author = author_match.group(1).strip() if author_match else "N/A"# 提取標簽tags = re.findall(r'<a class="tag".*?>(.*?)</a>', quote)results.append({"text": text,"author": author,"tags": tags})# 打印結果for i, result in enumerate(results, 1):print(f"Quote {i}:")print(f"  Text: {result['text']}")print(f"  Author: {result['author']}")print(f"  Tags: {', '.join(result['tags'])}\n")if __name__ == "__main__":scrape_quotes()

結果顯示：

?bs4模塊

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/pingmian/85347.shtml
繁體地址，請注明出處：http://hk.pswp.cn/pingmian/85347.shtml
英文地址，請注明出處：http://en.pswp.cn/pingmian/85347.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！