Python 爬蟲案例

以下是一些常見的 Python 爬蟲案例，涵蓋了不同的應用場景和技術點：
1. 簡單網頁內容爬取
案例：爬取網頁標題和簡介
import requests
from bs4 import BeautifulSoup

url = "https://www.runoob.com/"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

title = soup.title.string
description = soup.find('meta', attrs={'name': 'description'})['content']
print(f"標題: {title}")
print(f"簡介: {description}")

2. 爬取圖片
案例：爬取圖片網站并下載圖片
import os
import requests
from bs4 import BeautifulSoup

url = "https://unsplash.com/s/photos/nature"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# 創建文件夾存儲圖片
if not os.path.exists('images'):
? ? os.makedirs('images')

# 找到所有圖片標簽
img_tags = soup.find_all('img')
for idx, img in enumerate(img_tags):
? ? img_url = img['src']
? ? # 下載圖片
? ? img_data = requests.get(img_url).content
? ? with open(f'images/img_{idx}.jpg', 'wb') as handler:
? ? ? ? handler.write(img_data)

3. 爬取數據并存儲
案例：爬取豆瓣電影 Top250 并存儲到 CSV
import csv
import requests
from bs4 import BeautifulSoup

url = "https://movie.douban.com/top250"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

movies = []
for item in soup.select('.item'):
? ? title = item.select('.title')[0].get_text()
? ? rating = item.select('.rating_num')[0].get_text()
? ? director = item.select('.bd p')[0].get_text().split('\n')[1].strip().split('/')[0]
? ? movies.append([title, rating, director])

# 寫入 CSV 文件
with open('douban_top250.csv', 'w', newline='', encoding='utf-8') as f:
? ? writer = csv.writer(f)
? ? writer.writerow(['標題', '評分', '導演'])
? ? writer.writerows(movies)

4. 動態網頁爬取
案例：使用 Selenium 爬取動態加載的網頁
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time

# 啟動瀏覽器
driver = webdriver.Chrome()
driver.get("https://www.jd.com")

# 搜索商品
search_box = driver.find_element(By.ID, 'key')
search_box.send_keys('筆記本電腦')
search_box.send_keys(Keys.RETURN)
time.sleep(3) ?# 等待頁面加載

# 獲取商品列表
products = driver.find_elements(By.CLASS_NAME, 'gl-item')
for product in products:
? ? try:
? ? ? ? name = product.find_element(By.CLASS_NAME, 'p-name').text
? ? ? ? price = product.find_element(By.CLASS_NAME, 'p-price').text
? ? ? ? print(f"商品: {name}, 價格: {price}")
? ? except Exception as e:
? ? ? ? print(e)

driver.quit()

5. API 數據爬取
案例：爬取 GitHub API 數據
import requests

# 獲取 Python 倉庫信息
url = "https://api.github.com/search/repositories?q=language:python&sort=stars"
response = requests.get(url)
data = response.json()

for item in data['items']:
? ? name = item['name']
? ? description = item['description']
? ? stars = item['stargazers_count']
? ? print(f"倉庫: {name}, 描述: {description}, 星數: {stars}")

6. 爬取登錄后的數據
案例：模擬登錄并爬取數據
import requests

login_url = "https://example.com/login"
data_url = "https://example.com/dashboard"

# 登錄信息
payload = {
? ? 'username': 'your_username',
? ? 'password': 'your_password'
}

# 使用會話保持登錄狀態
with requests.Session() as session:
? ? # 發送登錄請求
? ? session.post(login_url, data=payload)
? ??
? ? # 訪問需要登錄的頁面
? ? response = session.get(data_url)
? ? print(response.text)

注意事項
1. ?遵守網站規則：在爬取之前，查看目標網站的 robots.txt 文件，了解哪些頁面允許爬取。
2. ?設置合理的請求間隔：避免頻繁請求導致服務器壓力過大或被封禁。
3. ?處理反爬機制：如果遇到反爬，可以嘗試使用代理 IP、設置請求頭（User-Agent）等方法。
4. ?合法性：確保爬取的數據和行為符合法律法規。
這些案例可以幫助你快速上手 Python 爬蟲開發，根據實際需求選擇合適的技術和工具。

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/bicheng/74747.shtml
繁體地址，請注明出處：http://hk.pswp.cn/bicheng/74747.shtml
英文地址，請注明出處：http://en.pswp.cn/bicheng/74747.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！