簡單 Python 爬蟲程序設計

爬蟲是獲取網頁數據的常用工具，我們一起來設計一個基于 ?requests? 和 ?BeautifulSoup? 的簡單爬蟲，它可以獲取網頁內容并提取文本信息。

所需庫安裝

首先需要安裝兩個必要的庫：

pip install requests beautifulsoup4

完整代碼

import requests

from bs4 import BeautifulSoup

import time

import random

import os

def simple_crawler(url, save_dir="crawled_data"):

? ? """

? ? 簡單網頁爬蟲程序

? ? :param url: 要爬取的網頁URL

? ? :param save_dir: 保存數據的目錄

? ? :return: 爬取的文本內容

? ? """

? ? try:

? ? ? ? # 模擬瀏覽器請求頭，避免被識別為爬蟲

? ? ? ? headers = {

? ? ? ? ? ? "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"

? ? ? ? }

? ? ? ??

? ? ? ? # 發送GET請求

? ? ? ? response = requests.get(url, headers=headers, timeout=10)

? ? ? ??

? ? ? ? # 檢查請求是否成功

? ? ? ? if response.status_code == 200:

? ? ? ? ? ? # 設置正確的編碼（處理中文亂碼）

? ? ? ? ? ? response.encoding = response.apparent_encoding

? ? ? ? ? ??

? ? ? ? ? ? # 使用BeautifulSoup解析HTML

? ? ? ? ? ? soup = BeautifulSoup(response.text, 'html.parser')

? ? ? ? ? ??

? ? ? ? ? ? # 提取所有文本內容

? ? ? ? ? ? all_text = soup.get_text()

? ? ? ? ? ??

? ? ? ? ? ? # 創建保存目錄（如果不存在）

? ? ? ? ? ? if not os.path.exists(save_dir):

? ? ? ? ? ? ? ? os.makedirs(save_dir)

? ? ? ? ? ??

? ? ? ? ? ? # 保存內容到文件

? ? ? ? ? ? filename = f"{save_dir}/{url.split('//')[-1].split('/')[0].replace('.', '_')}_{int(time.time())}.txt"

? ? ? ? ? ? with open(filename, 'w', encoding='utf-8') as f:

? ? ? ? ? ? ? ? f.write(all_text)

? ? ? ? ? ??

? ? ? ? ? ? print(f"成功爬取并保存內容到 {filename}")

? ? ? ? ? ? return all_text

? ? ? ? else:

? ? ? ? ? ? print(f"請求失敗，狀態碼: {response.status_code}")

? ? ? ? ? ? return None

? ??

? ? except requests.exceptions.RequestException as e:

? ? ? ? print(f"請求異常: {e}")

? ? ? ? return None

? ? except Exception as e:

? ? ? ? print(f"發生錯誤: {e}")

? ? ? ? return None

if __name__ == "__main__":

? ? # 要爬取的網址（請替換為你想爬取的合法網址）

? ? target_url = "https://example.com"

? ??

? ? # 執行爬取

? ? content = simple_crawler(target_url)

? ??

? ? if content:

? ? ? ? # 打印前500個字符（可選）

? ? ? ? print(f"\n爬取內容預覽:\n{content[:500]}...")

代碼功能解析

這個爬蟲程序主要包含以下幾個部分：

- 請求頭設置：模擬瀏覽器請求頭，降低被網站反爬機制識別的概率

- 請求發送：使用?requests?庫發送HTTP GET請求獲取網頁內容

- 內容解析：通過?BeautifulSoup?解析HTML，提取純文本內容

- 數據保存：將爬取的內容保存到本地文本文件中

- 異常處理：包含請求異常和通用異常處理，增強程序穩定性

使用注意事項

1.?替換URL：將代碼中的?https://example.com?替換為你想爬取的合法網址

2.?遵守規則：爬取前請閱讀網站的?robots.txt?，遵守網站爬取規則

3.?控制頻率：代碼中可添加?time.sleep(random.uniform(1, 3))?來控制爬取間隔，避免對服務器造成壓力

4.?合法用途：請確保爬取行為用于學習、研究等合法用途，避免侵犯他人權益。

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/pingmian/87634.shtml
繁體地址，請注明出處：http://hk.pswp.cn/pingmian/87634.shtml
英文地址，請注明出處：http://en.pswp.cn/pingmian/87634.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！