【python】遵守 robots.txt 規則的數據爬蟲程序

程序1

編寫一個遵守?robots.txt?規則的數據爬蟲程序涉及到多個步驟，包括請求網頁、解析?robots.txt?文件、掃描網頁內容、存儲數據以及處理異常。由于編程語言眾多，且每種語言編寫爬蟲程序的方式可能有所不同，以下將使用 Python 語言舉例，提供一個簡化的流程。

注意：以下代碼只是一個示例，并不是一個完備的、可直接運行的程序。此外，實際應用中還需要處理網絡錯誤、限速遵循禮貌原則，以及可能的存儲問題等等。

import requests
from urllib.robotparser import RobotFileParser
from bs4 import BeautifulSoup# 初始化robots.txt解析器
def init_robot_parser(url):rp = RobotFileParser()rp.set_url(url + "/robots.txt")rp.read()return rp# 爬取頁面
def crawl_page(url, user_agent='MyBot'):rp = init_robot_parser(url)if rp.can_fetch(user_agent, url):headers = {'User-Agent': user_agent}response = requests.get(url, headers=headers)if response.status_code == 200:return response.textelse:print(f"爬取被禁止在: {url}")return None# 解析頁面，提取數據
def extract_data(html):soup = BeautifulSoup(html, 'html.parser')# 這里根據實際需要定制提取數據的代碼# 例子: 提取所有的a標簽for link in soup.find_all('a'):href = link.get('href')print(href)# 應該將提取的數據存儲到數據庫或文件系統等# 主函數
def main():url = 'http://example.com'  # 目標網站user_agent = 'MyBot'  # 爬蟲名稱html = crawl_page(url, user_agent)if html:extract_data(html)if __name__ == "__main__":main()

程序2

編寫遵守`robots.txt`規則的數據爬蟲需要遵循幾個關鍵步驟。以下是一個遵守`robots.txt`規則的Python數據爬蟲的示例：
1. 解析`robots.txt`：使用`urllib.robotparser`模塊來解析目標網站的`robots.txt`文件，并確定哪些頁面是可爬取的。
2. 請求數據：使用如`requests`的庫發起網絡請求，獲取網頁內容。
3. 分析內容：利用如`BeautifulSoup`的庫分析網頁內容，提取所需數據。
4. 遵循爬蟲規則：確保在爬取時尊重`robots.txt`文件中的`Crawl-delay`指令，并且不爬取`Disallow`中指定的頁面。
下面是精簡版的代碼實現：

import requests
from urllib.robotparser import RobotFileParser
from time import sleep
from bs4 import BeautifulSoupclass MySpider:def __init__(self, base_url):self.base_url = base_urlself.robots_url = base_url + "/robots.txt"self.robot_parser = RobotFileParser()def fetch_robots_txt(self):response = requests.get(self.robots_url)# 假定robots.txt存在，若不存在需要額外處理self.robot_parser.parse(response.text.splitlines())def crawl(self, path):url = self.base_url + path# 檢查是否允許爬取if self.robot_parser.can_fetch("*", url):crawl_delay = self.robot_parser.crawl_delay("*")if crawl_delay:sleep(crawl_delay)  # 根據 Crawl-delay 設置等待response = requests.get(url)if response.status_code == 200:return response.textelse:print(f"被禁止爬取：{url}")return Nonedef parse(self, html):# 使用 BeautifulSoup 或其它工具解析 htmlsoup = BeautifulSoup(html, 'html.parser')# 這里進行具體的解析任務# ...base_url = "https://example.com"  # 假設這是您要爬取的網站的根URL
spider = MySpider(base_url)# 獲取并解析robots.txt
spider.fetch_robots_txt()# 爬取網站中的特定路徑
path_to_crawl = "/some-page"  # 你要爬取的頁面路徑
html_content = spider.crawl(path_to_crawl)if html_content:# 解析獲取的網頁內容spider.parse(html_content)

注意，上述代碼是一個基礎示例。在實際應用中，還需要考慮一些其他因素：
- 異常處理：網絡請求和內容解析時可能會出現異常。
- 日志記錄：記錄爬蟲的行為，方便追蹤問題和調試。
- 遵守法律法規：不同國家和地區對網絡爬蟲有不同的法律要求，在編寫和運行爬蟲之前，應當了解并遵守當地的法律。
- 更復雜的URL規則和爬取策略：需要更精細化地處理 URL 規則，并設計合理的爬取策略來避免服務器的壓力。

程序3

創建一個遵守 robots.txt 規則的數據爬蟲的示例程序相當復雜，因為它需要考慮多個方面，包括解析 robots.txt 規則、避開禁止訪問的路徑、遵守爬取時間間隔（Crawl-delay）等。
以下是一個簡化的 Python 程序，展示了如何使用 urllib.robotparser 庫讀取和解析 robots.txt 文件，以及如何使用 requests 庫實現基本的網頁爬取功能。注意，此示例僅作為演示之用，并不是一個完整功能的爬蟲。
首先，安裝 requests 庫，如果?Python 環境尚未安裝它的話：

pip install requests

接下來是示例程序：

import requests
from urllib.robotparser import RobotFileParser
from urllib.parse import urlparseclass SimpleCrawler:def __init__(self, base_url):self.base_url = base_urlself.robot_parser = RobotFileParser()self.crawl_delay = 0self.parse_robots_txt()def parse_robots_txt(self):robot_url = urlparse(self.base_url)robot_url = f"{robot_url.scheme}://{robot_url.netloc}/robots.txt"self.robot_parser.set_url(robot_url)self.robot_parser.read()self.crawl_delay = self.robot_parser.crawl_delay("*")def can_fetch(self, url):return self.robot_parser.can_fetch("*", url)def fetch_page(self, url):if self.can_fetch(url):response = requests.get(url)if response.status_code == 200:return response.textelse:print(f"Access denied by robots.txt: {url}")return None# 注意此方法可能不遵守 Crawl-delay，視乎 `requests` 的具體實現和網絡條件。def fetch_pages(self, urls):for url in urls:page_content = self.fetch_page(url)if page_content:print(f"Fetched URL: {url[:60]}...")  # 打印URL的前60個字符else:print(f"Failed to fetch URL: {url[:60]}...")# 遵守robots.txt的Crawl-delay設置等待相應時間if self.crawl_delay:time.sleep(self.crawl_delay)# 使用示例
if __name__ == "__main__":# 確保這個 URL 是你有權限爬取的base_url = "http://example.com"crawler = SimpleCrawler(base_url)url_list = ["http://example.com/page1","http://example.com/page2",# 添加更多你希望爬取的頁面]crawler.fetch_pages(url_list)

此代碼定義了一個 SimpleCrawler 類，初始化時給它一個基礎 URL。它將嘗試讀取此網站的 robots.txt，并解析其中的規則。`can_fetch` 方法用來判斷是否允許抓取指定的 URL。`fetch_page` 方法負責實際的抓取工作，并遵守 robots.txt 中的 Disallow 規則。`fetch_pages` 方法是一個遍歷 URL 列表并依次抓取每個頁面的高層次封裝。
請注意，在使用此代碼時，應確保：
1. 遵守目標網站的 robots.txt 規則。
2. 尊重 Crawl-delay 提示，此代碼示例僅在遍歷 URL 列表時等待指定時間，而不是在連續請求之間等待。
3. 在實際部署到爬蟲程序之前，需要進行全面的測試和額外的錯誤處理工作。
以上只是一個基礎示例，真實環境中的爬蟲還需要考慮 IP 被封禁、各種 HTTP 狀態碼處理、異常處理、日志記錄、爬取速率控制等更多復雜情況。此外，為了完整遵守 robots.txt，爬蟲還需要處理 Sitemap 和 Allow 指令，以及 User-agent 和 Crawl-delay 對不同爬蟲的特定規則。

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/717611.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/717611.shtml
英文地址，請注明出處：http://en.pswp.cn/news/717611.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！