【爬蟲】單個網站鏈接爬取文獻數據:標題、摘要、作者等信息

源碼鏈接： https://github.com/Niceeggplant/Single—Site-Crawler.git

一、項目概述

從指定網頁中提取文章關鍵信息的工具。通過輸入文章的 URL，程序將自動抓取網頁內容

二、技術選型與原理

requests 庫：這是 Python 中用于發送 HTTP 請求的常用庫。它能夠模擬瀏覽器向網頁服務器發送請求，并獲取網頁的 HTML 文本內容。在本項目中，我們利用它來獲取目標文章網頁的源代碼，為后續的信息提取提供基礎數據。其使用方法非常簡單，只需調用 requests.get() 方法，并傳入目標 URL 和可選的請求頭信息即可。例如：

import requestsurl = "https://example.com/article"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
response = requests.get(url, headers=headers)
html_text = response.text

這里設置請求頭中的 User-Agent 是為了模擬瀏覽器訪問，避免一些網站對非瀏覽器請求的限制。

BeautifulSoup 庫：該庫主要用于解析 HTML 和 XML 文檔。它能夠將復雜的網頁結構轉換為易于操作的 Python 對象，方便我們通過標簽、類名、ID 等屬性定位和提取網頁中的元素。在本項目中，我們使用它來解析 requests 庫獲取到的 HTML 文本，以提取文章的各種信息。使用時，首先需要創建一個 BeautifulSoup 對象，例如：

from bs4 import BeautifulSoupsoup = BeautifulSoup(html_text, 'html.parser')

這里的 html.parser 是 Python 內置的 HTML 解析器，也可以根據需要選擇其他更強大的解析器，如 lxml 解析器。

三、代碼實現步驟

定義提取函數

import requests
from bs4 import BeautifulSoupdef fetch_article_info(url):headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"}try:response = requests.get(url, headers=headers)response.raise_for_status()  soup = BeautifulSoup(response.text, 'html.parser')

這里定義了 fetch_article_info 函數，它接受一個文章 URL 作為參數，并在函數內部進行請求和解析的操作。

提取標題

        title_element = soup.find('h1')title = title_element.get_text().strip() if title_element else '未找到'

通過 soup.find('h1') 查找網頁中的 <h1> 標簽，通常文章標題會在這個標簽內。如果找到，則獲取其文本內容并去除首尾空格；如果未找到，則將標題設為 未找到。

提取作者

        authors = []author_elements = soup.find_all('div', class_='authors')if not author_elements:author_elements = soup.find_all('input', id='authors')for author_element in author_elements:author_links = author_element.find_all('a')for link in author_links:authors.append(link.get_text().strip())authors = ', '.join(authors) if authors else '未找到'

首先嘗試通過查找類名為 authors 的 <div> 標簽來獲取作者信息，如果未找到，則查找 id 為 authors 的 <input> 標簽。然后遍歷找到
在這里插入圖片描述

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/web/65895.shtml
繁體地址，請注明出處：http://hk.pswp.cn/web/65895.shtml
英文地址，請注明出處：http://en.pswp.cn/web/65895.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！