newspaper公共庫獲取每個 URL 對應的新聞內容，并將提取的新聞正文保存到一個文件中

示例代碼：

from newspaper import Article
from newspaper import Config
import json
from tqdm import tqdm
import os
import requestswith open('datasource/api/news_api.json', 'r') as file:data = json.load(file)print(len(data))
save_path = 'datasource/source/news_data.json'
def wr_dict(filename,dic):if not os.path.isfile(filename):data = []data.append(dic)with open(filename, 'w') as f:json.dump(data, f)else:      with open(filename, 'r') as f:data = json.load(f)data.append(dic)with open(filename, 'w') as f:json.dump(data, f)def rm_file(file_path):if os.path.exists(file_path):os.remove(file_path)
# rm_file(save_path)with open(save_path, 'r') as file:have = json.load(file)USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'config = Config()
config.headers = {'Cookie': "cookie1=xxx;cookie2=zzzz"}
config.browser_user_agent = USER_AGENT
config.request_timeout = 10RETRY_ATTEMPTS = 1
count = 0
def parse_article(url):for attempt in range(RETRY_ATTEMPTS):try:article = Article(url, config=config)article.download()article.parse()return article.textexcept:return None# print(f"Error retrieving article from URL '{url}'")return Nonefor idx, d in enumerate(tqdm(data)):if idx<len(have):continueurl = d['url']maintext = parse_article(url.strip())if maintext == None:continued['body'] = maintextwr_dict(save_path,d)count = count + 1
print(count+len(have))

這段代碼的功能是從一個包含新聞 URL 的數據集中獲取每個 URL 對應的新聞內容，并將提取的新聞正文保存到一個文件中。

1. 導入必要的庫

from newspaper import Article
from newspaper import Config
import json
from tqdm import tqdm
import os
import requests

newspaper：用于從新聞網站上提取文章內容，Article 用來獲取文章的正文，Config 用來配置請求頭和其他設置。
json：用于處理 JSON 格式的數據。
tqdm：用于顯示進度條。
os：用于操作文件。
requests：用于發送 HTTP 請求（雖然在這段代碼中沒有直接用到，但可能是為了配置 HTTP 請求的頭部）。

2. 加載新聞 URL 數據

with open('datasource/api/news_api.json', 'r') as file:data = json.load(file)print(len(data))

從 datasource/api/news_api.json 文件中讀取新聞 URL 數據，并加載到 data 變量中。
輸出 data 的長度，顯示有多少條新聞 URL 數據。

3. 定義寫入 JSON 文件的函數 `wr_dict`

def wr_dict(filename,dic):if not os.path.isfile(filename):data = []data.append(dic)with open(filename, 'w') as f:json.dump(data, f)else:      with open(filename, 'r') as f:data = json.load(f)data.append(dic)with open(filename, 'w') as f:json.dump(data, f)

wr_dict 函數用于將新聞數據字典（dic）追加到指定的 JSON 文件中。如果文件不存在，它會先創建文件并寫入數據。如果文件已經存在，先讀取文件內容，追加新的數據，再寫回文件。

4. 刪除文件的函數 `rm_file`

def rm_file(file_path):if os.path.exists(file_path):os.remove(file_path)

rm_file 函數用于刪除指定路徑的文件。

5. 加載已有的新聞數據

with open(save_path, 'r') as file:have = json.load(file)

從 save_path 文件中讀取已經保存的新聞數據，保存到 have 變量中。這樣可以避免重復下載和保存相同的新聞。

6. 配置請求頭和重試次數

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
config = Config()
config.headers = {'Cookie': "cookie1=xxx;cookie2=zzzz"}
config.browser_user_agent = USER_AGENT
config.request_timeout = 10

設置請求的 User-Agent（模擬瀏覽器請求頭），以及一些其他配置。
通過設置 config.headers 來模擬用戶登錄，避免因缺少 cookie 導致訪問失敗。

7. 定義文章解析函數 `parse_article`

def parse_article(url):for attempt in range(RETRY_ATTEMPTS):try:article = Article(url, config=config)article.download()article.parse()return article.textexcept:return Nonereturn None

parse_article 函數用于下載并解析指定 URL 的新聞文章。它會嘗試請求文章，下載并提取文本內容。如果成功則返回文章的正文。如果失敗（例如：網絡問題或者 URL 無效），則返回 None。

8. 處理每條新聞 URL，下載并保存新聞正文

for idx, d in enumerate(tqdm(data)):if idx < len(have):continueurl = d['url']maintext = parse_article(url.strip())if maintext == None:continued['body'] = maintextwr_dict(save_path,d)count = count + 1
print(count + len(have))

迭代 data 中的每一條新聞記錄，跳過已經處理過的（if idx < len(have): continue）。
對每個新聞 URL 調用 parse_article 函數，獲取新聞正文。
如果成功獲取正文，就將其添加到新聞字典（d）的 body 字段中。
使用 wr_dict 函數將包含正文的新新聞字典追加到保存的文件中。
count 用于統計成功保存的新聞數量。
最終輸出已處理新聞的總數（包括新保存的和之前已經存在的）。

總結：

從 news_api.json 中獲取新聞 URL。
通過 newspaper 庫下載和解析每個 URL 對應的新聞正文。
如果成功獲取正文，就將其保存到 news_data.json 文件中。
使用進度條（tqdm）顯示處理過程的進度。
使用 wr_dict 函數確保數據正確保存到 JSON 文件中。

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/bicheng/76479.shtml
繁體地址，請注明出處：http://hk.pswp.cn/bicheng/76479.shtml
英文地址，請注明出處：http://en.pswp.cn/bicheng/76479.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！