python 上海新聞爬蟲，東方網 + 澎湃新聞

1. 起因，目的:

繼續做新聞爬蟲。我之前寫過。
此文先記錄2個新聞來源。
后面打算進行過濾，比如只選出某一個類型新聞。

2. 先看效果

過濾出某種類型的新聞，然后生成 html 頁面，而且，自動打開這個頁面。
比如科技犯罪類的新聞。

3. 過程:

代碼 1 ，爬取東方網

很久之前寫過，代碼還能用。
這里雖然是復制一下，也是為了自己方便。

import os
import csv
import time
import requests"""
# home: https://sh.eastday.com/
# 1. 標題, url， 來源，時間
"""headers = {'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8','User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36'
}def get_data(pages):file_name = '5.8.400.csv'             # 400個標題。has_file =  os.path.exists(file_name)# 打開文件，寫入模式with open(file_name, 'a', newline='', encoding='utf-8') as file:# 創建一個csv.DictWriter對象，用于寫入字典數據columns = ['title', 'url', 'time','source']writer = csv.DictWriter(file, fieldnames=columns)# 寫入表頭if not has_file:writer.writeheader()# 爬取數據. 默認是 20頁，每頁20條。 每天大概有400個新聞。for i in range(pages):print(f"正在爬取第{i+1} / {pages}頁數據")time.sleep(0.5)url = f"https://apin.eastday.com/apiplus/special/specialnewslistbyurl?specialUrl=1632798465040016&skipCount={i * 20}&limitCount=20"resp = requests.get(url, headers=headers)if resp.status_code!= 200:print(f"請求失敗：{resp.status_code}")breakret = resp.json()junk = ret['data']['list']for x in junk:item = dict()# print(x)item["time"] = x["time"]item['title'] = x["title"]item["url"] = x["url"]item["source"] = x["infoSource"]# 寫入數據writer.writerow(item)# print(item)get_data(pages=20)

代碼 2 ，爬取，澎湃新聞

也是很簡單。

import os
import csv
import time
import requests
from datetime import datetime, timedelta# 請求頭
headers = {'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8','User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36','Content-Type': 'application/json',  # 響應頭要求 Content-Type'Referer': 'https://www.thepaper.cn/',  # 引薦來源，遵循 strict-origin-when-cross-origin'Origin': 'https://www.thepaper.cn'  # 跨域請求需要 Origin
}def get_thepaper_data(file_name='peng_pai_400.csv', max_pages=100, channel_id='-8'):"""爬取澎湃新聞數據，保存到 CSV 文件參數：file_name: 輸出 CSV 文件名max_pages: 最大爬取頁數channel_id: 新聞頻道 ID"""# 檢查文件是否存在has_file = os.path.exists(file_name)# 打開 CSV 文件，追加模式with open(file_name, 'a', newline='', encoding='utf-8') as file:columns = ['title', 'url', 'time', 'source']writer = csv.DictWriter(file, fieldnames=columns)if not has_file:writer.writeheader()# 計算 startTime（當前時間戳）current_time = int(time.time() * 1000)  # 當前毫秒時間戳start_time = current_time  # 使用此時此刻的時間# 爬取數據for page in range(1, max_pages + 1):time.sleep(0.5)  # 請求間隔payload = {'channelId': channel_id,'excludeContIds': [],  # 留空，需根據實際需求調整'province': '','pageSize': 20,'startTime': start_time,'pageNum': page}url = 'https://api.thepaper.cn/contentapi/nodeCont/getByChannelId'resp = requests.post(url, headers=headers, json=payload, timeout=10)if resp.status_code != 200:print(f"請求失敗：{url}, 狀態碼: {resp.status_code}, 頁碼: {page}")breakret = resp.json()# print(f"頁面 {page} 響應：{ret}")news_list = ret['data']['list']for item in news_list:# print(item)news = {}news['title'] = item.get('name', '')news['url'] = f"https://www.thepaper.cn/newsDetail_forward_{item.get('originalContId', '')}"news['time'] = item.get('pubTimeLong', '')news['source'] = item.get('authorInfo', {}).get('sname', '澎湃新聞')# 轉換時間格式（如果 API 返回時間戳）news['time'] = datetime.fromtimestamp(news['time'] / 1000).strftime('%Y-%m-%d %H:%M:%S')# 直接寫入，不去重writer.writerow(news)print(f"保存新聞：{news}")if __name__ == "__main__":get_thepaper_data(file_name='peng_pai_400.csv', max_pages=20, channel_id='-8')

4. 結論 + todo

1 數據來源，還需要增加。可選項：

- 上觀新聞 shobserver.com   與解放日報關聯，報道上海本地案件。
- 新浪新聞 news.sina.com.cn  全國性新聞，包含科技犯罪。
- 騰訊新聞 news.qq.com       聚合多種來源，覆蓋廣泛。

聚合。提取出自己感興趣的新聞，比如，科技犯罪。

希望對大家有幫助。

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/pingmian/80690.shtml
繁體地址，請注明出處：http://hk.pswp.cn/pingmian/80690.shtml
英文地址，請注明出處：http://en.pswp.cn/pingmian/80690.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！