使用Python實現網頁數據獲取與處理：以財經新聞為例

在現代數據驅動的世界中，獲取并處理豐富的網頁數據是非常重要的技能。本文將介紹如何使用Python編寫一個程序，自動獲取財經新聞數據并進行處理。這不僅可以幫助我們快速獲取最新的財經信息，還可以為后續的數據分析和研究提供支持。

環境準備

首先，確保你的Python環境已經安裝了以下庫：

pip install requests beautifulsoup4 tqdm concurrent.futures

核心代碼解析

我們將分步驟講解代碼實現的關鍵部分。

1. 設置請求頭和會話

為了模擬瀏覽器行為，我們需要設置合適的請求頭：

import requestssession = requests.session()
session.headers['User-Agent'] = ('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ''(KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36'
)
session.headers['Referer'] = 'https://money.163.com/'
session.headers['Accept-Language'] = 'zh-CN,zh;q=0.9'

2. 定義主函數和數據獲取邏輯

主函數負責調度和管理整個流程：

def main():base_url = ['https://money.163.com/special/00259BVP/news_flow_index.js?callback=data_callback','https://money.163.com/special/00259BVP/news_flow_biz.js?callback=data_callback','https://money.163.com/special/00259BVP/news_flow_fund.js?callback=data_callback','https://money.163.com/special/00259BVP/news_flow_house.js?callback=data_callback','https://money.163.com/special/00259BVP/news_flow_licai.js?callback=data_callback']kind = ['股票', '商業', '基金', '房產', '理財']path = r'.財經(根數據).json'save_path = r'./財經.json'# 載入已有數據try:if os.path.isfile(path):source_ls = bag.Bag.read_json(path)else:source_ls = []except FileNotFoundError:source_ls = []index = 0urls = []for url in base_url:result = get_url(url, kind[index])index += 1urls = urls + resultnewly_added = []if len(source_ls) == 0:bag.Bag.save_json(urls, path)newly_added = urlselse:flag = [i[1] for i in source_ls]for link in urls:if link[1] in flag:passelse:newly_added.append(link)if len(newly_added) == 0:print('無新數據')else:bag.Bag.save_json(newly_added + source_ls, path)if os.path.isfile(save_path):data_result = bag.Bag.read_json(save_path)else:data_result = []with ThreadPoolExecutor(max_workers=20) as t:tasks = []for url in tqdm(newly_added[:], desc='網易財經'):url: listtasks.append(t.submit(get_data, url))end = []for task in tqdm(tasks, desc='網易財經'):end.append(task.result())bag.Bag.save_json(end + data_result, save_path)

3. 獲取URL和數據

get_url函數負責從特定URL獲取數據鏈接和相關信息：

def get_url(url, kind):num = 1result = []while True:if num == 1:resp = session.get(url)else:if num < 10:resp = session.get(url.replace('.js?callback=data_callback', '') + f'_0{num}' + '.js?callback=data_callback')else:resp = session.get(url.replace('.js?callback=data_callback', '') + f'_{num}' + '.js?callback=data_callback')if resp.status_code == 404:breaknum += 1title = re.findall(r'"title":"(.*?)"', resp.text, re.S)docurl = re.findall(r'"docurl":"(.*?)"', resp.text, re.S)label = re.findall('"label":"(.*?)"', resp.text, re.S)keyword = re.findall(r'"keywords":\[(.*?)]', resp.text, re.S)mid = []for k in keyword:mid1 = []for j in re.findall(r'"keyname":"(.*?)"', str(k), re.S):mid1.append(j.strip())mid.append(','.join(mid1))for i in range(len(title)):result.append([title[i],docurl[i],label[i],kind,mid[i]])return result

get_data函數負責從獲取的鏈接中提取具體內容：

def get_data(ls):resp = session.get(ls[1])resp.encoding = 'utf8'resp.close()html = BeautifulSoup(resp.text, 'lxml')content = []p = re.compile(r'<p.*?>(.*?)</p>', re.S)contents = html.find_all('div', class_='post_body')for info in re.findall(p, str(contents)):content.append(re.sub('<.*?>', '', info))return [ls[-1], ls[0], '\n'.join(content), ls[-2], ls[1]]

運行程序

最后，在主程序中調用主函數：

if __name__ == '__main__':main()

總結

通過這篇教程，我們展示了如何使用Python實現一個自動化數據獲取和處理的程序。這個程序從指定的網址獲取財經新聞，并將其保存到本地文件中。通過這種方式，我們可以輕松地獲取并管理大量的財經信息，為后續的分析和研究提供便利。

希望這篇文章對你有所幫助。如果你有任何問題或建議，歡迎在評論區留言交流。

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/diannao/35428.shtml
繁體地址，請注明出處：http://hk.pswp.cn/diannao/35428.shtml
英文地址，請注明出處：http://en.pswp.cn/diannao/35428.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！