【爬蟲實戰】——Python爬取天氣信息

🍉CSDN小墨&曉末:https://blog.csdn.net/jd1813346972

???個人介紹: 研一｜統計學｜干貨分享
???????? 擅長Python、Matlab、R等主流編程軟件
???????? 累計十余項國家級比賽獎項，參與研究經費10w、40w級橫向

文章目錄

1 Python網頁爬蟲簡介
2 爬蟲實戰
- 2.1 導入相關包
- 2.2 爬取時間范圍及城市設置
- 2.3 爬取信息設置
- 2.4 天氣信息抓取
- 2.4 結果存儲
- 2.5 效果展示
3 完整代碼

該篇將進行城市天氣信息爬取實戰，主要涉及到網頁url解析、正則表達匹配等技術，可用作網頁爬蟲練手項目。

1 Python網頁爬蟲簡介

??Python是一種流行的編程語言，用于開發各種應用程序，包括網頁爬蟲。網頁爬蟲（Web Crawler）是一種自動化程序，用于在互聯網上瀏覽和收集數據。Python提供了許多庫和工具，使開發人員能夠輕松地構建網頁爬蟲。

??Python進行網頁爬蟲的基本原理是模擬人類在互聯網上瀏覽網頁的行為。爬蟲程序會向目標網站發送請求，獲取網頁的HTML代碼，然后解析這個HTML代碼，提取出所需的數據。Python中的requests和BeautifulSoup庫是實現這個過程的重要工具。

??Python爬蟲架構主要由以下幾個部分組成：

調度器（Scheduler）：負責調度URL管理器、下載器、解析器之間的協調工作。
URL管理器（URL Manager）：包括待爬取的URL地址和已爬取的URL地址，防止重復抓取URL和循環抓取URL。
網頁下載器（Web Downloader）：負責從互聯網上下載網頁的HTML代碼。
網頁解析器（Web Parser）：負責解析網頁的HTML代碼，提取出所需的數據。
應用程序（Application）：從網頁中提取的有用數據組成的一個應用。

??在Python中進行網頁爬蟲開發時，還需要注意遵守網站的robots.txt文件規定，以避免對網站造成不必要的負擔或違反法律法規。同時，爬蟲程序也需要處理各種網絡異常和錯誤，以確保程序的穩定性和可靠性。

??總之，Python是一種非常適合進行網頁爬蟲開發的編程語言，通過掌握相關的庫和工具，開發人員可以輕松地構建出高效、穩定的爬蟲程序，從互聯網上獲取所需的數據。

2 爬蟲實戰

2.1 導入相關包

import requests
import pandas as pd
import re

2.2 爬取時間范圍及城市設置

months = [1,2,3,4,5,6,7,8,9,10,11,12]
years = [2016,2017,2018,2019,2020,2021,2022,2023] 
citys = [59287]

此處城市代碼選取‘59287’，實際操作可另選區域或多區域。

2.3 爬取信息設置

index_ = ['MaxTemp','MinTemp', 'WindDir', 'Wind', 'Weather','Aqi','AqiInfo','AqiLevel'] 
# 選取的氣象要素

2.4 天氣信息抓取

data = pd.DataFrame(columns=index_)  # 建立一個空dataframe
for c in citys:for y in years:for m in months:# 找到json格式數據的urlif (y<2017) or (y==2017)&(m<=11):url = "http://tianqi.2345.com/t/wea_history/js/"+str(c)+"_"+str(y)+str(m)+".js" # ?qq-pf-to=pcqq.c2celse:url = "http://tianqi.2345.com/t/wea_history/js/"+str(y)+str(m).zfill(2)+"/"+str(c)+"_"+str(y)+str(m).zfill(2)+".js"print(url)response = requests.get(url=url)if response.status_code == 200:  # 防止url請求無響應response2 = response.text.replace("'", '"')  # 這一步可以忽略#  利用正則表達式獲取各個氣象要素（方法不唯一）date = re.findall("[0-9]{4}-[0-9]{2}-[0-9]{2}", response2)[:-2]mintemp = re.findall('yWendu:"(.*?)℃', response2)maxtemp = re.findall('bWendu:"(.*?)℃', response2)winddir = re.findall('fengxiang:"([\u4E00-\u9FA5]+)',response2)wind = re.findall('fengli:"([\u4E00-\u9FA5]+)',response2)weather = re.findall('tianqi:"([[\u4E00-\u9FA5]+)~?', response2)aqi = re.findall('aqi:"(\d*)',response2)aqiInfo = re.findall('aqiInfo:"([\u4E00-\u9FA5]+)',response2)aqiLevel = re.findall('aqiLevel:"(\d*)',response2)data_spider = pd.DataFrame([maxtemp,mintemp, winddir, wind, weather,aqi,aqiInfo,aqiLevel]).Tdata_spider.columns = index_  # 修改列名data_spider.index = date  # 修改索引data = pd.concat((data,data_spider), axis=0)  # 數據拼接print('%s年%s月的數據抓取成功' % (y, m))else:print('%s年%s月的數據不存在' % (y, m))break

2.4 結果存儲

data.to_excel('D:\\天氣數據可視化\\天氣數據可視化.xlsx')
print('爬取數據展示：\n', data)

2.5 效果展示

3 完整代碼

import requests
import pandas as pd
import remonths = [1,2,3,4,5,6,7,8,9,10,11,12]
years = [2016,2017,2018,2019,2020,2021,2022,2023] 
citys = [59287] index_ = ['MaxTemp','MinTemp', 'WindDir', 'Wind', 'Weather','Aqi','AqiInfo','AqiLevel']  # 選取的氣象要素
data = pd.DataFrame(columns=index_)  # 建立一個空dataframe
for c in citys:for y in years:for m in months:# 找到json格式數據的urlif (y<2017) or (y==2017)&(m<=11):url = "http://tianqi.2345.com/t/wea_history/js/"+str(c)+"_"+str(y)+str(m)+".js" # ?qq-pf-to=pcqq.c2celse:url = "http://tianqi.2345.com/t/wea_history/js/"+str(y)+str(m).zfill(2)+"/"+str(c)+"_"+str(y)+str(m).zfill(2)+".js"print(url)response = requests.get(url=url)if response.status_code == 200:  # 防止url請求無響應response2 = response.text.replace("'", '"')  # 這一步可以忽略#  利用正則表達式獲取各個氣象要素（方法不唯一）date = re.findall("[0-9]{4}-[0-9]{2}-[0-9]{2}", response2)[:-2]mintemp = re.findall('yWendu:"(.*?)℃', response2)maxtemp = re.findall('bWendu:"(.*?)℃', response2)winddir = re.findall('fengxiang:"([\u4E00-\u9FA5]+)',response2)wind = re.findall('fengli:"([\u4E00-\u9FA5]+)',response2)weather = re.findall('tianqi:"([[\u4E00-\u9FA5]+)~?', response2)aqi = re.findall('aqi:"(\d*)',response2)aqiInfo = re.findall('aqiInfo:"([\u4E00-\u9FA5]+)',response2)aqiLevel = re.findall('aqiLevel:"(\d*)',response2)data_spider = pd.DataFrame([maxtemp,mintemp, winddir, wind, weather,aqi,aqiInfo,aqiLevel]).Tdata_spider.columns = index_  # 修改列名data_spider.index = date  # 修改索引data = pd.concat((data,data_spider), axis=0)  # 數據拼接print('%s年%s月的數據抓取成功' % (y, m))else:print('%s年%s月的數據不存在' % (y, m))break
data.to_excel('D:\\天氣數據可視化\\天氣數據可視化.xlsx')
print('爬取數據展示：\n', data)

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/719556.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/719556.shtml
英文地址，請注明出處：http://en.pswp.cn/news/719556.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！