爬蟲系列之【數據解析之JSON】《三》

前置知識

一、 json.loads()：JSON 轉 Python 數據

二、json.dump()：python數據轉 json 并寫入文件

?三、json.loads() ：json 轉 python數據

四、json.load() ：json 轉 python數據（在文件操作中更方便）

五、實戰案例

完整代碼演示

前置知識

1，不涉及文件操作
json 字符串 => 轉換成 => python 數據類型 ：json.loads()
python 數據類型 => 轉換成 => json 字符串 ：json.dumps()2，涉及文件操作
包含 json 的類文件對象 => 轉換成 => python 數據類型 ：json.load()
python 數據類型 => 轉換成 => 包含 json 的類文件對象 ：json.dump()# 總結：不加 s 涉及到文件操作

json用于數據交換

JS（前端） -> json -> python（后端）
python(后端) -> json -> JS(前端)

一、 json.loads()：JSON 轉 Python 數據

import json
# 1,python數據類型 轉 json字符串
# python 字典數據
dic = {'a':1,'b':2}
print(type(dic))  #打印類型：字典類型
# 2,python數據 轉成 json數據
json1 = json.dumps(dic)
# 3,打印結果
print(type(json1))   #結果是str,就是json字符串
# 4,增加鍵值
dic[('2',1)] = '元組'   #錯誤，不能這樣添加
dic["c"]='元組'         # 正確
# 5,json數據key不能是元組，skipkeys=True可以過濾異常數據，ensure_ascii=False可以解決編碼問題
json2 = json.dumps(dic,skipkeys=True,ensure_ascii=False)print(json2)
print(type(json2))

二、json.dump()：python數據轉 json 并寫入文件

import json
# 1,python字典數據
dic = {'a':1,'b':2}
with open('text.json','w',encoding='utf-8') as f:# 參數1：要轉成json格式并保存的數據# 參數2：文件指針json.dump(dic,f,skipkeys=True,ensure_ascii=False)

?三、json.loads() ：json 轉 python數據

import json
# text.json中是剛才存儲的 json 數據
with open('text.json','r',encoding='utf-8') as f:data = f.read()print(data)print(type(data))  #json字符串# 轉換dic = json.loads(data) # python字典print(type(dic))

四、json.load() ：json 轉 python數據（在文件操作中更方便）

import json
# text.json中是剛才存儲的 json 數據
with open('text.json','r',encoding='utf-8') as f:# 轉換dic = json.load(f) #python字典print(f"字典數據：{dic}\n類型：{type(dic)}")

五、實戰案例

需求：爬取騰訊招聘信息的《標題》《城市》《日期》

鏈接：搜索 | 騰訊招聘

分析步驟：

1，找到目標url

目標URL：https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1740842427771&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=1&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn

2，分析響應數據

可以進一步驗證：

將數據全選復制進入?在線代碼格式化?驗證是否為 json 格式

3，分析一下數據內容

我們需要的數據層級關系，轉換成字典后：
Data > Posts > 列表n > RecruitPostName(標題) LocationName(城市) LastUpdateTime(日期)

特別注意：Posts是一個列表?

完整代碼演示


import json
import requests# 1,目標url
url = 'https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1740842427771&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=1&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn'
# 2,身份偽裝
header={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36"
}
# 3,發起請求獲取響應
response = requests.get(url=url,headers=header)
# 4,打印響應內容
# print(response.text)
print(type(response.text))  #json字符串類型# 5,將json字符串轉換成python數據
result = json.loads(response.text)
print(type(result))  # 字典類型# 6,提取需要的信息：標題+城市+日期
title = result['Data']['Posts']
for tit in title:# print(tit)print(f"{tit['RecruitPostName']} ,{tit['LocationName']} ,{tit['LastUpdateTime']}")# 字符串替換：result = str.replace(r'\n','')

拓展

實現多頁爬取分析步驟

1.分別獲取到第一頁、第二頁、第三頁的 url 對比

# 第一頁
https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1740904582702&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=1&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn
# 第二頁
https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1740904630357&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=1&keyword=&pageIndex=2&pageSize=10&language=zh-cn&area=cn
# 第三頁
https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1740904538329&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=1&keyword=&pageIndex=3&pageSize=10&language=zh-cn&area=cn

# 特別提醒：測試可以直接將上面的url放入導航欄查看響應數據，還有可以刪掉一些參數查看對響應數據有沒有影響
對比之后發現只有兩個地方不同
1, pageIndex 頁數
2, timestamp 時間戳

測試后發現，時間戳固定對爬取數據沒有任何影響，因此多頁爬取只需要變化 pageIndex 的值即可

# 偽代碼：
for i in range(1,11):url = f"https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1740904630357&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=1&keyword=&pageIndex={i}&pageSize=10&language=zh-cn&area=cn"# 這樣就可以得到十頁的url了，剩下的爬取工作是一樣的print(url)

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/896769.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/896769.shtml
英文地址，請注明出處：http://en.pswp.cn/news/896769.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！