python之requests爬蟲Boss數據
需要用到的庫:reqeusts、lxml
沒有的可以用直接下載
pip install requests
pip install lxm
這里以python崗位,地點北京為例
爬取的數據就是崗位名稱、薪資、地點
首先導入需要用到的模塊
import requests
from lxml import etree
崗位可以通過input提前輸入好,傳參給url
job = input('輸入職位')
將需要訪問的url賦給一個變量
url = 'https://www.zhipin.com/job_detail/?query=%s&city=101010100&industry=&position='%job
query=%s(%s是一個占位)在引號后面的%job就是占的值
訪問這個頁面需要加一個頭部(headers)降低被識別爬蟲的概率
在當前頁面按f12點擊Network,如果沒有東西的話可以刷新一下頁面
這里只用到兩個參數,一個user-agent,一個cookie
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
'cookie': '_uab_collina=157853739340991408682799; Hm_lvt_194df3105ad7148dcf2b98a91b5e727a=1578537393,1578554153; __c=1578554153; __g=-; Hm_lpvt_194df3105ad7148dcf2b98a91b5e727a=1578554168; __zp_stoken__=d0e7eq77rh2ql3R%2F5VwP4mPjHKu%2BjYVQMbIFSPnpEWipSXfKaWf%2FM%2FxBRat22vE%2FR4PdiD%2BDhDiSNaW%2FTjVMpYOEMTTUmxg7WSFqYfpdWi5SSIMEcHuwoKbmd%2B6tlv5ONmSF; __l=l=%2Fwww.zhipin.com%2Fjob_detail%2F%3Fquery%3D%25E4%25BA%25BA%25E5%25B7%25A5%25E6%2599%25BA%25E8%2583%25BD%26city%3D101010100%26industry%3D%26position%3D&r=&friend_source=0&friend_source=0; __a=32343010.1578537387.1578537387.1578554153.9.2.2.9'
}
請求數據并且返回值
res = requests.get(url,headers=headers).text
利用etree中的HTML解析數據
html = etree.HTML(res)
#崗位名稱
job_name = html.xpath('//*[@id="main"]/div/div[2]/ul/li/div/div[1]/h3/a/div[1]/text()')
#薪資
salary = html.xpath('//*[@id="main"]//ul/li//h3/a/span/text()')
#地點、工作經驗、學歷
site = html.xpath('//*[@id="main"]/div/div[2]/ul/li/div/div[1]/p')
print('工作崗位:',job)
print('薪資:',salary)
print('地點:',site)
輸出的數據如下
整體代碼
import requests
from lxml import etree
import json
job = input('輸入職位')
url = 'https://www.zhipin.com/job_detail/?query=%s&city=101010100&industry=&position='%job
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
'cookie': '_uab_collina=157853739340991408682799; Hm_lvt_194df3105ad7148dcf2b98a91b5e727a=1578537393,1578554153; __c=1578554153; __g=-; __l=l=%2Fwww.zhipin.com%2Fjob_detail%2F%3Fquery%3D%25E4%25BA%25BA%25E5%25B7%25A5%25E6%2599%25BA%25E8%2583%25BD%26city%3D101010100%26industry%3D%26position%3D&r=&friend_source=0&friend_source=0; lastCity=101010100; Hm_lpvt_194df3105ad7148dcf2b98a91b5e727a=1578556532; __zp_stoken__=d0e7eq77rh2ql3R%2F5VwP4mPjHOT%2BY0u%2F2GMG6hriOPZlx6iA6NPb%2FycP1M1RRJxkLq%2FdiD%2BDhDiSNaW%2FTjVMpYOEMScFTSjVVO31G%2B8%2Bwf%2Bxs7gEcHuwoKbmd%2B6tlv5ONmSF; __a=32343010.1578537387.1578537387.1578554153.29.2.22.29'
}
res = requests.get(url,headers=headers).text
html = etree.HTML(res)
job_name = html.xpath('//*[@id="main"]/div/div[2]/ul/li/div/div[1]/h3/a/div[1]/text()')
salary = html.xpath('//*[@id="main"]//ul/li//h3/a/span/text()')
site = html.xpath('//*[@id="main"]/div/div[2]/ul/li/div/div[1]/p/text()')
print('工作崗位:',job_name)
print('薪資:',salary)
print('地點:',site)
最后需要注意網站中的cookie是實時更新,如果數據沒有出來再去網頁中查看cookie值