【爬蟲】爬取旅行評論和評分

以馬蜂窩“普達措國家公園”為例，其評論高達3000多條，但這3000多條并非是完全向用戶展示的，向用戶展示的只有5頁，數了一下每頁15條評論，也就是75條評論，有點太少了吧！

?因此想了個辦法盡可能多爬取一些評論，根據我對爬蟲爬取數據法律法規的相關了解，爬取看得到的數據是合法的，而在評論最開始的這個地方有對評論的分類，當然每個分類主題也是最多能看到5頁內容，但是肯定會比我們被動的只爬取5頁多很多內容，因此我們選擇按主題分類去爬取評論。

點擊上圖中的全部，右鍵檢查或者按下F12去定位“全部”?

?把這個元素收起來就可以看到如下圖，這個<li></li>標簽的列表里保存著分類名稱、類型、id等，如果比較多的話可以利用selenium的XPATH自動獲取之后，再套進代碼里面，由于我只演示一個例子并且分類標簽也不多，我就直接拿了這個列表放在代碼里。

注意我們需要用到的是他的兩個屬性值：

data-type、data-catagory

我存放的方式：（代碼標注的分類id）

data-type：a = [0,0,1,1,1,2,2,2,2,2,0]
data-catagory：b = [0,2,13,12,11,134700810,173942219,112047583,112968615,143853527,1]

注意這個順序a[i]與b[i]是按照圖中框起來的<li></li>標簽一一對應的，順序不能錯。

點擊Network，按下Ctrl+R刷新一下

找到Name為poiCommentListApi?為首的（如下圖），點擊Headers，紅線畫出來的內容是代碼中comment_url（代碼標注①的地方），根據你自己需要的進行替換。

?下滑可以看到Request Headers中的‘Referer’和‘User-agent’兩個參數，根據你自己所需要的進行替換（代碼標注的②和③）

?點擊Payload，如果是下面這種情況你就點擊一下左邊的分類標簽（任選一個），在Name列表中一直往下滑找到Name為poiCommentListApi?為首的（根據你的點擊次數就會有多少個，從后往前找看看規律）

找到最后一個Name為poiCommentListApi?為首的，點擊Payload，看一下這個params參數

所以對于同一個景點來說，變化的參數有：評論類別（由type、catagory決定）、頁碼（取值范圍1-5）

分析完之后就可以寫代碼了

🌹--<-<-<@美味的code👑?

import re
import time
import requests
import pandas as pdcomment_url = 'http://pagelet.mafengwo.cn/poi/pagelet/poiCommentListApi?'
requests_headers = {'Referer': 'https://www.mafengwo.cn/poi/3110.html','User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'
}# Comment categories
a = [0, 0, 1, 1, 1, 2, 2, 2, 2, 2, 0]
b = [0, 2, 13, 12, 11, 134700810, 173942219, 112047583, 112968615, 143853527, 1]# Iterate through ten categories of comments
for i in range(11):# Get comments from five pages for each categoryfor num in range(1, 6):print('Fetching Page', num)requests_data = {'params': '{"poi_id":"3110","type":"%d","category":"%d","page":"%d","just_comment":1}' % (a[i], b[i], num)}response = requests.get(url=comment_url, headers=requests_headers, params=requests_data)if 200 == response.status_code:page = response.content.decode('unicode-escape', 'ignore').encode('utf-8', 'ignore').decode('utf-8')page = page.replace('\\/', '/')date_pattern = r'<a class="btn-comment _j_comment" title="Add Comment">Comment</a>.*?\n.*?<span class="time">(.*?)</span>'date_list = re.compile(date_pattern).findall(page)star_pattern = r'<span class="s-star s-star(\d)"></span>'star_list = re.compile(star_pattern).findall(page)comment_pattern = r'<p class="rev-txt">([\s\S]*?)</p>'comment_list = re.compile(comment_pattern).findall(page)best_comment = []for num in range(0, len(date_list)):date = date_list[num]star = star_list[num]comment = comment_list[num]comment = str(comment).replace('&nbsp;', '')comment = comment.replace('<br>', '')comment = comment.replace('<br />', "")comment = comment.replace('\n', "")comment = comment.replace("【", "")comment = comment.replace("】", "")comment = comment.replace("~", "")comment = comment.replace("*", "")comment = comment.replace('<br />', '')best_comment.append(comment)df = pd.DataFrame({'date': date_list, 'rating': star_list, 'comment': comment_list})df['comment'] = best_commentdf.to_csv('mafengwo.csv', mode='a', encoding='gb18030', index=False, header=None)print('Write successful')else:print("Fetch failed")

既然都看到這裏了，不如點個關注+收藏再走咯！？?

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/37612.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/37612.shtml
英文地址，請注明出處：http://en.pswp.cn/news/37612.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！