思路:1.獲取拉勾網搜索到職位的頁數
2.調用接口獲取職位id
3.根據職位id訪問頁面,匹配出關鍵字
?
url訪問采用unirest,由于拉鉤反爬蟲,短時間內頻繁訪問會被限制訪問,所以沒有采用多線程,而且每個頁面訪問時間間隔設定為10s,通過nokogiri解析頁面,正則匹配只獲取技能要求中的英文單詞,可能存在數據不準確的情況
數據持久化到excel中,采用ruby erb生成word_cloud報告
爬蟲代碼:
require 'unirest' require 'uri' require 'nokogiri' require 'json' require 'win32ole'@position = '測試開發工程師' @city = '杭州'# 頁面訪問 def query_url(method, url, headers:{}, parameters:nil)case methodwhen :getUnirest.get(url, headers:headers).bodywhen :postUnirest.post(url, headers:headers, parameters:parameters).bodyend end# 獲取頁數 def get_page_num(url)html = query_url(:get, url).force_encoding('utf-8')html.scan(/<span class="span totalNum">(\d+)<\/span>/).first.first end# 獲取每頁顯示的所有職位的id def get_positionsId(url, headers:{}, parameters:nil)response = query_url(:post, url, headers:headers, parameters:parameters)positions_id = Array.newresponse['content']['positionResult']['result'].each{|i| positions_id << i['positionId']}positions_id end# 匹配職位英文關鍵字 def get_skills(url)puts "loading url: #{url}"html = query_url(:get, url).force_encoding('utf-8')doc = Nokogiri::HTML(html)data = doc.css('dd.job_bt')data.text.scan(/[a-zA-Z]+/) end# 計算詞頻 def word_count(arr)arr.map!(&:downcase)arr.select!{|i| i.length>1}counter = Hash.new(0)arr.each { |k| counter[k]+=1 }# 過濾num=1的數據counter.select!{|_,v| v > 1}counter2 = counter.sort_by{|_,v| -v}.to_hcounter2 end# 轉換 def parse(hash)data = Array.newhash.each do |k,v|word = Hash.newword['name'] = kword['value'] = vdata << wordendJSON data end# 持久化數據 def save_excel(hash)excel = WIN32OLE.new('Excel.Application')excel.visible = falseworkbook = excel.Workbooks.Add()worksheet = workbook.Worksheets(1)# puts hash.size(1..hash.size+1).each do |i|if i == 1# puts "A#{i}:B#{i}"worksheet.Range("A#{i}:B#{i}").value = ['關鍵詞', '頻次']else# puts i# puts hash.keys[i-2], hash.values[i-2]worksheet.Range("A#{i}:B#{i}").value = [hash.keys[i-2], hash.values[i-2]]endendexcel.DisplayAlerts = falseworkbook.saveas(File.dirname(__FILE__)+'\lagouspider.xls')workbook.saved = trueexcel.ActiveWorkbook.Close(1)excel.Quit() end# 獲取頁數 url = URI.encode("https://www.lagou.com/jobs/list_#@position?city=#@city&cl=false&fromSearch=true&labelWords=&suginput=") num = get_page_num(url).to_i puts "存在 #{num} 個信息分頁"skills = Array.new (1..num).each do |i|puts "定位在第#{i}頁"# 獲取positionsidurl2 = URI.encode("https://www.lagou.com/jobs/positionAjax.json?city=#@city&needAddtionalResult=false")headers = {Referer:url, 'User-Agent':i%2==1?'Mozilla/5.0':'Chrome/67.0.3396.87'}parameters = {first:(i==1), pn:i, kd:@position}positions_id = get_positionsId(url2, headers:headers, parameters:parameters)positions_id.each do |id|# 訪問具體職位頁面,提取英文技能關鍵字url3 = "https://www.lagou.com/jobs/#{id}.html"skills.concat get_skills(url3)sleep 10end endcount = word_count(skills) save_excel(count) @data = parse(count)
?
效果展示:
? ?