Python2 Python3 爬取趕集網租房信息,帶源碼分析

*之前偶然看了某個騰訊公開課的視頻,寫的爬取趕集網的租房信息,這幾天突然想起來,于是自己分析了一下趕集網的信息,然后自己寫了一遍,寫完又用用Python3重寫了一遍.之中也遇見了少許的坑.記一下.算是一個總結.*

python2 爬取趕集網租房信息與網站分析

分析目標網站url
尋找目標標簽
獲取,并寫入csv文件

#-*- coding: utf-8 -*-
from bs4 import BeautifulSoup
from urlparse import urljoin
import requests
import csvURL = 'http://jn.ganji.com/fang1/o{page}p{price}/'
# 首先最基本的是 jn,ganji.com/fang1 其中jn為濟南,也就是我的城市,默認登錄后為此
# 而fang1 位租房信息 fang5 為二手房信息,zhaopin 為招聘模塊等,我們這次只查找fang1
# 不過這個鏈接還可以更復雜 
#比如http://jn.ganji.com/fang1/tianqiao/h1o1p1/ 或者
#  http://jn.ganji.com/fang1/tianqiao/b1000e1577/
#    其中h:房型,o頁面,p價格區間,其中h,p后的數字與網站相應菜單的排列順序相對應 
# 而s與e則為對應的自己輸入的價格區間
#                          h: house o:page p:price
# jn jinan  fang1 zufang tiaoqiao:tianqiaoqu b:begin 1000  e:end start 1755ADDR = 'http://bj.ganji.com/'
start_page =1
end_page = 5
price =1# 注意wb格式打開寫入可能會導致csv文件每次寫入前面多一個空格
# 對此你可以參考這篇文章:http://blog.csdn.net/pfm685757/article/details/47806469
with open('info.csv','wb') as f :csv_writer = csv.writer(f,delimiter=',')print 'starting'while start_page<end_page:start_page+=1# 通過分析標簽可知我們要獲取的標簽信息必須要通過多個class確認才能保證唯一性# 之后是獲取信息的具體設置print 'get{0}'.format(URL.format(page = start_page,price=price))response = requests.get(URL.format(page = start_page,price=price))html=BeautifulSoup(response.text,'html.parser')house_list = html.select('.f-list > .f-list-item > .f-list-item-wrap')#check house_listif not house_list:print 'No house_list'breakfor house in house_list:house_title = house.select('.title > a')[0].string.encode('utf-8')house_addr = house.select('.address > .area > a')[-1].string.encode('utf-8')house_price = house.select('.info > .price > .num')[0].string.encode('utf-8')house_url = urljoin(ADDR,house.select('.title > a ')[0]['href'])# 寫入csv文件csv_writer.writerow([house_title,house_addr,house_price,house_url])print 'ending'

Python3 爬取趕集網i租房信息

要注意的點

urlparse.urljoin 改為urllib.urlparse.urljoin

# python2
from urlparse import urljoin
# Python3
from urllib.parse  import urljoin

Python3中csv對bytes和str兩種類型進行了嚴格區分,open的寫入格式應該進行改變wb->w
設置utf8編碼格式

with open('info.csv','w',encoding='utf8') as f :csv_writer = csv.writer(f,delimiter=',')

完整代碼如下

#-*- coding: utf-8 -*-
from bs4 import BeautifulSoup
from urllib.parse  import urljoin
import requests
import csvURL = 'http://jn.ganji.com/fang1/o{page}p{price}/'
#                                h: house o:page p:price
#  http://jn.ganji.com/fang1/tianqiao/b1000e1577/
# jn jinan  fang1 zufang tiaoqiao:tianqiaoqu b:begin 1000  e:end start 1755
# fang5 為二手房 zhipin 為 招聘 趕集網的url劃分的都很簡單,時間充足完全可以獲取非常多的信息ADDR = 'http://bj.ganji.com/'
start_page =1
end_page = 5price =1'''
URL = 'http://jn.ganji.com/fang1/h{huxing}o{page}b{beginPrice}e{endPrice}/'
# 選擇戶型為h1-h5
# 輸入價位為 begin or end
price='b1000e2000'# 戶型為'''
# 默認為utf8打開,否則會以默認編碼GBK寫入
with open('info.csv','w',encoding='utf8') as f :csv_writer = csv.writer(f,delimiter=',')print('starting')while start_page<end_page:start_page+=1print('get{0}'.format(URL.format(page = start_page,price=price)))response = requests.get(URL.format(page = start_page,price=price))html=BeautifulSoup(response.text,'html.parser')house_list = html.select('.f-list > .f-list-item > .f-list-item-wrap')#check house_listif not house_list:print('No house_list')breakfor house in house_list:house_title = house.select('.title > a')[0].stringhouse_addr = house.select('.address > .area > a')[-1].stringhouse_price = house.select('.info > .price > .num')[0].stringhouse_url = urljoin(ADDR, house.select('.title > a ')[0]['href'])csv_writer.writerow([house_title,house_addr,house_price,house_url])print('ending')

最后的csv文件展示一下:
趕集網租房信息 csv文件網絡爬蟲

轉載于:https://www.cnblogs.com/fonttian/p/9162844.html

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/252827.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/252827.shtml
英文地址，請注明出處：http://en.pswp.cn/news/252827.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！