網絡爬蟲--20.【Scrapy-Redis實戰】分布式爬蟲獲取房天下--代碼實現

文章目錄

  • 一. 案例介紹
  • 二.創建項目
  • 三. settings.py配置
  • 四. 詳細代碼
  • 五. 部署
    • 1. windows環境下生成requirements.txt文件
    • 2. xshell連接ubuntu服務器并安裝依賴環境
    • 3. 修改部分代碼
    • 4. 上傳代碼至服務器并運行

一. 案例介紹

爬取房天下(https://www1.fang.com/)的網頁信息。

源代碼已更新至:Github

在這里插入圖片描述

二.創建項目

打開windows終端,切換至項目將要存放的目錄下:

scrapy startproject fang

cd fang\

scrapy genspider sfw “fang.com”

項目目錄結構如下所示:
在這里插入圖片描述

三. settings.py配置

# Obey robots.txt rules
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 3
DEFAULT_REQUEST_HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'en',
}
DOWNLOADER_MIDDLEWARES = {'fang.middlewares.UserAgentDownloadMiddleware': 543,
}
ITEM_PIPELINES = {'fang.pipelines.FangPipeline': 300,
}

四. 詳細代碼

settings.py:

# -*- coding: utf-8 -*-# Scrapy settings for fang project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'fang'SPIDER_MODULES = ['fang.spiders']
NEWSPIDER_MODULE = 'fang.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'fang (+http://www.yourdomain.com)'# Obey robots.txt rules
ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'en',
}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'fang.middlewares.FangSpiderMiddleware': 543,
#}# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {'fang.middlewares.UserAgentDownloadMiddleware': 543,
}# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {'fang.pipelines.FangPipeline': 300,
}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

items.py:

# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass NewHouseItem(scrapy.Item):# 省份province = scrapy.Field()# 城市city = scrapy.Field()# 小區名字name = scrapy.Field()# 價格price = scrapy.Field()# 幾居 列表rooms = scrapy.Field()# 面積area = scrapy.Field()# 地址address = scrapy.Field()# 行政區district = scrapy.Field()# 是否在售sale = scrapy.Field()# 房天下詳情頁面的urlorigin_url = scrapy.Field()class ESFHouseItem(scrapy.Item):# 省份province = scrapy.Field()# 城市city = scrapy.Field()# 小區名字name = scrapy.Field()# 幾室幾廳rooms = scrapy.Field()# 層floor = scrapy.Field()# 朝向toward = scrapy.Field()# 年代year = scrapy.Field()# 地址address = scrapy.Field()# 建筑面積area = scrapy.Field()# 總價price = scrapy.Field()# 單價unit = scrapy.Field()# 原始urlorigin_url = scrapy.Field()

pipelines.py:

# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.exporters import JsonLinesItemExporterclass FangPipeline(object):def __init__(self):self.newhouse_fp = open('newhouse.json','wb')self.esfhouse_fp = open('esfhouse.json','wb')self.newhouse_exporter = JsonLinesItemExporter(self.newhouse_fp,ensure_ascii=False)self.esfhouse_exporter = JsonLinesItemExporter(self.esfhouse_fp, ensure_ascii=False)def process_item(self, item, spider):self.newhouse_exporter.export_item(item)self.esfhouse_exporter.export_item(item)return itemdef close_spider(self,spider):self.newhouse_fp.close()self.esfhouse_fp.close()

sfw.py:

# -*- coding: utf-8 -*-
import reimport scrapy
from fang.items import NewHouseItem, ESFHouseItemclass SfwSpider(scrapy.Spider):name = 'sfw'allowed_domains = ['fang.com']start_urls = ['https://www.fang.com/SoufunFamily.htm']def parse(self, response):trs = response.xpath("//div[@class='outCont']//tr")province = Nonefor tr in trs:tds = tr.xpath(".//td[not(@class)]")province_td = tds[0]province_text = province_td.xpath(".//text()").get()province_text = re.sub(r"\s","",province_text)if province_text:province = province_textif province == "其它":continuecity_id = tds[1]city_links = city_id.xpath(".//a")for city_link in city_links:city = city_link.xpath(".//text()").get()city_url = city_link.xpath(".//@href").get()# print("省份:",province)# print("城市:", city)# print("城市鏈接:", city_url)#構建新房的url鏈接url_module = city_url.split("//")scheme = url_module[0]domain_all = url_module[1].split("fang")domain_0 = domain_all[0]domain_1 = domain_all[1]if "bj." in  domain_0:newhouse_url = "https://newhouse.fang.com/house/s/"esf_url = "https://esf.fang.com/"else:newhouse_url =scheme + "//" + domain_0 + "newhouse.fang" + domain_1 + "house/s/"# 構建二手房的URL鏈接esf_url = scheme + "//" + domain_0 + "esf.fang" + domain_1# print("城市:%s%s"%(province, city))# print("新房鏈接:%s"%newhouse_url)# print("二手房鏈接:%s"%esf_url)# yield scrapy.Request(url=newhouse_url,callback=self.parse_newhouse,meta={"info":(province, city)})yield scrapy.Request(url=esf_url,callback=self.parse_esf,meta={"info":(province, city)},dont_filter=True)#     break# breakdef parse_newhouse(self,response):province,city = response.meta.get('info')lis = response.xpath("//div[contains(@class,'nl_con')]/ul/li")for li in lis:# 獲取 項目名字name = li.xpath(".//div[@class='nlcd_name']/a/text()").get()name = li.xpath(".//div[@class='nlcd_name']/a/text()").get()if name == None:passelse:name = name.strip()# print(name)# 獲取房子類型:幾居house_type_list = li.xpath(".//div[contains(@class,'house_type')]/a/text()").getall()if len(house_type_list) == 0:passelse:house_type_list = list(map(lambda x:re.sub(r"\s","",x),house_type_list))rooms = list(filter(lambda x:x.endswith("居"),house_type_list))# print(rooms)# 獲取房屋面積area = "".join(li.xpath(".//div[contains(@class,'house_type')]/text()").getall())area = re.sub(r"\s|/|-", "", area)if len(area) == 0:passelse:area =area# print(area)# 獲取地址address = li.xpath(".//div[@class='address']/a/@title").get()if address == None:passelse:address = address# print(address)# 獲取區劃分:海淀 朝陽district_text = "".join(li.xpath(".//div[@class='address']/a//text()").getall())if len(district_text) == 0:passelse:district = re.search(r".*\[(.+)\].*",district_text).group(1)# print(district)# 獲取是否在售sale = li.xpath(".//div[contains(@class,'fangyuan')]/span/text()").get()if sale == None:passelse:sale = sale# print(sale)# 獲取價格price = li.xpath(".//div[@class='nhouse_price']//text()").getall()if len(price) == 0:passelse:price = "".join(price)price = re.sub(r"\s|廣告","",price)# print(price)# 獲取網址鏈接origin_url = li.xpath(".//div[@class='nlcd_name']/a/@href").get()if origin_url ==None:passelse:origin_url = origin_url# print(origin_url)item = NewHouseItem(name=name,rooms=rooms,area=area,address=address,district=district,sale=sale,price=price,origin_url=origin_url,province=province,city=city,)yield itemnext_url = response.xpath(".//div[@class='page']//a[@class='next']/@href").get()if next_url:yield scrapy.Request(url=response.urljoin(next_url), callback=self.parse_newhouse,meta={"info":(province,city)})def parse_esf(self, response):# 獲取省份和城市province, city = response.meta.get('info')dls = response.xpath("//div[@class='shop_list shop_list_4']/dl")for dl in dls:item = ESFHouseItem(province=province,city=city)# 獲取小區名字name = dl.xpath(".//p[@class='add_shop']/a/text()").get()if name == None:passelse:item['name'] = name.strip()# print(name)# 獲取綜合信息infos = dl.xpath(".//p[@class='tel_shop']/text()").getall()if len(infos) == 0:passelse:infos = list(map(lambda x:re.sub(r"\s","",x),infos))# print(infos)for info in infos:if "廳" in info :item['rooms']= infoelif '層' in info:item['floor']= infoelif '向' in info:item['toward']=infoelif '年' in info:item['year']=infoelif '㎡' in info:item['area'] = info# print(item)# 獲取地址address = dl.xpath(".//p[@class='add_shop']/span/text()").get()if address == None:passelse:# print(address)item['address'] = address# 獲取總價price = dl.xpath("./dd[@class='price_right']/span[1]/b/text()").getall()if len(price) == 0:passelse:price="".join(price)# print(price)item['price'] = price# 獲取單價unit = dl.xpath("./dd[@class='price_right']/span[2]/text()").get()if unit == None:passelse:# print(unit)item['unit'] = unit# 獲取初始urldetail_url = dl.xpath(".//h4[@class='clearfix']/a/@href").get()if detail_url == None:passelse:origin_url = response.urljoin(detail_url)# print(origin_url)item['origin_url'] = origin_url# print(item)yield itemnext_url = response.xpath(".//div[@class='page_al']/p/a/@href").get()# print(next_url)yield scrapy.Request(url=response.urljoin(next_url),callback=self.parse_esf,meta={"info":(province,city)})

middlewares.py:

# -*- coding: utf-8 -*-# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlimport randomclass UserAgentDownloadMiddleware(object):# user-agent隨機請求頭中間件USER_AGENTS = ['Mozilla/5.0 (Windows; U; Windows NT 6.1; rv:2.2) Gecko/20110201''Mozilla/5.0 (Windows; U; Windows NT 5.1; pl; rv:1.9.2.3) Gecko/20100401 Lightningquail/3.6.3''Mozilla/5.0 (X11; ; Linux i686; rv:1.9.2.20) Gecko/20110805''Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1b3) Gecko/20090305''Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.14) Gecko/2009091010''Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.10) Gecko/2009042523']def process_request(self, request, spider):user_agent = random.choice(self.USER_AGENTS)request.headers['User-Agent'] = user_agent

start.sh:

from scrapy import cmdlinecmdline.execute("scrapy crawl sfw".split())

此時在windows開發環境下運行start.sh,即可正常爬取數據。

五. 部署

1. windows環境下生成requirements.txt文件

打開cmder,首先切換至虛擬化境:

cd C:\Users\fxd.virtualenvs\sipder_env
.\Scripts\activate

在這里插入圖片描述

然后切換至項目所在目錄,輸入指令,生成requirements.txt文件
pip freeze > requirements.txt
在這里插入圖片描述

在這里插入圖片描述

2. xshell連接ubuntu服務器并安裝依賴環境

如果未安裝openssh,需要首先安裝,具體指令如下:

sudo apt-get install openssh-server

連接ubuntu服務器,切換至虛擬環境所在的目錄,執行:

source ./bin/activate

進入虛擬環境,執行:

rz

上傳requirements.txt,執行:

pip install -r requirements.txt

安裝項目依賴環境。

然后安裝scrapy-redis:

pip install scrapy-redis

3. 修改部分代碼

要將一個Scrapy項目變成一個Scrapy-redis項目,只需要修改以下三點:
(1)將爬蟲繼承的類,從scrapy.Spider 變成scrapy_redis.spiders.RedisSpider;或者從scrapy.CrowlSpider變成scrapy_redis.spiders.RedisCrowlSpider。
(2)將爬蟲中的start_urls刪掉,增加一個redis_key="***"。這個key是為了以后在redis中控制爬蟲啟動的,爬蟲的第一個url,就是在redis中通過這個推送出去的。
(3)在配置文件中增加如下配置:

# Scrapy-Redis相關配置
# 確保request存儲到redis中
SCHEDULER = "scrapy_redis.scheduler.Scheduler"# 確保所有的爬蟲共享相同的去重指紋
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"# 設置redis為item_pipeline
ITEM_PIPELINES = {'scrapy_redis.pipelines.RedisPipeline':300
}# 在redis中保持scrapy_redis用到的隊列,不會清理redis中的隊列,從而可以實現暫停和回復的功能
SCHEDULER_PERSIST = True# 設置連接redis信息
REDIS_HOST = '172.20.10.2'
REDIS_PORT = 6379

4. 上傳代碼至服務器并運行

將項目文件壓縮,在xshell中通過命令rz上傳,并解壓

運行爬蟲:
(1)在爬蟲服務器上,進入爬蟲文件sfw.py所在的路徑,然后輸入命令:scrapy runspider [爬蟲名字]

scrapy runspider sfw.py

(2)在redis(windows)服務器上,開啟redis服務:

redis-server redis.windows.conf
若報錯,按步驟執行以下命令:
redis-cli.exe
shutdown
exit
redis-server.exe redis.windows.conf

(3)然后打開另外一個windows終端:

redis-cli

推入一個開始的url鏈接:

lpush fang:start_urls https://www.fang.com/SoufunFamily.htm

爬蟲開始
在這里插入圖片描述

進入RedisDesktopManager查看保存的數據:

在這里插入圖片描述

另外一臺爬蟲服務器進行同樣的操作。
項目結束!

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/451968.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/451968.shtml
英文地址,請注明出處:http://en.pswp.cn/news/451968.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

同一臺電腦安裝python2python3

【安裝之前,先了解一下概念】 python是什么? Python是一種面向對象的解釋型計算機程序設計語言,由荷蘭人Guido van Rossum于1989年發明,第一個公開發行版發行于1991年。 Python是純粹的自由軟件, 源代碼和解釋器CPytho…

程序員的常見健康問題

其實這些問題不僅見于程序員,其他長期經常坐在電腦前的職場人士(比如:網絡編輯、站長等),都會有其中的某些健康問題。希望從事這些行業的朋友,對自己的健康問題,予以重視。以下是全文。 我最近…

Java中BufferedReader和InputStreamReader

BufferedReader 類BufferedReader 由Reader類擴展而來,提供通用的緩沖方式文本讀取,而且提供了很實用的readLine,讀取一個文本行,從字符輸入流中讀取文本,緩沖各個字符,從而提供字符、數組和行的高效讀取。…

網絡爬蟲--21.Scrapy知識點總結

文章目錄一. Scrapy簡介二. Scrapy架構圖三. Scrapy框架模塊功能四. 安裝和文檔五. 創建項目六. 創建爬蟲一. Scrapy簡介 二. Scrapy架構圖 三. Scrapy框架模塊功能 四. 安裝和文檔 中文文檔:https://scrapy-chs.readthedocs.io/zh_CN/latest/intro/tutorial.html …

Spring 定時任務的幾種實現

前些天發現了一個巨牛的人工智能學習網站,通俗易懂,風趣幽默,忍不住分享一下給大家。點擊跳轉到教程。 近日項目開發中需要執行一些定時任務,比如需要在每天凌晨時候,分析一次前一天的日志信息,借此機會整…

trie樹(字典樹)

trie樹學習 學習trie樹 轉載于:https://www.cnblogs.com/cjoierljl/p/9317023.html

Vue 教程第四篇—— Vue 實例化時基本屬性

實例元素 el 實例元素指的是 Vue 實例化時編譯的容器元素&#xff0c;或者說是 Vue 作用的元素容器 <div id"app"></div> var vm new Vue({el: #app}) 也可以為實例元素指定其它選擇器 <div class"app"></div> var vm new Vue({…

Ubuntu將在明年推出平板及手機系統

4月26日下午消息&#xff0c;知名Linux廠商Canonical今天正式發布Ubuntu 12.04版開源操作系統。Ubuntu中國首席代表于立強透露&#xff0c;針對平板電腦的Ubuntu操作系統將在明年推出。 Ubuntu 12.04版開源操作系統發布 Ubuntu操作系統是一款開源操作系統&#xff0c;主要與OE…

scrapy框架異常--no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)

解決方法&#xff1a; https://blog.csdn.net/qq_40176258/article/details/86527568 https://blog.csdn.net/weixin_39946931/article/details/88390797 謝謝博主分享&#xff01;

【BZOJ3590】[Snoi2013]Quare 狀壓DP

題解&#xff1a; 一道比較水的題 但這個測試數據極弱我也不知道我的代碼正確性是不是有保證 構成一個邊雙聯通 可以由兩個有一個公共點的邊雙聯通或者一個邊雙加一條鏈構成 所以我們需要要預處理出所有環 令f[i][j][k]表示起點為i&#xff0c;終點為j&#xff0c;經過點的狀態…

java swing簡介

UI 組件簡介 在開始學習 Swing 之前&#xff0c;必須回答針對真正初學者的一個問題&#xff1a;什么是 UI&#xff1f;初學者的答案是“用戶界面”。但是因為本教程的目標是要保證您不再只是個初學者&#xff0c;所以我們需要比這個定義更高級的定義。 所以&#xff0c;我再次…

定時任務 cron 表達式詳解

前些天發現了一個巨牛的人工智能學習網站&#xff0c;通俗易懂&#xff0c;風趣幽默&#xff0c;忍不住分享一下給大家。點擊跳轉到教程。 &#xff08;Spring定時任務的幾種實現&#xff1a;見博客另一頁&#xff1a;http://blog.csdn.net/jiangyu1013/article/details/54405…

Android Studio 超級簡單的打包生成apk

為什么要打包&#xff1a; apk文件就是一個包&#xff0c;打包就是要生成apk文件&#xff0c;有了apk別人才能安裝使用。打包分debug版和release包&#xff0c;通常所說的打包指生成release版的apk&#xff0c;release版的apk會比debug版的小&#xff0c;release版的還會進行混…

推薦16款最棒的Visual Studio插件

Visual Studio是微軟公司推出的開發環境&#xff0c;Visual Studio可以用來創建Windows平臺下的Windows應用程序和網絡應用程序&#xff0c;也可以用來創建網絡服務、智能設備應用程序和Office插件。 本文介紹16款最棒的Visual Studio擴展&#xff1a; 1. DevColor Extension…

網絡爬蟲--22.【CrawlSpider實戰】實現微信小程序社區爬蟲

文章目錄一. CrawlSpider二. CrawlSpider案例1. 目錄結構2. wxapp_spider.py3. items.py4. pipelines.py5. settings.py6. start.py三. 重點總結一. CrawlSpider 現實情況下&#xff0c;我們需要對滿足某個特定條件的url進行爬取&#xff0c;這時候就可以通過CrawlSpider完成。…

可以生成自動文檔的注釋

使用/**和*/可以用來自動的生成文檔。 這種注釋以/**開頭&#xff0c;以*/結尾

怎么安裝Scrapy框架以及安裝時出現的一系列錯誤(win7 64位 python3 pycharm)

因為要學習爬蟲&#xff0c;就打算安裝Scrapy框架&#xff0c;以下是我安裝該模塊的步驟&#xff0c;適合于剛入門的小白&#xff1a; 一、打開pycharm&#xff0c;依次點擊File---->setting---->Project----->Project Interpreter&#xff0c;打開后&#xff0c;可以…

illegal to have multiple occurrences of contentType with different values 解決

前些天發現了一個巨牛的人工智能學習網站&#xff0c;通俗易懂&#xff0c;風趣幽默&#xff0c;忍不住分享一下給大家。點擊跳轉到教程。 在網上查到說是&#xff1a;“包含頁面與被包含頁面的page指令里面的contentType不一致&#xff0c;仔細檢查兩個文件第一行的 page....…

xpath-helper: 谷歌瀏覽器安裝xpath helper 插件

1.下載文件xpath-helper.crx xpath鏈接&#xff1a;https://pan.baidu.com/s/1dFgzBSd 密碼&#xff1a;zwvb&#xff0c;感謝這位網友&#xff0c;我從這拿到了 2.在Google瀏覽器里邊找到這個“擴展程序”選項菜單即可。 3.然后就會進入到擴展插件的界面了,把下載好的離線插件…