Python 框架 之 Scrapy 爬蟲(二)

Scrapy是一個為了爬取網站數據,提取結構性數據而編寫的應用框架。 其可以應用在數據挖掘,信息處理或存儲歷史數據等一系列的程序中。其最初是為了頁面抓取 (更確切來說, 網絡抓取)所設計的, 也可以應用在獲取API所返回的數據(例如 Amazon Associates Web Services ) 或者通用的網絡爬蟲。Scrapy用途廣泛,可以用于數據挖掘、監測和自動化測試。

Scrapy 使用了 Twisted異步網絡庫來處理網絡通訊。整體架構大致如下

Scrapy主要包括了以下組件:

  • 引擎(Scrapy)
    用來處理整個系統的數據流處理, 觸發事務(框架核心)
  • 調度器(Scheduler)
    用來接受引擎發過來的請求, 壓入隊列中, 并在引擎再次請求的時候返回. 可以想像成一個URL(抓取網頁的網址或者說是鏈接)的優先隊列, 由它來決定下一個要抓取的網址是什么, 同時去除重復的網址
  • 下載器(Downloader)
    用于下載網頁內容, 并將網頁內容返回給蜘蛛(Scrapy下載器是建立在twisted這個高效的異步模型上的)
  • 爬蟲(Spiders)
    爬蟲是主要干活的, 用于從特定的網頁中提取自己需要的信息, 即所謂的實體(Item)。用戶也可以從中提取出鏈接,讓Scrapy繼續抓取下一個頁面
  • 項目管道(Pipeline)
    負責處理爬蟲從網頁中抽取的實體,主要的功能是持久化實體、驗證實體的有效性、清除不需要的信息。當頁面被爬蟲解析后,將被發送到項目管道,并經過幾個特定的次序處理數據。
  • 下載器中間件(Downloader Middlewares)
    位于Scrapy引擎和下載器之間的框架,主要是處理Scrapy引擎與下載器之間的請求及響應。
  • 爬蟲中間件(Spider Middlewares)
    介于Scrapy引擎和爬蟲之間的框架,主要工作是處理蜘蛛的響應輸入和請求輸出。
  • 調度中間件(Scheduler Middewares)
    介于Scrapy引擎和調度之間的中間件,從Scrapy引擎發送到調度的請求和響應。

Scrapy運行流程大概如下:

  1. 引擎從調度器中取出一個鏈接(URL)用于接下來的抓取
  2. 引擎把URL封裝成一個請求(Request)傳給下載器
  3. 下載器把資源下載下來,并封裝成應答包(Response)
  4. 爬蟲解析Response
  5. 解析出實體(Item),則交給實體管道進行進一步的處理
  6. 解析出的是鏈接(URL),則把URL交給調度器等待抓取

一、安裝

Linux:pip3 install scrapyWindows:a. pip3 install wheelb. 下載twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/#twistedc. 進入下載目錄,執行 pip3 install Twisted?17.1.0?cp35?cp35m?win_amd64.whld. pip3 install scrapye. 下載并安裝pywin32:https://sourceforge.net/projects/pywin32/files/

二、基本使用

1. 基本命令

1.?scrapy startproject 項目名稱# 在當前目錄中創建一個項目文件(類似于Django)2.?scrapy genspider [-t template] <name> <domain># 創建爬蟲應用scrapy gensipider?-t basic oldboy oldboy.comscrapy gensipider?-t xmlfeed autohome autohome.com.cnPS:查看所有命令:scrapy gensipider?-l查看模板命令:scrapy gensipider?-d 模板名稱3.?scrapy?list#?展示爬蟲應用列表4.?scrapy crawl 爬蟲應用名稱#?運行單獨爬蟲應用,要在項目內運行

2.項目結構以及爬蟲應用簡介

project_name/scrapy.cfg         # 項目的主配置信息。(真正爬蟲相關的配置信息在settings.py文件中)project_name/__init__.pyitems.py       # 設置數據存儲模板,用于結構化數據,如:Django的Modelpipelines.py   # 數據處理行為,如:一般結構化的數據持久化settings.py    # 配置文件,如:遞歸的層數、并發數,延遲下載等spiders/       # 爬蟲目錄,如:創建文件,編寫爬蟲規則__init__.py爬蟲1.py爬蟲2.py爬蟲3.py

注意:一般創建爬蟲文件時,以網站域名命名

爬蟲1.py

import scrapyclass XiaoHuarSpider(scrapy.spiders.Spider):name = "spidername"                 # 爬蟲名稱 *****allowed_domains = ["spider.com"]    # 允許的域名start_urls = ["http://www.flepeng.com/",      # 起始URL]def parse(self, response):# 訪問起始URL并獲取結果后的回調函數

關于windows編碼

import sys,os
sys.stdout=io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030')

3.?小試牛刀

import scrapy
from scrapy.selector import HtmlXPathSelector # 新版的好像已經棄用,使用Selector
from scrapy.http.request import Requestclass DigSpider(scrapy.Spider):name = "dig"    # 爬蟲應用的名稱,通過命令啟動爬蟲時,使用此參數allowed_domains = ["chouti.com"]    # 允許的域名start_urls = ['http://dig.chouti.com/',]    # 起始URLhas_request_set = {}def parse(self, response):print(response.url)hxs = HtmlXPathSelector(response)page_list = hxs.select('//div[@id="dig_lcpage"]//a[re:test(@href, "/all/hot/recent/\d+")]/@href').extract()for page in page_list:page_url = 'http://dig.chouti.com%s' % pagekey = self.md5(page_url)if key not in self.has_request_set:self.has_request_set[key] = page_urlobj = Request(url=page_url, method='GET', callback=self.parse)yield obj@staticmethoddef md5(val):import hashlibha = hashlib.md5()ha.update(bytes(val, encoding='utf-8'))key = ha.hexdigest()return key

執行此爬蟲文件,則在終端進入項目目錄執行如下命令:

scrapy crawl dig?--nolog # nolog 表示不打印日志

對于上述代碼重要之處在于:

  • Request是一個封裝用戶請求的類,在回調函數中yield該對象表示繼續訪問
  • HtmlXpathSelector用于結構化HTML代碼并提供選擇器功能

4. 選擇器

xpath的路徑表達式:

表達式

描述
nodename選取此節點的所有子節點。
/從根節點選取。
//從匹配選擇的當前節點選擇文檔中的節點,而不考慮它們的位置。
.選取當前節點。
..選取當前節點的父節點。
@選取屬性。

在下面的表格中,列出了一些路徑表達式以及表達式的結果:

路徑表達式結果
bookstore選取 bookstore 元素的所有子節點。
/bookstore

選取根元素 bookstore。

注釋:假如路徑起始于正斜杠( / ),則此路徑始終代表到某元素的絕對路徑!

bookstore/book選取屬于 bookstore 的子元素的所有 book 元素。
//book選取所有 book 子元素,而不管它們在文檔中的位置。
bookstore//book選擇屬于 bookstore 元素的后代的所有 book 元素,而不管它們位于 bookstore 之下的什么位置。
//@lang選取名為 lang 的所有屬性。
#!/usr/bin/env python
# -*- coding:utf-8 -*-
from?scrapy.selector?import?Selector, HtmlXPathSelector # 新版好像已棄,使用Selector,用法和這個一樣
from?scrapy.http?import?HtmlResponse
html?=?"""<!DOCTYPE html>
<html><head lang="en"><meta charset="UTF-8"><title></title></head><body><ul><li class="item-"><a id='i1' href="link.html">first item</a></li><li class="item-0"><a id='i2' href="llink.html">first item</a></li><li class="item-1"><a href="llink2.html">second item<span>vv</span></a></li></ul><div><a href="llink2.html">second item</a></div></body>
</html>
"""response?=?HtmlResponse(url='http://example.com', body=html,encoding='utf-8')
# hxs = HtmlXPathSelector(response)
# print(hxs)
# hxs = Selector(response=response).xpath('//a')  # 從根目錄下查找所有 a 元素
# print(hxs)
# hxs = Selector(response=response).xpath('//a[2]')
# print(hxs)
# hxs = Selector(response=response).xpath('//a[@id]')
# print(hxs)
# hxs = Selector(response=response).xpath('//a[@id="i1"]')
# print(hxs)
# hxs = Selector(response=response).xpath('//a[@href="link.html"][@id="i1"]')
# print(hxs)
# hxs = Selector(response=response).xpath('//a[contains(@href, "link")]')
# print(hxs)
# hxs = Selector(response=response).xpath('//a[starts-with(@href, "link")]')
# print(hxs)
# hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]')
# print(hxs)
# hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]/text()').extract()
# print(hxs)
# hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]/@href').extract()
# print(hxs)
# hxs = Selector(response=response).xpath('/html/body/ul/li/a/@href').extract()
# print(hxs)
# hxs = Selector(response=response).xpath('//body/ul/li/a/@href').extract_first()
# print(hxs)
# ul_list = Selector(response=response).xpath('//body/ul/li')
# for item in ul_list:
#???? v = item.xpath('./a/span')
#???? # 或
#???? # v = item.xpath('a/span')
#???? # 或
#???? # v = item.xpath('*/a/span')
#???? print(v)

示例:自動登陸抽屜并點贊

# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import HtmlXPathSelector
from scrapy.http.request import Request
from scrapy.http.cookies import CookieJar
from scrapy import FormRequestclass ChouTiSpider(scrapy.Spider):# 爬蟲應用的名稱,通過此名稱啟動爬蟲命令name = "chouti"# 允許的域名allowed_domains = ["chouti.com"]cookie_dict = {}has_request_set = {}def start_requests(self):url = 'http://dig.chouti.com/'# return [Request(url=url, callback=self.login)]yield Request(url=url, callback=self.login)def login(self, response):cookie_jar = CookieJar()cookie_jar.extract_cookies(response, response.request)for k, v in cookie_jar._cookies.items():for i, j in v.items():for m, n in j.items():self.cookie_dict[m] = n.valuereq = Request(url='http://dig.chouti.com/login',method='POST',headers={'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'},body='phone=8615131255089&password=pppppppp&oneMonth=1',cookies=self.cookie_dict,callback=self.check_login)yield reqdef check_login(self, response):req = Request(url='http://dig.chouti.com/',method='GET',callback=self.show,cookies=self.cookie_dict,dont_filter=True)yield reqdef show(self, response):# print(response)hxs = HtmlXPathSelector(response)news_list = hxs.select('//div[@id="content-list"]/div[@class="item"]')for new in news_list:# temp = new.xpath('div/div[@class="part2"]/@share-linkid').extract()link_id = new.xpath('*/div[@class="part2"]/@share-linkid').extract_first()yield Request(url='http://dig.chouti.com/link/vote?linksId=%s' %(link_id,),method='POST',cookies=self.cookie_dict,callback=self.do_favor)page_list = hxs.select('//div[@id="dig_lcpage"]//a[re:test(@href, "/all/hot/recent/\d+")]/@href').extract()for page in page_list:page_url = 'http://dig.chouti.com%s' % pageimport hashlibhash = hashlib.md5()hash.update(bytes(page_url,encoding='utf-8'))key = hash.hexdigest()if key in self.has_request_set:passelse:self.has_request_set[key] = page_urlyield Request(url=page_url,method='GET',callback=self.show)def do_favor(self, response):print(response.text)

?處理Cookie

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http.response.html import HtmlResponse
from scrapy.http import Request
from scrapy.http.cookies import CookieJarclass ChoutiSpider(scrapy.Spider):name = "chouti"allowed_domains = ["chouti.com"]start_urls = ('http://www.chouti.com/',)def start_requests(self):url = 'http://dig.chouti.com/'yield Request(url=url, callback=self.login, meta={'cookiejar': True})def login(self, response):print(response.headers.getlist('Set-Cookie'))req = Request(url='http://dig.chouti.com/login',method='POST',headers={'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'},body='phone=8613121758648&password=woshiniba&oneMonth=1',callback=self.check_login,meta={'cookiejar': True})yield reqdef check_login(self, response):print(response.text)

注意:settings.py中設置DEPTH_LIMIT = 1來指定“遞歸”的層數。

5. 格式化處理 pipelines

上述實例只是簡單的處理,所以在parse方法中直接處理。如果對于想要獲取更多的數據處理,則可以利用Scrapy的items將數據格式化,然后統一交由pipelines來處理。

spiders/xiahuar.py

import scrapy
from scrapy.selector import HtmlXPathSelector
from scrapy.http.request import Request
from scrapy.http.cookies import CookieJar
from scrapy import FormRequestclass XiaoHuarSpider(scrapy.Spider):name = "xiaohuar"allowed_domains = ["xiaohuar.com"]start_urls = ["http://www.xiaohuar.com/list-1-1.html",]# setting 中的配置pipelines# custom_settings = {#     'ITEM_PIPELINES':{#         'spider1.pipelines.JsonPipeline': 100#     }# }has_request_set = {}def parse(self, response):# 分析頁面# 找到頁面中符合規則的內容(校花圖片),保存# 找到所有的a標簽,再訪問其他a標簽,一層一層的搞下去hxs = HtmlXPathSelector(response)items = hxs.select('//div[@class="item_list infinite_scroll"]/div')for item in items:src = item.select('.//div[@class="img"]/a/img/@src').extract_first()name = item.select('.//div[@class="img"]/span/text()').extract_first()school = item.select('.//div[@class="img"]/div[@class="btns"]/a/text()').extract_first()url = "http://www.xiaohuar.com%s" % srcfrom ..items import XiaoHuarItemobj = XiaoHuarItem(name=name, school=school, url=url)yield objurls = hxs.select('//a[re:test(@href, "http://www.xiaohuar.com/list-1-\d+.html")]/@href')for url in urls:key = self.md5(url)if key in self.has_request_set:passelse:self.has_request_set[key] = urlreq = Request(url=url,method='GET',callback=self.parse)yield req@staticmethoddef md5(val):import hashlibha = hashlib.md5()ha.update(bytes(val, encoding='utf-8'))key = ha.hexdigest()return key

items

import scrapyclass XiaoHuarItem(scrapy.Item):name = scrapy.Field()school = scrapy.Field()url = scrapy.Field()

pipelines

import json
import os
import requestsclass JsonPipeline(object):def __init__(self):self.file = open('xiaohua.txt', 'w')def process_item(self, item, spider):v = json.dumps(dict(item), ensure_ascii=False)self.file.write(v)self.file.write('\n')self.file.flush()return itemclass FilePipeline(object):def __init__(self):if not os.path.exists('imgs'):os.makedirs('imgs')def process_item(self, item, spider):response = requests.get(item['url'], stream=True)file_name = '%s_%s.jpg' % (item['name'], item['school'])with open(os.path.join('imgs', file_name), mode='wb') as f:f.write(response.content)return item

settings

ITEM_PIPELINES = {'spider1.pipelines.JsonPipeline': 100,'spider1.pipelines.FilePipeline': 300,
}
# 后面的整數值,確定了他們運行的順序,item按數字從低到高的順序,通過pipeline,通常將這些數字定義在0-1000范圍內。

對于pipeline可以做更多,如下:

自定義pipeline格式

from scrapy.exceptions import DropItemclass CustomPipeline(object):def __init__(self,v):self.value = vdef process_item(self, item, spider):# 運行pipeline時會調用此函數,操作并進行持久化# return表示會被后續的pipeline繼續處理return item# 表示將item丟棄,不會被后續pipeline處理# raise DropItem()@classmethoddef from_crawler(cls, crawler):# 初始化時候,用于創建pipeline對象val = crawler.settings.getint('MMMM')return cls(val)def open_spider(self,spider):# 爬蟲開始執行時,調用print('000000')def close_spider(self,spider):# 爬蟲關閉時,被調用print('111111')

6.中間件

爬蟲中間件

class SpiderMiddleware(object):def process_spider_input(self,response, spider):"""下載完成,執行,然后交給parse處理:param response: :param spider: :return: """passdef process_spider_output(self,response, result, spider):"""spider處理完成,返回時調用:param response::param result::param spider::return: 必須返回包含 Request 或 Item 對象的可迭代對象(iterable)"""return resultdef process_spider_exception(self,response, exception, spider):"""異常調用:param response::param exception::param spider::return: None,繼續交給后續中間件處理異常;含 Response 或 Item 的可迭代對象(iterable),交給調度器或pipeline"""return Nonedef process_start_requests(self,start_requests, spider):"""爬蟲啟動時調用:param start_requests::param spider::return: 包含 Request 對象的可迭代對象"""return start_requests

下載器中間件

class DownMiddleware1(object):def process_request(self, request, spider):"""請求需要被下載時,經過所有下載器中間件的process_request調用:param request: :param spider: :return:  None,繼續后續中間件去下載;Response對象,停止process_request的執行,開始執行process_responseRequest對象,停止中間件的執行,將Request重新調度器raise IgnoreRequest異常,停止process_request的執行,開始執行process_exception"""passdef process_response(self, request, response, spider):"""spider處理完成,返回時調用:param response::param result::param spider::return: Response 對象:轉交給其他中間件process_responseRequest 對象:停止中間件,request會被重新調度下載raise IgnoreRequest 異常:調用Request.errback"""print('response1')return responsedef process_exception(self, request, exception, spider):"""當下載處理器(download handler)或 process_request() (下載中間件)拋出異常:param response::param exception::param spider::return: None:繼續交給后續中間件處理異常;Response對象:停止后續process_exception方法Request對象:停止中間件,request將會被重新調用下載"""return None

7. 自定制命令

  • 在spiders同級創建任意目錄,如:commands
  • 在其中創建 crawlall.py 文件 (此處文件名就是自定義的命令)

crawlall.py

from scrapy.commands import ScrapyCommand
from scrapy.utils.project import get_project_settingsclass Command(ScrapyCommand):requires_project = Truedef syntax(self):        # 命令的參數return '[options]'def short_desc(self):    # 命令的描述return 'Runs all of the spiders'def run(self, args, opts):spider_list = self.crawler_process.spiders.list()for name in spider_list:self.crawler_process.crawl(name, **opts.__dict__)self.crawler_process.start()
  • 在settings.py 中添加配置 COMMANDS_MODULE = '項目名稱.目錄名稱'
  • 在項目目錄執行命令:scrapy crawlall?

單個爬蟲

import sys
from scrapy.cmdline import executeif __name__ == '__main__':execute(["scrapy","github","--nolog"])

8. 自定義擴展

自定義擴展時,利用信號在指定位置注冊制定操作

from scrapy import signalsclass MyExtension(object):def __init__(self, value):self.value = value@classmethoddef from_crawler(cls, crawler):val = crawler.settings.getint('MMMM')ext = cls(val)# 注冊信號crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)return extdef spider_opened(self, spider):print('open')def spider_closed(self, spider):print('close')

9. 避免重復訪問

scrapy默認使用 scrapy.dupefilter.RFPDupeFilter 進行去重,相關配置有:

DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter'
DUPEFILTER_DEBUG = False
JOBDIR = "保存范文記錄的日志路徑,如:/root/"  # 最終路徑為 /root/requests.seen

自定義URL去重操作

class RepeatUrl:def __init__(self):self.visited_url = set()@classmethoddef from_settings(cls, settings):"""初始化時,調用:param settings: :return: """return cls()def request_seen(self, request):"""檢測當前請求是否已經被訪問過:param request: :return: True表示已經訪問過;False表示未訪問過"""if request.url in self.visited_url:return Trueself.visited_url.add(request.url)return Falsedef open(self):"""開始爬取請求時,調用:return: """print('open replication')def close(self, reason):"""結束爬蟲爬取時,調用:param reason: :return: """print('close replication')def log(self, request, spider):"""記錄日志:param request: :param spider: :return: """print('repeat', request.url)

10.其他

settings

# -*- coding: utf-8 -*-# Scrapy settings for step8_king project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html# 1. 爬蟲名稱
BOT_NAME = 'step8_king'# 2. 爬蟲應用路徑
SPIDER_MODULES = ['step8_king.spiders']
NEWSPIDER_MODULE = 'step8_king.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent
# 3. 客戶端 user-agent請求頭
# USER_AGENT = 'step8_king (+http://www.yourdomain.com)'# Obey robots.txt rules
# 4. 禁止爬蟲配置,應該開啟,看看是否允許
# ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)
# 5. 并發請求數
# CONCURRENT_REQUESTS = 4# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# 6. 延遲下載秒數
# DOWNLOAD_DELAY = 2# The download delay setting will honor only one of:
# 7. 單域名訪問并發數,并且延遲下次秒數也應用在每個域名
# CONCURRENT_REQUESTS_PER_DOMAIN = 2
# 單IP訪問并發數,如果有值則忽略:CONCURRENT_REQUESTS_PER_DOMAIN,并且延遲下次秒數也應用在每個IP
# CONCURRENT_REQUESTS_PER_IP = 3# Disable cookies (enabled by default)
# 8. 是否支持cookie,cookiejar進行操作cookie
# COOKIES_ENABLED = True
# COOKIES_DEBUG = True# Disable Telnet Console (enabled by default)
# 9. Telnet用于查看當前爬蟲的信息,操作爬蟲等...
#    使用telnet ip port ,然后通過命令操作
# TELNETCONSOLE_ENABLED = True
# TELNETCONSOLE_HOST = '127.0.0.1'
# TELNETCONSOLE_PORT = [6023,]# 10. 默認請求頭
# Override the default request headers:
# DEFAULT_REQUEST_HEADERS = {
#     'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#     'Accept-Language': 'en',
# }# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
# 11. 定義pipeline處理請求
# ITEM_PIPELINES = {
#    'step8_king.pipelines.JsonPipeline': 700,
#    'step8_king.pipelines.FilePipeline': 500,
# }# 12. 自定義擴展,基于信號進行調用
# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
# EXTENSIONS = {
#     # 'step8_king.extensions.MyExtension': 500,
# }# 13. 爬蟲允許的最大深度,可以通過meta查看當前深度;0表示無深度
# DEPTH_LIMIT = 3# 14. 爬取時,0表示深度優先Lifo(默認);1表示廣度優先FiFo# 后進先出,深度優先
# DEPTH_PRIORITY = 0
# SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleLifoDiskQueue'
# SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.LifoMemoryQueue'
# 先進先出,廣度優先# DEPTH_PRIORITY = 1
# SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
# SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'# 15. 調度器隊列
# SCHEDULER = 'scrapy.core.scheduler.Scheduler'
# from scrapy.core.scheduler import Scheduler# 16. 訪問URL去重
# DUPEFILTER_CLASS = 'step8_king.duplication.RepeatUrl'# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html"""
17. 自動限速算法from scrapy.contrib.throttle import AutoThrottle自動限速設置1. 獲取最小延遲 DOWNLOAD_DELAY2. 獲取最大延遲 AUTOTHROTTLE_MAX_DELAY3. 設置初始下載延遲 AUTOTHROTTLE_START_DELAY4. 當請求下載完成后,獲取其"連接"時間 latency,即:請求連接到接受到響應頭之間的時間5. 用于計算的... AUTOTHROTTLE_TARGET_CONCURRENCYtarget_delay = latency / self.target_concurrencynew_delay = (slot.delay + target_delay) / 2.0 # 表示上一次的延遲時間new_delay = max(target_delay, new_delay)new_delay = min(max(self.mindelay, new_delay), self.maxdelay)slot.delay = new_delay
"""# 開始自動限速
# AUTOTHROTTLE_ENABLED = True
# The initial download delay
# 初始下載延遲
# AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
# 最大下載延遲
# AUTOTHROTTLE_MAX_DELAY = 10
# The average number of requests Scrapy should be sending in parallel to each remote server
# 平均每秒并發數
# AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0# Enable showing throttling stats for every response received:
# 是否顯示
# AUTOTHROTTLE_DEBUG = True# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings"""
18. 啟用緩存目的用于將已經發送的請求或相應緩存下來,以便以后使用from scrapy.downloadermiddlewares.httpcache import HttpCacheMiddlewarefrom scrapy.extensions.httpcache import DummyPolicyfrom scrapy.extensions.httpcache import FilesystemCacheStorage
"""
# 是否啟用緩存策略
# HTTPCACHE_ENABLED = True# 緩存策略:所有請求均緩存,下次在請求直接訪問原來的緩存即可
# HTTPCACHE_POLICY = "scrapy.extensions.httpcache.DummyPolicy"
# 緩存策略:根據Http響應頭:Cache-Control、Last-Modified 等進行緩存的策略
# HTTPCACHE_POLICY = "scrapy.extensions.httpcache.RFC2616Policy"# 緩存超時時間
# HTTPCACHE_EXPIRATION_SECS = 0# 緩存保存路徑
# HTTPCACHE_DIR = 'httpcache'# 緩存忽略的Http狀態碼
# HTTPCACHE_IGNORE_HTTP_CODES = []# 緩存存儲的插件
# HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'"""
19. 代理,需要在環境變量中設置from scrapy.contrib.downloadermiddleware.httpproxy import HttpProxyMiddleware方式一:使用默認os.environ{http_proxy:http://root:woshiniba@192.168.11.11:9999/https_proxy:http://192.168.11.11:9999/}方式二:使用自定義下載中間件def to_bytes(text, encoding=None, errors='strict'):if isinstance(text, bytes):return textif not isinstance(text, six.string_types):raise TypeError('to_bytes must receive a unicode, str or bytes ''object, got %s' % type(text).__name__)if encoding is None:encoding = 'utf-8'return text.encode(encoding, errors)class ProxyMiddleware(object):def process_request(self, request, spider):PROXIES = [{'ip_port': '111.11.228.75:80', 'user_pass': ''},{'ip_port': '120.198.243.22:80', 'user_pass': ''},{'ip_port': '111.8.60.9:8123', 'user_pass': ''},{'ip_port': '101.71.27.120:80', 'user_pass': ''},{'ip_port': '122.96.59.104:80', 'user_pass': ''},{'ip_port': '122.224.249.122:8088', 'user_pass': ''},]proxy = random.choice(PROXIES)if proxy['user_pass'] is not None:request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])encoded_user_pass = base64.encodestring(to_bytes(proxy['user_pass']))request.headers['Proxy-Authorization'] = to_bytes('Basic ' + encoded_user_pass)print "**************ProxyMiddleware have pass************" + proxy['ip_port']else:print "**************ProxyMiddleware no pass************" + proxy['ip_port']request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])DOWNLOADER_MIDDLEWARES = {'step8_king.middlewares.ProxyMiddleware': 500,}""""""
20. Https訪問Https訪問時有兩種情況:1. 要爬取網站使用的可信任證書(默認支持)DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory"DOWNLOADER_CLIENTCONTEXTFACTORY = "scrapy.core.downloader.contextfactory.ScrapyClientContextFactory"2. 要爬取網站使用的自定義證書DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory"DOWNLOADER_CLIENTCONTEXTFACTORY = "step8_king.https.MySSLFactory"# https.pyfrom scrapy.core.downloader.contextfactory import ScrapyClientContextFactoryfrom twisted.internet.ssl import (optionsForClientTLS, CertificateOptions, PrivateCertificate)class MySSLFactory(ScrapyClientContextFactory):def getCertificateOptions(self):from OpenSSL import cryptov1 = crypto.load_privatekey(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.key.unsecure', mode='r').read())v2 = crypto.load_certificate(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.pem', mode='r').read())return CertificateOptions(privateKey=v1,  # pKey對象certificate=v2,  # X509對象verify=False,method=getattr(self, 'method', getattr(self, '_ssl_method', None)))其他:相關類scrapy.core.downloader.handlers.http.HttpDownloadHandlerscrapy.core.downloader.webclient.ScrapyHTTPClientFactoryscrapy.core.downloader.contextfactory.ScrapyClientContextFactory相關配置DOWNLOADER_HTTPCLIENTFACTORYDOWNLOADER_CLIENTCONTEXTFACTORY""""""
21. 爬蟲中間件class SpiderMiddleware(object):def process_spider_input(self,response, spider):'''下載完成,執行,然后交給parse處理:param response: :param spider: :return: '''passdef process_spider_output(self,response, result, spider):'''spider處理完成,返回時調用:param response::param result::param spider::return: 必須返回包含 Request 或 Item 對象的可迭代對象(iterable)'''return resultdef process_spider_exception(self,response, exception, spider):'''異常調用:param response::param exception::param spider::return: None,繼續交給后續中間件處理異常;含 Response 或 Item 的可迭代對象(iterable),交給調度器或pipeline'''return Nonedef process_start_requests(self,start_requests, spider):'''爬蟲啟動時調用:param start_requests::param spider::return: 包含 Request 對象的可迭代對象'''return start_requests內置爬蟲中間件:'scrapy.contrib.spidermiddleware.httperror.HttpErrorMiddleware': 50,'scrapy.contrib.spidermiddleware.offsite.OffsiteMiddleware': 500,'scrapy.contrib.spidermiddleware.referer.RefererMiddleware': 700,'scrapy.contrib.spidermiddleware.urllength.UrlLengthMiddleware': 800,'scrapy.contrib.spidermiddleware.depth.DepthMiddleware': 900,"""
# from scrapy.contrib.spidermiddleware.referer import RefererMiddleware
# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {# 'step8_king.middlewares.SpiderMiddleware': 543,
}"""
22. 下載中間件class DownMiddleware1(object):def process_request(self, request, spider):'''請求需要被下載時,經過所有下載器中間件的process_request調用:param request::param spider::return:None,繼續后續中間件去下載;Response對象,停止process_request的執行,開始執行process_responseRequest對象,停止中間件的執行,將Request重新調度器raise IgnoreRequest異常,停止process_request的執行,開始執行process_exception'''passdef process_response(self, request, response, spider):'''spider處理完成,返回時調用:param response::param result::param spider::return:Response 對象:轉交給其他中間件process_responseRequest 對象:停止中間件,request會被重新調度下載raise IgnoreRequest 異常:調用Request.errback'''print('response1')return responsedef process_exception(self, request, exception, spider):'''當下載處理器(download handler)或 process_request() (下載中間件)拋出異常:param response::param exception::param spider::return:None:繼續交給后續中間件處理異常;Response對象:停止后續process_exception方法Request對象:停止中間件,request將會被重新調用下載'''return None默認下載中間件{'scrapy.contrib.downloadermiddleware.robotstxt.RobotsTxtMiddleware': 100,'scrapy.contrib.downloadermiddleware.httpauth.HttpAuthMiddleware': 300,'scrapy.contrib.downloadermiddleware.downloadtimeout.DownloadTimeoutMiddleware': 350,'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': 400,'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 500,'scrapy.contrib.downloadermiddleware.defaultheaders.DefaultHeadersMiddleware': 550,'scrapy.contrib.downloadermiddleware.redirect.MetaRefreshMiddleware': 580,'scrapy.contrib.downloadermiddleware.httpcompression.HttpCompressionMiddleware': 590,'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': 600,'scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware': 700,'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 750,'scrapy.contrib.downloadermiddleware.chunked.ChunkedTransferMiddleware': 830,'scrapy.contrib.downloadermiddleware.stats.DownloaderStats': 850,'scrapy.contrib.downloadermiddleware.httpcache.HttpCacheMiddleware': 900,}"""
# from scrapy.contrib.downloadermiddleware.httpauth import HttpAuthMiddleware
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# DOWNLOADER_MIDDLEWARES = {
#    'step8_king.middlewares.DownMiddleware1': 100,
#    'step8_king.middlewares.DownMiddleware2': 500,
# }

此文為轉載https://www.cnblogs.com/wupeiqi/articles/6229292.html

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/454727.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/454727.shtml
英文地址,請注明出處:http://en.pswp.cn/news/454727.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

十六進制透明度參照表

00%FF&#xff08;不透明&#xff09; 5%F2 10%E5 15%D8 20%CC 25%BF 30%B2 35%A5 40%99 45%8c 50%7F 55%72 60%66 65%59 70%4c 75%3F 80%33 85%21 90%19 95%0c 100%00&#xff08;全透明&#xff09;轉載于:http…

lamp和php,[LAMP]Apache和PHP的結合

在LAMP架構中&#xff0c;Apache通過PHP模塊與Mysql建立連接&#xff0c;讀寫數據。那么配置Apache和PHP結合的步驟是怎么操作的呢&#xff1f;1、修改http.conf文件[rootjuispan ~]# cat /usr/local/apache2.4/conf/httpd.conf......#ServerName......AllowOverride noneRequi…

Day-5: Python高級特性

python的理念是&#xff1a;簡單、優雅。所以&#xff0c;在Python中集成了許多經常要使用的高級特性&#xff0c;以此來簡化代碼。 切片&#xff1a;對于一個list或者tuple&#xff0c;取其中一段的元素&#xff0c;稱為切片&#xff08;Slice&#xff09;。 L[start:end]表示…

前端之 XMLHttpRequest

XMLHttpRequest 和AJAX的愛恨情仇 AJAX 是 asynchronous javascript and XML 的簡寫&#xff0c;中文翻譯是異步的 javascript 和 XML&#xff0c;這一技術能夠向服務器請求額外的數據而無須卸載頁面&#xff0c;會帶來更好的用戶體驗。雖然名字中包含 XML &#xff0c;但 AJAX…

makefile——小試牛刀

//a.h,包含頭文件stdio.h,并且定義一個函數print #include<stdio.h> void print();//b.c&#xff0c;包含頭文件a.h&#xff0c;然后就可以寫print函數的內容了 #include"a.h" void print(){ printf("who are you\n"); }//c.c&#xff0c;包含頭文件…

云電腦是什么_云電腦和我們現在平時用的電腦有什么區別?

&#x1f340;溫馨提示&#x1f340;公眾號推送改版&#xff0c;為了不讓您錯過【掌中IT發燒友圈】每天的精彩推送&#xff0c;切記將本號設置星標哦&#xff01;~01云電腦&#xff0c;是5G云服務時代的電腦新概念&#xff0c;是電腦的新的一種形態。從具體操作使用上來講&…

PHP如何用while實現循環,PHP 循環 -

PHP 循環 - While 循環循環執行代碼塊指定的次數&#xff0c;或者當指定的條件為真時循環執行代碼塊。PHP 循環在您編寫代碼時&#xff0c;您經常需要讓相同的代碼塊一次又一次地重復運行。我們可以在代碼中使用循環語句來完成這個任務。在 PHP 中&#xff0c;提供了下列循環語…

比較全的C語言面試題

1. static有什么用途&#xff1f;&#xff08;請至少說明兩種&#xff09; 1).限制變量的作用域 2).設置變量的存儲域 2. 引用與指針有什么區別&#xff1f; 1) 引用必須被初始化&#xff0c;指針不必。 2) 引用初始化以后不能被改變&#xff0c;指針可以改變所指的對象…

PHP爬取歷史天氣

PHP爬取歷史天氣 PHP作為宇宙第一語言&#xff0c;爬蟲也是非常方便&#xff0c;這里爬取的是從天氣網獲得中國城市歷史天氣統計結果。 程序架構 main.php <?phpinclude_once("./parser.php");include_once("./storer.php");#解析器和存儲器見下文$par…

Python 第三方庫之docx

日常上官網 https://python-docx.readthedocs.io/en/latest/ 一、安裝 pip install python-docx 二、寫入word word 中主要有兩種用文本格式等級&#xff1a;塊等級&#xff08;block-level&#xff09;和內聯等級&#xff08;inline-level&#xff09;word 中大部分內容都…

Unity AI副總裁Danny Lange:如何用AI助推游戲行業?

本文講的是Unity AI副總裁Danny Lange&#xff1a;如何用AI助推游戲行業&#xff1f; &#xff0c;10月26日&#xff0c;在加州山景城舉辦的ACMMM 2017大會進入正會第三天。在會上&#xff0c;Unity Technology負責AI與機器學習的副總裁Danny Longe進行了題為《Bringing Gaming…

SPI 讀取不同長度 寄存器_SPI協議,MCP2515裸機驅動詳解

SPI概述Serial Peripheral interface 通用串行外圍設備接口是Motorola首先在其MC68HCXX系列處理器上定義的。SPI接口主要應用在 EEPROM&#xff0c;FLASH&#xff0c;實時時鐘&#xff0c;AD轉換器&#xff0c;還有數字信號處理器和數字信號解碼器之間。SPI&#xff0c;是一種高…

oracle并發執行max,跪求大量并發執行insert into select語句的方案

現在有數十萬張表要從A庫通過insert into tablename select * from tablenamedblink的方式導入到B庫中。B機上80個cpu&#xff0c;160G內存。希望能夠大量并發執行。怎么寫腳本呢&#xff1f;誰有這方面的經驗&#xff0c;麻煩指點一下。謝謝。下面是我的腳本&#xff1a;#!/us…

20162314 《Program Design Data Structures》Learning Summary Of The First Week

20162314 2017-2018-1 《Program Design & Data Structures》Learning Summary Of The First Week Summary of teaching materials Algorithm analysis is the basic project of the computer science.Increasing function prove that the utilization of the time and spa…

高并發解決方法

2019獨角獸企業重金招聘Python工程師標準>>> 高并發來說&#xff0c;要從實際項目的每一個過程去考慮&#xff0c;頁面&#xff0c;訪問過程&#xff0c;服務器處理&#xff0c;數據庫訪問每個過程都可以處理。&#xff08;前端-寬帶-后端-DB&#xff09; 集群&…

MySQL 之 存儲過程

一、初識存儲過程 1、什么是存儲過程 存儲過程是在大型數據庫系統中一組為了完成特定功能的SQL語句集&#xff0c;存儲在數據庫中。存儲過程經過第一次編譯后&#xff0c;再次調用不需要編譯&#xff0c;用戶可以通過指定的存儲過程名和給出一些存儲過程定義的參數來使用它。…

C/C++面試感受和經驗以及面試題收藏

http://topic.csdn.net/u/20080924/15/3b00a84e-970f-4dea-92f2-868c5d1ad825.html 前段時間剛參加了n多公司的C/C軟件工程師的面試&#xff0c;有國企&#xff0c;外企&#xff0c;私企&#xff08;moto&#xff0c;飛思卡爾&#xff0c;港灣&#xff0c;中國衛星XXX&#xf…

oracle存儲過程季度方法,Oracle存儲過程、觸發器實現獲取時間段內周、月、季度的具體時間...

歡迎技術交流。 QQ&#xff1a;138986722創建table&#xff1a;create table tbmeetmgrinfo(id number primary key, /*主鍵&#xff0c;自動增加 */huiyishi number, /*會議室編號 */STARTTIME varchar2(30), /*會議開始時間 */ENDTIME varchar2(30), /*會議結束時間 */CREATE…

如何root安卓手機_安卓Root+卡開機畫面救磚教程丨以一加手機為例

一加手機買到手已經用了1個多月了&#xff0c;還有很多朋友在問我怎么Root、怎么替換Recovery、怎么安裝Magisk、有時候刷Magisk模塊變磚怎么解救。小編統一整理一下&#xff0c;其他安卓手機也可以參考&#xff0c;很多思路都是通用的。一加手機刷入TWRP并RootTWRP大概是現在安…

Linux用ctrl + r 查找以前(歷史)輸入的命令

在Linux系統下一直用上下鍵查找以前輸入的命令&#xff0c;這個找剛輸入不久的命令還是很方便的&#xff0c;但是比較久遠的命令&#xff0c;用上下鍵效率就不高了。那個history命令也是個花架子&#xff0c;雖然功能多&#xff0c;但不好用&#xff0c;網上找了下&#xff0c;…