Python 框架 之 Scrapy 爬蟲(一)

在編寫爬蟲時,性能的消耗主要在IO請求中,當單進程單線程模式下請求URL時必然會引起等待,從而使得請求整體變慢。

1、同步執行

import requestsdef fetch_async(url):response = requests.get(url)return responseurl_list = ['http://www.github.com', 'http://www.bing.com']for url in url_list:fetch_async(url)

2、多線程執行

from concurrent.futures import ThreadPoolExecutor
import requestsdef fetch_async(url):response = requests.get(url)return responseurl_list = ['http://www.github.com', 'http://www.bing.com']
pool = ThreadPoolExecutor(5)
for url in url_list:pool.submit(fetch_async, url)
pool.shutdown(wait=True)

3、多線程+回調函數

from concurrent.futures import ThreadPoolExecutor
import requestsdef fetch_async(url):response = requests.get(url)return responsedef callback(future):print(future.result())url_list = ['http://www.github.com', 'http://www.bing.com']
pool = ThreadPoolExecutor(5)
for url in url_list:v = pool.submit(fetch_async, url)v.add_done_callback(callback)
pool.shutdown(wait=True)

4、多進程執行

from concurrent.futures import ProcessPoolExecutor
import requestsdef fetch_async(url):response = requests.get(url)return responseurl_list = ['http://www.github.com', 'http://www.bing.com']
pool = ProcessPoolExecutor(5)
for url in url_list:pool.submit(fetch_async, url)
pool.shutdown(wait=True)

5、多進程+回調函數

from concurrent.futures import ProcessPoolExecutor
import requestsdef fetch_async(url):response = requests.get(url)return responsedef callback(future):print(future.result())url_list = ['http://www.github.com', 'http://www.bing.com']
pool = ProcessPoolExecutor(5)
for url in url_list:v = pool.submit(fetch_async, url)v.add_done_callback(callback)
pool.shutdown(wait=True)

通過上述代碼均可以完成對請求性能的提高,對于多線程和多進行的缺點是在IO阻塞時會造成了線程和進程的浪費,所以首選異步IO:

1、asyncio 1

import asyncio@asyncio.coroutine
def func1():print('before...func1......')yield from asyncio.sleep(5)print('end...func1......')tasks = [func1(), func1()]
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.gather(*tasks))
loop.close()

2、asyncio 2

import asyncio@asyncio.coroutine
def fetch_async(host, url='/'):print(host, url)reader, writer = yield from asyncio.open_connection(host, 80)request_header_content = """GET %s HTTP/1.0\r\nHost: %s\r\n\r\n""" % (url, host,)request_header_content = bytes(request_header_content, encoding='utf-8')writer.write(request_header_content)yield from writer.drain()text = yield from reader.read()print(host, url, text)writer.close()tasks = [fetch_async('www.cnblogs.com', '/wupeiqi/'),fetch_async('dig.chouti.com', '/pic/show?nid=4073644713430508&lid=10273091')
]loop = asyncio.get_event_loop()
results = loop.run_until_complete(asyncio.gather(*tasks))
loop.close()

3、asyncio+aiohttp

import aiohttp
import asyncio@asyncio.coroutine
def fetch_async(url):print(url)response = yield from aiohttp.request('GET', url)# data = yield from response.read()# print(url, data)print(url, response)response.close()tasks = [fetch_async('http://www.google.com/'), fetch_async('http://www.chouti.com/')]event_loop = asyncio.get_event_loop()
results = event_loop.run_until_complete(asyncio.gather(*tasks))
event_loop.close()

4、asynico+requests

import asyncio
import requests@asyncio.coroutine
def fetch_async(func, *args):loop = asyncio.get_event_loop()future = loop.run_in_executor(None, func, *args)response = yield from futureprint(response.url, response.content)tasks = [fetch_async(requests.get, 'http://www.cnblogs.com/wupeiqi/'),fetch_async(requests.get, 'http://dig.chouti.com/pic/show?nid=4073644713430508&lid=10273091')
]loop = asyncio.get_event_loop()
results = loop.run_until_complete(asyncio.gather(*tasks))
loop.close()

5、gevert+requests

import gevent
import requests
from gevent import monkeymonkey.patch_all()def fetch_async(method, url, req_kwargs):print(method, url, req_kwargs)response = requests.request(method=method, url=url, **req_kwargs)print(response.url, response.content)# ##### 發送請求 #####
gevent.joinall([gevent.spawn(fetch_async, method='get', url='https://www.python.org/', req_kwargs={}),gevent.spawn(fetch_async, method='get', url='https://www.yahoo.com/', req_kwargs={}),gevent.spawn(fetch_async, method='get', url='https://github.com/', req_kwargs={}),
])# ##### 發送請求(協程池控制最大協程數量) #####
# from gevent.pool import Pool
# pool = Pool(None)
# gevent.joinall([
#     pool.spawn(fetch_async, method='get', url='https://www.python.org/', req_kwargs={}),
#     pool.spawn(fetch_async, method='get', url='https://www.yahoo.com/', req_kwargs={}),
#     pool.spawn(fetch_async, method='get', url='https://www.github.com/', req_kwargs={}),
# ])

6、grequests

import grequestsrequest_list = [grequests.get('http://httpbin.org/delay/1', timeout=0.001),grequests.get('http://fakedomain/'),grequests.get('http://httpbin.org/status/500')
]# ##### 執行并獲取響應列表 #####
# response_list = grequests.map(request_list)
# print(response_list)# ##### 執行并獲取響應列表(處理異常) #####
# def exception_handler(request, exception):
# print(request,exception)
#     print("Request failed")# response_list = grequests.map(request_list, exception_handler=exception_handler)
# print(response_list)

7、Twisted 示例

from twisted.web.client import getPage, defer
from twisted.internet import reactordef all_done(arg):reactor.stop()def callback(contents):print(contents)deferred_list = []url_list = ['http://www.bing.com', 'http://www.baidu.com', ]
for url in url_list:deferred = getPage(bytes(url, encoding='utf8'))deferred.addCallback(callback)deferred_list.append(deferred)dlist = defer.DeferredList(deferred_list)
dlist.addBoth(all_done)
reactor.run()

8、tornado

from tornado.httpclient import AsyncHTTPClient
from tornado.httpclient import HTTPRequest
from tornado import ioloopdef handle_response(response):"""處理返回值內容(需要維護計數器,來停止IO循環),調用 ioloop.IOLoop.current().stop():param response: :return: """if response.error:print("Error:", response.error)else:print(response.body)def func():url_list = ['http://www.baidu.com','http://www.bing.com',]for url in url_list:print(url)http_client = AsyncHTTPClient()http_client.fetch(HTTPRequest(url), handle_response)ioloop.IOLoop.current().add_callback(func)
ioloop.IOLoop.current().start()

Twisted 更多

from twisted.internet import reactor
from twisted.web.client import getPage
import urllib.parse

def one_done(arg):
print(arg)
reactor.stop()

post_data = urllib.parse.urlencode({‘check_data’: ‘adf’})
post_data = bytes(post_data, encoding=‘utf8’)
headers = {b’Content-Type’: b’application/x-www-form-urlencoded’}
response = getPage(bytes(‘http://dig.chouti.com/login’, encoding=‘utf8’),
method=bytes(‘POST’, encoding=‘utf8’),
postdata=post_data,
cookies={},
headers=headers)
response.addBoth(one_done)

reactor.run()
以上均是Python內置以及第三方模塊提供異步IO請求模塊,使用簡便大大提高效率,而對于異步IO請求的本質則是【非阻塞Socket】+【IO多路復用】:

異步IO

import select
import socket
import timeclass AsyncTimeoutException(TimeoutError):"""請求超時異常類"""def __init__(self, msg):self.msg = msgsuper(AsyncTimeoutException, self).__init__(msg)class HttpContext(object):"""封裝請求和相應的基本數據"""def __init__(self, sock, host, port, method, url, data, callback, timeout=5):"""sock: 請求的客戶端socket對象host: 請求的主機名port: 請求的端口port: 請求的端口method: 請求方式url: 請求的URLdata: 請求時請求體中的數據callback: 請求完成后的回調函數timeout: 請求的超時時間"""self.sock = sockself.callback = callbackself.host = hostself.port = portself.method = methodself.url = urlself.data = dataself.timeout = timeoutself.__start_time = time.time()self.__buffer = []def is_timeout(self):"""當前請求是否已經超時"""current_time = time.time()if (self.__start_time + self.timeout) < current_time:return Truedef fileno(self):"""請求sockect對象的文件描述符,用于select監聽"""return self.sock.fileno()def write(self, data):"""在buffer中寫入響應內容"""self.__buffer.append(data)def finish(self, exc=None):"""在buffer中寫入響應內容完成,執行請求的回調函數"""if not exc:response = b''.join(self.__buffer)self.callback(self, response, exc)else:self.callback(self, None, exc)def send_request_data(self):content = """%s %s HTTP/1.0\r\nHost: %s\r\n\r\n%s""" % (self.method.upper(), self.url, self.host, self.data,)return content.encode(encoding='utf8')class AsyncRequest(object):def __init__(self):self.fds = []self.connections = []def add_request(self, host, port, method, url, data, callback, timeout):"""創建一個要請求"""client = socket.socket()client.setblocking(False)try:client.connect((host, port))except BlockingIOError as e:pass# print('已經向遠程發送連接的請求')req = HttpContext(client, host, port, method, url, data, callback, timeout)self.connections.append(req)self.fds.append(req)def check_conn_timeout(self):"""檢查所有的請求,是否有已經連接超時,如果有則終止"""timeout_list = []for context in self.connections:if context.is_timeout():timeout_list.append(context)for context in timeout_list:context.finish(AsyncTimeoutException('請求超時'))self.fds.remove(context)self.connections.remove(context)def running(self):"""事件循環,用于檢測請求的socket是否已經就緒,從而執行相關操作"""while True:r, w, e = select.select(self.fds, self.connections, self.fds, 0.05)if not self.fds:returnfor context in r:sock = context.sockwhile True:try:data = sock.recv(8096)if not data:self.fds.remove(context)context.finish()breakelse:context.write(data)except BlockingIOError as e:breakexcept TimeoutError as e:self.fds.remove(context)self.connections.remove(context)context.finish(e)breakfor context in w:# 已經連接成功遠程服務器,開始向遠程發送請求數據if context in self.fds:data = context.send_request_data()context.sock.sendall(data)self.connections.remove(context)self.check_conn_timeout()if __name__ == '__main__':def callback_func(context, response, ex):""":param context: HttpContext對象,內部封裝了請求相關信息:param response: 請求響應內容:param ex: 是否出現異常(如果有異常則值為異常對象;否則值為None):return:"""print(context, response, ex)obj = AsyncRequest()url_list = [{'host': 'www.google.com', 'port': 80, 'method': 'GET', 'url': '/', 'data': '', 'timeout': 5,'callback': callback_func},{'host': 'www.baidu.com', 'port': 80, 'method': 'GET', 'url': '/', 'data': '', 'timeout': 5,'callback': callback_func},{'host': 'www.bing.com', 'port': 80, 'method': 'GET', 'url': '/', 'data': '', 'timeout': 5,'callback': callback_func},]for item in url_list:print(item)obj.add_request(**item)obj.running()

此文為轉載!!!

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/454732.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/454732.shtml
英文地址,請注明出處:http://en.pswp.cn/news/454732.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

編程新手導論(轉載)

第二部分 導論&#xff0c;這一部分主要是關于編程的導論&#xff0c; (要懂得一點思想具備一點常識)《設計&#xff0c;編碼&#xff0c;&#xff0c;與軟工》&#xff08;編程與思想&#xff09;這一章解釋了三種思想&#xff0c;原語&#xff0c;抽象&#xff0c;組合&#…

如何讓電腦成為看圖說話的高手?計算機視覺頂會ICCV論文解讀

ICCV&#xff0c;被譽為計算機視覺領域三大頂級會議之一。作為計算機視覺領域最高級別的會議之一&#xff0c;其論文集代表了計算機視覺領域最新的發展方向和水平。阿里巴巴在今年的大會上有多篇論文入選&#xff0c;本篇所解讀的論文是阿里iDST與多家機構合作的入選論文之一&a…

canvas繪制線條1像素的問題

http://jo2.org/html5-canvas%E7%94%BB%E5%9B%BE3%EF%BC%9A1px%E7%BA%BF%E6%9D%A1%E6%A8%A1%E7%B3%8A%E9%97%AE%E9%A2%98/轉載于:https://www.cnblogs.com/XIE7654/p/7493315.html

php汽車找車位,遭遇到車多車位少 教你如何快速找到停車位

[摘要]車主們大多時間會穿梭在市區&#xff0c;到了目的地后那就先找停車位&#xff0c;現在市區寸土寸金&#xff0c;一個停車位面積要占幾平米呢&#xff0c;所以停車位基本是不夠用的。下面和大家聊聊怎么找合適的停車位。車主們大多時間會穿梭在市區&#xff0c;去商場購物…

Python 框架 之 Scrapy 爬蟲(二)

Scrapy是一個為了爬取網站數據&#xff0c;提取結構性數據而編寫的應用框架。 其可以應用在數據挖掘&#xff0c;信息處理或存儲歷史數據等一系列的程序中。其最初是為了頁面抓取 (更確切來說, 網絡抓取)所設計的&#xff0c; 也可以應用在獲取API所返回的數據(例如 Amazon Ass…

十六進制透明度參照表

00%FF&#xff08;不透明&#xff09; 5%F2 10%E5 15%D8 20%CC 25%BF 30%B2 35%A5 40%99 45%8c 50%7F 55%72 60%66 65%59 70%4c 75%3F 80%33 85%21 90%19 95%0c 100%00&#xff08;全透明&#xff09;轉載于:http…

lamp和php,[LAMP]Apache和PHP的結合

在LAMP架構中&#xff0c;Apache通過PHP模塊與Mysql建立連接&#xff0c;讀寫數據。那么配置Apache和PHP結合的步驟是怎么操作的呢&#xff1f;1、修改http.conf文件[rootjuispan ~]# cat /usr/local/apache2.4/conf/httpd.conf......#ServerName......AllowOverride noneRequi…

Day-5: Python高級特性

python的理念是&#xff1a;簡單、優雅。所以&#xff0c;在Python中集成了許多經常要使用的高級特性&#xff0c;以此來簡化代碼。 切片&#xff1a;對于一個list或者tuple&#xff0c;取其中一段的元素&#xff0c;稱為切片&#xff08;Slice&#xff09;。 L[start:end]表示…

前端之 XMLHttpRequest

XMLHttpRequest 和AJAX的愛恨情仇 AJAX 是 asynchronous javascript and XML 的簡寫&#xff0c;中文翻譯是異步的 javascript 和 XML&#xff0c;這一技術能夠向服務器請求額外的數據而無須卸載頁面&#xff0c;會帶來更好的用戶體驗。雖然名字中包含 XML &#xff0c;但 AJAX…

makefile——小試牛刀

//a.h,包含頭文件stdio.h,并且定義一個函數print #include<stdio.h> void print();//b.c&#xff0c;包含頭文件a.h&#xff0c;然后就可以寫print函數的內容了 #include"a.h" void print(){ printf("who are you\n"); }//c.c&#xff0c;包含頭文件…

云電腦是什么_云電腦和我們現在平時用的電腦有什么區別?

&#x1f340;溫馨提示&#x1f340;公眾號推送改版&#xff0c;為了不讓您錯過【掌中IT發燒友圈】每天的精彩推送&#xff0c;切記將本號設置星標哦&#xff01;~01云電腦&#xff0c;是5G云服務時代的電腦新概念&#xff0c;是電腦的新的一種形態。從具體操作使用上來講&…

PHP如何用while實現循環,PHP 循環 -

PHP 循環 - While 循環循環執行代碼塊指定的次數&#xff0c;或者當指定的條件為真時循環執行代碼塊。PHP 循環在您編寫代碼時&#xff0c;您經常需要讓相同的代碼塊一次又一次地重復運行。我們可以在代碼中使用循環語句來完成這個任務。在 PHP 中&#xff0c;提供了下列循環語…

比較全的C語言面試題

1. static有什么用途&#xff1f;&#xff08;請至少說明兩種&#xff09; 1).限制變量的作用域 2).設置變量的存儲域 2. 引用與指針有什么區別&#xff1f; 1) 引用必須被初始化&#xff0c;指針不必。 2) 引用初始化以后不能被改變&#xff0c;指針可以改變所指的對象…

PHP爬取歷史天氣

PHP爬取歷史天氣 PHP作為宇宙第一語言&#xff0c;爬蟲也是非常方便&#xff0c;這里爬取的是從天氣網獲得中國城市歷史天氣統計結果。 程序架構 main.php <?phpinclude_once("./parser.php");include_once("./storer.php");#解析器和存儲器見下文$par…

Python 第三方庫之docx

日常上官網 https://python-docx.readthedocs.io/en/latest/ 一、安裝 pip install python-docx 二、寫入word word 中主要有兩種用文本格式等級&#xff1a;塊等級&#xff08;block-level&#xff09;和內聯等級&#xff08;inline-level&#xff09;word 中大部分內容都…

Unity AI副總裁Danny Lange:如何用AI助推游戲行業?

本文講的是Unity AI副總裁Danny Lange&#xff1a;如何用AI助推游戲行業&#xff1f; &#xff0c;10月26日&#xff0c;在加州山景城舉辦的ACMMM 2017大會進入正會第三天。在會上&#xff0c;Unity Technology負責AI與機器學習的副總裁Danny Longe進行了題為《Bringing Gaming…

SPI 讀取不同長度 寄存器_SPI協議,MCP2515裸機驅動詳解

SPI概述Serial Peripheral interface 通用串行外圍設備接口是Motorola首先在其MC68HCXX系列處理器上定義的。SPI接口主要應用在 EEPROM&#xff0c;FLASH&#xff0c;實時時鐘&#xff0c;AD轉換器&#xff0c;還有數字信號處理器和數字信號解碼器之間。SPI&#xff0c;是一種高…

oracle并發執行max,跪求大量并發執行insert into select語句的方案

現在有數十萬張表要從A庫通過insert into tablename select * from tablenamedblink的方式導入到B庫中。B機上80個cpu&#xff0c;160G內存。希望能夠大量并發執行。怎么寫腳本呢&#xff1f;誰有這方面的經驗&#xff0c;麻煩指點一下。謝謝。下面是我的腳本&#xff1a;#!/us…

20162314 《Program Design Data Structures》Learning Summary Of The First Week

20162314 2017-2018-1 《Program Design & Data Structures》Learning Summary Of The First Week Summary of teaching materials Algorithm analysis is the basic project of the computer science.Increasing function prove that the utilization of the time and spa…

高并發解決方法

2019獨角獸企業重金招聘Python工程師標準>>> 高并發來說&#xff0c;要從實際項目的每一個過程去考慮&#xff0c;頁面&#xff0c;訪問過程&#xff0c;服務器處理&#xff0c;數據庫訪問每個過程都可以處理。&#xff08;前端-寬帶-后端-DB&#xff09; 集群&…