python爬蟲線程，進程，協程

0x00 線程

線程是一個執行單位，是在一個進程里面的，是共享進程里面的提供的內存等資源，使用多個線程時和使用多個進程相比，多個線程使用的內存等資源較少。進程像一座“房子”（獨立資源），線程是房子里的“房間活動”（共享資源但獨立執行）。

進程是一個資源單位，比如說各種運行的應用程序，每個應用程序就是一個進程。
多進程會占用較多的內存資源，一般適用cpu密集型操作，如圖像處理，視頻編碼等，這里不做介紹了。
下面的代碼是一個單線程運行的

import requests
url=''
requests.get(url)

多線程的使用

import threading
def task(a):print(f"a子線程")
if __name__ == '__main__':s=threading.Thread(target=task, kwargs={"a":"bbb"})#通過字典傳遞函數的參數s.run()print("aaa")

創建一個線程為50的線程池

from concurrent.futures import ThreadPoolExecutor
def task():for i in range(1,1000):print(i)
if __name__ == '__main__':#創建一個50線程的with ThreadPoolExecutor(50) as t:t.submit(task)

實例爬取菜價在這里插入圖片描述
通過分析，發現源碼里面并沒有，network 格式選擇xhr，找到了最終的數據，通過分析發現是post提交的數據，current等于幾就是第幾頁

from concurrent.futures import ThreadPoolExecutor
import requests
url='http://www.xinfadi.com.cn/getPriceData.html'
def download(count):data={"current":count,"limit":"20"}rep=requests.post(url=url,data=data)dic=rep.json()for i in range(0,20):#注意這里是字典夾雜著列表name=dic['list'][i ]['prodName']price=dic['list'][i]['avgPrice']with open("4.csv","a+") as f:f.write(f"菜名:{name}")f.write(f"平均價:{price}")f.write("\n")
if __name__ == '__main__':with ThreadPoolExecutor(50) as t:for i in range(1,50):t.submit(download,count=i)

在這里插入圖片描述

0x01協程

協程（Coroutine）是一種用戶態的輕量級線程，通過協作式多任務實現高效并發，一般多用于io密集型操作，網絡請求、文件讀寫等。
多線程：通過操作系統調度多個線程并行執行，屬于并發的一種形式。
異步：單線程內通過事件循環調度多個任務，屬于并發模型，特點是單線程高并發。

#定義協程
import asyncio
async def fetch_data():print("發起請求...")await asyncio.sleep(1)  # 模擬異步I/Otime.sleep(1)#同步錯誤用法print("數據返回")return {"data": 42}

案列爬取小說
分析網頁，發現內容都在源代碼中，這里選用xpath解析器，將小說內容保存到txt文件中去

import aiohttp
import asyncio
from lxml import etree
import osheaders = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
async def fetch(url):async with aiohttp.ClientSession() as session:async with session.get(url, headers=headers) as response:response.encoding = "utf-8"return await response.text()
async def parse_down(url):text = await fetch(url)html=etree.HTML(text)title=html.xpath("//h1/text()")contents=html.xpath("//div[@id='chaptercontent']/text()")os.makedirs("novels", exist_ok=True)filename = f"{title[0]}.txt"filepath = os.path.join("novels", filename)with open(filepath,"w+",encoding='utf-8') as f:f.write(title[0]+'\n\n')for content in contents:f.write(content.strip().replace("請收藏本站：https://www.bibie.cc。筆趣閣手機版：https://m.bibie.cc", "")+'\n')
async def main():tasks = []for i in range(1, 517):url = f'https://www.bibie.cc/html/229506/{i}.html'tasks.append(parse_down(url))await asyncio.gather(*tasks)print("爬取完成")
if __name__ == '__main__':try:asyncio.run(main())except Exception as e:print("")

在這里插入圖片描述

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/902183.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/902183.shtml
英文地址，請注明出處：http://en.pswp.cn/news/902183.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！