爬取汽車之家新聞

?偽造瀏覽器向某個地址發送Http請求，獲取返回的字符串
- response = requests.get(url = '地址')
- response.content
- response.encoding = apparent_encoding
- response.text
bs4，解析HTML格式的字符串
- soup = BeautifulSoup('<html>...</html>', "html.parser")
- soup.find(name='標簽名')
- soup.find(name='標簽名', id='il')
- soup.find(name='標簽名', _class='il')
- soup.find(name='div', attrs={'id': 'auto-channel-lazyload-article', 'class': 'id'})

一、下載頁面

首先抓取要爬的頁面

import requestsret = requests.get(url="https://www.autohome.com.cn/news/")

此時print(ret)返回的是一個對象：?<Response [200]>?

然后再print(ret.content)輸出如下：

上圖看出返回的是整個網頁文本，不過是以字節形式的文本。

?這不是我們需要的，接著再改用print(ret.text)輸出如下：

此時，出現了惡心的亂碼！！！，我們再用encoding對ret進行編碼：

ret.encoding = 'gbk'

這樣可能不是很智能，那我們可以換一種方式：

ret.encoding = ret.apparent_encoding

在這里，print(ret.apparent_encoding)可以自動獲取網頁的編碼格式。此時print(ret.text)已經能正常顯示網頁了：

?二、解析：獲取想要的指定內容

此時我們分析汽車之家新聞頁面：

初步判斷，新聞部分位于id為"auto-channel-lazyload-article"的div下面的li標簽中，之所以選擇id是因為class名稱可能不是唯一的，不好用于過濾

此時，我們需要在py文件頭部導入bs4模塊，這個模塊主要用來幫我們解析整個html頁面，相當于正則表達式的功能

from bs4 import BeautifulSoup

?用html解析器對網頁進行解析

soup = BeautifulSoup(ret.text, 'html.parser')

我們用print(type(soup))輸出soup的類型得到：?<class 'bs4.BeautifulSoup'>?，可以看出soup由文本變成對象了。

?提取出新聞所在的div：

div = soup.find(name='div',id='auto-channel-lazyload-article')

我們先print(div)查看下結果：

然后再對這個div對象進行二次解析，我們最終要拿到里面的li，用find_all找所有的li

li_list = div.find_all(name='li')

再次print(li_list)輸出：

可以看出li_list已經是一個列表了。我們需要先找出里面的h3標簽

for li in li_list:h3 = li.find(name='h3')

用print(h3)查看下h3標簽

可以看出上圖有一個為None的地方，我們返回網頁查看源碼

目測這里應該是一個廣告位，這里我們可以采取if判斷直接過濾掉

for li in li_list:h3 = li.find(name='h3')if not h3:continueprint(h3)

在這里h3是一個對象，我們最終需要得到h3的文本

print(h3.text)

目前我們只是取得了每個li標簽的新聞標題，再獲取新聞正文和超鏈接

for li in li_list:h3 = li.find(name='h3')if not h3:continueprint(h3.text)p = li.find(name='p')print(p.text)a = li.find('a')    # 不寫name默認取第一個aprint(a.attrs)  # attrs拿取所有屬性

對輸出進行優化：

    print(h3.text, a.get('href'))print(p.text)print('\n')

我們順便爬下圖片吧

    img = li.find('img')# print(img)src = img.get('src')# print(src)file_name = src.rsplit('__', maxsplit=1)[1]# print(file_name)ret_img = requests.get(url='https:' + src)with open(file_name, 'wb') as f:f.write(ret_img.content)print('\n')

此時在自己當前路徑下，已經下載了很多圖片

轉載于:https://www.cnblogs.com/Black-rainbow/p/9214707.html

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/277760.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/277760.shtml
英文地址，請注明出處：http://en.pswp.cn/news/277760.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！