AI基礎01-文本數據采集

????????本篇文章是學習文本數據的采集，作為人工智能訓練師或者數據分析師有時需要先獲取數據，然后進行數據清洗、數據標注。很明顯數據采集是后續步驟的基礎。

1）數據采集定義

數據采集：data acquisition，DAQ 又稱為數據獲取，是利用一種裝置，從系統外部采集數據并輸入系統內部的一個接口。數據采集技術廣泛應用于各個領域。

2）數據采集實例

假如我們需要獲取佛山市當天的天氣情況，像天氣是晴天還是多云，天氣的氣溫以及風速。

步驟：請求網站獲取HTML信息 ==》使用beautifulsoup解析HTML內容，并找出我們需要的內容 ==》保存到csv 文件中

a、需要安裝Requests 庫、beautifulsoup庫

可以pip install requests

也可以直接在pycharm 里面導入時，提示沒有庫點擊后安裝

點擊后安裝成功，對應的庫名下面就沒有下劃線了：

同樣的方法也可以安裝別的庫。

b、Requests 庫使用方法

在Python中，requests模塊是一個非常流行的第三方庫，用于發送HTTP請求。它提供了一個簡單而強大的接口來與HTTP服務器進行交互。

requests.get() 是獲取 HTML網頁信息的主要方法

r = requests.get（url,params = None,**kwargs）:

url:要獲取頁面的url

params :為字典或者字節序列，作為參數增加到url中

r：為返回的一個包含服務器資源的response對象

import requests

# 請求天氣的網址
url = "https://www.weather.com.cn/weather/101280800.shtml"
r = requests.get(url,timeout=10)
print(r)
print(r.text)? #網頁上獲取的全部內容

c、Beautifulsoup 使用方式

Python中的BeautifulSoup是一個非常流行的庫，用于解析HTML和XML文檔。它提供了一個簡單的API來提取數據。

在使用BeautifulSoup之前，你需要先安裝這個庫。如果你還沒有安裝，可以通過pip來安裝：pip install beautifulsoup4

導入BeautifulSoup

在你的Python腳本中，首先需要導入BeautifulSoup和解析器（如lxml或html.parser）

from bs4 import BeautifulSoup

解析HTML或XML文檔

你可以使用BeautifulSoup類來解析HTML或XML文檔。通常，你需要傳遞文檔內容和解析器類型給BeautifulSoup的構造函數。

# 示例HTML文檔

html_doc = """

<html>

<head>

<title>The Dormouse's story</title>

</head>

<body>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

</body>

</html>

"""

# 使用html.parser解析器解析HTML文檔

soup = BeautifulSoup(html_doc, 'html.parser')

查找元素

BeautifulSoup提供了多種方法來查找元素，包括但不限于：

find(): 返回第一個匹配的標簽。

find_all(): 返回所有匹配的標簽。

find_parent(), find_parents(): 查找父標簽。

find_next_sibling(), find_next_siblings(): 查找下一個兄弟標簽。

find_previous_sibling(), find_previous_siblings(): 查找前一個兄弟標簽。

select(): 使用CSS選擇器查找元素。

示例：使用find()和find_all()

# 查找第一個<a>標簽

first_link = soup.find('a')

print(first_link)? # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

# 查找所有<a>標簽

all_links = soup.find_all('a')

for link in all_links:

??? print(link)? # 打印所有<a>標簽的詳細信息

示例：使用select()（CSS選擇器）

# 使用CSS選擇器查找所有class為"sister"的<a>標簽

sisters = soup.select('a.sister')

for sister in sisters:

??? print(sister['href'], sister.text)? # 打印鏈接和文本內容

獲取和修改屬性及內容

你可以輕松獲取或修改元素的屬性或內容。

# 獲取元素的屬性值

href = first_link['href']? # 獲取href屬性值

print(href)? # http://example.com/elsie

# 修改元素的屬性值或內容

first_link['href'] = "http://newexample.com/elsie"? # 修改href屬性值

first_link.string = "Elsie New"? # 修改<a>標簽內的文本內容為"Elsie New"

3) 編寫腳本

獲取佛山市當天的天氣情況，像天氣是晴天還是多云，天氣的氣溫以及風速。

參考代碼：
#網頁請求函數
def get_html_text(url):
??? try:
??????? r = requests.get(url,timeout=30)
??????? r.raise_for_status()
??????? r.encoding = 'utf-8'
??????? print("訪問網頁成功")
??????? return r.text
??? except:
??????? return "訪問異常"
?

#數據存放到csv文件中
def write_to_csv(file_name,data,day=1):
??? if not os.path.exists(file_name):
??????? with open(file_name,"w",errors="ignore",newline="") as f:
??????????? if day==1:
??????????????? header = ["最高溫度","最低溫度","天氣情況","風速"]
??????????? f_csv = csv.writer(f)
??????????? f_csv.writerow(header)
??????????? f_csv.writerows(data)
??? else:
??????? with open(file_name, "a", errors="ignore", newline="") as f:
??????????? f_csv = csv.writer(f)
??????????? #for i in range(0,len(data)):
??????????? f_csv.writerows(data)

#主函數

if __name__ == '__main__':
??? # 請求天氣的網址
??? url = "https://www.weather.com.cn/weather/101280800.shtml"
??? # csv數據保存文件夾
??? file_direction = "D:\\dewi\\project2024\\myListPractice\\pythonProject1\\test_data"

??? # 打開網頁天氣預報佛山市
??? html_text = get_html_text(url)
??? print(html_text)
??? # 使用BeautifulSoup解析HTML內容
??? soup = BeautifulSoup(html_text, 'html.parser')

??? # 獲取當天的天氣情況
??? # <div class="temperature">25°C</div> 和 <div class="humidity">60%</div>
??? if soup.find("p", class_="tem").span is None:
??????? temperature_H = "無"?? #晚上請求的時候可能沒有最高溫度，這里做了判斷
??? else:
??????? temperature_H = soup.find("p", class_="tem").span.string
??? temperature_L = soup.find('p', class_='tem').i.string? # find()這里返回第一個結果，最低溫度
??? weather = soup.find('p', class_='wea').string????????? #天氣狀態
??? wind_speed = soup.find("p", class_="win").i.string???? #風速

??? # 獲取的數據放到list
??? weather_data = []
??? weather_data.append([temperature_H, temperature_L, weather, wind_speed])? # 列表中包含列表，以便后續寫入，或者使用列表中是字典
??? print("今天天氣情況：", weather_data)
??? #保存到csv文件
??? write_to_csv(file_direction + "\\weather_data.csv", weather_data, day=1)

4）進階練習

如何獲取最近7天的最低溫度呢？

我們可以把它取出來放到列表中。

這里需要使用到find_all(),另外要分清html結構，然后用基本語法就可以實現了：

HTML結構參考如下：

參考代碼如下：

import requests
from bs4 import BeautifulSoup
# 請求天氣的網址
url = "https://www.weather.com.cn/weather/101280800.shtml"
r = requests.get(url,timeout=20)
r.encoding = 'utf-8'
print(r)
#print(r.text)? #網頁上獲取的全部內容

soup = BeautifulSoup(r.text,"html.parser")
#練習find（）
temprature_low = soup.find("p",class_="tem").i.string
print("第一個最低溫度：",temprature_low)

#練習find_all（）,7天所有的最低溫度
body = soup.body? #body內容
data = body.find('div', {'id': '7d'})#7天的數據
ul = data.find('ul')? #找到第一個ul
li = ul.find_all('li') #找到所有li
temprature_7days = []
for day in li:
??? temprature_day = day.find("p",class_="tem").i.string.replace('℃', '') #每天的最低溫度
??? temprature_7days.append(temprature_day)? #添加到list.如果是要每天的多個天氣情況時，可以使用list包含list形式
print("最近7天的天氣最低溫度：",temprature_7days)

每天進步一點點，加油！

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/898867.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/898867.shtml
英文地址，請注明出處：http://en.pswp.cn/news/898867.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！