數據科學與計算：爬蟲和數據分析案例筆記

案例 1：中國大學排名爬取與分析

一、任務描述

目標：爬取高三網中國大學排名一覽表，提取學校名稱、總分、全國排名、星級排名、辦學層級等數據，并保存為 CSV 文件。

網址：2021中國的大學排名一覽表_高三網

二、任務分析

數據來源：網頁中的表格數據，包含 “名次”“學校名稱”“總分”“全國排名”“星級排名”“辦學層次” 等字段。

網頁結構：數據嵌套在table > tbody > tr標簽中，需通過解析 HTML 提取表格行數據。

三、代碼實現

1. 核心庫導入

python

運行

import requests  # 發送HTTP請求
from bs4 import BeautifulSoup  # 解析HTML
import csv  # 處理CSV文件

2. 功能函數

獲取網頁內容（get_html）

python

運行

def get_html(url, time=3):try:r = requests.get(url, timeout=time)  # 發送GET請求r.encoding = r.apparent_encoding  # 自動識別編碼r.raise_for_status()  # 狀態碼非200時拋出異常return r.text  # 返回網頁文本except Exception as error:print(error)

解析網頁數據（parser）

python

運行

def parser(html):soup = BeautifulSoup(html, "lxml")  # 解析HTMLout_list = []for row in soup.select("table>tbody>tr"):  # 遍歷表格行td_html = row.select("td")  # 獲取單元格row_data = [td_html[1].text.strip(),  # 學校名稱td_html[2].text.strip(),  # 總分td_html[3].text.strip(),  # 全國排名td_html[4].text.strip(),  # 星級排名td_html[5].text.strip()   # 辦學層次]out_list.append(row_data)return out_list

保存為 CSV 文件（save_csv）

python

運行

def save_csv(item, path):with open(path, "wt", newline="", encoding="utf-8") as f:csv_write = csv.writer(f)csv_write.writerows(item)  # 寫入多行數據

3. 主程序

python

運行

if __name__ == "__main__":url = "http://www.bspider.top/gaosan/"html = get_html(url)  # 獲取網頁out_list = parser(html)  # 解析數據save_csv(out_list, "school.csv")  # 保存文件

四、數據預處理（處理缺失值）

針對 “總分” 列的空數據，使用 pandas 處理：

刪除含空字段的行

python

運行

import pandas as pd
df = pd.read_csv("school.csv")
new_df = df.dropna()  # 刪除缺失值所在行
print(new_df.to_string())

用指定內容替換空字段

python

運行

df.fillna("暫無分數信息", inplace=True)  # 替換缺失值為指定文本

用均值替換空字段

python

運行

x = df["總分"].mean()  # 計算均值
df["總分"].fillna(x, inplace=True)  # 填充缺失值

用中位數替換空字段

python

運行

x = df["總分"].median()  # 計算中位數
df["總分"].fillna(x, inplace=True)  # 填充缺失值

五、數據分析與可視化

1. 數據概況

共 820 所學校，按星級分布：8 星（8 所）、7 星（16 所）、6 星（36 所）、5 星（59 所）、4 星（103 所）、3 星（190 所）、2 星（148 所）、1 星（260 所）。

2. 可視化圖表

柱形圖（橫向 / 縱向）

python

運行

import matplotlib.pyplot as plt
import numpy as npx = np.array(["8星","7星","6星","5星","4星","3星","2星","1星"])
y = np.array([8, 16, 36, 59, 103, 190, 148, 260])plt.title("不同星級的學校個數")
plt.rcParams["font.sans-serif"] = ["SimHei"]  # 顯示中文
plt.bar(x, y)  # 縱向柱形圖
# plt.barh(x, y)  # 橫向柱形圖
plt.show()

餅圖（占比分布）

python

運行

y = np.array([1, 2, 4.5, 7.2, 12.5, 23.1, 18, 31.7])  # 各星級占比（%）
plt.pie(y, labels=["8星","7星","6星","5星","4星","3星","2星","1星"])
plt.title("不同星級的學校個數占比")
plt.rcParams["font.sans-serif"] = ["SimHei"]
plt.show()

六、總結

案例完整展示了 “爬蟲獲取數據→預處理清洗→可視化分析” 的流程。
核心技術：requests 爬蟲、BeautifulSoup 解析、pandas 數據處理、matplotlib 可視化。
應用場景：通過結構化數據提取與分析，直觀呈現大學排名的分布特征。

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/diannao/95592.shtml
繁體地址，請注明出處：http://hk.pswp.cn/diannao/95592.shtml
英文地址，請注明出處：http://en.pswp.cn/diannao/95592.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！