【Python爬蟲】爬取公共交通路網數據

程序來自于Github，以下這篇博客作為完整的學習記錄，也callback上一篇爬取公共交通站點的博文。

Bardbo/get_bus_lines_and_stations_data_from_gaode: 這個項目是基于高德開放平臺和公交網獲取公交線路及站點數據，并生成shp文件，代碼相對粗糙，但簡單可用https://github.com/Bardbo/get_bus_lines_and_stations_data_from_gaode

1. 導入庫

首先是程序需要的python庫。

import requests
import json
import pandas as pd
from lxml import etree
import time
from tqdm import tqdm

requests：用于發送HTTP請求，獲取網頁內容或API數據。
json：用于處理JSON格式的數據。
pandas：用于數據處理和保存為CSV文件。
lxml：用于解析HTML內容。
time：用于控制爬取速度，避免過快請求導致被封禁。
tqdm：用于顯示進度條，方便查看爬取進度。

2.?獲取城市公交線路名稱（`get_bus_line_name`?函數）

def get_bus_line_name(city_phonetic):url = 'http://{}.gongjiao.com/lines_all.html'.format(city_phonetic)r = requests.get(url).textet = etree.HTML(r)line_name = et.xpath('//div[@class="list"]//a/text()')return line_name

這里同樣也需要先爬取公交線路的名稱，函數需要傳入city_phonetic也就是城市的拼音（如?changsha、wuhan），函數會返回該城市所有公交線路的名稱列表（line_name）。

`3. 爬取公交路線（get_line_station_data`?函數）

獲取公交線路的url是高德的https://restapi.amap.com/v3/bus/linename?extensions=all&key={ak}&output=json&city={city}&offset=1&keywords={line_name}

那么先來看一下官方的介紹，這里使用的是公交路線關鍵字查詢，也就是說我們要輸入公交路線的關鍵字，例如15路等等。

返回的內容如下，status記錄了查詢成功與否，buslines中記錄了查詢成功的公交路線。

接下來來看代碼：

def get_line_station_data(city, line_name, ak, city_phonetic):print(f'正在獲取-->{line_name}')time.sleep(1)url = f'https://restapi.amap.com/v3/bus/linename?extensions=all&key={ak}&output=json&city={city}&offset=1&keywords={line_name}'r = requests.get(url).textrt = json.loads(r)try:if rt['buslines']:if len(rt['buslines']) == 0:print('no data in list..')else:dt = {}dt['line_name'] = rt['buslines'][0]['name']dt['polyline'] = rt['buslines'][0]['polyline']dt['total_price'] = rt['buslines'][0]['total_price']station_name = []station_coords = []for st in rt['buslines'][0]['busstops']:station_name.append(st['name'])station_coords.append(st['location'])dt['station_name'] = station_namedt['station_corrds'] = station_coordsdm = pd.DataFrame([dt])dm.to_csv(f'{city_phonetic}_lines.csv',mode='a',header=False,index=False,encoding='utf_8_sig')else:print('data not avaliable..')with open('data not avaliable.log', 'a') as f:f.write(line_name + '\n')except:print('error.. try it again..')time.sleep(2)get_line_station_data(city, line_name, ak, city_phonetic)

函數通過高德地圖API獲取某條公交線路的詳細信息，并保存到CSV文件中。通過構造API請求URL獲取公交線路數據，解析響應并提取線路名稱、路徑、票價、站點名稱和坐標，將數據保存到CSV文件，若數據不可用則記錄日志，失敗時等待2秒后重試。

如果爬取失敗的話，檢查一下key是否達到了限額，一天只能爬取5000次，爬取公交線路比較耗費次數。

爬取完成后來看一下保存的csv，總共5列。

A列：公交線路關鍵字（名稱）
B列：公交線路polyline，也就是線路途徑的每個點（注意不是站點，公交線路的每個拐點都會被記錄）
C列：總價
D列：途徑站點關鍵字（名稱）
E列：途徑站點經緯度坐標

4. 主程序調用

if __name__ == '__main__':city = '益陽'city_phonetic = 'yiyang'ak = '###'  # 這里建議更改為自己的keystart_time = time.time()print(f'==========正在獲取 {city} 線路名稱==========')line_names = get_bus_line_name(city_phonetic)print(f'{city}在公交網上顯示共有{len(line_names)}條線路')for line_name in tqdm(line_names):get_line_station_data(city, line_name, ak, city_phonetic)end_time = time.time()print(f'我爬完啦, 耗時{end_time - start_time}秒')

自己需要設置的是最開始的city、city_phonetic、ak.

程序調用了上面的函數并記錄了爬取的時間。

5. 線路、站點轉成shp

那么有了這個csv怎么可視化呢？

那么作者就寫了DataToShp這個?類用于將公交線路和站點數據從CSV文件轉換為Shapefile格式，主要功能包括：

get_station_data：將站點坐標和名稱從字符串格式轉換為列表格式，將數據從橫向展開為縱向，并去除重復項；

get_line_data：將線路的折線數據從字符串格式轉換為列表格式;

create_station_shp：創建站點Shapefile，包含站點名稱和坐標；

create_lines_shp：創建線路Shapefile，包含線路名稱和折線坐標；

其實作者其實在這里還引入了從高德的火星坐標系轉換為WGS_84的函數，但是大部分時候這種轉換并不可靠，所以建議高德爬取的數據就搭配高德地圖進行可視化使用。

# -*- coding: utf-8 -*-
# @Author: Bardbo
# @Date:   2020-11-09 21:09:12
# @Last Modified by:   Bardbo
# @Last Modified time: 2020-11-09 21:59:35
import pandas as pd
import numpy as np
import shapefile
# import converterclass DataToShp:def __init__(self, filename):self.data = pd.read_csv(filename,names=['line_name', 'polyline', 'price','station_names', 'station_coords'])def get_station_data(self):df_stations = self.data[['station_coords', 'station_names']]# 將原本的一行字符串變為列表df_stations['station_coords'] = df_stations['station_coords'].apply(lambda x: x.replace('[', '').replace(']', '').replace('\'', '').split(', '))df_stations['station_names'] = df_stations['station_names'].apply(lambda x: x.replace('[', '').replace(']', '').replace('\'', '').split(', '))# 橫置的數據變為縱向的數據station_all = pd.DataFrame(\np.column_stack((\np.hstack(df_stations['station_coords'].repeat(list(map(len, df_stations['station_coords'])))),np.hstack(df_stations['station_names'].repeat(list(map(len, df_stations['station_names'])))))),columns=['station_coords','station_names'])# 去除重復station_all = station_all.drop_duplicates()# # 坐標轉換# station_all['st_coords_wgs84'] = station_all['station_coords'].apply(#     self.stations_to_wgs84)station_all.reset_index(inplace=True)self.stations = station_alldef get_line_data(self):df_lines = self.data[['line_name', 'polyline']]df_lines['polyline'] = df_lines['polyline'].apply(lambda x: x.split(';'))# # 坐標轉換# df_lines['lines_wgs84'] = df_lines['polyline'].apply(#     self.lines_to_wgs84)df_lines.reset_index(inplace=True)self.lines = df_lines# def stations_to_wgs84(self, coor):#     xy = coor.split(',')#     lng, lat = float(xy[0]), float(xy[1])#     return converter.gcj02_to_wgs84(lng, lat)## def lines_to_wgs84(self, coor):#     ls = []#     for c in coor:#         xy = c.split(',')#         lng, lat = float(xy[0]), float(xy[1])#         ls.append(converter.gcj02_to_wgs84(lng, lat))#     return lsdef create_station_shp(self, city_phonetic):w = shapefile.Writer(f'./data/{city_phonetic}_stations.shp')w.field('name', 'C')# 確保所有坐標都是浮動類型for i in range(len(self.stations)):coords = self.stations.loc[i, 'station_coords'].split(',')  # 獲取坐標lat = float(coords[0])  # 強制轉換為浮動類型lon = float(coords[1])  # 強制轉換為浮動類型# 確保坐標是浮動類型w.point(lat, lon)  # 寫入點w.record(self.stations.loc[i, 'station_names'])  # 寫入記錄w.close()def create_lines_shp(self, city_phonetic):w = shapefile.Writer(f'./data/{city_phonetic}_lines.shp')w.field('name', 'C')for i in range(len(self.lines)):polyline = self.lines['polyline'][i]# 如果 polyline 是字符串，則使用 split()；如果是列表，則直接使用if isinstance(polyline, list):polyline = [list(map(float, point.split(','))) for point in polyline]# 確保 polyline 是列表類型，進行寫入w.line([polyline])w.record(self.lines['line_name'][i])w.close()if __name__ == '__main__':dts = DataToShp('yiyang_lines.csv')dts.get_station_data()dts.get_line_data()dts.create_station_shp()dts.create_lines_shp()print('shp文件創建完成')

如下就是可視化后的效果【爬的時候無意中發現學校這多了兩條公交，深入鄂州，win！】