python計算汽車的平均油耗_用python對汽車油耗進行數據分析

原標題：用python對汽車油耗進行數據分析

- 從http://fueleconomy.gov/geg/epadata/vehicles.csv.zip

下載汽車油耗數據集并解壓

- 進入jupyter notebook(ipython notebook)并新建一個New Notebook

- 輸入命令

[python]view plaincopy

importpandas as pd

importnumpy as np

fromggplotimport*

importmatplotlib.pyplot as plt

%matplotlib inline

vehicles = pd.read_csv("vehicles.csv")

vehicles.head按下Shift +Enter 鍵，可以看到如下結果：

其中 pandas中Data Frame類的邊界方法head，查看一個很有用的數據框data frame的中，包括每列的非空值數量和各列不同的數據類型的數量。

描述汽車油耗等數據

- 查看有多少觀測點(行)和多少變量(列)

- 查看年份信息

len(pd.unique(vehicles.years))

min(vehicles.year)

max(vehicles.year)

- 查看燃料類型

pd.value_counts(vehicles.fuelTypel)

- 查看變速箱類型

pd.value_counts(vehicles.trany)

trany變量自動擋是以A開頭，手動擋是以M開頭；故創建一個新變量trany2：

vehicles['trany2'] = vehicles.trany.str[0]

pd.value_counts(vehicles.trany2)

同理可以查看其它特征數據

分析汽車油耗隨時間變化的趨勢

- 先按照年份分組

grouped = vehicle.groupby('year')

- 再計算其中三列的均值

averaged= grouped['comb08', 'highway08', 'city08'].agg([np.mean])

- 為方便分析，對其進行重命名，然后創建一個‘year’的列，包含該數據框data frame的索引

averaged.columns = ['comb08_mean', 'highwayo8_mean', 'city08_mean']

averaged['year'] = averaged.index

- 使用ggplot包將結果繪成散點圖

print ggplot(averaged, aes('year', 'comb08_mean')) + geom_point(colour='steelblue') + xlab("Year") +

ylab("Average MPG") + ggtitle("All cars")

- 去除混合動力汽車

criteria1 = vehicles.fuelType1.isin(['Regular Gasoline', 'Premium Gasoline', 'Midgrade Gasoline'])

criteria2 = vehicles.fuelType2.isnull()

criteria3 = vehicles.atvType != 'Hybrid'

vehicles_non_hybrid = vehicles[criteria1 & criteria2 & criteria3]

- 將得到的數據框data frame按年份分組，并計算平均油耗

grouped = vehicles_non_hybrid.groupby(['year'])

averaged = grouped['comb08'].agg([np.mean])

averaged['hahhahah'] = averaged.index

- 查看是否大引擎的汽車越來越少

pd.unique(vehicles_non_hybrid.displ)

- 去掉nan值，并用astype方法保證各個值都是float型的

criteria = vehicles_non_hybrid.displ.notnull()

vehicles_non_hybrid = vehicles_non_hybrid[criteria]

vehicles_non_hybrid.loc[:,'displ'] = vehicles_non_hybrid.displ.astype('float')

criteria = vehicles_non_hybrid.comb08.notnull()

vehicles_non_hybrid = vehicles_non_hybrid[criteria]

vehicles_non_hybrid.loc[:,'comb08'] = vehicles_non_hybrid.comb08.astype('float')

- 最后用ggplot包來繪圖

print ggplot(vehicles_non_hybrid, aes('displ', 'comb08')) + geom_point(color='steelblue') +

xlab('Engine Displacement') + ylab('Average MPG') + ggtitle('Gasoline cars')

- 查看是否平均起來汽車越來越少了

grouped_by_year = vehicles_non_hybrid.groupby(['year'])

avg_grouped_by_year = grouped_by_year['displ', 'comb08'].agg([np.mean])

- 計算displ和conm08的均值，并改造數據框data frame

avg_grouped_by_year['year'] = avg_grouped_by_year.index

melted_avg_grouped_by_year = pd.melt(avg_grouped_by_year, id_vars='year')

- 創建分屏繪圖

p = ggplot(aes(x='year', y='value', color = 'variable_0'), data=melted_avg_grouped_by_year)

p + geom_point() + facet_grid("variable_0",scales="free") #scales參數fixed表示固定坐標軸刻度，free表示反饋坐標軸刻度

==========================================很皮的更新分隔線==========================================

調查汽車的制造商和型號

接下來的步驟會引導我們繼續深入完成數據探索

- 首先查看cylinders變量有哪些可能的值

pd.unique(vehicles_non_hybrid.cylinders)

- 我們再將cylinders變量轉換為float類型，這樣可以輕松方便地找到data frame的子集

vehicles_non_hybrid.cylinders = vehicles_non_hybrid.cylinders.astype('float')

pd.unique(vehicles_non_hybrid.cylinders)

- 現在，我們可以查看各個時間段有四缸引擎汽車的品牌數量

vehicles_non_hybrid_4 = vehicles_non_hybrid[(vehicles_non_hybrid.cylinders==4.0)]

import matplotlib.pyplot as plt

%matplotlib inline

grouped_by_year_4_cylinder =

vehicles_non_hybrid_4.groupby(['year']).make.nunique()

fig = grouped_by_year_4_cylinder.plot()

fig.set_xlabel('Year')

fig.set_ylabel('Number of 4-Cylinder Maker')

隨后，print fig 顯示出圖像，參見下圖：

分析：

我們可以從上圖中看到，從1980年以來四缸引擎汽車的品牌數量呈下降趨勢。然而，需要注意的是，這張圖可能會造成誤導，因為我們并不知道汽車品牌總數是否在同期也發生了變化。為了一探究竟，我們繼續一下操作。

- 查看各年有四缸引擎汽車的品牌的列表，找出每年的品牌列表

grouped_by_year_4_cylinder = vehicles_non_hybrid_4.groupby(['year'])

unique_makes = []

for name, group in grouped_by_year_4_cylinder:

unique_makes.append(set(pd.unique(group['make'])))

unique_makes = reduce(set.intersection, unique_makes)

print unique_makes

我們發現，在此期間只有12家制造商每年都制造四缸引擎汽車。

接下來，我們去發現這些汽車生產商的型號隨時間的油耗表現。這里采用一個較復雜的方式。首先，創建一個空列表，最終用來產生布爾值Booleans。我們用iterrows生成器generator遍歷data frame中的各行來產生每行及索引。然后判斷每行的品牌是否在此前計算的unique_makes集合中，在將此布爾值Blooeans添加在Booleans_mask集合后面。

- 最終選取在unique_makes集合中存在的品牌

boolean_mask = []

for index, row in vehicles_non_hybrid_4.iterrows():

make = row['make']

boolean_mask.append(make in unique_makes)

df_common_makes = vehicles_non_hybrid_4[boolean_mask]

- 先將數據框data frame按year和make分組，然后計算各組的均值

df_common_makes_grouped = df_common_makes.groupby(['year', 'make']).agg(np.mean).reset_index()

- 最后利用ggplot提供的分屏圖來顯示結果

ggplot(aes(x='year', y='comb08'), data = df_common_makes_grouped)

+ geom_line() + facet_wrap('make')

結果參見下圖：

責任編輯：

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/370099.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/370099.shtml
英文地址，請注明出處：http://en.pswp.cn/news/370099.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！