分析citibike數據eda

數據科學 (Data Science)

CitiBike is New York City’s famous bike rental company and the largest in the USA. CitiBike launched in May 2013 and has become an essential part of the transportation network. They make commute fun, efficient, and affordable — not to mention healthy and good for the environment.

CitiBike是紐約市著名的自行車租賃公司,也是美國最大的自行車租賃公司。 花旗自行車(CitiBike)于2013年5月推出,現已成為交通網絡的重要組成部分。 它們使通勤變得有趣,高效且負擔得起,更不用說健康且對環境有益。

I have got the data of CityBike riders of June 2013 from Kaggle. I will walk you through the complete exploratory data analysis answering some of the questions like:

我從Kaggle獲得了2013年6月的CityBike騎手數據。 我將引導您完成完整的探索性數據分析,回答一些問題,例如:

  1. Where do CitiBikers ride?

    CitiBikers騎在哪里?
  2. When do they ride?

    他們什么時候騎?
  3. How far do they go?

    他們走了多遠?
  4. Which stations are most popular?

    哪個電臺最受歡迎?
  5. What days of the week are most rides taken on?

    大多數游樂設施在一周的哪幾天?
  6. And many more

    還有很多

Key learning:

重點學習:

I have used many parameters to tweak the plotting functions of Matplotlib and Seaborn. It will be a good read to learn them practically.

我使用了許多參數來調整Matplotlib和Seaborn的繪圖功能。 實際學習它們將是一本好書。

Note:

注意:

This article is best viewed on a larger screen like a tablet or desktop. At any point of time if you find difficulty in understanding anything I will be dropping the link to my Kaggle notebook at the end of this article, you can drop your quaries in the comment section.

最好在平板電腦或臺式機等較大的屏幕上查看本文。 在任何時候,如果您發現難以理解任何內容,那么在本文結尾處,我都會刪除指向我的Kaggle筆記本的鏈接,您可以在評論部分中刪除您的查詢。

讓我們開始吧 (Let’s get?started)

Importing necessary libraries and reading data.

導入必要的庫并讀取數據。

#importing necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns#setting plot style to seaborn
plt.style.use('seaborn')#reading data
df = pd.read_csv('../input/citibike-system-data/201306-citibike-tripdata.csv')
df.head()
CitiBike dataset

Let’s get some more information on the data.

讓我們獲取有關數據的更多信息。

df.info()
Image for post
#sum of missing values in each column
df.isna().sum()
Image for post

We have whooping 5,77,703 rows to crunch and 15 columns. Also, quite a bit of missing values. Let’s deal with missing values first.

我們有多達5,77,703行要緊縮和15列。 此外,還有很多缺失值。 讓我們先處理缺失值。

處理缺失值 (Handling missing values)

Let’s first see the percentage of missing values which will help us decide whether to drop them or no.

首先讓我們看看缺失值的百分比,這將有助于我們決定是否刪除它們。

#calculating the percentage of missing values
#sum of missing value is the column divided by total number of rows in the dataset multiplied by 100data_loss1 = round((df['end station id'].isna().sum()/df.shape[0])*100)
data_loss2 = round((df['birth year'].isna().sum()/df.shape[0])*100)print(data_loss1, '% of data loss if NaN rows of end station id, \nend station name, end station latitude and end station longitude dropped.\n')
print(data_loss2, '% of data loss if NaN rows of birth year dropped.')
Image for post

We can not afford to drop the missing valued rows of ‘birth year’. Hence, drop the entire column ‘birth year’ and drop missing valued rows of ‘end station id’,‘ end station name’,‘ end station latitude’, and ‘end station longitude’. Fortunately, all the missing values in these four rows (end station id, end station name, end station latitude, and end station longitude) are on the exact same row, so dropping NaN rows from all four rows will still result in only 3% data loss.

我們不能舍棄丟失的“出生年份”有價值的行。 因此,刪除整列“出生年”并刪除“終端站ID”,“終端站名稱”,“終端站緯度”和“終端站經度”的缺失值行。 幸運的是,這四行中的所有缺失值(終端站ID,終端站名稱,終端站緯度和終端站經度)都在同一行上,因此從所有四行中刪除NaN行仍將僅導致3%數據丟失。

#dropping NaN values
rows_before_dropping = df.shape[0]
#drop entire birth year column.
df.drop(’birth year’,axis=1, inplace=True)#Now left with end station id, end station name, end station latitude and end station longitude
#these four columns have missing values in exact same row,
#so dropping NaN from all four columns will still result in only 3% data loss
df.dropna(axis=0, inplace=True)
rows_after_dropping = df.shape[0]#total data loss
print('% of data lost: ',((rows_before_dropping-rows_after_dropping)/rows_before_dropping)*100)#checking for NaN
df.isna().sum()
Image for post

讓我們看看性別在談論我們的數據 (Let’s see what gender talks about our data)

#plotting total no.of males and females
splot = sns.countplot('gender', data=df)#adding value above each bar:Annotation
for p in splot.patches:
an = splot.annotate(format(p.get_height(), '.2f'),
#bar value is nothing but height of the bar
(p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center',
va = 'center',
xytext = (0, 10),
textcoords = 'offset points')
an.set_size(20)#test size
splot.axes.set_title("Gender distribution",fontsize=30)
splot.axes.set_xlabel("Gender",fontsize=20)
splot.axes.set_ylabel("Count",fontsize=20)#adding x tick values
splot.axes.set_xticklabels(['Unknown', 'Male', 'Female'])
plt.show()
Image for post

We can see more male riders than females in New York City but due to a large number of unknown gender, we cannot get to any concrete conclusion. Filling unknown gender values is possible but we are not going to do it considering riders did not choose to disclose their gender.

在紐約市,我們看到男性騎手的人數多于女性騎手,但由于性別眾多,我們無法得出任何具體結論。 可以填寫未知的性別值,但考慮到車手沒有選擇公開性別,我們不會這樣做。

訂戶與客戶 (Subscribers vs Customers)

Subscribers are the users who bought the annual pass and customers are the once who bought either a 24-hour pass or a 3-day pass. Let’s see what the riders choose the most.

訂戶是購買年度通行證的用戶,客戶是購買24小時通行證或3天通行證的用戶。 讓我們來看看騎手最喜歡的東西。

user_type_count = df[’usertype’].value_counts()
plt.pie(user_type_count.values,
labels=user_type_count.index,
autopct=’%1.2f%%’,
textprops={’fontsize’: 15} )
plt.title(’Subcribers vs Customers’, fontsize=20)
plt.show()
Image for post

We can see there is more number of yearly subscribers than 1-3day customers. But the difference is not much, the company has to focus on converting customers to subscribers with some offers or sale.

我們可以看到,每年訂閱者的數量超過1-3天的客戶。 但是差異并不大,該公司必須專注于將客戶轉換為具有某些要約或銷售的訂戶。

騎自行車通常需要花費幾個小時 (How many hours do rides use the bike typically)

We have a column called ‘timeduration’ which talks about the duration each trip covered which is in seconds. Firstly, we will convert it to minutes, then create bins to group the trips into 0–30min, 30–60min, 60–120min, 120min, and above ride time. Then, let’s plot a graph to see how many hours do rides ride the bike typically.

我們有一個名為“ timeduration”的列,它討論了每次旅行的持續時間,以秒為單位。 首先,我們將其轉換為分鐘,然后創建垃圾箱,將行程分為0–30分鐘,30–60分鐘,60–120分鐘,120分鐘及以上行駛時間。 然后,讓我們繪制一個圖表,看看騎車通常需要騎幾個小時。

#converting trip duration from seconds to minuits
df['tripduration'] = df['tripduration']/60#creating bins (0-30min, 30-60min, 60-120min, 120 and above)
max_limit = df['tripduration'].max()
df['tripduration_bins'] = pd.cut(df['tripduration'], [0, 30, 60, 120, max_limit])sns.barplot(x='tripduration_bins', y='tripduration', data=df, estimator=np.size)
plt.title('Usual riding time', fontsize=30)
plt.xlabel('Trip duration group', fontsize=20)
plt.ylabel('Trip Duration', fontsize=20)
plt.show()
Image for post

There are a large number of riders who ride for less than half an hour per trip and most less than 1 hour.

有大量的騎手每次騎行少于半小時,最多少于1小時。

相同的開始和結束位置VS不同的開始和結束位置 (Same start and end location VS different start and end location)

We see in the data there are some trips that start and end at the same location. Let’s see how many.

我們在數據中看到一些行程在同一位置開始和結束。 讓我們看看有多少。

#number of trips that started and ended at same station
start_end_same = df[df['start station name'] == df['end station name']].shape[0]#number of trips that started and ended at different station
start_end_diff = df.shape[0]-start_end_sameplt.pie([start_end_same,start_end_diff],
labels=['Same start and end location',
'Different start and end location'],
autopct='%1.2f%%',
textprops={'fontsize': 15})
plt.title('Same start and end location vs Different start and end location', fontsize=20)
plt.show()
Image for post

本月的騎行方式 (Riding pattern of the month)

This part is where I have spent a lot of time and effort. The below graph talks a lot. Technically there is a lot of coding. Before looking at the code I will give an overview of what we are doing here. Basically, we are plotting a time series graph to see the trend of the number of rides taken per day and the trend of the total number of duration the bikes were in use per day. Let’s look at the code first then I will break it down for you.

這是我花費大量時間和精力的地方。 下圖很講究。 從技術上講,有很多編碼。 在查看代碼之前,我將概述我們在這里所做的事情。 基本上,我們正在繪制一個時間序列圖,以查看每天騎行次數的趨勢以及每天使用自行車的持續時間總數的趨勢。 讓我們先看一下代碼,然后我將為您分解代碼。

#converting string to datetime object
df['starttime']= pd.to_datetime(df['starttime'])#since we are dealing with single month, we grouping by days
#using count aggregation to get number of occurances i.e, total trips per day
start_time_count = df.set_index('starttime').groupby(pd.Grouper(freq='D')).count()#we have data from July month for only one day which is at last row, lets drop it
start_time_count.drop(start_time_count.tail(1).index, axis=0, inplace=True)#again grouping by day and aggregating with sum to get total trip duration per day
#which will used while plotting
trip_duration_count = df.set_index('starttime').groupby(pd.Grouper(freq='D')).sum()#again dropping the last row for same reason
trip_duration_count.drop(trip_duration_count.tail(1).index, axis=0, inplace=True)#plotting total rides per day
#using start station id to get the count
fig,ax=plt.subplots(figsize=(25,10))
ax.bar(start_time_count.index, 'start station id', data=start_time_count, label='Total riders')
#bbox_to_anchor is to position the legend box
ax.legend(loc ="lower left", bbox_to_anchor=(0.01, 0.89), fontsize='20')
ax.set_xlabel('Days of the month June 2013', fontsize=30)
ax.set_ylabel('Riders', fontsize=40)
ax.set_title('Bikers trend for the month June', fontsize=50)#creating twin x axis to plot line chart is same figure
ax2=ax.twinx()
#plotting total trip duration of all user per day
ax2.plot('tripduration', data=trip_duration_count, color='y', label='Total trip duration', marker='o', linewidth=5, markersize=12)
ax2.set_ylabel('Time duration', fontsize=40)
ax2.legend(loc ="upper left", bbox_to_anchor=(0.01, 0.9), fontsize='20')ax.set_xticks(trip_duration_count.index)
ax.set_xticklabels([i for i in range(1,31)])#tweeking x and y ticks labels of axes1
ax.tick_params(labelsize=30, labelcolor='#eb4034')
#tweeking x and y ticks labels of axes2
ax2.tick_params(labelsize=30, labelcolor='#eb4034')plt.show()

You might have understood the basic idea by reading the comments but let me explain the process step-by-step:

您可能通過閱讀評論已經了解了基本思想,但讓我逐步解釋了該過程:

  1. The date-time is in the string, we will convert it into DateTime object.

    日期時間在字符串中,我們將其轉換為DateTime對象。
  2. Grouping the data by days of the month and counting the number of occurrences to plot rides per day.

    將數據按每月的天數進行分組,并計算每天的出行次數。
  3. We have only one row with the information for the month of July. This is an outlier, drop it.

    我們只有一行包含7月份的信息。 這是一個離群值,將其刪除。
  4. Repeat steps 2 and 3 but the only difference this time is we sum the data instead of counting to get the total time duration of the trips per day.

    重復第2步和第3步,但是這次唯一的區別是我們對數據求和而不是進行計數以獲得每天行程的總持續時間。
  5. Plot both the data on a single graph using the twin axis method.

    使用雙軸方法將兩個數據繪制在一個圖形上。

I have used a lot of tweaking methods on matplotlib, make sure to go through each of them. If any doubts drop a comment on the Kaggle notebook for which the link will be dropped at the end of this article.

我在matplotlib上使用了很多調整方法,請確保每種方法都要經過。 如果有任何疑問,請在Kaggle筆記本上發表評論,其鏈接將在本文結尾處刪除。

Image for post

The number of riders increases considerably closing to the end of the month. There are negligible riders on the 1st Sunday of the month. The amount of time the bikers ride the bike decreases closing to the end of the month.

到月底,車手的數量大大增加。 每個月的第一個星期日的車手微不足道。 騎自行車的人騎自行車的時間減少到月底接近。

前10個出發站 (Top 10 start stations)

This is pretty straightforward, we get the occurrences of each start station using value_counts() and slice to get the first 10 values from it then plot the same.

這非常簡單,我們使用value_counts()和slice來獲取每個起始站點的出現,然后從中獲取前10個值,然后對其進行繪制。

#adding value above each bar:Annotation
for p in ax.patches:
an = ax.annotate(format(p.get_height(), '.2f'),
(p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center',
va = 'center',
xytext = (0, 10),
textcoords = 'offset points')
an.set_size(20)
ax.set_title("Top 10 start locations in NY",fontsize=30)
ax.set_xlabel("Station name",fontsize=20)#rotating the x tick labels to 45 degrees
ax.set_xticklabels(top_start_station.index, rotation = 45, ha="right")
ax.set_ylabel("Count",fontsize=20)
#tweeking x and y tick labels
ax.tick_params(labelsize=15)
plt.show()
Image for post

十佳終端站 (Top 10 end stations)

#top 10 end station
top_end_station = df['end station name'].value_counts()[:10]fig,ax=plt.subplots(figsize=(20,8))
ax.bar(x=top_end_station.index, height=top_end_station.values, color='#edde68', width=0.5)#adding value above each bar:Annotation
for p in ax.patches:
an = ax.annotate(format(p.get_height(), '.2f'),
(p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center',
va = 'center',
xytext = (0, 10),
textcoords = 'offset points')
an.set_size(20)
ax.set_title("Top 10 end locations in NY",fontsize=30)
ax.set_xlabel("Street name",fontsize=20)#rotating the x tick labels to 45 degrees
ax.set_xticklabels(top_end_station.index, rotation = 45, ha="right")
ax.set_ylabel("Count",fontsize=20)
#tweeking x and y tick labels
ax.tick_params(labelsize=15)
plt.show()
Image for post

Kaggle Notebook where I worked it out. Feel free to drop queries in the comment section.

Kaggle筆記本是我在其中解決的。 隨時在評論部分中刪除查詢。

翻譯自: https://medium.com/towards-artificial-intelligence/analyzing-citibike-data-eda-e657409f007a

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/389397.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/389397.shtml
英文地址,請注明出處:http://en.pswp.cn/news/389397.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

jvm感知docker容器參數

docker中的jvm檢測到的是宿主機的內存信息,它無法感知容器的資源上限,這樣可能會導致意外的情況。 -m參數用于限制容器使用內存的大小,超過大小時會被OOMKilled。 -Xmx: 默認為物理內存的1/4。 4核CPU16G內存的宿主機 java 7 docker run -m …

Flask之flask-script 指定端口

簡介 Flask-Scropt插件為在Flask里編寫額外的腳本提供了支持。這包括運行一個開發服務器,一個定制的Python命令行,用于執行初始化數據庫、定時任務和其他屬于web應用之外的命令行任務的腳本。 安裝 用命令pip和easy_install安裝: pip install…

上采樣(放大圖像)和下采樣(縮小圖像)(最鄰近插值和雙線性插值的理解和實現)

上采樣和下采樣 什么是上采樣和下采樣? ? 縮小圖像(或稱為下采樣(subsampled)或降采樣(downsampled))的主要目的有 兩個:1、使得圖像符合顯示區域的大小;2、生成對應圖…

r語言繪制雷達圖_用r繪制雷達蜘蛛圖

r語言繪制雷達圖I’ve tried several different types of NBA analytical articles within my readership who are a group of true fans of basketball. I found that the most popular articles are not those with state-of-the-art machine learning technologies, but tho…

java 分裂數字_分裂的補充:超越數字,打印物理可視化

java 分裂數字As noted in my earlier Nightingale writings, color harmony is the process of choosing colors on a Color Wheel that work well together in the composition of an image. Today, I will step further into color theory by discussing the Split Compleme…

Java 集合 之 Vector

http://www.verejava.com/?id17159974203844 import java.util.ArrayList; import java.util.Enumeration; import java.util.List; import java.util.Vector;public class Test {/*** param args the command line arguments*/public static void main(String[] args) {//打印…

前端電子書單大分享~~~

前言 純福利, 如果你不想買很多書,只想省錢看電子書; 如果你找不到很多想看書籍的電子書版本; 那么,請保存或者下載到自己的電腦或者手機或者網盤吧。 不要太著急,連接在最后呢 前端 前端框架 node html-cs…

結構化數據建模——titanic數據集的模型建立和訓練(Pytorch版)

本文參考《20天吃透Pytorch》來實現titanic數據集的模型建立和訓練 在書中理論的同時加入自己的理解。 一,準備數據 數據加載 titanic數據集的目標是根據乘客信息預測他們在Titanic號撞擊冰山沉沒后能否生存。 結構化數據一般會使用Pandas中的DataFrame進行預處理…

比賽,幸福度_幸福與生活滿意度

比賽,幸福度What is the purpose of life? Is that to be happy? Why people go through all the pain and hardship? Is it to achieve happiness in some way?人生的目的是什么? 那是幸福嗎? 人們為什么要經歷所有的痛苦和磨難? 是通過…

帶有postgres和jupyter筆記本的Titanic數據集

PostgreSQL is a powerful, open source object-relational database system with over 30 years of active development that has earned it a strong reputation for reliability, feature robustness, and performance.PostgreSQL是一個功能強大的開源對象關系數據庫系統&am…

Django學習--數據庫同步操作技巧

同步數據庫:使用上述兩條命令同步數據庫1.認識migrations目錄:migrations目錄作用:用來存放通過makemigrations命令生成的數據庫腳本,里面的生成的腳本不要輕易修改。要正常的使用數據庫同步的功能,app目錄下必須要有m…

《20天吃透Pytorch》Pytorch自動微分機制學習

自動微分機制 Pytorch一般通過反向傳播 backward 方法 實現這種求梯度計算。該方法求得的梯度將存在對應自變量張量的grad屬性下。 除此之外,也能夠調用torch.autograd.grad 函數來實現求梯度計算。 這就是Pytorch的自動微分機制。 一,利用backward方…

React 新 Context API 在前端狀態管理的實踐

2019獨角獸企業重金招聘Python工程師標準>>> 本文轉載至:今日頭條技術博客 眾所周知,React的單向數據流模式導致狀態只能一級一級的由父組件傳遞到子組件,在大中型應用中較為繁瑣不好管理,通常我們需要使用Redux來幫助…

機器學習模型 非線性模型_機器學習模型說明

機器學習模型 非線性模型A Case Study of Shap and pdp using Diabetes dataset使用糖尿病數據集對Shap和pdp進行案例研究 Explaining Machine Learning Models has always been a difficult concept to comprehend in which model results and performance stay black box (h…

5分鐘內完成胸部CT掃描機器學習

This post provides an overview of chest CT scan machine learning organized by clinical goal, data representation, task, and model.這篇文章按臨床目標,數據表示,任務和模型組織了胸部CT掃描機器學習的概述。 A chest CT scan is a grayscale 3…

Pytorch高階API示范——線性回歸模型

本文與《20天吃透Pytorch》有所不同,《20天吃透Pytorch》中是繼承之前的模型進行擬合,本文是單獨建立網絡進行擬合。 代碼實現: import torch import numpy as np import matplotlib.pyplot as plt import pandas as pd from torch import …

vue 上傳圖片限制大小和格式

<div class"upload-box clear"><span class"fl">上傳圖片</span><div class"artistDet-logo-box fl"><el-upload :action"this.baseServerUrl/fileUpload/uploadPic?filepathartwork" list-type"pic…

作業要求 20181023-3 每周例行報告

本周要求參見&#xff1a;https://edu.cnblogs.com/campus/nenu/2018fall/homework/2282 1、本周PSP 總計&#xff1a;927min 2、本周進度條 代碼行數 博文字數 用到的軟件工程知識點 217 757 PSP、版本控制 3、累積進度圖 &#xff08;1&#xff09;累積代碼折線圖 &…

算命數據_未來的數據科學家或算命精神向導

算命數據Real Estate Sale Prices, Regression, and Classification: Data Science is the Future of Fortune Telling房地產銷售價格&#xff0c;回歸和分類&#xff1a;數據科學是算命的未來 As we all know, I am unusually blessed with totally-real psychic abilities.眾…

openai-gpt_為什么到處都看到GPT-3?

openai-gptDisclaimer: My opinions are informed by my experience maintaining Cortex, an open source platform for machine learning engineering.免責聲明&#xff1a;我的看法是基于我維護 機器學習工程的開源平臺 Cortex的 經驗而 得出 的。 If you frequent any part…