泰坦尼克數據集預測分析
Data is simply useless until you don’t know what it’s trying to tell you.
除非您不知道數據在試圖告訴您什么,否則數據將毫無用處。
With this quote we’ll continue on our quest to find the hidden secrets of the Titanic. ‘The Unsinkable’, as it was claimed by its designers and makers proved that even the best of human engineering may sometimes fail when nature comes on to test it.
用這句話,我們將繼續尋找泰坦尼克號的秘密。 正如其設計師和制造商所宣稱的,“堅不可摧”證明了,即使人類最好的工程學,有時也會由于自然的考驗而失敗。
In last article, we saw the different attributes of the data and had quick glance on what the data looked like. If you haven’t read part 1 of this blog , I recommend you to kindly read it by clicking here before continuing. In this article we’ll look at the relationships of each of the attributes to the survival of the passenger and to continue with our quest to find out whether you would’ve survived the Titanic Sinking or not.
在上一篇文章中,我們看到了數據的不同屬性,并快速瀏覽了數據的外觀。 如果您還沒有閱讀本博客的第1部分,建議您在繼續之前單擊此處 ,請仔細閱讀。 在本文中,我們將研究每個屬性與乘客生存的關系,并繼續我們的探索,以找出您是否會在《泰坦尼克號沉沒》中幸存。
1.旅客艙位與生存的關聯 (1. Co-Relation of Passenger Class with the survival)
Since, there are 3 classes present in the ship. Let’s find out the count of each passengers in each class.
此后,船上共有3個班級。 讓我們找出每個班級的每位乘客人數。
Output:
輸出:

Now, let’s find out the the total number of survivors from each class
現在,讓我們找出每個班級的幸存者總數
Output:
輸出:


As you can see, the percentage of the passengers belonging to Upper Class who survived is better than the rest of the two having a survival percentage of around 62.96%.
如您所見,幸存的上層階級乘客百分比要好于其余兩個的生存百分比(約62.96%)。
The Survival Percentage of Middle Class Passengers is around 47.28% which better than the lower class but worse than that of the Upper Class
中產階級乘客的生存率大約為47.28%,高于低層階級,但低于上層階級
The Lower Class was hit the most, having a survival percentage of just 24.23% which is significantly lower than the above two classes.
下層階級受到的打擊最大,生存率僅為24.23%,明顯低于上述兩個階級。
The results indicate that the survival of the Titanic Sink was largely affected by the class in which you belong indicating the discrimination based on the class.
結果表明,泰坦尼克號水槽的生存在很大程度上受到您所屬類別的影響,表明基于類別的歧視。
2.性別與生存的關系 (2. Co-Relation of Gender with the survival)
Let’s start by printing the number of passengers of each gender.
讓我們開始打印每種性別的乘客數量。
Output:
輸出:

Now, let’s find out the survival percentage of the passengers belonging to each gender.
現在,讓我們找出每種性別的乘客的生存率。
Output:
輸出:


The information suggests that the women were given the highest priority while saving lives. Almost 74.2% of the women survived and 18.89% of men survived. (How pure these gentlemen were!😢??)
信息表明,在挽救生命的同時,婦女被賦予最高優先權。 幾乎有74.2%的女性得以幸存,而18.89%的男性得以幸存。 (這些先生們真是純潔!😢??)
3.年齡與生存的關系 (3. Co-Relation of Age with Survival)
Now, let’s look at the effect of age on the survival. But first, let’s have a quick glance on some stats of the age along with the values that are missing in the data-set.
現在,讓我們看看年齡對生存的影響。 但是首先,讓我們快速瀏覽一下年齡的一些統計數據以及數據集中缺少的值。
Output:
輸出:


There are a total of 177 missing values i.e. the age of 177 Passengers are missing in the data-set. These missing values may pose some problems while predicting and hence, need to be addressed.
共有177個缺失值,即數據集中缺少177歲的乘客。 這些缺失值在預測時可能會帶來一些問題,因此需要解決。
Now, let’s visualize by plotting some histograms on the basis of the data
現在,讓我們根據數據繪制一些直方圖以進行可視化
kde = True gives Kernel Density Function for the histogram and rug are the small markings which plots the exact point at which the data were recorded.
kde = True給出了直方圖的內核密度函數,而rug是小的標記,它們繪制了記錄數據的精確點。
Output:
輸出:

Now, let’s check out the survival in each group by plotting the following graph with kde. The y-axis actually denotes probability density function for the kernel density estimation and the area under the kde curve give the probability of respective points in x-axis.
現在,讓我們用kde繪制下圖來檢查每組的存活率。 y軸實際上表示用于核密度估計的概率密度函數,而kde曲線下的面積給出了x軸上各個點的概率。
Output:
輸出:

The following plot show the distribution of gender in each age group.
下圖顯示了各個年齡段的性別分布。
Output:
輸出:

Now, let’s find out comparison of survival in each of these groups using kde plot.
現在,讓我們使用kde圖找出這些組中每個組的生存率比較。
Output:
輸出:

We can also understand what’s represented in these histograms as follows:
我們還可以理解這些直方圖中的表示形式,如下所示:
Output:
輸出:

4.否的關聯 幸存旅客的兄弟姐妹/配偶 (4. Co-Relation of no. of Siblings/Spouses of the passenger with Survival)
Let’s start by understanding the distribution of values of this attribute.
讓我們首先了解該屬性的值的分布。
Output:
輸出:

Now, let’s plot the histogram describing the survival of the passengers having respective number of Siblings/Spouses.
現在,讓我們繪制直方圖,描述具有相應數量的兄弟姐妹/配偶的乘客的生存情況。
Output:
輸出:

The inference of the above histogram can be derived using the following code:
可以使用以下代碼推導以上直方圖的推論:
Output:
輸出:

5.父母/子女人數與生存率的相互關系 (5. Co-relation of No. of Parents/Children with survival)
The distribution of the number of Parents/Children are as follows
父母子女數的分配如下
Output:
輸出:

Here are the two different plots denoting the survival of passengers having respective no. of Parents/Children. The first one using ‘distplot’ and the second one using ‘countplot’
這是兩個不同的圖,分別表示編號分別為的乘客的生存情況。 父母/子女。 第一個使用“ distplot”,第二個使用“ countplot”
Output:
輸出:

6.票價與生存的關聯 (6. Co-relation of Fare with survival)
Now, let’s try to understand if there was any regularity in the fare and whether there’s any relation with the survival. The code describes the distribution of the fare.
現在,讓我們嘗試了解票價是否有規律性以及與生存率是否有關系。 該代碼描述了票價的分配。
Output:
輸出:

Let’s plot the distribution of the Fare classified by the Survival
讓我們繪制按幸存分類的票價分布
Output:
輸出:

Let’s check whether the passengers were charged uniformly or not. If yes, let’s try to understand what are the factors that decided the fare for the tickets.
讓我們檢查一下乘客是否被統一收費。 如果是,讓我們嘗試了解決定門票價格的因素是什么。
To check whether ‘Gender’ was the factor to decide the fare of the tickets, here’s the plot for each embarkation followed by the inference of it.
要檢查“性別”是否是決定票價的因素,以下是每次登機的情節,然后進行推斷。
Output:
輸出:

Output:
輸出:



Thus, as per the data, mean fare charged for women were significantly higher in Cherbourg and Southampton.
因此,根據數據,瑟堡和南安普敦的女性平均車費要高得多。
To check whether ‘Embarkation’ , ‘Class’ and ‘Age’ were the factor deciding the fare of the tickets, here’s the plot for each embarkation and class classified with ‘Survival’ Status followed by the inference of it.
要檢查“入庫”,“艙位”和“年齡”是否是決定票價的因素,這是按“生存”狀態歸類的每個登乘艙位和艙位的圖,然后進行推斷。
Output:
輸出:

Thus, it is evident from the data that tickets were priced mostly on the basis of Pclass and the point of Embarkation but not on the basis of Age.
因此,從數據中可以明顯看出,機票的定價主要基于Pclass和登機地點,而不是基于年齡。
7.登船與生存的關系 (7. Co-relation of Embarkation with survival)
We have seen the description of the data having numerical attributes till now. Here’s a look at the description of the categorical data.
到目前為止,我們已經看到了對具有數值屬性的數據的描述。 這里是對分類數據的描述。
Output:
輸出:

Here’s a plot describing the ratio of the survival of passengers from each port of Embarkation.
這是一張描述每個登船口岸旅客生存率的圖表。
Output:
輸出:

And now here’s the pair-plot of each of the attributes that we have discussed till now.
現在,這是到目前為止我們討論過的每個屬性的配對圖。
Output:
輸出:

As you might have noticed we’ve ignored Passenger_Id, Name of the Passenger, Ticket and Cabin No. as they play little to no role in determining the survival of the passenger.
您可能已經注意到,我們已經忽略了Passenger_Id , 乘客 姓名 , 機票和機艙號 。 因為它們在決定乘客的生存方面幾乎沒有作用。
Thus, we tried to understand the data by visualizing using various techniques and uncovered various mysteries related to Titanic. In next Article we’ll be understanding the types of data and why some type of data need to be converted into the specific format to be able to fit various Machine Learning models on it. Thank you for joining throughout this journey of exploration and hope, you’ve got the experience of being a detective!🕵
因此,我們試圖通過使用各種技術進行可視化來理解數據,并發現與泰坦尼克號有關的各種奧秘。 在下一篇文章中,我們將了解數據的類型以及為什么需要將某種類型的數據轉換為特定格式才能適合其上的各種機器學習模型。 感謝您加入探索和希望的整個旅程,您已經成為一名偵探!🕵
Link to the Notebook: Click Here
鏈接到筆記本: 單擊此處
Link to Part 1 of this Blog: Click Here
鏈接到此博客的第1部分: 單擊此處
翻譯自: https://medium.com/@bapreetam/exploratory-data-analysis-a-case-study-on-titanic-data-set-part-2-96a9f3df963a
泰坦尼克數據集預測分析
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/389287.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/389287.shtml 英文地址,請注明出處:http://en.pswp.cn/news/389287.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!