泰坦尼克號 數據分析_第1部分:泰坦尼克號-數據分析基礎

泰坦尼克號 數據分析

My goal was to get a better understanding of how to work with tabular data so I challenged myself and started with the Titanic -project. I think this was an excellent way to learn the basics of data analysis with python.

我的目標是更好地了解如何使用表格數據,因此我挑戰自我并開始了Titanic項目。 我認為這是學習python數據分析基礎知識的絕佳方法。

You can find the competition here: https://www.kaggle.com/c/titanicI really recommend you to try it yourself if you want to learn how to analyze the data and build machine learning models.

您可以在這里找到比賽: https : //www.kaggle.com/c/titanic如果您想學習如何分析數據和建立機器學習模型,我真的建議您自己嘗試一下。

I started by uploading the packages:

我首先上傳了軟件包:

import pandas as pd import numpy as np
import
matplotlib.pyplot as plt
import
seaborn as sns

Pandas is a great package for tabular data analysis. Numpy provides a high-performance multidimensional array object and tools for working with these arrays. Matplotlib packages help you to generate plots, histograms, power spectra, bar charts, etc., with just a few lines of code. Seaborn is developed based on the Matplotlib library and it can be used to create attractive and informative statistical graphics.

Pandas是用于表格數據分析的出色軟件包。 Numpy提供了高性能的多維數組對象和用于處理這些數組的工具。 Matplotlib軟件包可幫助您僅用幾行代碼即可生成圖,直方圖,功率譜,條形圖等。 Seaborn是基于Matplotlib庫開發的,可用于創建引人入勝且內容豐富的統計圖形。

After loading these packages I loaded the data:

加載這些軟件包后,我加載了數據:

df=pd.read_csv("train.csv")

Then I had a quick look at the data:

然后,我快速瀏覽了一下數據:

df.head()
#This prints you the first 5 rows of the table
#If you want to print 10 rows of the table instead of 5, then use
df.head(10)
Image for post
Screenshot of the first rows
第一行的屏幕截圖
df.tail()
# This prints you out the last five rows of the table

I recommend starting with a look at the data so that you can be sure everything is as it should be. This is how you can avoid stupid mistakes in further analysis.

我建議先查看數據,以確保所有內容都應該是正確的。 這樣可以避免進一步分析中的愚蠢錯誤。

df.shape
#This prints you the number of rows and columns

It is a good habit to print out the shape of the data in the beginning so you can check the number of columns and rows and be sure you haven’t missed any data during the analysis.

在開始時打印出數據的形狀是個好習慣,因此您可以檢查列數和行數,并確保在分析過程中沒有遺漏任何數據。

分析數據 (Analyze the data)

Then I continued to look at the data by counting the values. This gave me a lot of information about the content of the data.

然后,我繼續通過計算值來查看數據。 這給了我很多有關數據內容的信息。

df['Pclass'].value_counts()
# Prints out count of classes values
Image for post
The number of persons in each class. 3rd class was the most popular.
每個班級的人數。 第三類是最受歡迎的。

I prefer using percentages to showcase values. It is easier to understand the values in percentages.

我更喜歡使用百分比來展示價值。 更容易理解百分比值。

df['Pclass'].value_counts(normalize=True)
# same as above just that using "normalize=True" value is printed in percentages
Image for post
55% of people were in 3rd class
55%的人在三等艙

I counted values for each column separately. In the future, I challenge myself to do the function which prints out values but it was not my scope in this project.

我分別計算每列的值。 將來,我會挑戰自己執行輸出值的功能,但這不是我在本項目中的工作范圍。

I wanted to understand also the values of different columns so I used the describe() method for that.

我還想了解不同列的值,因此我使用了describe()方法。

df['Fare'].describe()
# describe() is used to view basic statistical details like count, mean, minimum and maximum values.
Image for post
“Fare” column values
“票價”列值

Here you can see for example that the minimum price for the ticket was 0,00 $ and the maximum price was 512,33 $.

例如,在這里您可以看到門票的最低價格為0,00 $,最高價格為512,33 $。

I did several crosstables to understand which were the determinant values for the surviving.

我做了幾個交叉表,以了解哪些是生存的決定性價值。

pd.crosstab(df['Survived'], df['Sex'])
# crosstable number of sex based on surviving.
Image for post
Here I also recommend using percentages instead of numerical values
在這里,我還建議使用百分比而不是數值
pd.crosstab(df['Survived'], df['Sex'], normalize=True)
# Using "normalize=True", you get values in percentage.
Image for post
Same as above just in percentages
與上面相同,只是百分比

Doing crosstables with different values gives you information about the possible correlations between the variables, for example, sex and surviving. As you can see, 26% of women survived and most of the men, 52%, didn’t survive.

使用不同的值進行交叉表可為您提供有關變量之間可能的相關性的信息,例如性別和存活率。 如您所見,有26%的女性幸存下來,而大多數男性(52%)沒有幸存。

可視化數據 (Visualize the data)

It is nice to have numerical values in tables but it is easier to understand the visualized data, at least for me. This is why I plotted histograms and bar charts. By creating histograms and bar charts I learned how to visualize the data. Here are a few examples:

在表格中有數值很高興,但至少對于我來說,更容易理解可視化數據。 這就是為什么我繪制直方圖和條形圖的原因。 通過創建直方圖和條形圖,我學習了如何可視化數據。 這里有一些例子:

df.hist(column='Age')
Image for post
In this histogram, you can see that passengers were mostly 20–40 years old.
在此直方圖中,您可以看到乘客的年齡大多為20-40歲。

I used seaborn library for the bar charts.

我使用seaborn庫制作條形圖。

sns.countplot(x='Sex', hue='Survived', data=df);
Image for post
More females survived than males.
存活下來的女性多于男性。

Also, I used a heatmap to see the correlation between different columns.

另外,我使用熱圖來查看不同列之間的相關性。

corrmat = df.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, annot=True, square=True, annot_kws={'size': 15});
Image for post

Heatmap shows that there is a strong negative correlation between Fares and Classes, so that when one increases other decreases. It is logical because ticket prices in the 1st class are higher than in the 3rd class.

熱圖顯示,票價和艙位之間有很強的負相關性,因此當票價增加時,其他票價會下降。 這是合乎邏輯的,因為第一類的機票價格高于第三類的機票價格。

If we focus on analyzing the correlations between surviving and other values, we see that there is a strong positive correlation between surviving and fare. The probability to survive is higher when the ticket price has been higher.

如果我們專注于分析幸存值與其他值之間的相關性,我們會發現幸存率和票價之間存在很強的正相關性。 當門票價格較高時,生存的可能性較高。

You can find the project in Github. please feel free to try it yourself and comment if there is something that needs clarifying!

您可以在Github中找到該項目。 請隨時嘗試一下,如果有需要澄清的地方,請發表評論!

Thank you for the highly trained monkey (Risto Hinno) for motivating and inspiring me!

感謝您訓練有素的猴子( Risto Hinno )激勵和啟發我!

翻譯自: https://medium.com/swlh/part-1-titanic-basic-of-data-analysis-ab3025d29f6e

泰坦尼克號 數據分析

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/388150.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/388150.shtml
英文地址,請注明出處:http://en.pswp.cn/news/388150.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

Imperva開源域目錄控制器,簡化活動目錄集成

Imperva已公開發布域目錄控制器(Domain Directory Controller,DDC)的源代碼,這是一個Java庫,用于簡化常見的Active Directory集成。 與Java的LdapContext不同,這個庫構建在Apache Directory LDAP之上&#…

2018.10.24 NOIP模擬 小 C 的序列(鏈表+數論)

傳送門 考慮到a[l],gcd(a[l],a[l1]),gcd(a[l],a[l1],a[l2])....gcd(a[l]...a[r])a[l],gcd(a[l],a[l1]),gcd(a[l],a[l1],a[l2])....gcd(a[l]...a[r])a[l],gcd(a[l],a[l1]),gcd(a[l],a[l1],a[l2])....gcd(a[l]...a[r])是可以分成最多logloglog段且段內的數都是相同的。 那么我們用…

vba數組dim_NDArray — —一個基于Java的N-Dim數組工具包

vba數組dim介紹 (Introduction) Within many development languages, there is a popular paradigm of using N-Dimensional arrays. They allow you to write numerical code that would otherwise require many levels of nested loops in only a few simple operations. Bec…

Nodejs教程08:同時處理GET/POST請求

示例代碼請訪問我的GitHub: github.com/chencl1986/… 同時處理GET/POST請求 通常在開發過程中,同一臺服務器需要接收多種類型的請求,并區分不同接口,向客戶端返回數據。 最常用的方式,就是對請求的方法、url進行區分判…

關于position的四個標簽

四個標簽是static,relative,absolute,fixed。 static 該值是正常流,并且是默認值,因此你很少看到(如果存在的話)指定該值。 relative:框的位置能夠相對于它在正常流中的位置有所偏移…

python算法和數據結構_Python中的數據結構和算法

python算法和數據結構To至 Leonardo da Vinci達芬奇(Leonardo da Vinci) 介紹 (Introduction) The purpose of this article is to give you a panorama of data structures and algorithms in Python. This topic is very important for a Data Scientist in order to help …

CSS:元素塌陷問題

2019獨角獸企業重金招聘Python工程師標準>>> 描述: 在文檔流中,父元素的高度默認是被子元素撐開的,也就是子元素多高,父元素就多高。但是當子元素設置浮動之后,子元素會完全脫離文檔流,此時將會…

Celery介紹及常見錯誤

celery 情景:用戶發起request,并等待response返回。在本些views中,可能需要執行一段耗時的程序,那么用戶就會等待很長時間,造成不好的用戶體驗,比如發送郵件、手機驗證碼等。 使用celery后,情況…

python dash_Dash是Databricks Spark后端的理想基于Python的前端

python dash📌 Learn how to deliver AI for Big Data using Dash & Databricks this recorded webinar with Peter Kim of Plotly and Prasad Kona of Databricks.this通過Plotly的Peter Kim和Databricks的Prasad Kona的網絡研討會了解如何使用Dash&#xff06…

js里的數據類型轉換

1、類型轉換 轉換為字符串 - String(x)- x.toString(x, 10)- x 轉換為數字 - Number(x)- parseInt(x, 10) - parseFloat(x) - x - 0- x 轉換為boolean - Boolean(x)- !!x 2、falsy值(false) - 0- NaN- - null- undefined 3、內存圖 - object存儲的是地址…

Eclipse 插件開發遇到問題心得總結

Eclipse 插件開發遇到問題心得總結 Posted on 2011-07-17 00:51 季楓 閱讀(3997) 評論(0) 編輯 收藏1、Eclipse 中插件開發多語言的實現 為了使用 .properties 文件,需要在 META-INF/MANIFEST.MF 文件中定義: Bundle-Localization: plugin 這樣就會…

/src/applicationContext.xml

<?xml version"1.0" encoding"UTF-8"?> <beans xmlns"http://www.springframework.org/schema/beans" xmlns:xsi"http://www.w3.org/2001/XMLSchema-instance" xmlns:context"http://www.springframework.org/schema…

在Python中查找子字符串索引的5種方法

在Python中查找字符串中子字符串索引的5種方法 (5 Ways to Find the Index of a Substring in Strings in Python) str.find() str.find() str.rfind() str.rfind() str.index() str.index() str.rindex() str.rindex() re.search() re.search() str.find() (str.find()) …

[LeetCode] 3. Longest Substring Without Repeating Characters 題解

問題描述 輸入一個字符串&#xff0c;找到其中最長的不重復子串 例1&#xff1a; 輸入&#xff1a;"abcabcbb" 輸出&#xff1a;3 解釋&#xff1a;最長非重復子串為"abc" 復制代碼例2&#xff1a; 輸入&#xff1a;"bbbbb" 輸出&#xff1a;1 解…

WPF中MVVM模式的 Event 處理

WPF的有些UI元素有Command屬性可以直接實現綁定&#xff0c;如Button 但是很多Event的觸發如何綁定到ViewModel中的Command呢&#xff1f; 答案就是使用EventTrigger可以實現。 繼續上一篇對Slider的研究&#xff0c;在View中修改Interaction. <i:Interaction.Triggers>&…

Eclipse 插件開發 向導

閱讀目錄 最近由于特殊需要&#xff0c;開始學習插件開發。   下面就直接弄一個簡單的插件吧!   1 新建一個插件工程   2 創建自己的插件名字&#xff0c;這個名字最好特殊一點&#xff0c;一遍融合到eclipse的時候&#xff0c;不會發生沖突。   3 下一步&#xff0c;進…

線性回歸 假設_線性回歸的假設

線性回歸 假設Linear Regression is the bicycle of regression models. It’s simple yet incredibly useful. It can be used in a variety of domains. It has a nice closed formed solution, which makes model training a super-fast non-iterative process.線性回歸是回…

ES6模塊與commonJS模塊的差異

參考&#xff1a; 前端模塊化 ES6 在語言標準的層面上&#xff0c;實現了模塊功能&#xff0c;而且實現得相當簡單&#xff0c;旨在成為瀏覽器和服務器通用的模塊解決方案。 其模塊功能主要由兩個命令構成&#xff1a;export和import。export命令用于規定模塊的對外接口&#x…

solo

solo - 必應詞典 美[so?lo?]英[s??l??]n.【樂】獨奏(曲)&#xff1b;獨唱(曲)&#xff1b;單人舞&#xff1b;單獨表演adj.獨唱[奏]的&#xff1b;單獨的&#xff1b;單人的v.獨奏&#xff1b;放單飛adv.獨網絡梭羅&#xff1b;獨奏曲&#xff1b;索羅變形復數&#xff1…

Eclipse 簡介和插件開發天氣預報

Eclipse 簡介和插件開發 Eclipse 是一個很讓人著迷的開發環境&#xff0c;它提供的核心框架和可擴展的插件機制給廣大的程序員提供了無限的想象和創造空間。目前網上流傳相當豐富且全面的開發工具方面的插件&#xff0c;但是 Eclipse 已經超越了開發環境的概念&#xff0c;可以…