eda可視化_5用于探索性數據分析(EDA)的高級可視化

eda可視化

Early morning, a lady comes to meet Sherlock Holmes and Watson. Even before the lady opens her mouth and starts telling the reason for her visit, Sherlock can tell a lot about a person by his sheer power of observation and deduction. Similarly, we can deduce a lot about the data and relationship among the features before complex modelling and feeding the data to algorithms.

清晨,一位女士來見福爾摩斯和沃森。 甚至在這位女士張開嘴并開始說出拜訪原因之前,Sherlock都能憑借其觀察力和推論的絕對能力來講述一個人的事。 同樣,在進行復雜建模并將數據輸入算法之前,我們可以推斷出很多數據以及要素之間的關系。

Objective

目的

In this article, I will discuss five advanced data visualisation options to perform an advanced EDA and become Sherlock Holmes of data science. The goal is to deduce most on the relationship among different data points with minimal coding and quickest built-in options available.

在本文中,我將討論五個高級數據可視化選項,以執行高級EDA并成為數據科學的Sherlock Holmes。 目的是通過最少的編碼和最快的內置選項來推斷不同數據點之間的關系。

Step 1: We will be using the seaborn package inbuilt datasets and advanced option to illustrate the advanced data visualisation.

步驟1:我們將使用seaborn軟件包內置的數據集和高級選項來說明高級數據可視化。

import seaborn as sns
import matplotlib.pyplot as plt

Step 2:Seaborn package comes with a few in-built datasets to quickly prototype a visualisation and evaluate its suitability for EDA with own data points. In the article, we will be using the seaborn “penguins” dataset. From the online seaborn repository, the dataset is loaded with load_dataset method. We can get the list of all the inbuilt Seaborn datasets with get_dataset_names() names method.

第2步: Seaborn軟件包附帶了一些內置數據集,可快速創建可視化原型并使用自己的數據點評估其對EDA的適用性。 在本文中,我們將使用原始的“企鵝”數據集。 從在線Seaborn存儲庫中,使用load_dataset方法加載數據集。 我們可以使用get_dataset_names()names方法獲取所有內置Seaborn數據集的列表。

In the below code, in FacetGrid method, dataset name i.e. “bird”, feature by which we want to organise the visualisation of the data i.e. “island”, and feature by which we want to group i.e. hue as “specifies” is mentioned as the parameter. Further, in the “map” method X and Y-axis of the scatter plot i.e. “flipper_length_mm”, and “body_mass_g” mentioned in the example below.

在下面的代碼中,在FacetGrid方法中,數據集名稱即“ bird”,我們要通過其組織數據可視化的功能(即“ island”)和我們要對其進行分組(例如“ specify”的色相)的功能如下:參數。 此外,在“映射”方法中,在以下示例中提到的散點圖的X軸和Y軸,即“ flipper_length_mm”和“ body_mass_g”。

bird= sns.load_dataset("penguins")
g = sns.FacetGrid(bird, col="island", hue="species")
g.map(plt.scatter, "flipper_length_mm", "body_mass_g", alpha=.6)
g.add_legend()
plt.show()

Data set visualisation based on the above code plots the data points organised by “island” and colour-coded by “species”.

基于以上代碼的數據集可視化將按“島嶼”組織并按“物種”按顏色編碼的數據點繪制成圖。

With one glance we can infer that “Gentoo” species are only present on the island “Biscoe”. Gentoo species is heavier and has a longer flipper length that other species. “Adelie” species is available on all three islands and “Chinstrap” is only available on the island “Dream”. You can see that with only 5 lines of code we can get so much information without any modelling.

乍一看,我們可以推斷出“ Gentoo”物種僅存在于“ Biscoe”島上。 Gentoo物種較重,并且鰭狀肢的長度比其他物種更長。 在所有三個島上都可以使用“阿德利”物種,而在“夢”島上則可以使用“ Chinstrap”物種。 您會看到,僅用5行代碼,我們無需任何建模就可以獲得大量信息。

I will encourage you to post a comment on other information we can deduce from the below visualisation.

我鼓勵您對我們可以從下面的圖表中得出的其他信息發表評論。

Image for post

Step 3: We want to quickly get an idea on the range of weights of the penguins by species and islands. Also, identify the concentration of the weight range.

步驟3:我們想快速了解按物種和島嶼劃分的企鵝體重范圍。 另外,確定體重范圍的濃度。

With a strip plot, we can plot the weight of the penguins organised by species for each the islands.

通過條形圖,我們可以繪制每個島嶼按物種組織的企鵝的體重。

sns.stripplot(x="island", y="body_mass_g", hue= "species", data=bird, palette="Set1")
plt.show()
Image for post

Strip plot is helpful to get an insight as long as plots are not overcrowded with densely populated data points. In the island “Dream” points are densely populated in the plot, and it is a bit difficult to get meaningful information from it.

只要圖塊不被人口稠密的數據點過度擁擠,帶狀圖就有助于獲得洞察力。 在島上,“夢”點在圖中密密麻麻地填充著,很難從中獲取有意義的信息。

Swarmplot can help to visualise the range of weight of the penguins by species in each of the islands non-overlapping points.

Swarmplot可以幫助按島上每個非重疊點的物種可視化企鵝的體重范圍。

In the below code, we mention similar information like above in strip plot.

在下面的代碼中,我們在帶狀圖中提到了類似的信息。


sns.swarmplot(x="island", y="body_mass_g", hue="species",data=bird, palette="Set1", dodge=True)
plt.show()

This improves the comprehension of the data points immensely in case of densely populated data. We can infer with a glance that “Adelie” weight ranges from approx 2500 to 4800 grams and a typical Gentoo species is heavier than Chinstrap species. I will leave you to perform other exploratory data analysis based on the below swarm plot.

在數據密集的情況下,這極大地改善了數據點的理解。 我們可以一目了然地推斷出“阿德利”的重量范圍約為2500至4800克,典型的Gentoo物種比Chinstrap物種重。 我將讓您根據以下群圖執行其他探索性數據分析。

Image for post

Step 4: Next, we would like to understand the relationship between body mass and culmen length of penguins in each of the island based on their sex.

步驟4:接下來,我們想了解根據島嶼的性別,企鵝的體重與高短長度之間的關系。

In the below code, in x and y parameter features between which we are interested in identifying, the relationship is mentioned. Hue is mentioned as “sex” as we want to learn the relation for male and female penguins separately.

在下面的代碼中,在我們希望識別的x和y參數特征中,提到了這種關系。 順化被稱為“性別”,因為我們想分別學習男女企鵝的關系。

sns.lmplot(x="body_mass_g", y="culmen_length_mm", hue="sex", col="island", markers=["o", "x"],palette="Set1",data=bird)
plt.show()

We can gather that body mass and culmen length relationship in Biscoe island penguins is similar for both male and female sex. On the contrary, in the island Dream, the relationship trend is quite the opposite for male and female penguins. Are the body mass and culmen length relationship of the penguins in the island Dream is linear or not linear?

我們可以發現,在Biscoe島上,企鵝的體重和長度與男性和女性的相似。 相反,在“夢島”中,男女企鵝的關系趨勢相反。 夢島中企鵝的體重和宮長長度關系是線性的還是線性的?

We can visualise the polynomial relationship by specifying the parameter order in the lmplot method.

我們可以通過在lmplot方法中指定參數順序來可視化多項式關系。

Image for post

In the case of heavily densely populated data points, we can further extend our exploratory data analysis by visualizing the body mass culmen relationship by male and female sex separately for each island.

在人口稠密的數據點的情況下,我們可以通過可視化每個島嶼分別以男性和女性性別劃分的身體標本關系來進一步擴展探索性數據分析。

Visualization is organised by the col and row parameter mentioned in the code.

可視化由代碼中提到的col和row參數組織。

sns.lmplot(x="body_mass_g", y="culmen_length_mm", hue="sex", col="island",row="sex",order=2, markers=["o", "x"],palette="Set1",data=bird)
plt.show()
Image for post

Step 5: It helps to visualise a scatter plot and histogram side by side to get a holistic view of the spread of data points and also the frequency of observation at the same time. Joint plots are an efficient way to picture it. In the code below, x & y parameter is the feature between which we are trying to identify the relationship. As the data points are densely populated, hence we will plot a hexbin instead of a scatter plot. “Kind” parameter indicates the type of plot. If you want to know more about hexbin plot, then please refer my article 5 Powerful Visualisation with Pandas for Data Preprocessing.

第5步:這有助于并排可視化散點圖和直方圖,從而獲得數據點分布的整體視圖以及同時觀察的頻率。 聯合圖是描繪它的有效方法。 在下面的代碼中,x&y參數是我們試圖識別關系的功能。 由于數據點密集,因此我們將繪制六邊形而不是散點圖。 “種類”參數指示圖的類型。 如果您想了解更多關于hexbin圖的信息,請參閱我的文章5:使用Pandas進行功能強大的可視化以進行數據預處理 。

sns.set_palette("gist_rainbow_r")
sns.jointplot(x="body_mass_g", y="culmen_length_mm", kind="hex",data=bird )
plt.show()
Image for post

Based on the histogram we can infer that most numbers of penguins are between 3500 and 4500 grams. Can you deduce the most frequent culmen length range of the penguins?

根據直方圖,我們可以推斷出大多數企鵝數量在3500至4500克之間。 您能推斷出企鵝最常見的莖長度范圍嗎?

We can also plot the individual data points (as shown in the right side plot) inside the hexbin plot with the code shown below.

我們還可以使用下面所示的代碼在hexbin圖內繪制各個數據點(如右側圖所示)。

g = sns.jointplot(x="body_mass_g", y="culmen_length_mm", data=bird, kind="hex", color="c")
g.plot_joint(plt.scatter, c="w", s=30, linewidth=1, marker="+")
g.set_axis_labels("Body Mass (in gram)", "Culmen Length ( in mm)")
plt.show()

Step 6: At last, we would like to get an overview of the spread and relationship among different features by island. Paitplots are very handy to visualise the scatter plot among different feature. The feature “island” is mentioned as the hue as we want to colour code the plot based on it.

第6步:最后,我們希望按島嶼概述不同要素之間的傳播和關系。 Paitplots非常便于查看不同特征之間的散點圖。 由于我們要根據其對地塊進行顏色編碼,因此將特征“島”稱為色相。

sns.pairplot(bird, hue="island") 
plt.show()

We can see from the visualisation that in the island Biscoe most of the penguins have shallow culmen depth but are heavier than penguins in other islands. Similarly, we can conclude that flipper length is longer for most penguins in Biscoe island but have shallow culmen depth.

從可視化中我們可以看到,在比斯科島中,大多數企鵝的陰莖深度較淺,但比其他島嶼的企鵝重。 同樣,我們可以得出結論,在比斯科島上的大多數企鵝,鰭狀肢的長度更長,但洞室深度卻較淺。

Image for post

I hope you will use these advanced visualisations for the exploratory data analysis and get a sense of data points relationship before embarking any complex modelling exercise. I would love to hear your favourite visualization plots for EDA and also a list of conclusions we can draw from the examples illustrated in this article.

我希望您可以在進行任何復雜的建模練習之前,將這些高級可視化用于探索性數據分析,并了解數據點之間的關系。 我很想聽聽您最喜歡的EDA可視化圖,以及我們可以從本文所示示例中得出的結論列表。

In case, you would like to learn data visualisation using pandas then please read by trending article on 5 Powerful Visualisation with Pandas for Data Preprocessing.

如果您想使用熊貓學習數據可視化,那么請閱讀趨勢文章5關于數據預處理的熊貓的強大可視化 。

If you are interested in learning different Scikit-Learn scalers then please do read my article Feature Scaling — Effect Of Different Scikit-Learn Scalers: Deep Dive

如果您有興趣學習其他Scikit-Learn潔牙機,請閱讀我的文章Feature Scaling —不同Scikit-Learn潔牙機的效果:深入研究

"""Full Code"""import seaborn as sns
import matplotlib.pyplot as plt
bird= sns.load_dataset("penguins")
g = sns.FacetGrid(bird, col="island", hue="species")
g.map(plt.scatter, "flipper_length_mm", "body_mass_g", alpha=.6)
g.add_legend()
plt.show()
sns.stripplot(x="island", y="body_mass_g", hue= "species", data=bird, palette="Set1")
plt.show()
sns.swarmplot(x="island", y="body_mass_g", hue="species",data=bird, palette="Set1", dodge=True)
plt.show()
sns.lmplot(x="body_mass_g", y="culmen_length_mm", hue="sex", col="island", markers=["o", "x"],palette="Set1",data=bird)
plt.show()
sns.lmplot(x="body_mass_g", y="culmen_length_mm", hue="sex", col="island",row="sex",order=2, markers=["o", "x"],palette="Set1",data=bird)
plt.show()
sns.set_palette("gist_rainbow_r")
sns.jointplot(x="body_mass_g", y="culmen_length_mm", kind="hex",data=bird )
plt.show()
g = sns.jointplot(x="body_mass_g", y="culmen_length_mm", data=bird, kind="hex", color="c")
g.plot_joint(plt.scatter, c="w", s=30, linewidth=1, marker="+")
g.set_axis_labels("Body Mass (in gram)", "Culmen Length ( in mm)")
plt.show()
sns.pairplot(bird, hue="island")
plt.show()

翻譯自: https://towardsdatascience.com/5-advanced-visualisation-for-exploratory-data-analysis-eda-c8eafeb0b8cb

eda可視化

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/391546.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/391546.shtml
英文地址,請注明出處:http://en.pswp.cn/news/391546.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

我的AWS開發人員考試未通過。 現在怎么辦?

I have just taken the AWS Certified Developer - Associate Exam on July 1st of 2019. The result? I failed.我剛剛在2019年7月1日參加了AWS認證開發人員-聯考。結果如何? 我失敗了。 The AWS Certified Developer - Associate (DVA-C01) has a scaled score …

關系數據可視化gephi

表示對象之間的關系,可通過gephi軟件實現,軟件下載官方地址https://gephi.org/users/download/ 如何來表示兩個對象之間的關系? 把對象變成點,點的大小、顏色可以是它的兩個參數,兩個點之間的關系可以用連線表示。連線…

Hyperledger Fabric 1.0 從零開始(十二)——fabric-sdk-java應用

Hyperledger Fabric 1.0 從零開始(十)——智能合約(參閱:Hyperledger Fabric Chaincode for Operators——實操智能合約) Hyperledger Fabric 1.0 從零開始(十一)——CouchDB(參閱&a…

css跑道_如何不超出跑道:計劃種子的簡單方法

css跑道There’s lots of startup advice floating around. I’m going to give you a very practical one that’s often missed — how to plan your early growth. The seed round is usually devoted to finding your product-market fit, meaning you start with no or li…

將json 填入表格_如何將Google表格用作JSON端點

將json 填入表格UPDATE: 5/13/2020 - New Share Dialog Box steps available below.更新:5/13/2020-下面提供了 新共享對話框步驟。 Thanks Erica H!謝謝埃里卡H! Are you building a prototype dynamic web application and need to collaborate with …

leetcode 173. 二叉搜索樹迭代器

實現一個二叉搜索樹迭代器類BSTIterator ,表示一個按中序遍歷二叉搜索樹(BST)的迭代器: BSTIterator(TreeNode root) 初始化 BSTIterator 類的一個對象。BST 的根節點 root 會作為構造函數的一部分給出。指針應初始化為一個不存在…

jyputer notebook 、jypyter、IPython basics

1 、修改jupyter默認工作目錄:打開cmd,在命令行下指定想要進的工作目錄,即鍵入“cd d/ G:\0工作面試\學習記錄”標紅部分是想要進入的工作目錄。 2、Tab補全 a、在命令行輸入表達式時,按下Tab鍵即可為任意變量(對象、…

cookie和session(1)

cookie和session 1.cookie產生 識別用戶 HTTP是無狀態協議,這就回出現這種現象:當你登錄一個頁面,然后轉到登錄網站的另一個頁面,服務器無法認識到。或者說兩次的訪問,服務器不能認識到是同一個客戶端的訪問&#xff0…

熊貓數據集_為數據科學拆箱熊貓

熊貓數據集If you are already familiar with NumPy, Pandas is just a package build on top of it. Pandas provide more flexibility than NumPy to work with data. While in NumPy we can only store values of single data type(dtype) Pandas has the flexibility to st…

2018年,你想從InfoQ獲取什么內容?丨Q言Q語

- Q 言 Q 語 第 三 期 - Q言Q語是 InfoQ 推出的最新板塊, 旨在給所有 InfoQer 一個展示觀點的平臺。 每期一個主題, 不扣帽子,不論對錯,不看輸贏, 只愿跟有趣的靈魂相遇。 本期話題: 2018年,你想…

特征阻抗輸入阻抗輸出阻抗_軟件阻抗說明

特征阻抗輸入阻抗輸出阻抗by Milan Mimica米蘭米米卡(Milan Mimica) 軟件阻抗說明 (Software impedance explained) 數據處理組件之間的阻抗不匹配 (The impedance mismatch between data processing components) It all starts with the simplest signal-processing diagram …

數學建模3

數學建模3 轉載于:https://www.cnblogs.com/Forever77/p/11423169.html

leetcode 190. 顛倒二進制位(位運算)

顛倒給定的 32 位無符號整數的二進制位。 提示: 請注意,在某些語言(如 Java)中,沒有無符號整數類型。在這種情況下,輸入和輸出都將被指定為有符號整數類型,并且不應影響您的實現,因…

JAVA基礎——時間Date類型轉換

在java中有六大時間類,分別是: 1、java.util包下的Date類, 2、java.sql包下的Date類, 3、java.text包下的DateFormat類,(抽象類) 4、java.text包下的SimpleDateFormat類, 5、java.ut…

LeetCode第五天

leetcode 第五天 2018年1月6日 22.(566) Reshape the Matrix JAVA class Solution {public int[][] matrixReshape(int[][] nums, int r, int c) {int[][] newNums new int[r][c];int size nums.length*nums[0].length;if(r*c ! size)return nums;for(int i0;i<size;i){ne…

matplotlib可視化_使用Matplotlib改善可視化設計的5個魔術技巧

matplotlib可視化It is impossible to know everything, no matter how much our experience has increased over the years, there are many things that remain hidden from us. This is normal, and maybe an exciting motivation to search and learn more. And I am sure …

adb 多點觸碰_無法觸及的神話

adb 多點觸碰On Twitter, in Slack, on Discord, in IRC, or wherever you hang out with other developers on the internet, you may have heard some formulation of the following statements:在Twitter上&#xff0c;在Slack中&#xff0c;在Discord中&#xff0c;在IRC…

robot:循環遍歷數據庫查詢結果是否滿足要求

使用list類型變量{}接收查詢結果&#xff0c;再for循環遍歷每行數據&#xff0c;取出需要比較的數值 轉載于:https://www.cnblogs.com/gcgc/p/11424114.html

leetcode 74. 搜索二維矩陣(二分)

編寫一個高效的算法來判斷 m x n 矩陣中&#xff0c;是否存在一個目標值。該矩陣具有如下特性&#xff1a; 每行中的整數從左到右按升序排列。 每行的第一個整數大于前一行的最后一個整數。 示例 1&#xff1a; 輸入&#xff1a;matrix [[1,3,5,7],[10,11,16,20],[23,30,34…

rm命令

命令 ‘rm’ &#xff08;remove&#xff09;&#xff1a;刪除一個目錄中的一個或多個文件或目錄&#xff0c;也可以將某個目錄及其下屬的所有文件及其子目錄均刪除掉 語法&#xff1a;rm&#xff08;選項&#xff09;&#xff08;參數&#xff09; 默認會提示‘是否’刪除&am…