熊貓燒香分析報告
目錄 (Table of Contents)
- Introduction 介紹
- Overview 總覽
- Variables 變數
- Interactions 互動互動
- Correlations 相關性
- Missing Values 缺失值
- Sample 樣品
- Summary 摘要
介紹 (Introduction)
There are countless ways to perform exploratory data analysis (EDA) in Python (and in R). I do most of mine in the popular Jupyter Notebook. Once I realized there was a library that could summarize my dataset with just one line of code, I made sure to utilize it for every project, reaping countless benefits from the ease of this EDA tool. The EDA step should be performed first before executing any Machine Learning models for all Data Scientists, therefore, the kind and intelligent developers from Pandas Profiling [2] have made it easy to view your dataset in a beautiful format, while also describing the information well in your dataset.
在Python( 和R )中執行探索性數據分析(EDA)的方法有無數種。 我在流行的Jupyter筆記本電腦上做大多數事情。 一旦意識到有一個庫可以用一行代碼來總結我的數據集,我便確保將其用于每個項目,并從此EDA工具的易用性中獲得了無數的收益。 在為所有數據科學家執行任何機器學習模型之前,應首先執行EDA步驟,因此, Pandas Profiling [2]的友善而又聰明的開發人員已輕松以美觀的格式查看數據集,同時也很好地描述了信息在您的數據集中。
The Pandas Profiling report serves as this excellent EDA tool that can offer the following benefits: overview, variables, interactions, correlations, missing values, and a sample of your data. I will be using randomly generated data to serve as an example of this useful tool.
熊貓分析報告是一種出色的EDA工具,可提供以下好處:概述,變量,交互作用,相關性,缺失值和數據樣本。 我將使用隨機生成的數據作為此有用工具的示例。
總覽 (Overview)

The overview tab in the report provides a quick glance at how many variables and observations you have or the number of rows and columns. It will also perform a calculation to see how many of your missing cells there are compared to the whole dataframe column. Additionally, it will point out duplicate rows as well and calculate that percentage. This tab is most similar to part of the describe function from Pandas, while providing a better user-interface (UI) experience.
報告中的“概述”選項卡可讓您快速瀏覽一下您擁有多少變量和觀測值,或者行和列的數量。 它還將執行計算,以查看與整個數據框列相比有多少個丟失的單元格。 此外,它還將指出重復的行并計算該百分比。 此選項卡與Pandas的describe函數的一部分最為相似,同時提供了更好的用戶界面 ( UI )體驗。
The overview is broken into dataset statistics and variable types. You can also refer to warnings and reproduction for more specific information on your data.
概述分為數據集統計信息和變量類型。 您還可以參考警告和復制以獲取有關數據的更多特定信息。
I will be discussing variables, which are also referred to as columns or features of your dataframe
我將討論變量,這些變量也稱為數據框的列或特征
變數 (Variables)

To achieve more granularity in your descriptive statistics, the variables tab is the way to go. You can look at distinct, missing, aggregations or calculations like mean, min, and max of your dataframe features or variables. You can also see the type of data you are working with (i.e., NUM). Not pictured is when you click on ‘Toggle details’. This toggle prompts a whole plethora of more usable statistics. The details include:
為了使描述性統計信息更加精確,可以使用“變量”選項卡。 您可以查看數據框特征或變量的不同,缺失,聚合或計算,例如均值,最小值和最大值。 您還可以查看正在使用的數據類型( 即NUM )。 當您單擊“ 切換詳細信息 ”時,未顯示圖片。 此切換提示大量更多可用統計信息。 詳細信息包括:
Statistics — quantile and descriptive
統計-分位數和描述性
quantile
分位數
Minimum
5th percentile
Q1
Median
Q3
95th percentile
Maximum
Range
Interquartile range (IQR)
descriptive
描述性的
Standard deviation
Coefficient of variation (CV)
Kurtosis
Mean
Median Absolute Deviation (MAD)
Skewness
Sum
Variance
Monotonicity
These statistics also provide similar information from the describe function I see most Data Scientists using today, however, there are a few more and it presents them in an easy-to-view format.
這些統計信息還提供了我今天看到的大多數數據科學家使用的describe函數的類似信息,但是,還有更多信息,并且以易于查看的格式顯示。
Histograms
直方圖
The histograms provide for an easily digestible visual of your variables. You can expect to see the frequency of your variable on the y-axis and fixed-size bins (bins=15 is the default) on the x-axis.
直方圖為您的變量提供了易于理解的視覺效果。 你可以期望看到的在y軸變量的在x軸的頻率和固定大小的塊( 倉= 15是默認值 )。
Common Values
共同價值觀
The common values will provide the value, count, and frequency that are most common for your variable.
公用值將提供最常用于變量的值,計數和頻率。
Extreme Values
極端值
The extreme values will provide the value, count, and frequency that are in the minimum and maximum values of your dataframe.
極值將提供數據框的最小值和最大值中的值,計數和頻率。
互動互動 (Interactions)

The interactions feature of the profiling report is unique in that you can choose from your list of columns to either be on the x-axis or y-xis provided. For example, pictured above is variable A against variable A, which is why you see overlapping. You can easily switch to other variables or columns to achieve a different plot and an excellent representation of your data points.
分析報告的交互功能是獨特的,因為您可以從列列表中選擇在提供的x軸還是y-xis上 。 例如,如上圖所示, 變量A相對于變量A ,這就是為什么看到重疊的原因。 您可以輕松地切換到其他變量或列,以實現不同的圖并很好地表示數據點。
相關性 (Correlations)

Sometimes making fancier or colorful correlation plots can be time-consuming if you make them from line-by-line Python code. However, with this correlation plot, you can easily visualize the relationships between variables in your data, which are also nicely color-coded. There are four main plots that you can display:
如果使用逐行的 Python代碼進行繪制,有時制作更精美的彩色關聯圖可能會很耗時。 但是,使用此相關圖,您可以輕松地可視化數據中變量之間的關系,這些變量也已進行了很好的顏色編碼 。 您可以顯示四個主要圖表:
Pearson’s r
皮爾遜河
Spearman’s ρ
斯皮爾曼的ρ
Kendall’s τ
肯德爾的τ
Phik (φk)
皮克(φk)
You may only be used to one of these correlation methods, so the other ones may sound confusing or not usable. Therefore, the correlation plot also comes provided with a toggle for details onto the meaning of each correlation you can visualize — this feature really helps when you need a refresher on correlation, as well as when you are deciding between which plot(s) to use for your analysis
您可能只習慣了這些相關方法之一,因此其他方法可能聽起來令人困惑或無法使用。 因此,相關情節還附帶提供了一個切換為細節上可以直觀的每個相關的含義-這個功能真的幫助,當你需要在相關復習,以及當你決定為與該陰謀( 縣 )使用供您分析
缺失值 (Missing Values)

As you can see from the plot above, the report tool also includes missing values. You can see how much of each variable is missing, including the count, and matrix. It is a nice way to visualize your data before you perform any models with it. You would preferably want to see a plot like the above, meaning you have no missing values.
從上圖可以看到,報告工具還包含缺失值。 您可以看到缺少每個變量的多少,包括計數和矩陣。 這是在執行任何模型之前可視化數據的好方法。 您最好希望看到上面的圖,這意味著您沒有缺失的值。
樣品 (Sample)

Sample acts similarly to the head and tail function where it returns your dataframe’s first few rows or last rows. In this example, you can see the first rows and last rows as well. I use this tab when I want a sense of where my data started and where it ended — I recommend ranking or ordering to see more benefit out of this tab, as you can see the range of your data, with a visual respective representation.
Sample的行為類似于head和tail函數,它返回數據框的前幾行或最后幾行。 在此示例中,您還可以看到第一行和最后一行。 當我想了解我的數據的開始和結束位置時,可以使用此選項卡-我建議進行排序或排序,以便從該選項卡中獲得更多好處,因為您可以看到數據的范圍,并具有直觀的外觀。
摘要 (Summary)

I hope this article provided you with some inspiration for your next exploratory data analysis. Being a Data Scientist can be overwhelming and EDA is often forgotten or not practiced as much as model-building. With the Pandas Profiling report, you can perform EDA with minimal code, providing useful statistics and visualizes as well. That way, you can focus on the fun part of Data Science and Machine Learning, the model process.
我希望本文能為您的下一個探索性數據分析提供一些啟發。 身為數據科學家可能會令人不知所措,而EDA常常像建立模型一樣被遺忘或未得到實踐。 使用Pandas Profiling報告,您可以用最少的代碼執行EDA,同時提供有用的統計信息并進行可視化。 這樣,您就可以專注于數據科學和機器學習的有趣部分,即模型過程。
To summarize, the main features of Pandas Profiling report include overview, variables, interactions, correlations, missing values, and a sample of your data.
總之,Pandas Profiling報告的主要功能包括概述,變量,交互作用,相關性,缺失值以及數據樣本。
Here is the code I used to install and import libraries, as well as to generate some dummy data for the example, and finally, the one line of code used to generate the Pandas Profile report based on your Pandas dataframe [10].
這是我用于安裝和導入庫以及為示例??生成一些虛擬數據的代碼,最后是用于基于您的Pandas數據框[10]生成Pandas Profile報告的一行代碼。
# install library
#!pip install pandas_profilingimport pandas_profiling
import pandas as pd
import numpy as np# create data
df = pd.DataFrame(np.random.randint(0,200,size=(15, 6)), columns=list('ABCDEF'))# run your report!
df.profile_report()# I did get an error and had to reinstall matplotlib to fix
Please feel free to comment down below if you have any questions or have used this feature before. There is still some information I did not describe, but you can find more of that information on the link I provided from above.
如果您有任何疑問或以前使用過此功能,請在下面隨意評論。 仍然有一些我沒有描述的信息,但是您可以從上面提供的鏈接中找到更多的信息。
Thank you for reading, I hope you enjoyed!
謝謝您的閱讀,希望您喜歡!
翻譯自: https://towardsdatascience.com/the-best-exploratory-data-analysis-with-pandas-profiling-e85b4d514583
熊貓燒香分析報告
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/389829.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/389829.shtml 英文地址,請注明出處:http://en.pswp.cn/news/389829.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!