熊貓燒香分析報告

目錄 (Table of Contents)

Introduction
介紹
Overview
總覽
Variables
變數
Interactions
互動互動
Correlations
相關性
Missing Values
缺失值
Sample
樣品
Summary
摘要

介紹 (Introduction)

There are countless ways to perform exploratory data analysis (EDA) in Python (and in R). I do most of mine in the popular Jupyter Notebook. Once I realized there was a library that could summarize my dataset with just one line of code, I made sure to utilize it for every project, reaping countless benefits from the ease of this EDA tool. The EDA step should be performed first before executing any Machine Learning models for all Data Scientists, therefore, the kind and intelligent developers from Pandas Profiling [2] have made it easy to view your dataset in a beautiful format, while also describing the information well in your dataset.

在Python( 和R )中執行探索性數據分析(EDA)的方法有無數種。我在流行的Jupyter筆記本電腦上做大多數事情。一旦意識到有一個庫可以用一行代碼來總結我的數據集，我便確保將其用于每個項目，并從此EDA工具的易用性中獲得了無數的收益。在為所有數據科學家執行任何機器學習模型之前，應首先執行EDA步驟，因此， Pandas Profiling [2]的友善而又聰明的開發人員已輕松以美觀的格式查看數據集，同時也很好地描述了信息在您的數據集中。

The Pandas Profiling report serves as this excellent EDA tool that can offer the following benefits: overview, variables, interactions, correlations, missing values, and a sample of your data. I will be using randomly generated data to serve as an example of this useful tool.

熊貓分析報告是一種出色的EDA工具，可提供以下好處：概述，變量，交互作用，相關性，缺失值和數據樣本。我將使用隨機生成的數據作為此有用工具的示例。

總覽 (Overview)

Image for post — Overview example. Screenshot by Author [3].

The overview tab in the report provides a quick glance at how many variables and observations you have or the number of rows and columns. It will also perform a calculation to see how many of your missing cells there are compared to the whole dataframe column. Additionally, it will point out duplicate rows as well and calculate that percentage. This tab is most similar to part of the describe function from Pandas, while providing a better user-interface (UI) experience.

報告中的“概述”選項卡可讓您快速瀏覽一下您擁有多少變量和觀測值，或者行和列的數量。它還將執行計算，以查看與整個數據框列相比有多少個丟失的單元格。此外，它還將指出重復的行并計算該百分比。此選項卡與Pandas的describe函數的一部分最為相似，同時提供了更好的用戶界面 ( UI )體驗。

The overview is broken into dataset statistics and variable types. You can also refer to warnings and reproduction for more specific information on your data.

概述分為數據集統計信息和變量類型。您還可以參考警告和復制以獲取有關數據的更多特定信息。

I will be discussing variables, which are also referred to as columns or features of your dataframe

我將討論變量，這些變量也稱為數據框的列或特征

變數 (Variables)

To achieve more granularity in your descriptive statistics, the variables tab is the way to go. You can look at distinct, missing, aggregations or calculations like mean, min, and max of your dataframe features or variables. You can also see the type of data you are working with (i.e., NUM). Not pictured is when you click on ‘Toggle details’. This toggle prompts a whole plethora of more usable statistics. The details include:

為了使描述性統計信息更加精確，可以使用“變量”選項卡。您可以查看數據框特征或變量的不同，缺失，聚合或計算，例如均值，最小值和最大值。您還可以查看正在使用的數據類型( 即NUM )。當您單擊“ 切換詳細信息 ”時，未顯示圖片。此切換提示大量更多可用統計信息。詳細信息包括：

Statistics — quantile and descriptive

統計-分位數和描述性

quantile
分位數

Minimum
5th percentile
Q1
Median
Q3
95th percentile
Maximum
Range
Interquartile range (IQR)

descriptive
描述性的

Standard deviation
Coefficient of variation (CV)
Kurtosis
Mean
Median Absolute Deviation (MAD)
Skewness
Sum
Variance
Monotonicity

These statistics also provide similar information from the describe function I see most Data Scientists using today, however, there are a few more and it presents them in an easy-to-view format.

這些統計信息還提供了我今天看到的大多數數據科學家使用的describe函數的類似信息，但是，還有更多信息，并且以易于查看的格式顯示。

Histograms

直方圖

The histograms provide for an easily digestible visual of your variables. You can expect to see the frequency of your variable on the y-axis and fixed-size bins (bins=15 is the default) on the x-axis.

直方圖為您的變量提供了易于理解的視覺效果。你可以期望看到的在y軸變量的在x軸的頻率和固定大小的塊( 倉= 15是默認值 )。

Common Values

共同價值觀

The common values will provide the value, count, and frequency that are most common for your variable.

公用值將提供最常用于變量的值，計數和頻率。

Extreme Values

極端值

The extreme values will provide the value, count, and frequency that are in the minimum and maximum values of your dataframe.

極值將提供數據框的最小值和最大值中的值，計數和頻率。

互動互動 (Interactions)

The interactions feature of the profiling report is unique in that you can choose from your list of columns to either be on the x-axis or y-xis provided. For example, pictured above is variable A against variable A, which is why you see overlapping. You can easily switch to other variables or columns to achieve a different plot and an excellent representation of your data points.

分析報告的交互功能是獨特的，因為您可以從列列表中選擇在提供的x軸還是y-xis上 。例如，如上圖所示， 變量A相對于變量A ，這就是為什么看到重疊的原因。您可以輕松地切換到其他變量或列，以實現不同的圖并很好地表示數據點。

缺失值 (Missing Values)

As you can see from the plot above, the report tool also includes missing values. You can see how much of each variable is missing, including the count, and matrix. It is a nice way to visualize your data before you perform any models with it. You would preferably want to see a plot like the above, meaning you have no missing values.

從上圖可以看到，報告工具還包含缺失值。您可以看到缺少每個變量的多少，包括計數和矩陣。這是在執行任何模型之前可視化數據的好方法。您最好希望看到上面的圖，這意味著您沒有缺失的值。

樣品 (Sample)

Sample acts similarly to the head and tail function where it returns your dataframe’s first few rows or last rows. In this example, you can see the first rows and last rows as well. I use this tab when I want a sense of where my data started and where it ended — I recommend ranking or ordering to see more benefit out of this tab, as you can see the range of your data, with a visual respective representation.

Sample的行為類似于head和tail函數，它返回數據框的前幾行或最后幾行。在此示例中，您還可以看到第一行和最后一行。當我想了解我的數據的開始和結束位置時，可以使用此選項卡-我建議進行排序或排序，以便從該選項卡中獲得更多好處，因為您可以看到數據的范圍，并具有直觀的外觀。

摘要 (Summary)

I hope this article provided you with some inspiration for your next exploratory data analysis. Being a Data Scientist can be overwhelming and EDA is often forgotten or not practiced as much as model-building. With the Pandas Profiling report, you can perform EDA with minimal code, providing useful statistics and visualizes as well. That way, you can focus on the fun part of Data Science and Machine Learning, the model process.

我希望本文能為您的下一個探索性數據分析提供一些啟發。身為數據科學家可能會令人不知所措，而EDA常常像建立模型一樣被遺忘或未得到實踐。使用Pandas Profiling報告，您可以用最少的代碼執行EDA，同時提供有用的統計信息并進行可視化。這樣，您就可以專注于數據科學和機器學習的有趣部分，即模型過程。

To summarize, the main features of Pandas Profiling report include overview, variables, interactions, correlations, missing values, and a sample of your data.
總之，Pandas Profiling報告的主要功能包括概述，變量，交互作用，相關性，缺失值以及數據樣本。

Here is the code I used to install and import libraries, as well as to generate some dummy data for the example, and finally, the one line of code used to generate the Pandas Profile report based on your Pandas dataframe [10].

這是我用于安裝和導入庫以及為示例??生成一些虛擬數據的代碼，最后是用于基于您的Pandas數據框[10]生成Pandas Profile報告的一行代碼。

# install library 
#!pip install pandas_profilingimport pandas_profiling
import pandas as pd
import numpy as np# create data 
df = pd.DataFrame(np.random.randint(0,200,size=(15, 6)), columns=list('ABCDEF'))# run your report!
df.profile_report()# I did get an error and had to reinstall matplotlib to fix

Please feel free to comment down below if you have any questions or have used this feature before. There is still some information I did not describe, but you can find more of that information on the link I provided from above.

如果您有任何疑問或以前使用過此功能，請在下面隨意評論。仍然有一些我沒有描述的信息，但是您可以從上面提供的鏈接中找到更多的信息。

Thank you for reading, I hope you enjoyed!
謝謝您的閱讀，希望您喜歡！

翻譯自: https://towardsdatascience.com/the-best-exploratory-data-analysis-with-pandas-profiling-e85b4d514583

熊貓燒香分析報告

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/389829.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/389829.shtml
英文地址，請注明出處：http://en.pswp.cn/news/389829.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！

2060. 同源字符串檢測

2060. 同源字符串檢測原字符串由小寫字母組成，可以按下述步驟編碼： 任意將其分割為由若干非空子字符串組成的一個序列。任意選擇序列中的一些元素（也可能不選擇），然后將這些元素替換為元素各自的長度&#x…

vue中的data用return返回

為什么在大型項目中data需要使用return返回數據呢？答：不使用return包裹的數據會在項目的全局可見，會造成變量污染；使用return包裹后數據中變量只在當前組件中生效，不會影響其他組件。 1、在簡單的vue實例中看到的Vue實…

白褲子變粉褲子怎么辦_使用褲子構建構建數據科學的monorepo

白褲子變粉褲子怎么辦At HousingAnywhere, one of the first major obstacles we had to face when scaling the Data team was building a centralised repository that contains our ever-growing machine learning applications. Between these projects, many of them shar…

ubuntu+anaconda+tensorflow 及相關問題

配置tensorflow部分參考：https://blog.csdn.net/XUTIAN1129/article/details/78997633 裝完anaconda, source ~/.bashrc后, 可以直接 pip install tensorflow-gpu , 珍愛生命，遠離bazel。但想要c/c調用tf的時候遠離不了，還是得bazel編譯安裝t…

2022. 將一維數組轉變成二維數組

2022. 將一維數組轉變成二維數組給你一個下標從 0 開始的一維整數數組 original 和兩個整數 m 和 n 。你需要使用 original 中所有元素創建一個 m 行 n 列的二維數組。 original 中下標從 0 到 n - 1 （都包含 ）的元素構成二維數組的第一行&#xf…

支持向量機SVM算法原理及應用（R）

mad離群值_全部關于離群值

mad離群值An outlier is a data point in a data set that is distant from all other observations. A data point that lies outside the overall distribution of the dataset. Or in a layman term, we can say, an outlier is something that behaves differently from th…

2057. 值相等的最小索引

2057. 值相等的最小索引給你一個下標從 0 開始的整數數組 nums ，返回 nums 中滿足 i mod 10 nums[i] 的最小下標 i ；如果不存在這樣的下標，返回 -1 。 x mod y 表示 x 除以 y 的余數。示例 1：輸入：nums [0,1,2…

SpringBoot中各配置文件的優先級及加載順序

我們在寫程序的時候會碰到各種環境(開發、測試、生產)，因而，在我們切換環境的時候，我們需要手工切換配置文件的內容。這大大的加大了運維人員的負擔，同時會帶來一定的安全隱患。為此，為了能更合理地重寫各屬性的值&am…

青年報告_了解青年的情緒

青年報告Youth-led media is any effort created, planned, implemented, and reflected upon by young people in the form of media, including websites, newspapers, television shows, and publications. Such platforms connect writers, artists, and photographers in …