熊貓燒香分析報告_熊貓分析進行最佳探索性數據分析

熊貓燒香分析報告

目錄 (Table of Contents)

  1. Introduction

    介紹
  2. Overview

    總覽
  3. Variables

    變數
  4. Interactions

    互動互動
  5. Correlations

    相關性
  6. Missing Values

    缺失值
  7. Sample

    樣品
  8. Summary

    摘要

介紹 (Introduction)

There are countless ways to perform exploratory data analysis (EDA) in Python (and in R). I do most of mine in the popular Jupyter Notebook. Once I realized there was a library that could summarize my dataset with just one line of code, I made sure to utilize it for every project, reaping countless benefits from the ease of this EDA tool. The EDA step should be performed first before executing any Machine Learning models for all Data Scientists, therefore, the kind and intelligent developers from Pandas Profiling [2] have made it easy to view your dataset in a beautiful format, while also describing the information well in your dataset.

在Python( 和R )中執行探索性數據分析(EDA)的方法有無數種。 我在流行的Jupyter筆記本電腦上做大多數事情。 一旦意識到有一個庫可以用一行代碼來總結我的數據集,我便確保將其用于每個項目,并從此EDA工具的易用性中獲得了無數的收益。 在為所有數據科學家執行任何機器學習模型之前,應首先執行EDA步驟,因此, Pandas Profiling [2]的友善而又聰明的開發人員已輕松以美觀的格式查看數據集,同時也很好地描述了信息在您的數據集中。

The Pandas Profiling report serves as this excellent EDA tool that can offer the following benefits: overview, variables, interactions, correlations, missing values, and a sample of your data. I will be using randomly generated data to serve as an example of this useful tool.

熊貓分析報告是一種出色的EDA工具,可提供以下好處:概述,變量,交互作用,相關性,缺失值和數據樣本。 我將使用隨機生成的數據作為此有用工具的示例。

總覽 (Overview)

Image for post
Overview example. Screenshot by Author [3].
概述示例。 作者[3]的屏幕截圖。

The overview tab in the report provides a quick glance at how many variables and observations you have or the number of rows and columns. It will also perform a calculation to see how many of your missing cells there are compared to the whole dataframe column. Additionally, it will point out duplicate rows as well and calculate that percentage. This tab is most similar to part of the describe function from Pandas, while providing a better user-interface (UI) experience.

報告中的“概述”選項卡可讓您快速瀏覽一下您擁有多少變量和觀測值,或者行和列的數量。 它還將執行計算,以查看與整個數據框列相比有多少個丟失的單元格。 此外,它還將指出重復的行并計算該百分比。 此選項卡與Pandas的describe函數的一部分最為相似,同時提供了更好的用戶界面 ( UI )體驗。

The overview is broken into dataset statistics and variable types. You can also refer to warnings and reproduction for more specific information on your data.

概述分為數據集統計信息和變量類型。 您還可以參考警告和復制以獲取有關數據的更多特定信息。

I will be discussing variables, which are also referred to as columns or features of your dataframe

我將討論變量,這些變量也稱為數據框的列或特征

變數 (Variables)

Image for post
Variables example. Screenshot by Author [4].
變量示例。 作者[4]的屏幕截圖。

To achieve more granularity in your descriptive statistics, the variables tab is the way to go. You can look at distinct, missing, aggregations or calculations like mean, min, and max of your dataframe features or variables. You can also see the type of data you are working with (i.e., NUM). Not pictured is when you click on ‘Toggle details’. This toggle prompts a whole plethora of more usable statistics. The details include:

為了使描述性統計信息更加精確,可以使用“變量”選項卡。 您可以查看數據框特征或變量的不同,缺失,聚合或計算,例如均值,最小值和最大值。 您還可以查看正在使用的數據類型( 即NUM )。 當您單擊“ 切換詳細信息 ”時,未顯示圖片。 此切換提示大量更多可用統計信息。 詳細信息包括:

Statistics — quantile and descriptive

統計-分位數和描述性

quantile

分位數

Minimum
5th percentile
Q1
Median
Q3
95th percentile
Maximum
Range
Interquartile range (IQR)

descriptive

描述性的

Standard deviation
Coefficient of variation (CV)
Kurtosis
Mean
Median Absolute Deviation (MAD)
Skewness
Sum
Variance
Monotonicity

These statistics also provide similar information from the describe function I see most Data Scientists using today, however, there are a few more and it presents them in an easy-to-view format.

這些統計信息還提供了我今天看到的大多數數據科學家使用的describe函數的類似信息,但是,還有更多信息,并且以易于查看的格式顯示。

Histograms

直方圖

The histograms provide for an easily digestible visual of your variables. You can expect to see the frequency of your variable on the y-axis and fixed-size bins (bins=15 is the default) on the x-axis.

直方圖為您的變量提供了易于理解的視覺效果。 你可以期望看到的在y軸變量的在x軸的頻率和固定大小的塊( 倉= 15是默認值 )。

Common Values

共同價值觀

The common values will provide the value, count, and frequency that are most common for your variable.

公用值將提供最常用于變量的值,計數和頻率。

Extreme Values

極端值

The extreme values will provide the value, count, and frequency that are in the minimum and maximum values of your dataframe.

極值將提供數據框的最小值和最大值中的值,計數和頻率。

互動互動 (Interactions)

Image for post
Interactions example. Screenshot by Author [5].
互動示例。 作者[5]的屏幕截圖。

The interactions feature of the profiling report is unique in that you can choose from your list of columns to either be on the x-axis or y-xis provided. For example, pictured above is variable A against variable A, which is why you see overlapping. You can easily switch to other variables or columns to achieve a different plot and an excellent representation of your data points.

分析報告的交互功能是獨特的,因為您可以從列列表中選擇在提供的x軸還是y-xis上 。 例如,如上圖所示, 變量A相對于變量A ,這就是為什么看到重疊的原因。 您可以輕松地切換到其他變量或列,以實現不同的圖并很好地表示數據點。

相關性 (Correlations)

Image for post
Correlations example. Screenshot by Author [6].
相關示例。 作者[6]的屏幕截圖。

Sometimes making fancier or colorful correlation plots can be time-consuming if you make them from line-by-line Python code. However, with this correlation plot, you can easily visualize the relationships between variables in your data, which are also nicely color-coded. There are four main plots that you can display:

如果使用逐行的 Python代碼進行繪制,有時制作更精美的彩色關聯圖可能會很耗時。 但是,使用此相關圖,您可以輕松地可視化數據中變量之間的關系,這些變量也已進行了很好的顏色編碼 。 您可以顯示四個主要圖表:

  • Pearson’s r

    皮爾遜河

  • Spearman’s ρ

    斯皮爾曼的ρ

  • Kendall’s τ

    肯德爾的τ

  • Phik (φk)

    皮克(φk)

You may only be used to one of these correlation methods, so the other ones may sound confusing or not usable. Therefore, the correlation plot also comes provided with a toggle for details onto the meaning of each correlation you can visualize — this feature really helps when you need a refresher on correlation, as well as when you are deciding between which plot(s) to use for your analysis

您可能只習慣了這些相關方法之一,因此其他方法可能聽起來令人困惑或無法使用。 因此,相關情節還附帶提供了一個切換為細節上可以直觀的每個相關的含義-這個功能真的幫助,當你需要在相關復習,以及當你決定為與該陰謀( )使用供您分析

缺失值 (Missing Values)

Image for post
Missing Values example. Screenshot by Author [7].
缺失值示例。 作者[7]的屏幕截圖。

As you can see from the plot above, the report tool also includes missing values. You can see how much of each variable is missing, including the count, and matrix. It is a nice way to visualize your data before you perform any models with it. You would preferably want to see a plot like the above, meaning you have no missing values.

從上圖可以看到,報告工具還包含缺失值。 您可以看到缺少每個變量的多少,包括計數和矩陣。 這是在執行任何模型之前可視化數據的好方法。 您最好希望看到上面的圖,這意味著您沒有缺失的值。

樣品 (Sample)

Image for post
Sample example. Screenshot by Author [8].
示例示例。 作者[8]的屏幕截圖。

Sample acts similarly to the head and tail function where it returns your dataframe’s first few rows or last rows. In this example, you can see the first rows and last rows as well. I use this tab when I want a sense of where my data started and where it ended — I recommend ranking or ordering to see more benefit out of this tab, as you can see the range of your data, with a visual respective representation.

Sample的行為類似于head和tail函數,它返回數據框的前幾行或最后幾行。 在此示例中,您還可以看到第一行和最后一行。 當我想了解我的數據的開始和結束位置時,可以使用此選項卡-我建議進行排序或排序,以便從該選項卡中獲得更多好處,因為您可以看到數據的范圍,并具有直觀的外觀。

摘要 (Summary)

Image for post
Photo by Elena Loshina on Unsplash [9].
艾琳娜·洛西娜 ( Elena Loshina)在《 Unsplash [9]》上拍攝 。

I hope this article provided you with some inspiration for your next exploratory data analysis. Being a Data Scientist can be overwhelming and EDA is often forgotten or not practiced as much as model-building. With the Pandas Profiling report, you can perform EDA with minimal code, providing useful statistics and visualizes as well. That way, you can focus on the fun part of Data Science and Machine Learning, the model process.

我希望本文能為您的下一個探索性數據分析提供一些啟發。 身為數據科學家可能會令人不知所措,而EDA常常像建立模型一樣被遺忘或未得到實踐。 使用Pandas Profiling報告,您可以用最少的代碼執行EDA,同時提供有用的統計信息并進行可視化。 這樣,您就可以專注于數據科學和機器學習的有趣部分,即模型過程。

To summarize, the main features of Pandas Profiling report include overview, variables, interactions, correlations, missing values, and a sample of your data.

總之,Pandas Profiling報告的主要功能包括概述,變量,交互作用,相關性,缺失值以及數據樣本。

Here is the code I used to install and import libraries, as well as to generate some dummy data for the example, and finally, the one line of code used to generate the Pandas Profile report based on your Pandas dataframe [10].

這是我用于安裝和導入庫以及為示例??生成一些虛擬數據的代碼,最后是用于基于您的Pandas數據框[10]生成Pandas Profile報告的一行代碼。

# install library 
#!pip install pandas_profilingimport pandas_profiling
import pandas as pd
import numpy as np# create data
df = pd.DataFrame(np.random.randint(0,200,size=(15, 6)), columns=list('ABCDEF'))# run your report!
df.profile_report()# I did get an error and had to reinstall matplotlib to fix

Please feel free to comment down below if you have any questions or have used this feature before. There is still some information I did not describe, but you can find more of that information on the link I provided from above.

如果您有任何疑問或以前使用過此功能,請在下面隨意評論。 仍然有一些我沒有描述的信息,但是您可以從上面提供的鏈接中找到更多的信息。

Thank you for reading, I hope you enjoyed!

謝謝您的閱讀,希望您喜歡!

翻譯自: https://towardsdatascience.com/the-best-exploratory-data-analysis-with-pandas-profiling-e85b4d514583

熊貓燒香分析報告

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/389829.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/389829.shtml
英文地址,請注明出處:http://en.pswp.cn/news/389829.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

2060. 同源字符串檢測

2060. 同源字符串檢測 原字符串由小寫字母組成,可以按下述步驟編碼: 任意將其 分割 為由若干 非空 子字符串組成的一個 序列 。 任意選擇序列中的一些元素(也可能不選擇),然后將這些元素替換為元素各自的長度&#x…

vue中的data用return返回

為什么在大型項目中data需要使用return返回數據呢?答:不使用return包裹的數據會在項目的全局可見,會造成變量污染;使用return包裹后數據中變量只在當前組件中生效,不會影響其他組件。 1、在簡單的vue實例中看到的Vue實…

白褲子變粉褲子怎么辦_使用褲子構建構建數據科學的monorepo

白褲子變粉褲子怎么辦At HousingAnywhere, one of the first major obstacles we had to face when scaling the Data team was building a centralised repository that contains our ever-growing machine learning applications. Between these projects, many of them shar…

ubuntu+anaconda+tensorflow 及相關問題

配置tensorflow部分參考:https://blog.csdn.net/XUTIAN1129/article/details/78997633 裝完anaconda, source ~/.bashrc后, 可以直接 pip install tensorflow-gpu , 珍愛生命,遠離bazel。但想要c/c調用tf的時候遠離不了,還是得bazel編譯安裝t…

2022. 將一維數組轉變成二維數組

2022. 將一維數組轉變成二維數組 給你一個下標從 0 開始的一維整數數組 original 和兩個整數 m 和 n 。你需要使用 original 中 所有 元素創建一個 m 行 n 列的二維數組。 original 中下標從 0 到 n - 1 (都 包含 )的元素構成二維數組的第一行&#xf…

支持向量機SVM算法原理及應用(R)

支持向量機SVM算法原理及應用(R) 2016年08月17日 16:37:25 閱讀數:22292更多 個人分類: 數據挖掘實戰應用版權聲明:本文為博主原創文章,轉載請注明來源。 https://blog.csdn.net/csqazwsxedc/article/detai…

mad離群值_全部關于離群值

mad離群值An outlier is a data point in a data set that is distant from all other observations. A data point that lies outside the overall distribution of the dataset. Or in a layman term, we can say, an outlier is something that behaves differently from th…

2057. 值相等的最小索引

2057. 值相等的最小索引 給你一個下標從 0 開始的整數數組 nums ,返回 nums 中滿足 i mod 10 nums[i] 的最小下標 i ;如果不存在這樣的下標,返回 -1 。 x mod y 表示 x 除以 y 的 余數 。 示例 1:輸入:nums [0,1,2…

SpringBoot中各配置文件的優先級及加載順序

我們在寫程序的時候會碰到各種環境(開發、測試、生產),因而,在我們切換環境的時候,我們需要手工切換配置文件的內容。這大大的加大了運維人員的負擔,同時會帶來一定的安全隱患。 為此,為了能更合理地重寫各屬性的值&am…

青年報告_了解青年的情緒

青年報告Youth-led media is any effort created, planned, implemented, and reflected upon by young people in the form of media, including websites, newspapers, television shows, and publications. Such platforms connect writers, artists, and photographers in …

post提交參數過多時,取消Tomcat對 post長度限制

1.Tomcat 默認的post參數的最大大小為2M, 當超過時將會出錯,可以配置maxPostSize參數來改變大小。 從 apache-tomcat-7.0.63 開始,參數 maxPostSize 的含義就變了: 如果將值設置為 0,表示 POST 最大值為 0,…

2048. 下一個更大的數值平衡數

2048. 下一個更大的數值平衡數 如果整數 x 滿足:對于每個數位 d ,這個數位 恰好 在 x 中出現 d 次。那么整數 x 就是一個 數值平衡數 。 給你一個整數 n ,請你返回 嚴格大于 n 的 最小數值平衡數 。 示例 1:輸入:n …

bzoj1222: [HNOI2001]產品加工

一開始以為是費用流。。然后搞不出來&#xff0c;路牌是DP&#xff0c;想一想 f[i][j]表示加工到第i個產品&#xff0c;然后A用時j&#xff0c;B用時的最小值 那么f[i][j]max(f[i-1][j-a[i]],f[i-1][j]b[i],f[i-1][j-c[i]]c[i]) 滾掉一維美滋滋 #include<cstdio> #includ…

map(平均平均精度_客戶的平均平均精度

map(平均平均精度Disclaimer: this was created for my clients because it’s rather challenging to explain such a complex metric in simple words, so don’t expect to see much of math or equations here. And remember that I try to keep it simple.免責聲明 &#…

Sublime Text 2搭建Go開發環境,代碼提示+補全+調試

本文在已安裝Go環境的前提下繼續。 1、安裝Sublime Text 2 2、安裝Package Control。 運行Sublime&#xff0c;按下 Ctrl&#xff08;在Tab鍵上邊&#xff09;&#xff0c;然后輸入以下內容&#xff1a; import urllib2,os,hashlib; h 7183a2d3e96f11eeadd761d777e62404 e330…

629. K個逆序對數組

629. K個逆序對數組 給出兩個整數 n 和 k&#xff0c;找出所有包含從 1 到 n 的數字&#xff0c;且恰好擁有 k 個逆序對的不同的數組的個數。 逆序對的定義如下&#xff1a;對于數組的第i個和第 j個元素&#xff0c;如果滿i < j且 a[i] > a[j]&#xff0c;則其為一個逆…

zookeeper、hbase常見命令

a) Zookeeper&#xff1a;幫助命令-help i. ls /查看zk下根節點目錄 ii. create /zk_test my_data//在測試集群沒有創建成功 iii. get /zk_test my_data//獲取節點信息 iv. set / zk_test my_data//更改節點相關信息 v. delete /zk_test//刪除節點信…

鮮活數據數據可視化指南_數據可視化實用指南

鮮活數據數據可視化指南Exploratory data analysis (EDA) is an essential part of the data science or the machine learning pipeline. In order to create a robust and valuable product using the data, you need to explore the data, understand the relations among v…

2049. 統計最高分的節點數目

2049. 統計最高分的節點數目 給你一棵根節點為 0 的 二叉樹 &#xff0c;它總共有 n 個節點&#xff0c;節點編號為 0 到 n - 1 。同時給你一個下標從 0 開始的整數數組 parents 表示這棵樹&#xff0c;其中 parents[i] 是節點 i 的父節點。由于節點 0 是根&#xff0c;所以 p…

Linux lsof命令詳解

lsof&#xff08;List Open Files&#xff09; 用于查看你進程開打的文件&#xff0c;打開文件的進程&#xff0c;進程打開的端口(TCP、UDP)&#xff0c;找回/恢復刪除的文件。是十分方便的系統監視工具&#xff0c;因為lsof命令需要訪問核心內存和各種文件&#xff0c;所以需要…