eda分析_EDA理論指南

eda分析

Most data analysis problems start with understanding the data. It is the most crucial and complicated step. This step also affects the further decisions that we make in a predictive modeling problem, one of which is what algorithm we are going to choose for a problem.

中號 OST的數據分析問題開始理解數據。 這是最關鍵和最復雜的步驟。 此步驟還會影響我們在預測建模問題中做出的進一步決策,其中一項是我們要為問題選擇的算法。

In this article, we will see a complete tough guide for such a problem.

在本文中,我們將看到有關此問題的完整指南。

Content

內容

  1. Reading Data

    讀取數據
  2. Variable Identification

    變量識別
  3. Univariate analysis

    單變量分析
  4. Bivariate analysis

    雙變量分析
  5. Missing values- types and analysis

    缺失值-類型和分析
  6. Outlier treatment

    離群值處理
  7. Variable Transformation

    變量變換

讀取數據和變量識別 (Reading data and Variable Identification)

Reading the data infers getting the answers to the following questions

讀取數據可以得出以下問題的答案

  • What is the shape of my data?

    數據的形狀如何?
  • How many features does my data contain?

    我的數據包含多少個功能?
  • What does it look like?

    它是什么樣子的?
  • What are the types of variables?

    變量的類型是什么?
Image for post
Guide1: Types of Variables
指南1:變量類型

單變量分析(UA) (Univariate Analysis (UA))

什么是UA? (What is UA?)

When we explore a single variable at a time from a given list of features, its called UA. We summarize the variable and help us better understand the data.

當我們一次從給定的功能列表中探索單個變量時,其稱為UA。 我們總結了變量并幫助我們更好地理解了數據。

We see for the following things in UA

我們在UA中看到以下內容

  • Central tendency (mean, median, mode) and dispersion of the variable

    變量的集中趨勢(均值,中位數,眾數)和離散
  • Distribution of variable- symmetric, right-skewed or left-skewed

    對稱分布,右偏或左偏的分布
  • Missing values and outliers

    缺失值和離群值
  • Count and count percent: Observing the frequency of each category in a categorical variable helps us to understand and deal with that variable.

    計算百分比:觀察類別變量中每個類別的頻率有助于我們理解和處理該變量。

為什么選擇UA? (Why UA?)

We explore that variable, checks for anomalies like outliers, and missing values that we will see in the latter part.

我們將探索該變量,檢查異常值(如異常值)和缺失值,我們將在后面的部分中看到這些值。

UA方法 (Methods for UA)

For Continuous Variables:

對于連續變量:

  1. Tabular Method: Used to describe central tendencies, dispersion, and missing values.

    表格方法:用于描述中心趨勢,離散度和缺失值。
  2. Graphical Method: Used for distribution and checking Outliers. We can use Histograms for understanding distribution and Box Plots for outliers detection.

    圖形方法:用于分發和檢查離群值。 我們可以使用直方圖來了解分布,而可以使用箱形圖來檢測異常值。

A combination of Histograms and Box plots is called a Violin Plot

直方圖和箱形圖的組合稱為小提琴圖

Image for post
Guide2: Methods of Univariate Analysis for continuous variables
指南2:連續變量的單變量分析方法

For Categorical variables:

對于分類變量:

  1. Tabular Method: “.value_counts()” operation in python gives a tabular form of frequencies.

    表格方法:python中的“ .value_counts()”操作提供了表格形式的頻率。
  2. Graphical Method: The best graph that is used in the case of a categorical variable is barplot.

    圖形方法:對于分類變量,使用的最佳圖形是條形圖。
Image for post
Guide3: Methods of Univariate Analysis for categorical variables
指南3:分類變量的單變量分析方法

雙變量分析(BA) (Bivariate Analysis (BA))

什么是學士學位? (What is BA?)

When we study the empirical relationship of two variables concerning each other, it is called BA.

當我們研究兩個變量彼此相關的經驗關系時,稱為BA。

為什么要學士學位? (Why BA?)

It helps to detect anomalies, understand the dependence of two variables on each other, and the impact of each variable ion the target variable.

它有助于檢測異常,了解兩個變量之間的依賴性,以及每個變量對目標變量的影響。

BA的方法 (Methods for BA)

  1. For Continuous-Continuous types: There are two methods to study the relationship between two continuous variables i.e. A scatter plot and the correlation analysis.

    對于連續-連續類型 :有兩種方法研究兩個連續變量之間的關系,即散點圖相關性分析

Image for post
Guide4: Bivariate analysis for Continuous-Continuous type variables
指南4:連續-連續類型變量的雙變量分析

2. For categorical-continuous types: Under this head, we can use bar plots and T-tests for the analysis purpose.

2. 對于連續類別:在此標題下,我們可以使用條形圖T檢驗進行分析。

The T-test is a type of inferential statistic used to determine if there is a significant difference between the means of two or more groups/categories. Calculating a t-test requires the difference between the mean values and the standard deviation from each category.

T檢驗是一種推論統計量,用于確定兩個或多個組/類別的均值之間是否存在顯著差異。 計算t檢驗需要每個類別的平均值和標準偏差之間的差。

Image for post
Guide5: Bivariate analysis for categorical-Continuous type variables
指南5:分類連續類型變量的雙變量分析

3. For Categorical-categorical types: Two-way table and Chi-square test are used to analyze the relationship of two categorical variables.

3. 對于分類類別類型:使用雙向表和卡方檢驗分析兩個分類變量之間的關系。

缺失值 (Missing Values)

缺少價值的原因? (Reasons for Missing Values?)

There can be various missing values in data, some of which can be

數據中可能存在各種缺失值,其中一些可能是

  • There may not be may response recorded.

    可能沒有記錄響應。
  • There can be some error while recording the data

    記錄數據時可能會出現一些錯誤
  • There can be some error while reading the data, etc.

    讀取數據時可能會出錯,等等。

缺失值的類型? (Types of Missing values?)

  1. Missing Completely at Random (MCAR): These are the missing values that do not have any relation with any other variable or the variable in which they are occurring.

    完全隨機缺失(MCAR):這些缺失值與任何其他變量或發生它們的變量沒有任何關系。

  2. Missing at random (MAR): The missing values that do not have any relation within the variable they exist but may have an observable trend in other variables. Eg. The income data for people having age greater than 60 years can be missing as people with that age are generally retired.

    隨機缺失(MAR):這些缺失值在存在的變量中沒有任何關系,但在其他變量中可能有可觀察的趨勢。 例如 。 年齡超過60歲的人的收入數據可能會丟失,因為該年齡的人通常已經退休。

  3. Missing Not at Random (MNAR): The missing value has a relation in the variable they exist. Eg. House having a price more than Rs. 2 crores can be missing in the database as for that price there cannot be frequent buyers.

    隨機缺失(MNAR):缺失值與它們存在的變量有關。 例如 。 價格超過Rs的房子。 數據庫中可能缺少2千萬,因為該價格不能頻繁購買。

缺失值的處理方法 (Methods of dealing Missing Values)

There are two basic methods to deal with missing values

有兩種處理缺失值的基本方法

  1. Deletion: We delete all the missing value rows from the dataset before training the model.

    刪除:我們在訓練模型之前從數據集中刪除所有缺失值行。

  2. Imputation: There are various methods by which we can fill the missing values.

    歸因:我們可以通過多種方法來填充缺失值。

Image for post
Guide6: Treating Missing values
指南6:處理缺失值

離群值 (Outliers)

離群值的類型及其識別 (Types of Outliers and their identification)

There are two types of outliers:

有兩種異常值:

  1. Univariate Outlier: It can be identified using a box plot.

    單變量離群值:可以使用箱形圖進行識別。

  2. Bivariate Outliers: It can be identified using a scatter plot between the two variables.

    雙變量離群值:可以使用兩個變量之間的散點圖來識別。

離群值的標準 (Criteria for an outlier)

Criteria for X to be outlier:Q1: median for first 25% observation when sorted in ascending order
Q2: median for last 25% observation when sorted in ascending order
Q3: median of all observationIQR: Inter quartile range = Q3-Q1
if X is outlier then X must satisfy:X > (Q3 + 1.5*IQR) OR X < (Q1-1.5*IQR)

異常值的處理 (Treatment of outlier)

  1. We can delete that observation.

    我們可以刪除該觀察。
  2. We can impute the value of outlier by the methods discussed in ways for imputing missing values.

    我們可以通過以估算缺失值的方式討論的方法來估算離群值。
  3. We can apply transformations (to be discussed next)

    我們可以應用轉換(將在下面討論)

變量變換 (Variable Transformation)

We all know that normalization increases the accuracy of the model. But what exactly is normalization? It is one of the techniques of variable transformation.

眾所周知,歸一化可以提高模型的準確性。 但是規范化到底是什么? 它是變量轉換的技術之一。

In variable transformation, we replace the variable by one of its functions. for example, replace the variable x by its log value.

在變量轉換中,我們用變量的功能之一代替變量。 例如,將變量x替換為其對數值。

We can try to fix the following things that we have obtained as an observation in previous EDA processes:

我們可以嘗試修復在以前的EDA過程中觀察得到的以下問題:

  1. We can change the scale of the variable (redefining the limits of a variable)

    我們可以更改變量的小數位數(重新定義變量的限制)
  2. Conversion of a non-linear relationship into a linear relationship

    將非線性關系轉換為線性關系
  3. It is observed that algorithms better perform on symmetrically distributed variables than skewed so we can convert skewed distribution to symmetric distribution.

    可以看出,算法在對稱分布變量上的性能要優于偏態分布,因此我們可以將偏態分布轉換為對稱分布。

變量轉換方法 (Methods of Variable Transformation)

  1. Non-linear transformation: We can replace the variable by its log value, square root, or cube root. These are non-linear transformations, hence help us to deal with all the points stated above.

    非線性轉換 :我們可以用變量的對數值,平方根或立方根替換變量。 這些是非線性變換,因此有助于我們處理上述所有問題。

  2. Binning: We can divide the continuous values into various bins hence converting a continuous variable into categorical. This may help us to categorize the outlier into some categories with which our model can deal.

    Binning:我們可以將連續值劃分為不同的bin,從而將連續變量轉換為分類變量。 這可以幫助我們將異常值分類為模型可以處理的某些類別。

加起來 (Summing up)

This is an extensive guide for Exploratory Data Analysis. This not only includes how to detect anomalies but also how to deal and get rid of them. This is a very naive approach to EDA hence most of the chapters are covered yet.

這是探索性數據分析的詳盡指南。 這不僅包括如何檢測異常,還包括如何處理和消除異常。 這是一種非常幼稚的EDA方法,因此大多數章節都已介紹。

翻譯自: https://towardsdatascience.com/the-eda-theoretical-guide-b7cef7653f0d

eda分析

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/391294.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/391294.shtml
英文地址,請注明出處:http://en.pswp.cn/news/391294.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

leetcode 897. 遞增順序搜索樹(中序遍歷)

給你一棵二叉搜索樹&#xff0c;請你 按中序遍歷 將其重新排列為一棵遞增順序搜索樹&#xff0c;使樹中最左邊的節點成為樹的根節點&#xff0c;并且每個節點沒有左子節點&#xff0c;只有一個右子節點。 示例 1&#xff1a; 輸入&#xff1a;root [5,3,6,2,4,null,8,1,null…

【一針見血】 JavaScript this

JavaScript this 指向一站式解決轉載于:https://www.cnblogs.com/xueyejinghong/p/8403987.html

基于ssm框架和freemarker的商品銷售系統

項目說明 1、項目文件結構 2、項目主要接口及其實現 &#xff08;1&#xff09;Index&#xff1a; 首頁頁面&#xff1a;展示商品功能&#xff0c;可登錄或查看商品詳細信息 &#xff08;2&#xff09;登錄&#xff1a;/ApiLogin 3、dao層 數據持久化層&#xff0c;把商品和用戶…

c++飛揚的小鳥游戲_通過建立一個飛揚的鳥游戲來學習從頭開始

c飛揚的小鳥游戲Learn how to use Scratch 3.0 by building a flappy bird game in this course developed by Warfame. Scratch is a free programming language and online community where you can create your own interactive stories, games, and animations. Scratch is…

345

345 轉載于:https://www.cnblogs.com/Forever77/p/11512701.html

簡·雅各布斯指數第二部分:測試

In Part I, I took you through the data gathering and compilation required to rank Census tracts by the four features identified by Jane Jacobs as the foundation of a great neighborhood:在第一部分中 &#xff0c;我帶您完成了根據簡雅各布斯(Jacobs Jacobs)所確定…

Docker 入門(3)Docke的安裝和基本配置

1. Docker Linux下的安裝 1.1 Docker Engine 的版本 社區版 ( CE, Community Edition ) 社區版 ( Docker Engine CE ) 主要提供了 Docker 中的容器管理等基礎功能&#xff0c;主要針對開發者和小型團隊進行開發和試驗企業版 ( EE, Enterprise Edition ) 企業版 ( Docker Engi…

python:單元測試框架pytest的一個簡單例子

之前一般做自動化測試用的是unitest框架&#xff0c;發現pytest同樣不錯&#xff0c;寫一個例子感受一下 test_sample.py import cx_Oracle import config from send_message import send_message from insert_cainiao_oracle import insert_cainiao_oracledef test_cainiao_mo…

mkdir命令使用范例

mkdir -p dir1/dir2/dir3/dir4 :-p 創建不存在的中間目錄mkdir -m 000 demdir &#xff1a;-m 000 為新創建的目錄指定權限轉載于:https://blog.51cto.com/2685141/2068162

pwa 問題_您真的需要PWA嗎? 這里有四個問題可以幫助您做出決定。

pwa 問題為什么需要PWA并不成問題。 讓我們看看為什么您可能不需要它 (Why you need a PWA is not in question. Let’s see why you may NOT need it) My inbox has been filled with questions regarding PWAs after my last two articles. 在上兩篇文章之后&#xff0c;我的…

利用ssh反向代理以及autossh實現從外網連接內網服務器

http://www.cnblogs.com/kwongtai/p/6903420.html轉載于:https://www.cnblogs.com/littlehb/p/7598037.html

抑郁癥損傷神經細胞嗎_使用神經網絡探索COVID-19與抑郁癥之間的聯系

抑郁癥損傷神經細胞嗎The drastic changes in our lifestyles coupled with restrictions, quarantines, and social distancing measures introduced to combat the corona virus outbreak have lead to an alarming rise in mental health issues all over the world. Social…

倦怠和枯燥_如何不斷學習(不倦怠)

倦怠和枯燥In tech, constantly learning (both in and out of work) is an unstated job requirement. 在科技界&#xff0c;不斷學習(工作中和工作中)是一項未闡明的工作要求。 When I was growing up, I would go to the bookstore with my dad every weekend, and every t…

Xcode 9.0 新增功能大全

Xcode是用于為Apple TV&#xff0c;Apple Watch&#xff0c;iPad&#xff0c;iPhone和Mac創建應用程序的完整開發人員工具集。Xcode開發環境采用tvOS SDK&#xff0c;watchOS SDK&#xff0c;iOS SDK和macOS SDK的形式捆綁Instruments分析工具&#xff0c;Simulator和OS框架。 …

Docker 入門(4)鏡像與容器

1. 鏡像與容器 1.1 鏡像 Docker鏡像類似于未運行的exe應用程序&#xff0c;或者停止運行的VM。當使用docker run命令基于鏡像啟動容器時&#xff0c;容器應用便能為外部提供服務。 鏡像實際上就是這個用來為容器進程提供隔離后執行環境的文件系統。我們也稱之為根文件系統&a…

python:pytest中的setup和teardown

原文&#xff1a;https://www.cnblogs.com/peiminer/p/9376352.html  之前我寫的unittest的setup和teardown&#xff0c;還有setupClass和teardownClass&#xff08;需要配合classmethod裝飾器一起使用&#xff09;&#xff0c;接下來就介紹pytest的類似于這類的固件。 &#…

如何開始使用任何類型的數據? - 第1部分

從數據開始 (START WITH DATA) My data science journey began with a student job in the Advanced Analytics department of one of the biggest automotive manufacturers in Germany. I was nave and still doing my masters.我的數據科學之旅從在德國最大的汽車制造商之一…

iHealth基于Docker的DevOps CI/CD實踐

本文由1月31日晚iHealth運維技術負責人郭拓在Rancher官方技術交流群內所做分享的內容整理而成&#xff0c;分享了iHealth從最初的服務器端直接部署&#xff0c;到現在實現全自動CI/CD的實踐經驗。作者簡介郭拓&#xff0c;北京愛和健康科技有限公司&#xff08;iHealth)。負責公…

從早期的初創企業到MongoDB的經理(播客)

In this weeks podcast episode, I chat with Harry Wolff, an engineering manager at MongoDB in New York City. Harry has been in the world of tech for over a decade, holding jobs in various startups before ending up at Mongo. 在本周的播客節目中&#xff0c;我與…

leetcode 1011. 在 D 天內送達包裹的能力(二分法)

傳送帶上的包裹必須在 D 天內從一個港口運送到另一個港口。 傳送帶上的第 i 個包裹的重量為 weights[i]。每一天&#xff0c;我們都會按給出重量的順序往傳送帶上裝載包裹。我們裝載的重量不會超過船的最大運載重量。 返回能在 D 天內將傳送帶上的所有包裹送達的船的最低運載…