eda分析
Most data analysis problems start with understanding the data. It is the most crucial and complicated step. This step also affects the further decisions that we make in a predictive modeling problem, one of which is what algorithm we are going to choose for a problem.
中號 OST的數據分析問題開始理解數據。 這是最關鍵和最復雜的步驟。 此步驟還會影響我們在預測建模問題中做出的進一步決策,其中一項是我們要為問題選擇的算法。
In this article, we will see a complete tough guide for such a problem.
在本文中,我們將看到有關此問題的完整指南。
Content
內容
- Reading Data 讀取數據
- Variable Identification 變量識別
- Univariate analysis 單變量分析
- Bivariate analysis 雙變量分析
- Missing values- types and analysis 缺失值-類型和分析
- Outlier treatment 離群值處理
- Variable Transformation 變量變換
讀取數據和變量識別 (Reading data and Variable Identification)
Reading the data infers getting the answers to the following questions
讀取數據可以得出以下問題的答案
- What is the shape of my data? 數據的形狀如何?
- How many features does my data contain? 我的數據包含多少個功能?
- What does it look like? 它是什么樣子的?
- What are the types of variables? 變量的類型是什么?

單變量分析(UA) (Univariate Analysis (UA))
什么是UA? (What is UA?)
When we explore a single variable at a time from a given list of features, its called UA. We summarize the variable and help us better understand the data.
當我們一次從給定的功能列表中探索單個變量時,其稱為UA。 我們總結了變量并幫助我們更好地理解了數據。
We see for the following things in UA
我們在UA中看到以下內容
- Central tendency (mean, median, mode) and dispersion of the variable 變量的集中趨勢(均值,中位數,眾數)和離散
- Distribution of variable- symmetric, right-skewed or left-skewed 對稱分布,右偏或左偏的分布
- Missing values and outliers 缺失值和離群值
- Count and count percent: Observing the frequency of each category in a categorical variable helps us to understand and deal with that variable. 計算百分比:觀察類別變量中每個類別的頻率有助于我們理解和處理該變量。
為什么選擇UA? (Why UA?)
We explore that variable, checks for anomalies like outliers, and missing values that we will see in the latter part.
我們將探索該變量,檢查異常值(如異常值)和缺失值,我們將在后面的部分中看到這些值。
UA方法 (Methods for UA)
For Continuous Variables:
對于連續變量:
- Tabular Method: Used to describe central tendencies, dispersion, and missing values. 表格方法:用于描述中心趨勢,離散度和缺失值。
Graphical Method: Used for distribution and checking Outliers. We can use Histograms for understanding distribution and Box Plots for outliers detection.
圖形方法:用于分發和檢查離群值。 我們可以使用直方圖來了解分布,而可以使用箱形圖來檢測異常值。
A combination of Histograms and Box plots is called a Violin Plot
直方圖和箱形圖的組合稱為小提琴圖

For Categorical variables:
對于分類變量:
- Tabular Method: “.value_counts()” operation in python gives a tabular form of frequencies. 表格方法:python中的“ .value_counts()”操作提供了表格形式的頻率。
- Graphical Method: The best graph that is used in the case of a categorical variable is barplot. 圖形方法:對于分類變量,使用的最佳圖形是條形圖。

雙變量分析(BA) (Bivariate Analysis (BA))
什么是學士學位? (What is BA?)
When we study the empirical relationship of two variables concerning each other, it is called BA.
當我們研究兩個變量彼此相關的經驗關系時,稱為BA。
為什么要學士學位? (Why BA?)
It helps to detect anomalies, understand the dependence of two variables on each other, and the impact of each variable ion the target variable.
它有助于檢測異常,了解兩個變量之間的依賴性,以及每個變量對目標變量的影響。
BA的方法 (Methods for BA)
For Continuous-Continuous types: There are two methods to study the relationship between two continuous variables i.e. A scatter plot and the correlation analysis.
對于連續-連續類型 :有兩種方法研究兩個連續變量之間的關系,即散點圖和相關性分析 。

2. For categorical-continuous types: Under this head, we can use bar plots and T-tests for the analysis purpose.
2. 對于連續類別:在此標題下,我們可以使用條形圖和T檢驗進行分析。
The T-test is a type of inferential statistic used to determine if there is a significant difference between the means of two or more groups/categories. Calculating a t-test requires the difference between the mean values and the standard deviation from each category.
T檢驗是一種推論統計量,用于確定兩個或多個組/類別的均值之間是否存在顯著差異。 計算t檢驗需要每個類別的平均值和標準偏差之間的差。

3. For Categorical-categorical types: Two-way table and Chi-square test are used to analyze the relationship of two categorical variables.
3. 對于分類類別類型:使用雙向表和卡方檢驗分析兩個分類變量之間的關系。
缺失值 (Missing Values)
缺少價值的原因? (Reasons for Missing Values?)
There can be various missing values in data, some of which can be
數據中可能存在各種缺失值,其中一些可能是
- There may not be may response recorded. 可能沒有記錄響應。
- There can be some error while recording the data 記錄數據時可能會出現一些錯誤
- There can be some error while reading the data, etc. 讀取數據時可能會出錯,等等。
缺失值的類型? (Types of Missing values?)
Missing Completely at Random (MCAR): These are the missing values that do not have any relation with any other variable or the variable in which they are occurring.
完全隨機缺失(MCAR):這些缺失值與任何其他變量或發生它們的變量沒有任何關系。
Missing at random (MAR): The missing values that do not have any relation within the variable they exist but may have an observable trend in other variables. Eg. The income data for people having age greater than 60 years can be missing as people with that age are generally retired.
隨機缺失(MAR):這些缺失值在存在的變量中沒有任何關系,但在其他變量中可能有可觀察的趨勢。 例如 。 年齡超過60歲的人的收入數據可能會丟失,因為該年齡的人通常已經退休。
Missing Not at Random (MNAR): The missing value has a relation in the variable they exist. Eg. House having a price more than Rs. 2 crores can be missing in the database as for that price there cannot be frequent buyers.
隨機缺失(MNAR):缺失值與它們存在的變量有關。 例如 。 價格超過Rs的房子。 數據庫中可能缺少2千萬,因為該價格不能頻繁購買。
缺失值的處理方法 (Methods of dealing Missing Values)
There are two basic methods to deal with missing values
有兩種處理缺失值的基本方法
Deletion: We delete all the missing value rows from the dataset before training the model.
刪除:我們在訓練模型之前從數據集中刪除所有缺失值行。
Imputation: There are various methods by which we can fill the missing values.
歸因:我們可以通過多種方法來填充缺失值。

離群值 (Outliers)
離群值的類型及其識別 (Types of Outliers and their identification)
There are two types of outliers:
有兩種異常值:
Univariate Outlier: It can be identified using a box plot.
單變量離群值:可以使用箱形圖進行識別。
Bivariate Outliers: It can be identified using a scatter plot between the two variables.
雙變量離群值:可以使用兩個變量之間的散點圖來識別。
離群值的標準 (Criteria for an outlier)
Criteria for X to be outlier:Q1: median for first 25% observation when sorted in ascending order
Q2: median for last 25% observation when sorted in ascending order
Q3: median of all observationIQR: Inter quartile range = Q3-Q1
if X is outlier then X must satisfy:X > (Q3 + 1.5*IQR) OR X < (Q1-1.5*IQR)
異常值的處理 (Treatment of outlier)
- We can delete that observation. 我們可以刪除該觀察。
- We can impute the value of outlier by the methods discussed in ways for imputing missing values. 我們可以通過以估算缺失值的方式討論的方法來估算離群值。
- We can apply transformations (to be discussed next) 我們可以應用轉換(將在下面討論)
變量變換 (Variable Transformation)
We all know that normalization increases the accuracy of the model. But what exactly is normalization? It is one of the techniques of variable transformation.
眾所周知,歸一化可以提高模型的準確性。 但是規范化到底是什么? 它是變量轉換的技術之一。
In variable transformation, we replace the variable by one of its functions. for example, replace the variable x by its log value.
在變量轉換中,我們用變量的功能之一代替變量。 例如,將變量x替換為其對數值。
We can try to fix the following things that we have obtained as an observation in previous EDA processes:
我們可以嘗試修復在以前的EDA過程中觀察得到的以下問題:
- We can change the scale of the variable (redefining the limits of a variable) 我們可以更改變量的小數位數(重新定義變量的限制)
- Conversion of a non-linear relationship into a linear relationship 將非線性關系轉換為線性關系
- It is observed that algorithms better perform on symmetrically distributed variables than skewed so we can convert skewed distribution to symmetric distribution. 可以看出,算法在對稱分布變量上的性能要優于偏態分布,因此我們可以將偏態分布轉換為對稱分布。
變量轉換方法 (Methods of Variable Transformation)
Non-linear transformation: We can replace the variable by its log value, square root, or cube root. These are non-linear transformations, hence help us to deal with all the points stated above.
非線性轉換 :我們可以用變量的對數值,平方根或立方根替換變量。 這些是非線性變換,因此有助于我們處理上述所有問題。
Binning: We can divide the continuous values into various bins hence converting a continuous variable into categorical. This may help us to categorize the outlier into some categories with which our model can deal.
Binning:我們可以將連續值劃分為不同的bin,從而將連續變量轉換為分類變量。 這可以幫助我們將異常值分類為模型可以處理的某些類別。
加起來 (Summing up)
This is an extensive guide for Exploratory Data Analysis. This not only includes how to detect anomalies but also how to deal and get rid of them. This is a very naive approach to EDA hence most of the chapters are covered yet.
這是探索性數據分析的詳盡指南。 這不僅包括如何檢測異常,還包括如何處理和消除異常。 這是一種非常幼稚的EDA方法,因此大多數章節都已介紹。
翻譯自: https://towardsdatascience.com/the-eda-theoretical-guide-b7cef7653f0d
eda分析
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/391294.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/391294.shtml 英文地址,請注明出處:http://en.pswp.cn/news/391294.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!