如何進行數據分析統計
Recently, I took the opportunity to work on a competition held by Wells Fargo (Mindsumo). The dataset provided was just a bunch of numbers in various columns with no indication of what the data might be. I always thought that the analysis of data required some knowledge and understanding of the data and the domain to perform an efficient analysis. I have attached a sample below. It consisted of columns from X0 to X29 which consisted of continuous values and XC which consisted of categorical data i.e. 30 variables in total. I set out on further analysis on the entire dataset to understand the data.
[R ecently,我趁機工作由舉辦的比賽富國銀行(Mindsumo) 。 提供的數據集只是各個列中的一堆數字,沒有指示數據可能是什么。 我一直認為,數據分析需要對數據和領域有一定的了解和理解,才能進行有效的分析。 我在下面附上了一個樣本。 它由從X0到X29的列組成,這些列由連續值組成,而XC則由分類數據組成,即總共30個變量。 我著手對整個數據集進行進一步分析以了解數據。

連續變量的正態性檢查 (Normality check of continuous variables)
I used the QQ plot to determine the normality distribution of the variables and understand if there is any skew in the data. All the data points were normally distributed with very less deviation which required no processing of the data to be done at this point to attain a Gaussian distribution. I prefer a QQ plot for the initial analysis because it makes it very easy to analyze the data and determine the type of distribution be it Gaussian distribution, uniform distribution, etc.
我使用QQ圖來確定變量的正態分布,并了解數據中是否存在任何偏差。 所有數據點均以極小的偏差進行正態分布,因此在這一點上無需對數據進行任何處理即可獲得高斯分布。 我更喜歡使用QQ圖進行初始分析,因為它使分析數據和確定分布類型變得非常容易,包括高斯分布,均勻分布等。

Once the data is determined to be a normal distribution using the QQ plot, a Shapiro Wilk test can be performed to confirm the hypothesis. It has been deigned specifically for normal distributions. The null hypothesis, in this case, is that the variable is normally distributed. If the p value obtained is less than 0.05 then the null hypothesis is rejected and it is assumed that the variable is not normally distributed. All the values seem to be greater than 0.5 which means that all the variables follow a normal distribution.
一旦使用QQ圖確定數據為正態分布,就可以執行Shapiro Wilk檢驗來確認假設。 專為正態分布而設計。 在這種情況下,零假設是變量是正態分布的。 如果獲得的p值小于0.05,則拒絕零假設,并假設變量不是正態分布的。 所有值似乎都大于0.5,這意味著所有變量都遵循正態分布。

分類變量 (Categorical variable)
I checked the distribution of the categorical variable to check if the points in the dataset were equally distributed. The distribution of the variables was as shown below. The variable consisted of 5 unique values (A,B,C,D and E) with all the values being more or less equally distributed.
我檢查了分類變量的分布,以檢查數據集中的點是否均勻分布。 變量的分布如下所示。 該變量由5個唯一值(A,B,C,D和E)組成,所有值或多或少均等地分布。

I used a One-Hot Encoding mechanism to convert the categorical variables to a binary variable for each resulting categorical value as shown below. Although this resulted in an increase in dimensionality, I was hoping to check for correlations later on and remove or merge certain rows.
我使用單點編碼機制將每個結果分類值的分類變量轉換為二進制變量,如下所示。 盡管這導致維數增加,但我希望以后再檢查相關性,并刪除或合并某些行。

數據關聯 (Data Correlation)
Correlation is an important technique which helps in determining the relationships between the variables and weeding out highly correlated data. The reason we do this because we don't want variables that are highly correlated with each other since they affect the final dependent variable in the same way. The correlation values range from -1 to 1 with a value of 1 signifying a strong, positive correlation between the two and a value of -1 signifying a strong, negative correlation. We also calculate the statistical significance of the correlations to determine if the null hypothesis (There is no correlation) is valid or not. I have taken three values as benchmarks for measuring statistical significance — 0.1, 0.05 and 0.01. The below table is a small sub sample of the correlation values for each set of variables. A p value which is less than 0.01 signifies a high statistical significance and that the null hypothesis can be rejected which is represented by 3 ‘*’ while the statistical significance is lesser if the p-value is lesser than 0.1 but greater than 0.05 which is represented by 1 ‘*’.
關聯是一項重要的技術,可幫助確定變量之間的關系并清除高度相關的數據。 我們之所以這樣做,是因為我們不希望變量之間具有高度相關性,因為它們以相同的方式影響最終的因變量。 相關值的范圍是-1到1,值1表示兩者之間的強正相關,值-1表示強的負相關。 我們還計算了相關性的統計顯著性,以確定零假設(無相關性)是否有效。 我已經采取三個值作為基準,用于測量統計顯著性- 0.1,0.05和0.01。 下表是每組變量的相關值的一小部分子樣本。 如果p值小于0.01,則表示具有較高的統計顯著性,并且可以拒絕由3 '*'表示的原假設,而如果p值小于0.1但大于0.05,則統計學顯著性較小。用1' *'表示 。

The dataset above did not have any values which had a high correlation value. Thus, I safely went ahead with the assumption that the values were not related to each other. I further explored the correlation of the variables with the dependent variable y.
上面的數據集沒有任何具有高相關值的值。 因此,我可以安全地繼續進行以下假設:這些值彼此無關。 我進一步探討了變量與因變量y的相關性。

We are concerned with the prediction of the variable y. As seen, there are a lot of variables which don't have any correlation to y as well as are not statistically significant at all which means that the relationship is weak and we can safely exclude them from the final dataset. These include variables such as X5, X8, X9, X10, XC_C, etc. I have not excluded the other variables which have low correlation but high statistical significance as there may be a small sample which affects the final dependent variable and we cannot exclude them completely. We can further reduce the variables by merging some of them. We do this on the basis of variables which have the same correlation value with the y variable. These include —
我們關注變量y的預測。 可以看出,有很多變量與y沒有任何關系,并且在統計上根本不重要,這意味著該關系很弱,我們可以安全地將它們從最終數據集中排除。 這些變量包括X5 , X8 , X9 , X10 , XC_C等。我沒有排除其他相關性較低但具有統計學意義的變量,因為可能會有一個小的樣本影響最終因變量,因此我們不能排除它們完全。 我們可以通過合并其中的一些變量來進一步減少變量。 我們基于與y變量具有相同相關值的變量進行此操作。 這些包括 -
X7, X11, X17 and X21
X7 , X11 , X17和X21
X1 and X23
X1和X23
X22 and X26
X22和X26
X4, X15, X19 and X25
X4 , X15 , X19和X25
I merged these variables using an optimization technique. Let us consider the variable, X1 and X23. I achieved this by assuming a linear relation mX1 + nX23 with y. For determining the maximum correlation, we have to calculate the optimum value for m and n. In all the cases, I assumed n to be 1 and solved for m. The equation is as shown below.
我使用優化技術合并了這些變量。 讓我們考慮變量X1和X23 。 我通過假設線性關系m X1 + n X23與y來實現這一點。 為了確定最大相關性,我們必須計算m和n的最佳值。 在所有情況下,我都假定n為1并求解m。 公式如下所示。

Once n is set as 1, we can easily solve for m. This can be substituted in the above linear equation for each value. In this way, we can merge all the above variables. If the denominator is 0, m can be taken as 1 and the equation can be solved for n. Make sure that the correlation values are equal for both the variables. After merging, I generated the correlation table again.
將n設置為1后 ,我們可以輕松求解m。 可以在上面的線性方程式中將其替換為每個值。 這樣,我們可以合并所有上述變量。 如果分母為0 ,則m可以取為1,并且該方程可以求解n 。 確保兩個變量的相關值相等。 合并后,我再次生成了相關表。

We can see that the correlation values for the merged variables have increased and the statistical significance is high for all of them. I managed to reduce the number of variables from 30 to 15. I can now use these variables to feed it into my machine learning model and check the accuracy against the validation dataset.
我們可以看到,合并變量的相關值已經增加,并且所有變量的統計意義都很高。 我設法將變量的數量從30個減少到15個 。 現在,我可以使用這些變量將其輸入到我的機器學習模型中,并根據驗證數據集檢查準確性。
訓練和驗證數據 (Training and Validating the data)
I chose a Logistic Regression model for the training and predicting on this dataset for multiple reasons —
由于多種原因,我選擇了Logistic回歸模型來對該數據集進行訓練和預測-
- The dependent variable is binary 因變量是二進制
- The independent variables are related to the dependent variable 自變量與因變量有關
After training the model, I checked the model against the validation dataset and these are the results.
訓練模型后,我對照驗證數據集檢查了模型,這些是結果。

The model had an accuracy
of 99.6% with a F1 score
of 99.45%.
該模型的accuracy
為99.6% , F1 score
為99.45% 。
結論 (Conclusion)
This was a basic exploratory and statistical analysis to reduce the number of features and assure that there are no correlated variables in the final dataset. Using a few simple techniques, we can be assured of getting good results even if we do not understand what the data is initially. The main steps include ensuring a normal distribution of data and an efficient encoding scheme for the categorical variables. Further, the variables can be removed and merged based on correlation among them after which an appropriate model can be chosen for analysis. You can find the code repository at https://github.com/Raul9595/AnonyData.
這是一項基本的探索性和統計分析,目的是減少特征數量并確保最終數據集中不存在相關變量。 使用一些簡單的技術,即使我們不了解最初的數據是什么,也可以確保獲得良好的結果。 主要步驟包括確保數據的正態分布和有效的分類變量編碼方案。 此外,可以基于變量之間的相關性來刪除和合并變量,之后可以選擇適當的模型進行分析。 您可以在https://github.com/Raul9595/AnonyData中找到代碼存儲庫。
翻譯自: https://towardsdatascience.com/statistical-analysis-on-a-dataset-you-dont-understand-f382f43c8fa5
如何進行數據分析統計
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/390753.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/390753.shtml 英文地址,請注明出處:http://en.pswp.cn/news/390753.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!