如何進行數據分析統計_對您不了解的數據集進行統計分析

如何進行數據分析統計

Recently, I took the opportunity to work on a competition held by Wells Fargo (Mindsumo). The dataset provided was just a bunch of numbers in various columns with no indication of what the data might be. I always thought that the analysis of data required some knowledge and understanding of the data and the domain to perform an efficient analysis. I have attached a sample below. It consisted of columns from X0 to X29 which consisted of continuous values and XC which consisted of categorical data i.e. 30 variables in total. I set out on further analysis on the entire dataset to understand the data.

[R ecently,我趁機工作由舉辦的比賽富國銀行(Mindsumo) 。 提供的數據集只是各個列中的一堆數字,沒有指示數據可能是什么。 我一直認為,數據分析需要對數據和領域有一定的了解和理解,才能進行有效的分析。 我在下面附上了一個樣本。 它由從X0X29的列組成,這些列由連續值組成,而XC則由分類數據組成,即總共30個變量。 我著手對整個數據集進行進一步分析以了解數據。

Image for post

連續變量的正態性檢查 (Normality check of continuous variables)

I used the QQ plot to determine the normality distribution of the variables and understand if there is any skew in the data. All the data points were normally distributed with very less deviation which required no processing of the data to be done at this point to attain a Gaussian distribution. I prefer a QQ plot for the initial analysis because it makes it very easy to analyze the data and determine the type of distribution be it Gaussian distribution, uniform distribution, etc.

我使用QQ圖來確定變量的正態分布,并了解數據中是否存在任何偏差。 所有數據點均以極小的偏差進行正態分布,因此在這一點上無需對數據進行任何處理即可獲得高斯分布。 我更喜歡使用QQ圖進行初始分析,因為它使分析數據和確定分布類型變得非常容易,包括高斯分布,均勻分布等。

Image for post

Once the data is determined to be a normal distribution using the QQ plot, a Shapiro Wilk test can be performed to confirm the hypothesis. It has been deigned specifically for normal distributions. The null hypothesis, in this case, is that the variable is normally distributed. If the p value obtained is less than 0.05 then the null hypothesis is rejected and it is assumed that the variable is not normally distributed. All the values seem to be greater than 0.5 which means that all the variables follow a normal distribution.

一旦使用QQ圖確定數據為正態分布,就可以執行Shapiro Wilk檢驗來確認假設。 專為正態分布而設計。 在這種情況下,零假設是變量是正態分布的。 如果獲得的p值小于0.05,則拒絕零假設,并假設變量不是正態分布的。 所有值似乎都大于0.5,這意味著所有變量都遵循正態分布。

Image for post

分類變量 (Categorical variable)

I checked the distribution of the categorical variable to check if the points in the dataset were equally distributed. The distribution of the variables was as shown below. The variable consisted of 5 unique values (A,B,C,D and E) with all the values being more or less equally distributed.

我檢查了分類變量的分布,以檢查數據集中的點是否均勻分布。 變量的分布如下所示。 該變量由5個唯一值(A,B,C,D和E)組成,所有值或多或少均等地分布。

Image for post

I used a One-Hot Encoding mechanism to convert the categorical variables to a binary variable for each resulting categorical value as shown below. Although this resulted in an increase in dimensionality, I was hoping to check for correlations later on and remove or merge certain rows.

我使用單點編碼機制將每個結果分類值的分類變量轉換為二進制變量,如下所示。 盡管這導致維數增加,但我希望以后再檢查相關性,并刪除或合并某些行。

Image for post

數據關聯 (Data Correlation)

Correlation is an important technique which helps in determining the relationships between the variables and weeding out highly correlated data. The reason we do this because we don't want variables that are highly correlated with each other since they affect the final dependent variable in the same way. The correlation values range from -1 to 1 with a value of 1 signifying a strong, positive correlation between the two and a value of -1 signifying a strong, negative correlation. We also calculate the statistical significance of the correlations to determine if the null hypothesis (There is no correlation) is valid or not. I have taken three values as benchmarks for measuring statistical significance — 0.1, 0.05 and 0.01. The below table is a small sub sample of the correlation values for each set of variables. A p value which is less than 0.01 signifies a high statistical significance and that the null hypothesis can be rejected which is represented by 3 ‘*’ while the statistical significance is lesser if the p-value is lesser than 0.1 but greater than 0.05 which is represented by 1 ‘*’.

關聯是一項重要的技術,可幫助確定變量之間的關系并清除高度相關的數據。 我們之所以這樣做,是因為我們不希望變量之間具有高度相關性,因為它們以相同的方式影響最終的因變量。 相關值的范圍是-1到1,值1表示兩者之間的強正相關,值-1表示強的負相關。 我們還計算了相關性的統計顯著性,以確定零假設(無相關性)是否有效。 我已經采取三個值作為基準,用于測量統計顯著性- 0.1,0.050.01。 下表是每組變量的相關值的一小部分子樣本。 如果p值小于0.01,則表示具有較高的統計顯著性,并且可以拒絕由3 '*'表示的原假設,而如果p值小于0.1但大于0.05,則統計學顯著性較小。用1' *'表示

Image for post

The dataset above did not have any values which had a high correlation value. Thus, I safely went ahead with the assumption that the values were not related to each other. I further explored the correlation of the variables with the dependent variable y.

上面的數據集沒有任何具有高相關值的值。 因此,我可以安全地繼續進行以下假設:這些值彼此無關。 我進一步探討了變量與因變量y的相關性。

Image for post

We are concerned with the prediction of the variable y. As seen, there are a lot of variables which don't have any correlation to y as well as are not statistically significant at all which means that the relationship is weak and we can safely exclude them from the final dataset. These include variables such as X5, X8, X9, X10, XC_C, etc. I have not excluded the other variables which have low correlation but high statistical significance as there may be a small sample which affects the final dependent variable and we cannot exclude them completely. We can further reduce the variables by merging some of them. We do this on the basis of variables which have the same correlation value with the y variable. These include —

我們關注變量y的預測。 可以看出,有很多變量與y沒有任何關系,并且在統計上根本不重要,這意味著該關系很弱,我們可以安全地將它們從最終數據集中排除。 這些變量包括X5X8X9X10XC_C等。我沒有排除其他相關性較低但具有統計學意義的變量,因為可能會有一個小的樣本影響最終因變量,因此我們不能排除它們完全。 我們可以通過合并其中的一些變量來進一步減少變量。 我們基于與y變量具有相同相關值的變量進行此操作。 這些包括 -

  • X7, X11, X17 and X21

    X7X11X17X21

  • X1 and X23

    X1X23

  • X22 and X26

    X22X26

  • X4, X15, X19 and X25

    X4X15X19X25

I merged these variables using an optimization technique. Let us consider the variable, X1 and X23. I achieved this by assuming a linear relation mX1 + nX23 with y. For determining the maximum correlation, we have to calculate the optimum value for m and n. In all the cases, I assumed n to be 1 and solved for m. The equation is as shown below.

我使用優化技術合并了這些變量。 讓我們考慮變量X1X23 。 我通過假設線性關系m X1 + n X23y來實現這一點。 為了確定最大相關性,我們必須計算mn的最佳值。 在所有情況下,我都假定n1并求解m。 公式如下所示。

Image for post

Once n is set as 1, we can easily solve for m. This can be substituted in the above linear equation for each value. In this way, we can merge all the above variables. If the denominator is 0, m can be taken as 1 and the equation can be solved for n. Make sure that the correlation values are equal for both the variables. After merging, I generated the correlation table again.

將n設置為1后 ,我們可以輕松求解m。 可以在上面的線性方程式中將其替換為每個值。 這樣,我們可以合并所有上述變量。 如果分母為0 ,則m可以取為1,并且該方程可以求解n 。 確保兩個變量的相關值相等。 合并后,我再次生成了相關表。

Image for post

We can see that the correlation values for the merged variables have increased and the statistical significance is high for all of them. I managed to reduce the number of variables from 30 to 15. I can now use these variables to feed it into my machine learning model and check the accuracy against the validation dataset.

我們可以看到,合并變量的相關值已經增加,并且所有變量的統計意義都很高。 我設法將變量的數量從30個減少到15個 。 現在,我可以使用這些變量將其輸入到我的機器學習模型中,并根據驗證數據集檢查準確性。

訓練和驗證數據 (Training and Validating the data)

I chose a Logistic Regression model for the training and predicting on this dataset for multiple reasons —

由于多種原因,我選擇了Logistic回歸模型來對該數據集進行訓練和預測-

  • The dependent variable is binary

    因變量是二進制
  • The independent variables are related to the dependent variable

    自變量與因變量有關

After training the model, I checked the model against the validation dataset and these are the results.

訓練模型后,我對照驗證數據集檢查了模型,這些是結果。

Image for post

The model had an accuracy of 99.6% with a F1 score of 99.45%.

該模型的accuracy99.6%F1 score99.45%

結論 (Conclusion)

This was a basic exploratory and statistical analysis to reduce the number of features and assure that there are no correlated variables in the final dataset. Using a few simple techniques, we can be assured of getting good results even if we do not understand what the data is initially. The main steps include ensuring a normal distribution of data and an efficient encoding scheme for the categorical variables. Further, the variables can be removed and merged based on correlation among them after which an appropriate model can be chosen for analysis. You can find the code repository at https://github.com/Raul9595/AnonyData.

這是一項基本的探索性和統計分析,目的是減少特征數量并確保最終數據集中不存在相關變量。 使用一些簡單的技術,即使我們不了解最初的數據是什么,也可以確保獲得良好的結果。 主要步驟包括確保數據的正態分布和有效的分類變量編碼方案。 此外,可以基于變量之間的相關性來刪除和合并變量,之后可以選擇適當的模型進行分析。 您可以在https://github.com/Raul9595/AnonyData中找到代碼存儲庫。

翻譯自: https://towardsdatascience.com/statistical-analysis-on-a-dataset-you-dont-understand-f382f43c8fa5

如何進行數據分析統計

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/390753.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/390753.shtml
英文地址,請注明出處:http://en.pswp.cn/news/390753.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

經典:區間dp-合并石子

題目鏈接 :http://acm.nyist.edu.cn/JudgeOnline/problem.php?pid737 這個動態規劃的思是,要得出合并n堆石子的最優答案可以從小到大枚舉所有石子合并的最優情況,例如要合并5堆石子就可以從,最優的23和14中得到最佳的答案。從兩堆…

常見排序算法_解釋的算法-它們是什么以及常見的排序算法

常見排序算法In its most basic form, an algorithm is a set of detailed step-by-step instructions to complete a task. For example, an algorithm to make coffee in a french press would be:在最基本的形式中,算法是一組完成任務的詳細分步說明。 例如&…

020-Spring Boot 監控和度量

一、概述 通過配置使用actuator查看監控和度量信息 二、使用 2.1、建立web項目&#xff0c;增加pom <dependency><groupId>org.springframework.boot</groupId><artifactId>spring-boot-starter-web</artifactId></dependency> 啟動項目&a…

matplotlib布局_Matplotlib多列,行跨度布局

matplotlib布局For Visualization in Python, Matplotlib library has been the workhorse for quite some time now. It has held its own even after more nimble rivals with easier code interface and capabilities like seaborn, plotly, bokeh etc. have arrived on the…

Hadoop生態系統

大數據架構-Lambda Lambda架構由Storm的作者Nathan Marz提出。旨在設計出一個能滿足實時大數據系統關鍵特性的架構&#xff0c;具有高容錯、低延時和可擴展等特性。Lambda架構整合離線計算和實時計算&#xff0c;融合不可變性&#xff08;Immutability&#xff09;&#xff0c…

javascript之 原生document.querySelector和querySelectorAll方法

querySelector和querySelectorAll是W3C提供的 新的查詢接口&#xff0c;其主要特點如下&#xff1a; 1、querySelector只返回匹配的第一個元素&#xff0c;如果沒有匹配項&#xff0c;返回null。 2、querySelectorAll返回匹配的元素集合&#xff0c;如果沒有匹配項&#xff0c;…

RDBMS數據定時采集到HDFS

[toc] RDBMS數據定時采集到HDFS 前言 其實并不難&#xff0c;就是使用sqoop定時從MySQL中導入到HDFS中&#xff0c;主要是sqoop命令的使用和Linux腳本的操作這些知識。 場景 在我們的場景中&#xff0c;需要每天將數據庫中新增的用戶數據采集到HDFS中&#xff0c;數據庫中有tim…

單詞嵌入_神秘的文本分類:單詞嵌入簡介

單詞嵌入Natural language processing (NLP) is an old science that started in the 1950s. The Georgetown IBM experiment in 1954 was a big step towards a fully automated text translation. More than 60 Russian sentences were translated into English using simple…

使用Hadoop所需要的一些Linux基礎

Linux 概念 Linux 是一個類Unix操作系統&#xff0c;是 Unix 的一種&#xff0c;它 控制整個系統基本服務的核心程序 (kernel) 是由 Linus 帶頭開發出來的&#xff0c;「Linux」這個名稱便是以 「Linus’s unix」來命名的。 Linux泛指一類操作系統&#xff0c;具體的版本有&a…

python多項式回歸_Python從頭開始的多項式回歸

python多項式回歸Polynomial regression in an improved version of linear regression. If you know linear regression, it will be simple for you. If not, I will explain the formulas here in this article. There are other advanced and more efficient machine learn…

《Linux命令行與shell腳本編程大全 第3版》Linux命令行---4

以下為閱讀《Linux命令行與shell腳本編程大全 第3版》的讀書筆記&#xff0c;為了方便記錄&#xff0c;特地與書的內容保持同步&#xff0c;特意做成一節一次隨筆&#xff0c;特記錄如下&#xff1a; 《Linux命令行與shell腳本編程大全 第3版》Linux命令行--- Linux命令行與she…

徹底搞懂 JS 中 this 機制

徹底搞懂 JS 中 this 機制 摘要&#xff1a;本文屬于原創&#xff0c;歡迎轉載&#xff0c;轉載請保留出處&#xff1a;https://github.com/jasonGeng88/blog 目錄 this 是什么this 的四種綁定規則綁定規則的優先級綁定例外擴展&#xff1a;箭頭函數this 是什么 理解this之前&a…

?如何在2分鐘內將GraphQL服務器添加到RESTful Express.js API

You can get a lot done in 2 minutes, like microwaving popcorn, sending a text message, eating a cupcake, and hooking up a GraphQL server.您可以在2分鐘內完成很多工作&#xff0c;例如微波爐爆米花&#xff0c;發送短信&#xff0c; 吃蛋糕以及連接GraphQL服務器 。 …

leetcode 1744. 你能在你最喜歡的那天吃到你最喜歡的糖果嗎?

給你一個下標從 0 開始的正整數數組 candiesCount &#xff0c;其中 candiesCount[i] 表示你擁有的第 i 類糖果的數目。同時給你一個二維數組 queries &#xff0c;其中 queries[i] [favoriteTypei, favoriteDayi, dailyCapi] 。 你按照如下規則進行一場游戲&#xff1a; 你…

回歸分析_回歸

回歸分析Machine learning algorithms are not your regular algorithms that we may be used to because they are often described by a combination of some complex statistics and mathematics. Since it is very important to understand the background of any algorith…

ruby nil_Ruby中的數據類型-True,False和Nil用示例解釋

ruby niltrue, false, and nil are special built-in data types in Ruby. Each of these keywords evaluates to an object that is the sole instance of its respective class.true &#xff0c; false和nil是Ruby中的特殊內置數據類型。 這些關鍵字中的每一個都求值為一個對…

淺嘗flutter中的動畫(淡入淡出)

在移動端開發中&#xff0c;經常會有一些動畫交互&#xff0c;比如淡入淡出,效果如圖&#xff1a; 因為官方封裝好了AnimatedOpacity Widget&#xff0c;開箱即用&#xff0c;所以我們用起來很方便&#xff0c;代碼量很少&#xff0c;做少量配置即可&#xff0c;所以&#xff0…

數據科學還是計算機科學_何時不使用數據科學

數據科學還是計算機科學意見 (Opinion) 目錄 (Table of Contents) Introduction 介紹 Examples 例子 When You Should Use Data Science 什么時候應該使用數據科學 Summary 摘要 介紹 (Introduction) Both Data Science and Machine Learning are useful fields that apply sev…

空間復雜度 用什么符號表示_什么是大O符號解釋:時空復雜性

空間復雜度 用什么符號表示Do you really understand Big O? If so, then this will refresh your understanding before an interview. If not, don’t worry — come and join us for some endeavors in computer science.您真的了解Big O嗎&#xff1f; 如果是這樣&#xf…

leetcode 523. 連續的子數組和

給你一個整數數組 nums 和一個整數 k &#xff0c;編寫一個函數來判斷該數組是否含有同時滿足下述條件的連續子數組&#xff1a; 子數組大小 至少為 2 &#xff0c;且 子數組元素總和為 k 的倍數。 如果存在&#xff0c;返回 true &#xff1b;否則&#xff0c;返回 false 。 …