數據eda_關于分類和有序數據的EDA

數據eda

數據科學和機器學習統計 (STATISTICS FOR DATA SCIENCE AND MACHINE LEARNING)

Categorical variables are the ones where the possible values are provided as a set of options, it can be pre-defined or open. An example can be the gender of a person. In the case of Ordinal variables, the options can be ordered by some rule, like the Likert Scale:

分類變量是將可能的值作為一組選項提供的變量,可以預定義或打開。 一個例子可以是一個人的性別。 對于序數變量,可以按照某些規則對選項進行排序,例如Likert Scale:

  • Like

    喜歡
  • Like Somewhat

    有點像
  • Neutral

    中性
  • Dislike Somewhat

    有點不喜歡
  • Dislike

    不喜歡

To simplify further examples, we will use a simple example, based on a group of students that have passed or not 2 distinct exams, the results are represented in the next RxC table:

為了簡化更多示例,我們將使用一個簡單示例,該示例基于一組已通過或未通過2次不同考試的學生,結果顯示在下一個RxC表中:

Image for post
The example used in the whole article, self-generated.
整篇文章中使用的示例是自生成的。

Statisticians have developed specific techniques to analyze this data, the most important are:

統計人員已經開發出分析此數據的特定技術,其中最重要的是:

協議措施 (Measures of Agreement)

百分比協議 (Percent Agreement)

Calculated as the divisions between the number of cases where the rates are in a certain class by the total number of rates.

計算為費率在特定類別中的案例數除以費率總數。

Image for post
Adding totals to the example, self-generated.
將總計添加到示例中,自行生成。
  • The percent agreement for Passing the exam 2 is 25/(25+60) = 0.29, so 29.4%

    通過考試2的百分比協議是25 /(25 + 60)= 0.29,所以29.4%
  • The percent agreement for Passing the exam 1 is 30/85 = 0.35, so 35.3%

    通過考試1的百分比協議是30/85 = 0.35,所以35.3%
  • The percent agreement of passing the exam 1 and not passing the exam 2 is 10/85 = 0.117, so 11.7%.

    通過考試1和未通過考試2的百分比協議是10/85 = 0.117,所以11.7%。

The problem with the percent agreement is that the data can be obtained only by chance.

百分比一致性的問題在于只能偶然獲得數據。

科恩的卡帕 (Cohen’s Kappa)

Image for post
The example used in the whole article, self-generated.
整篇文章中使用的示例是自生成的。

To overcome the problems of percent agreement, we calculate Kappa as:

為了克服百分比協議的問題,我們將Kappa計算為:

Image for post
Cohen’s Kappa formula, self-generated.
科恩的Kappa公式,是自生成的。

where P0 is the observed agreement and Pe the expected agreement, calculated as:

其中P0是觀察到的協議, Pe是期望的協議,計算公式為:

Image for post
P0 and Pe formulas, self-generated.
P0和Pe公式,是自生成的。

In our example:

在我們的示例中:

  • P0 = 70/85 = 0.82

    P0 = 70/85 = 0.82

  • Pe = 30 x 25 / 852 + 55 x 60 / 852 = 0.56

    Pe = 30 x 25 /852+ 55 x 60 /852= 0.56

  • K = 0.26 / 0.44 = 0.59

    K = 0.26 / 0.44 = 0.59

The Kappa results are in possible range is (-1,1), where 0 means that observed agreement and chance agreement is the same, 1 if all cases were in agreement and -1 if all cases were in disagreement.

Kappa結果的可能范圍是(-1,1),其中0表示觀察到的一致和機會一致是相同的,如果所有情況都一致,則為1;如果所有情況都不一致,則為-1。

卡方分布 (The Chi-Squared Distribution)

To do hypothesis testing with categorical variables, we need to use custom distributions, the most common is the Chi-Square, being a continuous theoretical probability distribution.

要使用分類變量進行假設檢驗,我們需要使用自定義分布,最常見的是卡方,即連續的理論概率分布。

This distribution has only one parameter, k which means degrees of freedom. As k approaches infinity, the chi-Squared distribution becomes similar to the normal distribution.

這種分布只有一個參數, k表示自由度。 當k接近無窮大時,卡方分布變得類似于正態分布。

卡方檢驗 (Chi-Squared Test)

This test is used to check if two categorical variables are independent, we will use the same example to explain how to calculate it:

該測試用于檢查兩個類別變量是否獨立,我們將使用相同的示例來說明如何計算它:

First, we define the hypothesis that we want to test, in our case, we want to check if passing exam 1 and exam 2 are independent, so:

首先,我們定義要測試的假設,在本例中,我們要檢查通過考試1和考試2是否獨立,因此:

  • H0 = Pass exam 1 and pass exam 2 are independent.

    H0 =通過考試1和通過考試2是獨立的。
  • Ha = Pass exam 1 and pass exam 2 are dependent.

    Ha =通過考試1和通過考試2是相關的。

This test relies on the difference between expected and observed values, to calculate the expected values(what you expect to find if both variables were independent), we use:

該測試依賴于期望值與觀察值之間的差異,以計算期望值(如果兩個變量都是獨立的,您會發現什么),我們使用:

Image for post
Expected values formula, self-generated.
期望值公式,自行生成。

To simplify the calculations first we calculate the marginals, these values are the sums per row and column that we already calculated in the second table if this post. The expected values are calculated as:

為了簡化計算,首先我們計算邊際,這些值是我們在第二張表中已經計算出的每行和每列的總和。 期望值的計算公式為:

Image for post
Expected values calculation for our example, self-generated.
本示例的期望值計算,是自生成的。

Now we have all we need to calculate the chi-squared formula:

現在我們有了計算卡方公式所需的全部:

Image for post
The chi-Squared formula, self-generated.
卡方公式,自生成。

With the sum symbol, we mean that we have to calculate the formula for all combinations of our variables, in our case 4, and sum the results:

對于總和符號,我們的意思是我們必須為變量4的所有組合計算公式,并對結果求和:

Image for post
Results for each sum of the formula, self-generated.
公式的每個和的結果,自生成。

The final values are the sum of all 4, being 26.96, now we have to compare this result with the statistical tables, for this we need to know the degrees of freedom, they are calculated as (num rows-1)*(num columns-1), in our case we have a degree of freedom = 1.

最終值是所有4的總和,即26.96 ,現在我們必須將此結果與統計表進行比較,為此,我們需要知道自由度,它們的計算方式為(num rows-1)*(num columns -1) ,在我們的情況下,我們的自由度= 1。

According to the tables found easy searching Chi-Squared table at Google(statistical packages for any language should have them in a function), the critical value for 𝝰 = 0.05, is 3.841, our result is much larger, so, we reject the null hypothesis which means that pass exam 1 and pass exam 2 are dependent.

根據在Google上發現的易于搜索的Chi-Squared表(任何語言的統計軟件包都應在函數中包含它們),, = 0.05的臨界值為3.841,我們的結果要大得多,因此,我們拒絕空值假設意味著通過考試1和通過考試2是相互依賴的。

分類數據的相關統計 (Correlation statistics for categorical data)

As person correlation requires variables to be measured on at least interval level, we need to adopt a new calculation for binary and ordinal variables, let’s introduce them:

由于人的相關性要求至少在區間水平上測量變量,因此我們需要對二進制和序數變量采用新的計算方法,讓我們對其進行介紹:

二進制變量 (Binary Variables)

Phi is a measure of the degree of association between two binary variables, based on the table introduced at the Cohen’s Kappa sections, it’s calculated as:

Phi是兩個二進制變量之間關聯度的度量,基于Cohen Kappa部分介紹的表,其計算公式為:

Image for post
Formulas to calculate the phi statistic, self-generated.
自行計算phi統計信息的公式。

Using the second formula, in our example, Φ = (26.96/85)^(1/2) = 0.1

在我們的示例中,使用第二個公式,Φ=( 26.96 / 85)^(1/2)= 0.1

Notice that the first formula can obtain negative values, meanwhile, the second one can only result in positive values, we don't care about the direction of our result, we just analyze the absolute value.

注意,第一個公式可以得出負值,而第二個公式只能得出正值,我們不在乎結果的方向,我們只分析絕對值。

If the distribution of the data is 50–50, so data is evenly distributed, phi can reach the value of 1, else the potential max value is lower. In our case, we have very little relationship.

如果數據的分布是50–50,則數據分布均勻,phi可以達到1的值,否則潛在的最大值較低。 就我們而言,我們之間的關系很少。

點-雙相關 (The Point-Biserial Correlation)

It’s a measure that calculates the correlation between dichotomous and continuous variables, the formula is the next-one:

這是一種計算二分變量和連續變量之間的相關性的度量,公式為下一個:

Image for post
Point biserial correlation formula, self-generated.
點雙數相關公式,自生成。

Where:

哪里:

  • x?1 = mean of the continuous variable for group 1

    x?1 =組1連續變量的平均值

  • x?2 = mean of the continuous variable for group 2

    x?2 =第2組連續變量的平均值

  • p = proportion of class 1 in the dichotomous variable

    p = 1類在二分變量中的比例

  • s_x = Standart deviation of the continuous variable

    s_x =連續變量的標準偏差

To follow our example we will suppose the next values, obtained comparing the exam 1 variable with the number of hours studied:

遵循我們的示例,我們將假定下一個值,該值是將考試1變量與學習的小時數進行比較而獲得的:

  • x? pass = 5.5

    x?通過 = 5.5

  • x? not pass = 3.1

    x?不及格 = 3.1

  • p = 20/25 = 0.8

    p = 20/25 = 0.8

  • s_x = 2

    s_x = 2

With these values, we obtain a result of 2.4 * 0.4 / 2 = 0.48, indicating that there’s some relation between our variables.

使用這些值,我們得到的結果為2.4 * 0.4 / 2 = 0.48 ,表明變量之間存在某種關系。

序數變數 (Ordinal Variables)

The most used correlation coefficient for ordinal variables is the Spearman’s rank-order coefficient, usually called Spearman’s r.

序數變量最常用的相關系數是Spearman的秩序系數 ,通常稱為Spearman的r

Image for post
Spearman’s r correlation coefficient for ordinal variables, self-generated.
Spearman的r相關系數,用于自變量。

where d_i means the difference between 2 variables for each individual and n the size of the sample.

其中d_i表示每個個體的2個變量與樣本大小的n之差。

摘要 (Summary)

In data science, we’re used to do some scatter plots of the binary, categorical or ordinary variables, use them as color differences in other plots, but when we calculate the correlations it’s easy to skip this variable, because of the built-in functions for pandas in the case of python or Dplyr in R don't use them.

在數據科學中,我們習慣于對二進制,分類或普通變量進行散點圖繪制,將它們用作其他圖中的色差,但是當我們計算相關性時,由于內置變量,很容易跳過此變量R中的python或Dplyr的熊貓函數不使用它們。

In this post, we showed how to analyze these variables' distribution and their correlation with all the other variables.

在這篇文章中,我們展示了如何分析這些變量的分布以及它們與所有其他變量的相關性。

This is the tenth post of my particular #100daysofML, I will be publishing the advances of this challenge at GitHub, Twitter, and Medium (Adrià Serra).

這是我特別#十后100daysofML,我會發布在GitHub上,Twitter和中型企業(這一挑戰的進步阿德里亞塞拉 )。

https://twitter.com/CrunchyML

https://twitter.com/CrunchyML

https://github.com/CrunchyPistacho/100DaysOfML

https://github.com/CrunchyPistacho/100DaysOfML

翻譯自: https://medium.com/ai-in-plain-english/eda-on-categorical-and-ordinal-data-22f8a4407836

數據eda

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/389430.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/389430.shtml
英文地址,請注明出處:http://en.pswp.cn/news/389430.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

PyTorch官方教程中文版:PYTORCH之60MIN入門教程代碼學習

Pytorch入門 import torch""" 構建非初始化的矩陣 """x torch.empty(5,3) #print(x)""" 構建隨機初始化矩陣 """x torch.rand(5,3)""" 構造一個矩陣全為 0,而且數據類型是 long &qu…

Flexbox 最簡單的表單

彈性布局(Flexbox)逐漸流行&#xff0c;越來越多的人開始使用&#xff0c;因為它寫Css布局真是太簡單了一一、<form>元素表單使用<form>元素<form></form>復制代碼上面是一個空的表單&#xff0c;根據HTML標準&#xff0c;它是一個塊級元素&#xff0c…

CSS中的盒子模型

一.為什么使用CSS 1.有效的傳遞頁面信息 2.使用CSS美化過的頁面文本&#xff0c;使頁面漂亮、美觀&#xff0c;吸引用戶 3.可以很好的突出頁面的主題內容&#xff0c;使用戶第一眼可以看到頁面主要內容 4.具有良好的用戶體驗 二.字體樣式屬性 1.font-family:英…

jdk重啟后步行_向后介紹步行以一種新穎的方式來預測未來

jdk重啟后步行“永遠不要做出預測&#xff0c;尤其是關于未來的預測。” (KK Steincke) (“Never Make Predictions, Especially About the Future.” (K. K. Steincke)) Does this picture portray a horse or a car? 這張照片描繪的是馬還是汽車&#xff1f; How likely is …

PyTorch官方教程中文版:入門強化教程代碼學習

PyTorch之數據加載和處理 from __future__ import print_function, division import os import torch import pandas as pd #用于更容易地進行csv解析 from skimage import io, transform #用于圖像的IO和變換 import numpy as np import matplotlib.pyplot a…

css3-2 CSS3選擇器和文本字體樣式

css3-2 CSS3選擇器和文本字體樣式 一、總結 一句話總結&#xff1a;是要記下來的&#xff0c;記下來可以省很多事。 1、css的基本選擇器中的:first-letter和:first-line是什么意思&#xff1f; :first-letter選擇第一個單詞&#xff0c;:first-line選擇第一行 2、css的偽類選…

mongodb仲裁者_真理的仲裁者

mongodb仲裁者Coming out of college with a background in mathematics, I fell upward into the rapidly growing field of data analytics. It wasn’t until years later that I realized the incredible power that comes with the position. As Uncle Ben told Peter Par…

優化 回歸_使用回歸優化產品價格

優化 回歸應用數據科學 (Applied data science) Price and quantity are two fundamental measures that determine the bottom line of every business, and setting the right price is one of the most important decisions a company can make. Under-pricing hurts the co…

Node.js——異步上傳文件

前臺代碼 submit() {var file this.$refs.fileUpload.files[0];var formData new FormData();formData.append("file", file);formData.append("username", this.username);formData.append("password", this.password);axios.post("http…

用 JavaScript 的方式理解遞歸

原文地址 1. 遞歸是啥? 遞歸概念很簡單&#xff0c;“自己調用自己”&#xff08;下面以函數為例&#xff09;。 在分析遞歸之前&#xff0c;需要了解下 JavaScript 中“壓棧”&#xff08;call stack&#xff09; 概念。 2. 壓棧與出棧 棧是什么&#xff1f;可以理解是在內存…

PyTorch官方教程中文版:Pytorch之圖像篇

微調基于 torchvision 0.3的目標檢測模型 """ 為數據集編寫類 """ import os import numpy as np import torch from PIL import Imageclass PennFudanDataset(object):def __init__(self, root, transforms):self.root rootself.transforms …

大數據數據科學家常用面試題_進行數據科學工作面試

大數據數據科學家常用面試題During my time as a Data Scientist, I had the chance to interview my fair share of candidates for data-related roles. While doing this, I started noticing a pattern: some kinds of (simple) mistakes were overwhelmingly frequent amo…

scrapy模擬模擬點擊_模擬大流行

scrapy模擬模擬點擊復雜系統 (Complex Systems) In our daily life, we encounter many complex systems where individuals are interacting with each other such as the stock market or rush hour traffic. Finding appropriate models for these complex systems may give…

公司想申請網易企業電子郵箱,怎么樣?

不論公司屬于哪個行業&#xff0c;選擇企業郵箱&#xff0c;交互界面友好度、穩定性、安全性都是選擇郵箱所必須考慮的因素。網易企業郵箱郵箱方面已有21年的運營經驗&#xff0c;是國內資歷最高的電子郵箱&#xff0c;在各個方面都非常成熟完善。 從交互界面友好度來看&#x…

莫煩Matplotlib可視化第二章基本使用代碼學習

基本用法 import matplotlib.pyplot as plt import numpy as np""" 2.1基本用法 """ # x np.linspace(-1,1,50) #[-1,1]50個點 # #y 2*x 1 # # y x**2 # plt.plot(x,y) #注意&#xff1a;x,y順序不能反 # plt.show()"""…

vue.js python_使用Python和Vue.js自動化報告過程

vue.js pythonIf your organization does not have a data visualization solution like Tableau or PowerBI nor means to host a server to deploy open source solutions like Dash then you are probably stuck doing reports with Excel or exporting your notebooks.如果…

plsql中導入csvs_在命令行中使用sql分析csvs

plsql中導入csvsIf you are familiar with coding in SQL, there is a strong chance you do it in PgAdmin, MySQL, BigQuery, SQL Server, etc. But there are times you just want to use your SQL skills for quick analysis on a small/medium sized dataset.如果您熟悉SQ…

第十八篇 Linux環境下常用軟件安裝和使用指南

提醒&#xff1a;如果之后要安裝virtualenvwrapper的話&#xff0c;可以直接跳到安裝virtualenvwrapper的方法&#xff0c;而不需要先安裝好virtualenv安裝virtualenv和生成虛擬環境安裝virtualenv&#xff1a;yum -y install python-virtualenv生成虛擬環境&#xff1a;先切換…

莫煩Matplotlib可視化第三章畫圖種類代碼學習

3.1散點圖 import matplotlib.pyplot as plt import numpy as npn 1024 X np.random.normal(0,1,n) Y np.random.normal(0,1,n) T np.arctan2(Y,X) #用于計算顏色plt.scatter(X,Y,s75,cT,alpha0.5)#alpha是透明度 #plt.scatter(np.arange(5),np.arange(5)) #一條線的散點…

計算機科學必讀書籍_5篇關于數據科學家的產品分類必讀文章

計算機科學必讀書籍Product categorization/product classification is the organization of products into their respective departments or categories. As well, a large part of the process is the design of the product taxonomy as a whole.產品分類/產品分類是將產品…