多變量線性相關分析_如何測量多個變量之間的“非線性相關性”?

多變量線性相關分析

現實世界中的數據科學 (Data Science in the Real World)

This article aims to present two ways of calculating non linear correlation between any number of discrete variables. The objective for a data analysis project is twofold : on the one hand, to know the amount of information the variables share with each other, and therefore, to identify whether the data available contain the information one is looking for ; and on the other hand, to identify which minimum set of variables contains the most important amount of useful information.

本文旨在介紹兩種計算任意數量的離散變量之間的非線性相關性的方法。 數據分析項目的目標是雙重的:一方面,了解變量之間共享的信息量,從而確定可用數據是否包含人們正在尋找的信息; 另一方面,確定哪些最小變量集包含最重要的有用信息量。

變量之間的不同類型的關系 (The different types of relationships between variables)

線性度 (Linearity)

The best-known relationship between several variables is the linear one. This is the type of relationships that is measured by the classical correlation coefficient: the closer it is, in absolute value, to 1, the more the variables are linked by an exact linear relationship.

幾個變量之間最著名的關系是線性關系。 這是用經典相關系數衡量的關系類型:絕對值越接近1,變量之間通過精確的線性關系鏈接的越多。

However, there are plenty of other potential relationships between variables, which cannot be captured by the measurement of conventional linear correlation.

但是,變量之間還有許多其他潛在的關系,無法通過常規線性相關性的測量來捕獲。

Image for post
Correlation between X and Y is almost 0%
X和Y之間的相關性幾乎為0%

To find such non-linear relationships between variables, other correlation measures should be used. The price to pay is to work only with discrete, or discretized, variables.

為了找到變量之間的這種非線性關系,應該使用其他相關度量。 要付出的代價是僅對離散變量或離散變量起作用。

In addition to that, having a method for calculating multivariate correlations makes it possible to take into account the two main types of interaction that variables may present: relationships of information redundancy or complementarity.

除此之外,擁有一種用于計算多元相關性的方法,可以考慮變量可能呈現的兩種主要交互類型:信息冗余或互補性的關系。

冗余 (Redundancy)

When two variables (hereafter, X and Y) share information in a redundant manner, the amount of information provided by both variables X and Y to predict Z will be inferior to the sum of the amounts of information provided by X to predict Z, and by Y to predict Z.

當兩個變量(以下,XY)以冗余的方式共享信息,由兩個變量XY中提供的信息來預測的Z量將不如由X所提供的預測的Z信息的量的總和,和由Y預測Z。

In the extreme case, X = Y. Then, if the values taken by Z can be correctly predicted 50% of the times by X (and Y), the values taken by Z cannot be predicted perfectly (i.e. 100% of the times) by the variables X and Y together.

在極端情況下, X = Y。 然后,如果可以通過X (和Y )正確地預測Z所取的值的50%時間,則變量XY不能一起完美地預測Z所取的值(即100%的時間)。

                            ╔═══╦═══╦═══╗
║ X ║ Y ║ Z ║
╠═══╬═══╬═══╣
║ 0 ║ 0 ║ 0 ║
║ 0 ║ 0 ║ 0 ║
║ 1 ║ 1 ║ 0 ║
║ 1 ║ 1 ║ 1 ║
╚═══╩═══╩═══╝

互補性 (Complementarity)

The complementarity relationship is the exact opposite situation. In the extreme case, X provides no information about Z, neither does Y, but the variables X and Y together allow to predict perfectly the values taken by Z. In such a case, the correlation between X and Z is zero, as is the correlation between Y and Z, but the correlation between X, Y and Z is 100%.

互補關系是完全相反的情況。 在極端情況下, X不提供有關Z的信息, Y也不提供任何信息,但是變量XY一起可以完美地預測Z所取的值。 在這種情況下, XZ之間的相關性為零, YZ之間的相關性也為零,但是XYZ之間的相關性為100%。

These complementarity relationships only occur in the case of non-linear relationships, and must then be taken into account in order to avoid any error when trying to reduce the dimensionality of a data analysis problem: discarding X and Y because they do not provide any information on Z when considered independently would be a bad idea.

這些互補關系僅在非線性關系的情況下發生,然后在嘗試減小數據分析問題的維數時必須考慮到它們以避免錯誤:丟棄XY,因為它們不提供任何信息在Z上單獨考慮時,將是一個壞主意。

                            ╔═══╦═══╦═══╗
║ X ║ Y ║ Z ║
╠═══╬═══╬═══╣
║ 0 ║ 0 ║ 0 ║
║ 0 ║ 1 ║ 1 ║
║ 1 ║ 0 ║ 1 ║
║ 1 ║ 1 ║ 0 ║
╚═══╩═══╩═══╝

“多元非線性相關性”的兩種可能測度 (Two possible measures of “multivariate non-linear correlation”)

There is a significant amount of possible measures of (multivariate) non-linear correlation (e.g. multivariate mutual information, maximum information coefficient — MIC, etc.). I present here two of them whose properties, in my opinion, satisfy exactly what one would expect from such measures. The only caveat is that they require discrete variables, and are very computationally intensive.

存在(多元)非線性相關性的大量可能度量(例如多元互信息,最大信息系數MIC等)。 我在這里介紹他們中的兩個,我認為它們的性質完全滿足人們對此類措施的期望。 唯一的警告是它們需要離散變量,并且計算量很大。

對稱測度 (Symmetric measure)

The first one is a measure of the information shared by n variables V1, …, Vn, known as “dual total correlation” (among other names).

第一個是對n個變量V1,…,Vn共享的信息的度量,稱為“雙重總相關”(在其他名稱中)。

This measure of the information shared by different variables can be characterized as:

不同變量共享的信息的這種度量可以表征為:

Image for post

where H(V) expresses the entropy of variable V.

其中H(V)表示變量V的熵。

When normalized by H(V1, …, Vn), this “mutual information score” takes values ranging from 0% (meaning that the n variables are not at all similar) to 100% (meaning that the n variables are identical, except for the labels).

當用H(V1,…,Vn)歸一化時,該“互信息分”取值范圍從0%(意味著n個變量根本不相似)到100%(意味著n個變量相同,除了標簽)。

This measure is symmetric because the information shared by X and Y is exactly the same as the information shared by Y and X.

此度量是對稱的,因為XY共享的信息與YX共享的信息完全相同。

Image for post
Joint entropy of V1, V2 and V3
V1,V2和V3的聯合熵

The Venn diagram above shows the “variability” (entropy) of the variables V1, V2 and V3 with circles. The shaded area represents the entropy shared by the three variables: it is the dual total correlation.

上方的維恩圖用圓圈顯示變量V1V2V3的“變異性”(熵)。 陰影區域表示三個變量共享的熵:它是對偶總相關。

不對稱測度 (Asymmetric measure)

The symmetry property of usual correlation measurements is sometimes criticized. Indeed, if I want to predict Y as a function of X, I do not care if X and Y have little information in common: all I care about is that the variable X contains all the information needed to predict Y, even if Y gives very little information about X. For example, if X takes animal species and Y takes animal families as values, then X easily allows us to know Y, but Y gives little information about X:

常用的相關測量的對稱性有時會受到批評。 的確,如果我想將Y預測為X的函數,則我不在乎XY是否有很少的共同點信息:我只關心變量X包含預測Y所需的所有信息,即使Y給出關于X的信息很少。 例如,如果X取動物種類而Y取動物種類作為值,則X容易使我們知道Y ,但Y幾乎沒有提供有關X的信息:

    ╔═════════════════════════════╦══════════════════════════════╗
║ Animal species (variable X) ║ Animal families (variable Y) ║
╠═════════════════════════════╬══════════════════════════════╣
║ Tiger ║ Feline ║
║ Lynx ║ Feline ║
║ Serval ║ Feline ║
║ Cat ║ Feline ║
║ Jackal ║ Canid ║
║ Dhole ║ Canid ║
║ Wild dog ║ Canid ║
║ Dog ║ Canid ║
╚═════════════════════════════╩══════════════════════════════╝

The “information score” of X to predict Y should then be 100%, while the “information score” of Y for predicting X will be, for example, only 10%.

那么,用于預測YX的“信息分數”應為100%,而用于預測XY的“信息分數”僅為例如10%。

In plain terms, if the variables D1, …, Dn are descriptors, and the variables T1, …, Tn are target variables (to be predicted by descriptors), then such an information score is given by the following formula:

簡而言之,如果變量D1,...,Dn是描述符,變量T1,...,Tn是目標變量(將由描述符預測),則這樣的信息得分將由以下公式給出:

Image for post

where H(V) expresses the entropy of variable V.

其中H(V)表示變量V的熵。

This “prediction score” also ranges from 0% (if the descriptors do not predict the target variables) to 100% (if the descriptors perfectly predict the target variables). This score is, to my knowledge, completely new.

此“預測分數”的范圍也從0%(如果描述符未預測目標變量)到100%(如果描述符完美地預測目標變量)。 據我所知,這個分數是全新的。

Image for post
Share of entropy of D1 and D2 useful to predict T1
D1和D2的熵份額可用于預測T1

The shaded area in the above diagram represents the entropy shared by the descriptors D1 and D2 with the target variable T1. The difference with the dual total correlation is that the information shared by the descriptors but not related to the target variable is not taken into account.

上圖中的陰影區域表示描述符D1D2與目標變量T1共享的熵。 與雙重總相關的區別在于,不考慮描述符共享但與目標變量無關的信息。

實際中信息分數的計算 (Computation of the information scores in practice)

A direct method to calculate the two scores presented above is based on the estimation of the entropies of the different variables, or groups of variables.

計算上述兩個分數的直接方法是基于對不同變量或變量組的熵的估計。

In R language, the entropy function of the ‘infotheo’ package gives us exactly what we need. The calculation of the joint entropy of three variables V1, V2 and V3 is very simple:

在R語言中,“ infotheo”程序包的熵函數提供了我們所需的信息。 三個變量V1V2V3的聯合熵的計算非常簡單:

library(infotheo)df <- data.frame(V1 = c(0,0,1,1,0,0,1,0,1,1),                 V2 = c(0,1,0,1,0,1,1,0,1,0),                 V3 = c(0,1,1,0,0,0,1,1,0,1))entropy(df)[1] 1.886697

The computation of the joint entropy of several variables in Python requires some additional work. The BIOLAB contributor, on the blog of the Orange software, suggests the following function:

Python中幾個變量的聯合熵的計算需要一些額外的工作。 BIOLAB貢獻者在Orange軟件的博客上建議了以下功能:

import numpy as np
import itertools
from functools import reducedef entropy(*X): entropy = sum(-p * np.log(p) if p > 0 else 0 for p in
(
np.mean(reduce(np.logical_and, (predictions == c for predictions, c in zip(X, classes))))
for
classes in itertools.product(*[set(x) for x in X]))) return(entropy)V1 = np.array([0,0,1,1,0,0,1,0,1,1])V2 = np.array([0,1,0,1,0,1,1,0,1,0])V3 = np.array([0,1,1,0,0,0,1,1,0,1])entropy(V1, V2, V3)1.8866967846580784

In each case, the entropy is given in nats, the “natural unit of information”.

在每種情況下,熵都以nat(“信息的自然單位”)給出。

For a high number of dimensions, the information scores are no longer computable, as the entropy calculation is too computationally intensive and time-consuming. Also, it is not desirable to calculate information scores when the number of samples is not large enough compared to the number of dimensions, because then the information score is “overfitting” the data, just like in a classical machine learning model. For instance, if only two samples are available for two variables X and Y, the linear regression will obtain a “perfect” result:

對于大量維,信息分數不再可計算,因為熵計算的計算量很大且很耗時。 同樣,當樣本數量與維數相比不夠大時,也不希望計算信息分數,因為就像經典的機器學習模型一樣,信息分數會使數據“過度擬合”。 例如,如果對于兩個變量XY只有兩個樣本可用,則線性回歸將獲得“完美”的結果:

                            ╔════╦═════╗
║ X ║ Y ║
╠════╬═════╣
║ 0 ║ 317 ║
║ 10 ║ 40 ║
╚════╩═════╝
Image for post
Basic example of overfitting
過度擬合的基本示例

Similarly, let’s imagine that I take temperature measures over time, while ensuring to note the time of day for each measure. I can then try to explore the relationship between time of day and temperature. If the number of samples I have is too small relative to the number of problem dimensions, the chances are high that the information scores overestimate the relationship between the two variables:

同樣,讓我們??想象一下,我會隨著時間的推移進行溫度測量,同時確保記下每個測量的時間。 然后,我可以嘗試探索一天中的時間與溫度之間的關系。 如果我擁有的樣本數量相對于問題維度的數量而言太少,則信息分數很有可能高估了兩個變量之間的關系:

                ╔══════════════════╦════════════════╗
║ Temperature (°C) ║ Hour (0 to 24) ║
╠══════════════════╬════════════════╣
║ 23 ║ 10 ║
║ 27 ║ 15 ║
╚══════════════════╩════════════════╝

In the above example, and based on the only observations available, it appears that the two variables are in perfect bijection: the information scores will be 100%.

在上面的示例中,并且基于僅可用的觀察結果,看來這兩個變量完全是雙射的:信息得分將為100%。

It should therefore be remembered that information scores are capable, like machine learning models, of “overfitting”, much more than linear correlation, since linear models are by nature limited in complexity.

因此,應該記住,信息評分像機器學習模型一樣,具有“過擬合”的能力,遠遠超過了線性相關性,因為線性模型天生就受到復雜性的限制。

預測分數使用示例 (Example of prediction score use)

The Titanic dataset contains information about 887 passengers from the Titanic who were on board when the ship collided with an iceberg: the price they paid for boarding (Fare), their class (Pclass), their name (Name), their gender (Sex), their age (Age), the number of their relatives on board (Parents/Children Aboard and Siblings/Spouses Aboard) and whether they survived or not (Survived).

泰坦尼克號數據集包含有關當泰坦尼克號與冰山相撞時在船上的887名乘客的信息:他們所支付的登船價格( 車費 ),其艙位( Pclass ),姓名( Name ),性別( Sex ) ,他們的年齡( Age ),在船上的親戚數( 父母/子女兄弟姐妹/配偶 )以及他們是否幸存( Survived )。

This dataset is typically used to determine the probability that a person had of surviving, or more simply to “predict” whether the person survived, by means of the individual data available (excluding the Survived variable).

該數據集通常用于通過可用的個人數據(不包括生存變量)來確定一個人生存的可能性,或更簡單地“預測”該人是否生存

So, for different possible combinations of the descriptors, I calculated the prediction score with respect to the Survived variable. I removed the nominative data (otherwise the prediction score would be 100% because of the overfitting) and discretized the continuous variables. Some results are presented below:

因此,對于描述符的不同可能組合,我針對生存變量計算了預測得分。 我刪除了名義數據(否則,由于過度擬合,預測得分將為100%),并離散化了連續變量。 一些結果如下所示:

Image for post
Purely illustrative example — results depend on the discretization method
純粹是示例性的-結果取決于離散化方法

The first row of the table gives the prediction score if we use all the predictors to predict the target variable: this score being more than 80%, it is clear that the available data enable us to predict with a “good precision” the target variable Survived.

如果我們使用所有預測變量來預測目標變量,則表的第一行將給出預測得分:該得分超過80%,很明顯,可用數據使我們能夠“精確”地預測目標變量幸存下來

Cases of information redundancy can also be observed: the variables Fare, PClass and Sex are together correlated at 41% with the Survived variable, while the sum of the individual correlations amounts to 43% (11% + 9% + 23%).

信息冗余的情況下,也可以觀察到:變量票價 ,PClass性別在與幸存變量41%一起相關,而各個相關性的總和達43%(11%+ 9%+ 23%)。

There are also cases of complementarity: the variables Age, Fare and Sex are almost 70% correlated with the Survived variable, while the sum of their individual correlations is not even 40% (3% + 11% + 23%).

還有互補的情況: 年齡票價性別變量與生存變量幾乎有70%相關,而它們各自的相關總和甚至不到40%(3%+ 11%+ 23%)。

Finally, if one wishes to reduce the dimensionality of the problem and to find a “sufficiently good” model using as few variables as possible, it is better to use the three variables Age and Fare and Sex (prediction score of 69%) rather than the variables Fare, Parents/Children Aboard, Pclass and Siblings/Spouses Aboard (prediction score of 33%). It allows to find twice as much useful information with one less variable.

最后,如果希望減少問題的范圍并使用盡可能少的變量來找到“足夠好”的模型,則最好使用年齡票價性別這三個變量(預測得分為69%),而不是變量票價家長 / 兒童 到齊 ,Pclass兄弟姐妹 / 配偶 到齊 (33%預測得分)。 它允許查找變量少一倍的有用信息。

Calculating the prediction score can therefore be very useful in a data analysis project, to ensure that the data available contain sufficient relevant information, and to identify the variables that are most important for the analysis.

因此,在數據分析項目中,計算預測分數可能非常有用,以確保可用數據包含足夠的相關信息,并確定對于分析最重要的變量。

翻譯自: https://medium.com/@gdelongeaux/how-to-measure-the-non-linear-correlation-between-multiple-variables-804d896760b8

多變量線性相關分析

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/391403.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/391403.shtml
英文地址,請注明出處:http://en.pswp.cn/news/391403.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

wp博客寫文章500錯誤_500多個博客文章教我如何撰寫出色的文章

wp博客寫文章500錯誤Ive written a lot of blog posts. Somewhere north of 500 to be exact. All of them are technical. 我寫了很多博客文章。 確切地說是在500以北的某個地方。 所有這些都是技術性的。 About two dozen of them are actually good. 實際上大約有兩打是不錯…

leetcode 220. 存在重復元素 III(排序)

給你一個整數數組 nums 和兩個整數 k 和 t 。請你判斷是否存在 兩個不同下標 i 和 j&#xff0c;使得 abs(nums[i] - nums[j]) < t &#xff0c;同時又滿足 abs(i - j) < k 。 如果存在則返回 true&#xff0c;不存在返回 false。 示例 1&#xff1a; 輸入&#xff1a…

ON DUPLICATE KEY UPDATE

INSERT INTO ON DUPLICATE KEY UPDATE 與 REPLACE INTO&#xff0c;兩個命令可以處理重復鍵值問題&#xff0c;在實際上它之間有什么區別呢&#xff1f;前提條件是這個表必須有一個唯一索引或主鍵。1、REPLACE發現重復的先刪除再插入&#xff0c;如果記錄有多個字段&#xff0c…

os.path 模塊

os.path.abspath(path) #返回絕對路徑os.path.basename(path) #返回文件名os.path.commonprefix(list) #返回list(多個路徑)中&#xff0c;所有path共有的最長的路徑。os.path.dirname(path) #返回文件路徑os.path.exists(path) #路徑存在則返回True,路徑損壞返回Falseos.path…

探索性數據分析(EDA):Python

什么是探索性數據分析(EDA)&#xff1f; (What is Exploratory Data Analysis(EDA)?) If we want to explain EDA in simple terms, it means trying to understand the given data much better, so that we can make some sense out of it.如果我們想用簡單的術語來解釋EDA&a…

微服務框架---搭建 go-micro環境

1.安裝micro 需要使用GO1.11以上版本 #linux 下 export GO111MODULEon export GOPROXYhttps://goproxy.io # windows下設置如下環境變量 setx GO111MODULE on setx GOPROXY https://goproxy.io # 使用如下指令安裝 go get -u -v github.com/micro/micro go get -u -v github.co…

angular dom_Angular 8 DOM查詢:ViewChild和ViewChildren示例

angular domThe ViewChild and ViewChildren decorators in Angular provide a way to access and manipulate DOM elements, directives and components. In this tutorial, well see an Angular 8 example of how to use the two decorators.Angular中的ViewChild和ViewChild…

浪潮之巔——IT產業的三大定律

http://www.cnblogs.com/ysocean/p/7641540.html轉載于:https://www.cnblogs.com/czlovezmt/p/8325772.html

DStream算子講解(一)

先把目錄列好&#xff0c;方便有條理的進行整理轉載于:https://www.cnblogs.com/leodaxin/p/7507600.html

aws 靜態網站_如何使用AWS托管靜態網站-入門指南

aws 靜態網站When I created my first portfolio last year, I based it on what I had learned from freeCodeCamp (HTML, CSS and a little JavaScript). 去年創建我的第一個投資組合時 &#xff0c;我基于從freeCodeCamp (HTML&#xff0c;CSS和一些JavaScript)中學到的知識…

leetcode 27. 移除元素(雙指針)

給你一個數組 nums 和一個值 val&#xff0c;你需要 原地 移除所有數值等于 val 的元素&#xff0c;并返回移除后數組的新長度。 不要使用額外的數組空間&#xff0c;你必須僅使用 O(1) 額外空間并 原地 修改輸入數組。 元素的順序可以改變。你不需要考慮數組中超出新長度后面…

使用TVP批量插入數據

TVP&#xff08;全稱 :Table-Valued Parameter&#xff09; 叫做表值參數(Table-Valued Parameter)是SQL2008的一個新特性。顧名思義&#xff0c;表值參數表示你可以把一個表類型作為參數傳遞到函數或存儲過程里。 第一步&#xff1a;創建一個Type類型和寫入數據的原始表結構相…

python:找出兩個列表中相同和不同的元素(使用推導式)

#接口返回值 list1 [張三, 李四, 王五, 老二] #數據庫返回值 list2 [張三, 李四, 老二, 王七]a [x for x in list1 if x in list2] #兩個列表表都存在 b [y for y in (list1 list2) if y not in a] #兩個列表中的不同元素print(a的值為:,a) print(b的值為:,b)c [x for x …

springcloud(六):配置中心git示例

隨著線上項目變的日益龐大&#xff0c;每個項目都散落著各種配置文件&#xff0c;如果采用分布式的開發模式&#xff0c;需要的配置文件隨著服務增加而不斷增多。某一個基礎服務信息變更&#xff0c;都會引起一系列的更新和重啟&#xff0c;運維苦不堪言也容易出錯。配置中心便…

寫作工具_4種加快數據科學寫作速度的工具

寫作工具I’ve been writing about data science on Medium for just over two years. Writing, in particular, technical writing can be time-consuming. Not only do you need to come up with an idea, write well, edit your articles for accuracy and flow, and proofr…

leetcode 91. 解碼方法(dp)

解題思路 記憶化搜索&#xff0c;記錄已經計算過的子問題 代碼 func numDecodings(s string) int {temp:make([]int,len(s),len(s))for i : range temp {temp[i]-1}return de(s,0,temp) } func de(s string,cur int,dp []int) int {if curlen(s){return 1}if dp[cur]!-1{re…

python數據結構與算法

2019獨角獸企業重金招聘Python工程師標準>>> http://python.jobbole.com/tag/%E6%95%B0%E6%8D%AE%E7%BB%93%E6%9E%84%E4%B8%8E%E7%AE%97%E6%B3%95/ 轉載于:https://my.oschina.net/u/3572879/blog/1611369

test5

test5 轉載于:https://www.cnblogs.com/Forever77/p/11468284.html

ux和ui_閱讀10個UI / UX設計系統所獲得的經驗教訓

ux和uiAs a way to improve my UI/UX skills I decided to read the guidelines for 10 popular UI/UX design systems. In this article I will give you a concise summary of the most important concepts. 為了提高我的UI / UX技能&#xff0c;我決定閱讀10種流行的UI / UX…

大數據(big data)_如何使用Big Query&Data Studio處理和可視化Google Cloud上的財務數據...

大數據(big data)介紹 (Introduction) This article will show you one of the ways you can process stock price data using Google Cloud Platform’s BigQuery, and build a simple dashboard on the processed data using Google Data Studio.本文將向您展示使用Google Cl…