數據多重共線性_多重共線性對您的數據科學項目的影響比您所知道的要多

數據多重共線性

Multicollinearity is likely far down on a mental list of things to check for, if it is on a list at all. This does, however, appear almost always in real-life datasets, and it’s important to be aware of how to address it.

多重共線性可能根本不在要檢查的事物的清單上，即使它根本不在清單上。但是，這幾乎總是出現在現實生活的數據集中，因此，重要的是要知道如何解決它。

As its name suggests, multicollinearity is when two (or more) features are correlated with each other, or ‘collinear’. This occurs often in real datasets, since one measurement (e.g. family income) may be correlated with another (e.g. school performance). You may be unaware that many algorithms and analysis methods rely on the assumption of no multicollinearity.

顧名思義，多重共線性是指兩個(或多個)要素相互關聯，即“共線性”。由于一種度量(例如家庭收入)可能與另一種度量(例如學校成績)相關，因此這經常發生在真實數據集中。您可能沒有意識到許多算法和分析方法都依賴于沒有多重共線性的假設。

Let’s take this dataset for example, which attempts to predict a student’s chance of admission given a variety of factors.

讓我們以這個數據集為例，該數據集嘗試在各種因素的影響下預測學生的入學機會。

We want to achieve a static like “performing research increases your chance of admission by x percent” or “each additional point on the TOEFL increases chance of admission by y percent”. The first thought is to train a linear regression model and interpret the coefficients.

我們希望實現一個靜態的狀態，例如“進行研究使您被錄取的機會增加x ％”或“托福每增加一個點，就可以使入學的機會增加y ％”。首先想到的是訓練線性回歸模型并解釋系數。

A multiple regression model achieves a mean absolute error of about 4.5% percentage points, which is fairly accurate. The coefficients are interesting to analyze, and we can make a statement like ‘each point on the GRE increases you chances of admission by 0.2%, whereas each point on your CGPA increases chances by 11.4%.’

多元回歸模型的平均絕對誤差約為4.5％，這是相當準確的。系數的分析很有趣，我們可以做出這樣的表述：“ GRE的每個點使您被錄取的機會增加了0.2％，而CGPA的每個點使您被錄取的機會增加了11.4％。”

Let’s take a look at the correlation matrix to identify which features are correlated with each other. In general, this dataset is full of highly correlated features, but CGPA is in general correlated heavily with other features.

讓我們看一下相關矩陣，以確定哪些要素相互關聯。總的來說，該數據集充滿了高度相關的特征，但是CGPA通常與其他特征高度相關。

Since the TOEFL score is highly correlated by the GRE score, let’s remove this feature and re-train a linear regression model. (Perhaps) Surprisingly, the mean absolute error decreases to 4.3%. The change of coefficients is also interesting — notably, the importance of the University Rating decreases by almost half, the importance of research doubles, etc.

由于TOEFL分數與GRE分數高度相關，因此讓我們刪除此功能并重新訓練線性回歸模型。 (也許)令人驚訝的是，平均絕對誤差降低到4.3％。系數的變化也很有趣-值得注意的是，大學評估的重要性降低了近一半，研究的重要性增加了一倍，等等。

We can take the following away from this:

我們可以采取以下措施：

The TOEFL score, like any other feature, can be thought of as having two components: information and noise.
就像其他任何功能一樣，TOEFL分數也可以認為具有兩個組成部分：信息和噪音。
Its information component was already represented in other variables (perhaps performing well on the GRE requires skills to perform well on the TOEFL, etc.), so it did not provide any new information.
它的信息部分已經用其他變量表示(也許在GRE上表現出色需要技能才能在TOEFL上表現出色，等等)，因此它沒有提供任何新信息。
It had enough noise such that keeping the feature for a minimal information gain is not worth the amount of noise it introduces to the model.
它具有足夠的噪聲，以至于為了獲得最小的信息增益而保持該功能不值得其引入模型的噪聲量。

In other words, the TOEFL score was collinear with many of the other features. At a base level, the performance of the model was damaged, but additionally, more complex analyses on linear models — which can be very insightful — like the interpretation of coefficients need to be adjusted.

換句話說，托福成績與許多其他特征是共線的。從根本上來說，模型的性能受到了損害，但此外，對線性模型的更復雜的分析(可能非常有見地)也需要調整，例如系數的解釋需要調整。

It’s worth exploring what the coefficients of regression models mean. For instance, if the coefficient for GRE score is 0.2%, this means that holding all other variables fixed, a one-unit increase in GRE score translates to a 0.2% increase in admission. If we include the TOEFL score (and additional highly correlated features), however, we can’t assume that these variables will remain fixed.

值得探索回歸模型系數的含義。例如，如果GRE分數的系數為0.2％，則意味著將所有其他變量固定為 ，GRE分數每增加1單位，入學率就會增加0.2％。但是，如果我們包括TOEFL分數(以及其他高度相關的功能)，則不能假定這些變量將保持不變。

Hence, the coefficients go haywire, and completely uninterpretable, since there is massive information overlap. When such scenarios arise, the modelling capability is limited as well. Because there is so much overlap, everything is amplified — if there is an error in one part, it is likely to spread via overlapping to several other sections.

因此，由于存在大量的信息重疊，因此這些系數非常困難，并且完全無法解釋。當出現這種情況時，建模能力也會受到限制。由于存在太多的重疊，因此一切都會被放大-如果一個部分中存在錯誤，則很可能會通過重疊到其他幾個部分而擴展。

In general, it’s impractical to memorize whether algorithms or techniques work well with multicollinearity, but it is usually true that any model that treats features the ‘same’ (makes assumptions about feature relationships) or doesn’t measure information content is vulnerable to multicollinearity.

總的來說，記住算法或技術在多重共線性中是否能很好地工作是不切實際的，但是通常確實的是，任何對待特征“相同”(對特征關系做出假設)或不測量信息內容的模型都容易受到多重共線性的影響。

What does this mean?

這是什么意思？

Take, for instance, the decision tree, which is not vulnerable to multicollinearity because it explicitly measures information content (opposite of entropy) and makes no other assumptions or measurements of relationships between features. If columns A and B are correlated with each other, the decision tree will simply choose one and discard the other (or place it very low). In this case, features are considered by their information content.

以決策樹為例，該決策樹不易受到多重共線性的影響，因為它顯式地測量信息內容(與熵相反)，并且不進行其他假設或度量要素之間的關系。如果A列和B列相互關聯，則決策樹將只選擇其中一個并丟棄另一個(或將其放置得很低)。在這種情況下，要素將通過其信息內容來考慮。

On the other hand, K-Nearest Neighbors is affected by multicollinearity because it assumes every point can be represented in multidimensional space as some coordinate (e.g. (3, 2.5, 6.7, 9.8) on an x training set with four dimensions). It doesn’t measure information content, and treats features as the same. Hence, one can imagine that data points between two highly correlated features would cluster together along a line, and how that would interfere with cross-dimensional distances.

在另一方面，K最近鄰是由多重共線性，因為它假設每個點都可以在多維空間被表示為一些坐標的影響(例如(3, 2.5, 6.7, 9.8)上的x訓練集具有四個尺寸)。它不度量信息內容，并且將特征視為相同。因此，可以想象兩個高度相關的要素之間的數據點將沿著一條線聚在一起，以及如何干擾跨維距離。

Principal Component Analysis is an unsupervised method, but we can still evaluate it along these criteria! The goal of PCA is to explicitly retain the variance, or structure (information) of a reduced dataset, which is why it is not only generally immune to multicollinearity but is often used to reduce multicollinearity in datasets.

主成分分析是一種無監督的方法，但是我們仍然可以按照這些標準對其進行評估！ PCA的目標是明確保留簡化數據集的方差或結構(信息)，這就是為什么它不僅通常不受多重共線性影響，而且經常用于減少數據集中的多重共線性。

Most efficient solving methods for algorithms rely on matrix mathematics and linear algebra systems — essentially representations of high-dimensional spaces, which are easily screwed with by multicollinearity.

用于算法的最有效的求解方法依賴于矩陣數學和線性代數系統-本質上是高維空間的表示，這些空間很容易被多重共線性所迷惑。

Common techniques like heavy one-hot encoding (dummy variables) in which categorical variables are represented as 0s and 1s can also be damaging because they form a perfectly linear relationship. Say that we have three binary columns A, B, and C, indicating if a row is part of one of the categories. The sum of these columns must add to 1, and hence a perfectly linear relationship A+B+C=1 is established.

諸如重一鍵編碼(虛擬變量)之類的常用技術(其中分類變量以0和1表示)也可能是有害的，因為它們形成了完美的線性關系。假設我們有三個二進制列A，B和C，它們指示一行是否屬于類別之一。這些列的總和必須加1，因此建立了完美的線性關系A+B+C=1 。

How can we identify multicollinearity?

我們如何識別多重共線性？

Use a VIF (Variance Inflation Factor) score on a regression model to identify is multicollinearity is present in your dataset.
在回歸模型上使用VIF(方差通貨膨脹因子)得分來確定數據集中是否存在多重共線性。
If standard errors are too high, it may be an indicator that one error is being repeatedly propagated because of information overlap.
如果標準錯誤過高，則可能表明由于信息重疊，一個錯誤正在重復傳播。
Large changes in parameters when adding or removing new features indicate heavily duplicated information.
添加或刪除新功能時，參數的較大變化表示信息重復很多。
Create a correlation matrix. Features with values consistently above 0.4 are indicators of multicollinearity.
創建一個相關矩陣。值始終高于0.4的要素表示多重共線性。

There are many solutions to multicollinearity:

多重共線性有多種解決方案：

Use an algorithm that is immune to multicollinearity if it is an inherent aspect of the data and other transformations are not feasible. Ridge regression, principal component regression, or partial least squares regression are all good regression alternatives.
如果它是數據的固有方面并且其他轉換不可行，則使用不受多重共線性影響的算法。嶺回歸，主成分回歸或偏最小二乘回歸都是很好的回歸選擇。
Use PCA to reduce the dimensionality of the dataset and only retain variables that are important towards preserving the data’s structure. This is beneficial if the dataset is overall very multicollinear.
使用PCA可以減少數據集的維數，并且僅保留對于保留數據結構很重要的變量。如果數據集總體上非常多共線性，這將是有益的。
Use a feature selection method to remove highly correlated features.
使用特征選擇方法刪除高度相關的特征。
Obtain more data — this is the preferred method. More data can allow the model to retain the current amount of information while giving context and perspective to noise.
獲取更多數據-這是首選方法。更多數據可以使模型保留當前的信息量，同時為噪聲提供上下文和透視圖。

翻譯自: https://medium.com/@andre_ye/multicollinearity-impacts-your-data-science-project-more-than-you-know-8504efd706f

數據多重共線性

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/389536.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/389536.shtml
英文地址，請注明出處：http://en.pswp.cn/news/389536.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！

PHP工廠模式計算面積與周長

<?phpinterface InterfaceShape{ function getArea(); function getCircumference();}/** * 矩形 */class Rectangle implements InterfaceShape{ private $width; private $height; public function __construct($width,$height){ $this->width$…

數據多重共線性_多重共線性對您的數據科學項目的影響比您所知道的要多

相關文章