數據多重共線性_多重共線性對您的數據科學項目的影響比您所知道的要多

數據多重共線性

Multicollinearity is likely far down on a mental list of things to check for, if it is on a list at all. This does, however, appear almost always in real-life datasets, and it’s important to be aware of how to address it.

多重共線性可能根本不在要檢查的事物的清單上,即使它根本不在清單上。 但是,這幾乎總是出現在現實生活的數據集中,因此,重要的是要知道如何解決它。

As its name suggests, multicollinearity is when two (or more) features are correlated with each other, or ‘collinear’. This occurs often in real datasets, since one measurement (e.g. family income) may be correlated with another (e.g. school performance). You may be unaware that many algorithms and analysis methods rely on the assumption of no multicollinearity.

顧名思義,多重共線性是指兩個(或多個)要素相互關聯,即“共線性”。 由于一種度量(例如家庭收入)可能與另一種度量(例如學校成績)相關,因此這經常發生在真實數據集中。 您可能沒有意識到許多算法和分析方法都依賴于沒有多重共線性的假設。

Let’s take this dataset for example, which attempts to predict a student’s chance of admission given a variety of factors.

讓我們以這個數據集為例,該數據集嘗試在各種因素的影響下預測學生的入學機會。

Image for post

We want to achieve a static like “performing research increases your chance of admission by x percent” or “each additional point on the TOEFL increases chance of admission by y percent”. The first thought is to train a linear regression model and interpret the coefficients.

我們希望實現一個靜態的狀態,例如“進行研究使您被錄取的機會增加x %”或“托福每增加一個點,就可以使入學的機會增加y %”。 首先想到的是訓練線性回歸模型并解釋系數。

A multiple regression model achieves a mean absolute error of about 4.5% percentage points, which is fairly accurate. The coefficients are interesting to analyze, and we can make a statement like ‘each point on the GRE increases you chances of admission by 0.2%, whereas each point on your CGPA increases chances by 11.4%.’

多元回歸模型的平均絕對誤差約為4.5%,這是相當準確的。 系數的分析很有趣,我們可以做出這樣的表述:“ GRE的每個點使您被錄取的機會增加了0.2%,而CGPA的每個點使您被錄取的機會增加了11.4%。”

Image for post

Let’s take a look at the correlation matrix to identify which features are correlated with each other. In general, this dataset is full of highly correlated features, but CGPA is in general correlated heavily with other features.

讓我們看一下相關矩陣,以確定哪些要素相互關聯。 總的來說,該數據集充滿了高度相關的特征,但是CGPA通常與其他特征高度相關。

Image for post

Since the TOEFL score is highly correlated by the GRE score, let’s remove this feature and re-train a linear regression model. (Perhaps) Surprisingly, the mean absolute error decreases to 4.3%. The change of coefficients is also interesting — notably, the importance of the University Rating decreases by almost half, the importance of research doubles, etc.

由于TOEFL分數與GRE分數高度相關,因此讓我們刪除此功能并重新訓練線性回歸模型。 (也許)令人驚訝的是,平均絕對誤差降低到4.3%。 系數的變化也很有趣-值得注意的是,大學評估的重要性降低了近一半,研究的重要性增加了一倍,等等。

Image for post

We can take the following away from this:

我們可以采取以下措施:

  • The TOEFL score, like any other feature, can be thought of as having two components: information and noise.

    就像其他任何功能一樣,TOEFL分數也可以認為具有兩個組成部分:信息和噪音。
  • Its information component was already represented in other variables (perhaps performing well on the GRE requires skills to perform well on the TOEFL, etc.), so it did not provide any new information.

    它的信息部分已經用其他變量表示(也許在GRE上表現出色需要技能才能在TOEFL上表現出色,等等),因此它沒有提供任何新信息。
  • It had enough noise such that keeping the feature for a minimal information gain is not worth the amount of noise it introduces to the model.

    它具有足夠的噪聲,以至于為了獲得最小的信息增益而保持該功能不值得其引入模型的噪聲量。

In other words, the TOEFL score was collinear with many of the other features. At a base level, the performance of the model was damaged, but additionally, more complex analyses on linear models — which can be very insightful — like the interpretation of coefficients need to be adjusted.

換句話說,托福成績與許多其他特征是共線的。 從根本上來說,模型的性能受到了損害,但此外,對線性模型的更復雜的分析(可能非常有見地)也需要調整,例如系數的解釋需要調整。

It’s worth exploring what the coefficients of regression models mean. For instance, if the coefficient for GRE score is 0.2%, this means that holding all other variables fixed, a one-unit increase in GRE score translates to a 0.2% increase in admission. If we include the TOEFL score (and additional highly correlated features), however, we can’t assume that these variables will remain fixed.

值得探索回歸模型系數的含義。 例如,如果GRE分數的系數為0.2%,則意味著將所有其他變量固定為 ,GRE分數每增加1單位,入學率就會增加0.2%。 但是,如果我們包括TOEFL分數(以及其他高度相關的功能),則不能假定這些變量將保持不變。

Hence, the coefficients go haywire, and completely uninterpretable, since there is massive information overlap. When such scenarios arise, the modelling capability is limited as well. Because there is so much overlap, everything is amplified — if there is an error in one part, it is likely to spread via overlapping to several other sections.

因此,由于存在大量的信息重疊,因此這些系數非常困難,并且完全無法解釋。 當出現這種情況時,建模能力也會受到限制。 由于存在太多的重疊,因此一切都會被放大-如果一個部分中存在錯誤,則很可能會通過重疊到其他幾個部分而擴展。

In general, it’s impractical to memorize whether algorithms or techniques work well with multicollinearity, but it is usually true that any model that treats features the ‘same’ (makes assumptions about feature relationships) or doesn’t measure information content is vulnerable to multicollinearity.

總的來說,記住算法或技術在多重共線性中是否能很好地工作是不切實際的,但是通常確實的是,任何對待特征“相同”(對特征關系做出假設)或不測量信息內容的模型都容易受到多重共線性的影響。

What does this mean?

這是什么意思?

Take, for instance, the decision tree, which is not vulnerable to multicollinearity because it explicitly measures information content (opposite of entropy) and makes no other assumptions or measurements of relationships between features. If columns A and B are correlated with each other, the decision tree will simply choose one and discard the other (or place it very low). In this case, features are considered by their information content.

以決策樹為例,該決策樹不易受到多重共線性的影響,因為它顯式地測量信息內容(與熵相反),并且不進行其他假設或度量要素之間的關系。 如果A列和B列相互關聯,則決策樹將只選擇其中一個并丟棄另一個(或將其放置得很低)。 在這種情況下,要素將通過其信息內容來考慮。

On the other hand, K-Nearest Neighbors is affected by multicollinearity because it assumes every point can be represented in multidimensional space as some coordinate (e.g. (3, 2.5, 6.7, 9.8) on an x training set with four dimensions). It doesn’t measure information content, and treats features as the same. Hence, one can imagine that data points between two highly correlated features would cluster together along a line, and how that would interfere with cross-dimensional distances.

在另一方面,K最近鄰由多重共線性,因為它假設每個點都可以在多維空間被表示為一些坐標的影響(例如(3, 2.5, 6.7, 9.8)上的x訓練集具有四個尺寸)。 它不度量信息內容,并且將特征視為相同。 因此,可以想象兩個高度相關的要素之間的數據點將沿著一條線聚在一起,以及如何干擾跨維距離。

Principal Component Analysis is an unsupervised method, but we can still evaluate it along these criteria! The goal of PCA is to explicitly retain the variance, or structure (information) of a reduced dataset, which is why it is not only generally immune to multicollinearity but is often used to reduce multicollinearity in datasets.

主成分分析是一種無監督的方法,但是我們仍然可以按照這些標準對其進行評估! PCA的目標是明確保留簡化數據集的方差或結構(信息),這就是為什么它不僅通常不受多重共線性影響,而且經常用于減少數據集中的多重共線性。

Most efficient solving methods for algorithms rely on matrix mathematics and linear algebra systems — essentially representations of high-dimensional spaces, which are easily screwed with by multicollinearity.

用于算法的最有效的求解方法依賴于矩陣數學和線性代數系統-本質上是高維空間的表示,這些空間很容易被多重共線性所迷惑。

Common techniques like heavy one-hot encoding (dummy variables) in which categorical variables are represented as 0s and 1s can also be damaging because they form a perfectly linear relationship. Say that we have three binary columns A, B, and C, indicating if a row is part of one of the categories. The sum of these columns must add to 1, and hence a perfectly linear relationship A+B+C=1 is established.

諸如重一鍵編碼(虛擬變量)之類的常用技術(其中分類變量以0和1表示)也可能是有害的,因為它們形成了完美的線性關系。 假設我們有三個二進制列A,B和C,它們指示一行是否屬于類別之一。 這些列的總和必須加1,因此建立了完美的線性關系A+B+C=1

How can we identify multicollinearity?

我們如何識別多重共線性?

  • Use a VIF (Variance Inflation Factor) score on a regression model to identify is multicollinearity is present in your dataset.

    在回歸模型上使用VIF(方差通貨膨脹因子)得分來確定數據集中是否存在多重共線性。
  • If standard errors are too high, it may be an indicator that one error is being repeatedly propagated because of information overlap.

    如果標準錯誤過高,則可能表明由于信息重疊,一個錯誤正在重復傳播。
  • Large changes in parameters when adding or removing new features indicate heavily duplicated information.

    添加或刪除新功能時,參數的較大變化表示信息重復很多。
  • Create a correlation matrix. Features with values consistently above 0.4 are indicators of multicollinearity.

    創建一個相關矩陣。 值始終高于0.4的要素表示多重共線性。

There are many solutions to multicollinearity:

多重共線性有多種解決方案:

  • Use an algorithm that is immune to multicollinearity if it is an inherent aspect of the data and other transformations are not feasible. Ridge regression, principal component regression, or partial least squares regression are all good regression alternatives.

    如果它是數據的固有方面并且其他轉換不可行,則使用不受多重共線性影響的算法。 嶺回歸,主成分回歸或偏最小二乘回歸都是很好的回歸選擇。
  • Use PCA to reduce the dimensionality of the dataset and only retain variables that are important towards preserving the data’s structure. This is beneficial if the dataset is overall very multicollinear.

    使用PCA可以減少數據集的維數,并且僅保留對于保留數據結構很重要的變量。 如果數據集總體上非常多共線性,這將是有益的。
  • Use a feature selection method to remove highly correlated features.

    使用特征選擇方法刪除高度相關的特征。
  • Obtain more data — this is the preferred method. More data can allow the model to retain the current amount of information while giving context and perspective to noise.

    獲取更多數據-這是首選方法。 更多數據可以使模型保留當前的信息量,同時為噪聲提供上下文和透視圖。

翻譯自: https://medium.com/@andre_ye/multicollinearity-impacts-your-data-science-project-more-than-you-know-8504efd706f

數據多重共線性

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/389536.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/389536.shtml
英文地址,請注明出處:http://en.pswp.cn/news/389536.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

PHP工廠模式計算面積與周長

<?phpinterface InterfaceShape{ function getArea(); function getCircumference();}/** * 矩形 */class Rectangle implements InterfaceShape{ private $width; private $height; public function __construct($width,$height){ $this->width$…

K-Means聚類算法思想及實現

K-Means聚類概念&#xff1a; K-Means聚類是最常用的聚類算法&#xff0c;最初起源于信號處理&#xff0c;其目標是將數據點劃分為K個類簇&#xff0c; 找到每個簇的中心并使其度量最小化。 該算法的最大優點是簡單、便于理解&#xff0c;運算速度較快&#xff0c;缺點是只能應…

(2.1)DDL增強功能-數據類型、同義詞、分區表

1.數據類型 &#xff08;1&#xff09;常用數據類型  1.整數類型 int 存儲范圍是-2,147,483,648到2,147,483,647之間的整數&#xff0c;主鍵列常設置此類型。 &#xff08;每個數值占用 4字節&#xff09; smallint 存儲范圍是-32,768 到 32,767 之間的整數&#xff0c;用…

充分利用昂貴的分析

By Noor Malik努爾馬利克(Noor Malik) Let’s say you write a query in Deephaven which performs a lengthy and expensive analysis, resulting in a live table. For example, in a previous project, I wrote a query which pulled data from an RSS feed to create a li…

【java并發編程藝術學習】(一)初衷、感想與筆記目錄

不忘初心&#xff0c;方得始終。 學習java編程這么長時間&#xff0c;自認為在項目功能需求開發中沒啥問題&#xff0c;但是之前的幾次面試和跟一些勤奮的或者小牛、大牛級別的人的接觸中&#xff0c;才發現自己的無知與淺薄。 學習總得有個方向吧&#xff0c;現階段就想把并發…

層次聚類和密度聚類思想及實現

層次聚類 層次聚類的概念&#xff1a; 層次聚類是一種很直觀的算法。顧名思義就是要一層一層地進行聚類。 層次法&#xff08;Hierarchicalmethods&#xff09;先計算樣本之間的距離。每次將距離最近的點合并到同一個類。然后&#xff0c;再 計算類與類之間的距離&#xff0…

通配符 或 怎么濃_濃咖啡的咖啡渣新鮮度

通配符 或 怎么濃How long could you wait to brew espresso after grinding? Ask a barista, any barista, and I suspect their answer is immediately or within a few minutes. The common knowledge on coffee grounds freshness is that after 30 minutes or so, coffee…

保留

看見 你在我眼前 不去猜想我們隔多遠 當我 夜幕中準備 只想讓沉默的能開解 在不同的遭遇里 我發現你的瞬間 有種不可言說的溫柔直覺 在有限的深夜消失之前 觸摸你的臉 我情愿這是幻覺 也不愿是種告別 已經忘了 你的名字 就在這座 寂靜星石 怎么還有 你的樣子 被保留 給我 一整個…

《netty入門與實戰》筆記-02:服務端啟動流程

為什么80%的碼農都做不了架構師&#xff1f;>>> 1.服務端啟動流程 這一小節&#xff0c;我們來學習一下如何使用 Netty 來啟動一個服務端應用程序&#xff0c;以下是服務端啟動的一個非常精簡的 Demo: NettyServer.java public class NettyServer {public static v…

譜聚類思想及實現

&#xff08;這個我也沒有怎么懂&#xff0c;為了防止以后能用上&#xff0c;還是記錄下來&#xff09; 譜聚類 注意&#xff1a;譜聚類核心聚類算法還是K-means 算法進行聚類~ 譜聚類的實現過程&#xff1a; 1.根據數據構造一個 圖結構&#xff08;Graph&#xff09; &…

Tengine HTTPS原理解析、實踐與調試【轉】

本文邀請阿里云CDN HTTPS技術專家金九&#xff0c;分享Tengine的一些HTTPS實踐經驗。內容主要有四個方面&#xff1a;HTTPS趨勢、HTTPS基礎、HTTPS實踐、HTTPS調試。 一、HTTPS趨勢 這一章節主要介紹近幾年和未來HTTPS的趨勢&#xff0c;包括兩大瀏覽器chrome和firefox對HTTPS的…

Linux 指定運行時動態庫路徑【轉】

轉自&#xff1a;http://www.cnblogs.com/cute/archive/2011/02/24/1963957.html 眾所周知&#xff0c; Linux 動態庫的默認搜索路徑是 /lib 和 /usr/lib 。動態庫被創建后&#xff0c;一般都復制到這兩個目錄中。當程序執行時需要某動態庫&#xff0c; 并且該動態庫還未加載到…

opencv:SIFT——尺度不變特征變換

SIFT概念&#xff1a; Sift&#xff08;尺度不變特征變換&#xff09;&#xff0c;全稱是Scale Invariant Feature Transform Sift提取圖像的局部特征&#xff0c;在尺度空間尋找極值點&#xff0c;并提取出其位置、尺度、方向信息。 Sfit的應用范圍包括 物體辨別、機器人地圖…

pca(主成分分析技術)_主成分分析技巧

pca(主成分分析技術)介紹 (Introduction) Principal Component Analysis (PCA) is an unsupervised technique for dimensionality reduction.主成分分析(PCA)是一種無監督的降維技術。 What is dimensionality reduction?什么是降維&#xff1f; Let us start with an exam…

npm link run npm script

npm link & run npm script https://blog.csdn.net/juhaotian/article/details/78672390 npm link命令可以將一個任意位置的npm包鏈接到全局執行環境&#xff0c;從而在任意位置使用命令行都可以直接運行該npm包。 app-cmd.cmd #!/usr/bin/env nodeecho "666" &a…

一文詳解java中對JVM的深度解析、調優工具、垃圾回收

2019獨角獸企業重金招聘Python工程師標準>>> jvm監控分析工具一般分為兩類&#xff0c;一種是jdk自帶的工具&#xff0c;一種是第三方的分析工具。jdk自帶工具一般在jdk bin目錄下面&#xff0c;以exe的形式直接點擊就可以使用&#xff0c;其中包含分析工具已經很強…

借用繼承_博物館正在數字化,并在此過程中從數據中借用

借用繼承Data visualization is a great way to celebrate our favorite pieces of art as well as reveal connections and ideas that were previously invisible. More importantly, it’s a fun way to connect things we love — visualizing data and kicking up our fee…

高斯噪聲,椒鹽噪聲的思想及多種噪聲的實現

圖像噪聲&#xff1a; 概念&#xff1a; ? 圖像噪聲是圖像在獲取或是傳輸過程中受到隨機信號干擾&#xff0c;妨礙人們對圖像理解及分析處理 的信號。 ? 很多時候將圖像噪聲看做多維隨機過程&#xff0c;因而描述噪聲的方法完全可以借用隨機過程的描述&#xff0c; 也就是使…

bzoj1095 [ZJOI2007]Hide 捉迷藏

據說是道很厲害的題。。。。黃學長的安利啊。。。。 然而我卻用它學分治。。。。 一個坑就擺在這里了。。。。 轉載于:https://www.cnblogs.com/LLppdd/p/9124394.html

如何識別媒體偏見_描述性語言理解,以識別文本中的潛在偏見

如何識別媒體偏見TGumGum can do to bring change by utilizing our Natural Language Processing technology to shed light on potential bias that websites may have in their content. The ideas and techniques shared in this blog are a result of the GumGum Hackatho…