決策樹之前要不要處理缺失值_不要使用這樣的決策樹

決策樹之前要不要處理缺失值

As one of the most popular classic machine learning algorithm, the Decision Tree is much more intuitive than the others for its explainability. In one of my previous article, I have introduced the basic idea and mechanism of a Decision Tree model. It demonstrated this machine learning model using an algorithm called ID3, which is one of the most classic ones for training a Decision Tree classification model.

作為最受歡迎的經典機器學習算法之一,決策樹在可解釋性方面比其他決策樹更為直觀。 在上一篇文章中,我介紹了決策樹模型的基本概念和機制。 它演示了使用稱為ID3的算法的機器學習模型,該算法是訓練決策樹分類模型的最經典模型之一。

If you are not that familiar with Decision Tree, it is highly recommended to check out the above article before reading into this one.

如果您不熟悉決策樹,強烈建議您在閱讀本文之前先閱讀以上文章。

To intuitively understand Decision Trees, it is indeed good to start with ID3. However, it is probably not a good idea to use it in practice. In this article, I’ll introduce a commonly used algorithm to build Decision Tree models — C4.5.

為了直觀地理解決策樹,從ID3開始確實不錯。 但是,在實踐中使用它可能不是一個好主意。 在本文中,我將介紹一種用于構建決策樹模型的常用算法-C4.5。

經典ID3算法的缺點 (Drawbacks of Classic ID3 Algorithm)

Image for post
Photo by aitoff on Pixabay
照片由aitoff在Pixabay上發布

Before we can demonstrate the major drawbacks of the ID3 algorithm, let’s have a look at what are the major building blocks of it. Basically, the important is the Entropy and Information Gain.

在我們證明ID3算法的主要缺點之前,讓我們看一下它的主要構成部分。 基本上,重要的是熵和信息增益。

熵回顧 (Recap of Entropy)

Here is the formula of Entropy:

這是熵的公式:

Image for post

The set “X” is everything in the set of the node, and “x?” refers to the specific decision of each sample. Therefore, “P(x?)” is the probability of the set to be made with a certain decision.

集合“ X ”是節點集合中的所有內容,而“ x ”是指每個樣本的特定決策。 因此,“ P(x?) ”是通過確定的決定進行集合的概率。

Image for post

Let’s use the same training dataset as an example. Suppose that we have an internal node in our decision tree with “weather = rainy”. It is can be seen that the final decisions are both “No”. Then, we can easily calculate the entropy of this node as follows:

讓我們以相同的訓練數據集為例。 假設我們的決策樹中有一個內部節點,其“天氣=下雨天”。 可以看出,最終決定都是“否”。 然后,我們可以輕松地計算出該節點的熵,如下所示:

Image for post

Basically, the probability of being “No” is 2/2 = 1, whereas the probability of being “Yes” is 0/2 = 0.

基本上,“否”的概率為2/2 = 1,而“是”的概率為0/2 = 0。

信息獲取回顧 (Recap of Information Gain)

On top of the concept of Entropy, we can calculate the Information Gain, which is the basic criterion to decide whether a feature should be used as a node to be split.

在熵的概念之上,我們可以計算信息增益,這是決定是否將特征用作要分割的節點的基本標準。

For example, we have three features: “Weather”, “Temperature” and “Wind Level”. When we start to build our Decision Tree using ID3, how can we decide which one of them should be used as the root node?

例如,我們具有三個功能:“天氣”,“溫度”和“風力等級”。 當我們開始使用ID3構建決策樹時,如何確定應將其中一個用作根節點?

ID3 makes use Information Gain as the criterion. The rule is that, select the feature with the maximum Information Gain among all of them. Here is the formula of calculating Information Gain:

ID3以信息增益為標準。 規則是,在所有選項中選擇具有最大信息增益的功能。 這是計算信息增益的公式:

Image for post

where

哪里

  • “T” is the parent node and “a” is the set of attributes of “T”

    “ T”是父節點,“ a”是“ T”的屬性集
  • The notation “|T|” means the size of the set

    表示法“ | T |” 表示集合的大小

Using the same example, when we calculating the Information Gain for “Weather = Rainy”, we also need to take its child nodes’ Entropy into account. Specific derivation and calculating progress can be found in the article that was shared in the introduction.

使用相同的示例,當我們計算“天氣=多雨”的信息增益時,我們還需要考慮其子節點的熵。 在引言中共享的文章中可以找到特定的推導和計算進度。

使用信息增益的主要缺點 (Major Drawbacks of Using Information Gain)

The major drawbacks of using Information Gain as the criterion for determining which feature to be used as the root/next node is that it tends to use the feature that has more unique values.

使用信息增益作為確定哪個特征用作根/下一個節點的標準的主要缺點是,它傾向于使用具有更多唯一值的特征。

But why? Let me demonstrate it using an extreme scenario. Let’s say, we have got the training set with one more feature: “Date”.

但為什么? 讓我用一個極端的場景來演示它。 假設,我們為培訓設置了另一個功能:“日期”。

Image for post

You might say that the feature “Date” should not be considered in this case because it intuitively will not be helpful to decide whether we should go out for running or not. Yes, you’re right. However, practically, we may have much more complicated dataset to be classified, and we may not be able to understand all the features. So, we may not always be able to determine whether a feature does make sense or not. In here, I will just use “Date” as an example.

您可能會說在這種情況下不應考慮“日期”功能,因為它從直覺上對決定我們是否應該運行不起作用沒有幫助。 你是對的。 但是,實際上,我們可能要分類的數據集要復雜得多,并且我們可能無法理解所有功能。 因此,我們可能無法始終確定某個功能是否有意義。 在這里,我僅以“日期”為例。

Now, let’s calculate the Information Gain for “Date”. We can start to calculate the entropy for one of the dates, such as “2020–01–01”.

現在,讓我們計算“日期”的信息增益。 我們可以開始計算其中一個日期的熵,例如“ 2020-01-01”。

Image for post

Since there is only 1 row for each date, the final decision must be either “Yes” or “No”. So, the entropy must be 0! In terms of the information theory, it is equivalent to say:

由于每個日期只有一行,因此最終決定必須為“是”或“否”。 因此,熵必須為0! 就信息論而言,它等同于說:

The date tells us nothing, because the result is just one, which is certain. So, there is no “uncertainty” at all.

日期沒有告??訴我們任何信息,因為結果只是一個,可以肯定。 因此,根本沒有“不確定性”。

Similarly, for all the other dates, their entropies are 0, too.

同樣,對于所有其他日期,它們的熵也為0。

Now, let’s calculate the entropy for the date itself.

現在,讓我們計算日期本身的熵。

Image for post

WoW, that is a pretty large number compared to the other features. So, we can calculate the Information Gain of “Date” now.

哇,與其他功能相比,這是一個很大的數目。 因此,我們現在可以計算“日期”的信息增益。

Image for post

Unsurprisingly, the Information Gain of “Date” is the entropy of itself because all its attribute having entropies that are 0.

毫不奇怪,“日期”的信息增益是其自身的熵,因為其所有屬性的熵均為0。

If we calculate the Information Gain for the other three features (you can find details in the article that is linked in the introduction), they are:

如果我們計算其他三個功能的信息增益(您可以在簡介中鏈接的文章中找到詳細信息),則它們是:

  • Information Gain of Weather is 0.592

    信息的天氣增益為0.592
  • Information Gain of Temperature is 0.522

    信息的溫度增益為0.522
  • Information Gain of Wind Level is 0.306

    風信息增益為0.306

Obviously, the Information Gain of Date is overwhelmingly larger than the others. Also, it can be seen that it will be even larger if the training dataset is larger. After that, don’t forget that the feature “Date” actually does not make sense in deciding whether we should go out for running or not, but it is decided as the “Best” one to be the root node.

顯然,最新的信息獲取比其他的要大得多。 另外,可以看出,如果訓練數據集更大,則該范圍將更大。 此后,請不要忘記,“日期”功能在決定是否應該運行時實際上沒有意義,而是被確定為“最佳”根節點。

Even funnier, after we decided to use “Date” as our root node, we’re done :)

更有趣的是,在我們決定使用“ Date”作為根節點之后,我們就完成了:)

Image for post

We end up with a Decision Tree as shown above. This is because the feature “Date” is too good. If we use it as the root node, all its attributes will simply tell us whether we should go out for running or not. It is not necessary to have the other features.

我們最終得到了如上所示的決策樹。 這是因為功能“日期”太好了。 如果我們將其用作根節點,則其所有屬性將簡單地告訴我們是否應該運行。 不必具有其他功能。

Image for post
Image by Clker-Free-Vector-Images on Pixabay
該圖片由Clker-Free-Vector-Images在Pixabay上發布

Yes, you may have a face like this fish at the moment, so do I.

是的,您現在可能有一張像這樣的魚的臉,我也是。

解決信息增益限制 (Fix the Information Gain Limitation)

Image for post
Photo by jarmoluk on Pixabay
照片由jarmoluk在Pixabay上發布

The easiest fix of the Information Gain limitation that exists in ID3 Algorithm is from another Decision Tree algorithm called C4.5. The basic idea of reducing this issue is to use Information Gain Ratio rather than Information Gain.

ID3算法中存在的信息增益限制的最簡單解決方法來自另一種稱為C4.5的決策樹算法。 減少此問題的基本思想是使用信息增益比而不是信息增益。

Specifically, Information Gain Ratio is simply adding a penalty on the Information Gain by dividing with the entropy of the parent node.

具體而言,信息增益比只是通過除以父節點的熵來對信息增益添加懲罰。

Image for post

In other words,

換一種說法,

Image for post

Therefore, if we’re using C4.5 rather than ID3, the Information Gain Ratio of the feature “Date” will be as follows.

因此,如果我們使用的是C4.5而不是ID3,則“日期”功能的信息增益比如下。

Image for post

Well, it is indeed still the largest one compared to the other features, but don’t forget that we are really using an extreme example where each attribute value of the feature “Date” will have only one row. In practice, Information Gain Ratio will be quite enough to avoid most of the scenarios that Information Gain will cause bias.

嗯,與其他功能相比,它確實仍然是最大的功能,但是請不要忘記,我們確實使用了一個極端的示例,其中“ Date”功能的每個屬性值將只有一行。 實際上,信息增益比率將足以避免大多數情況下信息增益會引起偏差。

C4.5的其他改進 (Other Improvements of C4.5)

Image for post
Photo by silviarita on Pixabay
照片由silviarita在Pixabay上發布

In my opinion, using Information Gain Ratio is the most significant improvement from ID3 to C4.5. Nevertheless, there are more improvements in C4.5 that you should know.

我認為,使用信息增益比率是從ID3到C4.5的最大改進。 但是,您應該知道C4.5還有更多改進。

PEP(悲觀錯誤修剪) (PEP (Pessimistic Error Pruning))

If you are not familiar with the concept “Pruning” of Decision Tree, again, you may need to check out my previous article that is attached in the introduction of this article.

如果您不熟悉決策樹的“修剪”概念,則可能需要查看本文簡介中附帶的我以前的文章。

PEP is another significant improvement in C4.5. Specifically, it will prune the tree in a top-down manner. For every internal node, the algorithm will calculate its error rate. Then, try to prune this branch to compare the error rate before and after the pruning. So, it is decided whether we should reserve this branch.

PEP是C4.5的另一個重大改進。 具體來說,它將以自頂向下的方式修剪樹。 對于每個內部節點,算法將計算其錯誤率。 然后,嘗試修剪此分支以比較修剪前后的錯誤率。 因此,決定是否應保留此分支。

Some characteristics of PEP:

PEP的一些特征:

  1. It is one of the Post-Pruning methods.

    它是修剪后的方法之一。
  2. It prunes the tree without the dependency of a validation dataset.

    它修剪樹而不依賴驗證數據集。
  3. Usually quite good to avoid overfitting, and consequently improve the performance in classifying unknown data.

    通常很好避免過度擬合,因此提高了對未知數據進行分類的性能。

離散連續特征 (Discretising the Continuous Features)

C4.5 supports continuous values. So, we are not limited to have “Low”, “Medium” and “High” such categorical values. Instead, C4.5 will automatically detect the thresholds of the continuous value that can generate the maximum Information Gain Ratio and then split the node using this threshold.

C4.5支持連續值。 因此,我們不限于具有“低”,“中”和“高”這樣的分類值。 取而代之的是,C4.5將自動檢測可產生最大信息增益比的連續值的值,然后使用該閾值拆分節點。

摘要 (Summary)

Image for post
Bessi on Bessi在PixabayPixabay上

In this article, I have illustrated why ID3 is not ideal. The major reason is that the criterion it uses — Information Gain — might significantly bias to those features have larger numbers of distinct values.

在本文中,我已說明了為什么ID3不理想。 主要原因是它使用的標準-信息增益-可能會嚴重偏向那些具有大量不同值的功能。

The solution has been given in another Decision Tree algorithm called C4.5. It evolves the Information Gain to Information Gain Ratio that will reduce the impact of large numbers of distinct values of the attributes.

該解決方案已在另一種稱為C4.5的決策樹算法中給出。 它改進了信息增益與信息增益之比,從而減少了屬性的大量不同值的影響。

Again, if you feel that you need more context and basic knowledge about Decision Trees, please check out my previous article.

同樣,如果您覺得需要更多有關決策樹的知識和基礎知識,請查閱我以前的文章。

翻譯自: https://towardsdatascience.com/do-not-use-decision-tree-like-this-369769d6104d

決策樹之前要不要處理缺失值

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/389627.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/389627.shtml
英文地址,請注明出處:http://en.pswp.cn/news/389627.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

說說 C 語言中的變量與算術表達式

我們先來寫一個程序&#xff0c;打印英里與公里之間的對應關系表。公式&#xff1a;1 mile1.61 km 程序如下&#xff1a; #include <stdio.h>/* print Mile to Kilometre table*/ main() {float mile, kilometre;int lower 0;//lower limitint upper 1000;//upper limi…

gl3520 gl3510_帶有gl gl本機的跨平臺地理空間可視化

gl3520 gl3510Editor’s note: Today’s post is by Ib Green, CTO, and Ilija Puaca, Founding Engineer, both at Unfolded, an “open core” company that builds products and services on the open source deck.gl / vis.gl technology stack, and is also a major contr…

uiautomator +python 安卓UI自動化嘗試

使用方法基本說明&#xff1a;https://www.cnblogs.com/mliangchen/p/5114149.html&#xff0c;https://blog.csdn.net/Eugene_3972/article/details/76629066 環境準備&#xff1a;https://www.cnblogs.com/keeptheminutes/p/7083816.html 簡單實例 1.自動化安裝與卸載 &#…

5922. 統計出現過一次的公共字符串

5922. 統計出現過一次的公共字符串 給你兩個字符串數組 words1 和 words2 &#xff0c;請你返回在兩個字符串數組中 都恰好出現一次 的字符串的數目。 示例 1&#xff1a;輸入&#xff1a;words1 ["leetcode","is","amazing","as",&…

Python+Appium尋找藍牙/wifi匹配

前言&#xff1a; 此篇是介紹怎么去尋找藍牙&#xff0c;進行匹配。主要2個問題點&#xff1a; 1.在不同環境下&#xff0c;搜索到的藍牙數量有變 2.在不同環境下&#xff0c;搜索到的藍牙排序會變 簡單思路&#xff1a; 將搜索出來的藍牙名字添加到一個list去&#xff0c;然后…

power bi中的切片器_在Power Bi中顯示選定的切片器

power bi中的切片器Just recently, while presenting my session: “Magnificent 7 — Simple tricks to boost your Power BI Development” at the New Stars of Data conference, one of the questions I’ve received was:就在最近&#xff0c;在“新數據之星”會議上介紹我…

字符串匹配 sunday算法

#include"iostream" #include"string.h" using namespace std;//BF算法 int strfind(char *s1,char *s2,int pos){int len1 strlen(s1);int len2 strlen(s2);int i pos - 1,j 0;while(j < len2){if(s1[i j] s2[j]){j;}else{i;j 0;}}if(j len2){…

5939. 半徑為 k 的子數組平均值

5939. 半徑為 k 的子數組平均值 給你一個下標從 0 開始的數組 nums &#xff0c;數組中有 n 個整數&#xff0c;另給你一個整數 k 。 半徑為 k 的子數組平均值 是指&#xff1a;nums 中一個以下標 i 為 中心 且 半徑 為 k 的子數組中所有元素的平均值&#xff0c;即下標在 i …

Adobe After Effects CS6 操作記錄

安裝 After Effects CS6 在Mac OS 10.12.5 上無法直接安裝, 需要瀏覽到安裝的執行文件后才能進行 https://helpx.adobe.com/creative-cloud/kb/install-creative-suite-mac-os-sierra.html , 但是即使安裝成功, 也不能正常啟動, 會報"You can’t use this version of the …

數據庫邏輯刪除的sql語句_通過數據庫的眼睛查詢sql的邏輯流程

數據庫邏輯刪除的sql語句Structured Query Language (SQL) is famously known as the romance language of data. Even thinking of extracting the single correct answer from terabytes of relational data seems a little overwhelming. So understanding the logical flow…

好用的模塊

import requests# 1、發get請求urlhttp://api.xxx.xx/api/user/sxx_infodata{stu_name:xxx}reqrequests.get(url,paramsdata) #發get請求print(req.json()) #字典print(req.text) #string,json串# 返回的都是什么# 返回的類型是什么# 中文的好使嗎# 2、發請求posturlhttp://api…

5940. 從數組中移除最大值和最小值

5940. 從數組中移除最大值和最小值 給你一個下標從 0 開始的數組 nums &#xff0c;數組由若干 互不相同 的整數組成。 nums 中有一個值最小的元素和一個值最大的元素。分別稱為 最小值 和 最大值 。你的目標是從數組中移除這兩個元素。 一次 刪除 操作定義為從數組的 前面 …

BZOJ4127Abs——樹鏈剖分+線段樹

題目描述 給定一棵樹,設計數據結構支持以下操作 1 u v d  表示將路徑 (u,v) 加d 2 u v 表示詢問路徑 (u,v) 上點權絕對值的和 輸入 第一行兩個整數n和m&#xff0c;表示結點個數和操作數接下來一行n個整數a_i,表示點i的權值接下來n-1行,每行兩個整數u,v表示存在一條(u,v)的…

數據挖掘流程_數據流挖掘

數據挖掘流程1-簡介 (1- Introduction) The fact that the pace of technological change is at its peak, Silicon Valley is also introducing new challenges that need to be tackled via new and efficient ways. Continuous research is being carried out to improve th…

北門外的小吃街才是我的大學食堂

學校北門外的那些小吃攤&#xff0c;陪我度過了漫長的大學四年。 細數下來&#xff0c;我最懷念的是…… &#xff08;1&#xff09;烤雞翅 吸引指數&#xff1a;★★★★★ 必殺技&#xff1a;酥流油 烤雞翅有蜂蜜味、香辣味、孜然味……最愛店家獨創的秘制雞翅。雞翅的外皮被…

786. 第 K 個最小的素數分數

786. 第 K 個最小的素數分數 給你一個按遞增順序排序的數組 arr 和一個整數 k 。數組 arr 由 1 和若干 素數 組成&#xff0c;且其中所有整數互不相同。 對于每對滿足 0 < i < j < arr.length 的 i 和 j &#xff0c;可以得到分數 arr[i] / arr[j] 。 那么第 k 個…

[LeetCode]最長公共前綴(Longest Common Prefix)

題目描述 編寫一個函數來查找字符串數組中的最長公共前綴。如果不存在公共前綴&#xff0c;返回空字符串 ""。 示例 1:輸入: ["flower","flow","flight"]輸出: "fl"示例 2:輸入: ["dog","racecar",&quo…

域嵌套太深_pyspark如何修改嵌套結構域

域嵌套太深In our adventures trying to build a data lake, we are using dynamically generated spark cluster to ingest some data from MongoDB, our production database, to BigQuery. In order to do that, we use PySpark data frames and since mongo doesn’t have …

redis小結

Redis 切換到redis的目錄 啟動&#xff1a;./redis-server 關閉&#xff1a;killall redis-server Redis的數據類型&#xff1a; String字符 list鏈表 set集合&#xff08;無序&#xff09; Sort Set排序&#xff08;有序&#xff09; hash數據類型 string類型的數據操作 re…

WIN10下ADB工具包安裝的教程和總結 --201809

ADB&#xff08;Android Debug Bridge&#xff09;是Android SDK中的一個工具, 使用ADB可以直接操作管理Android模擬器或者真實的Andriod設備。 ADB主要功能有: 在Android設備上運行Shell(命令行)管理模擬器或設備的端口映射在計算機和設備之間上傳/下載文件將電腦上的本地APK軟…