【機器學習】熵、決策樹、隨機森林 總結

一、熵

公式:
?∑i=1np(xi)?log2p(xi)-\sum_{i = 1}^{n}{p(xi)*log_2p(xi)}?i=1n?p(xi)?log2?p(xi)

∑i=1np(xi)?log21p(xi)\sum_{i=1}^{n}p(xi)*log_2\frac{1}{p(xi)}i=1n?p(xi)?log2?p(xi)1?

在這里插入圖片描述

import numpy as np# 賬號是否真實:3no(0.3) 7yes(0.7)# 不進行劃分,信息熵
info_D = 0.3*np.log2(1/0.3) + 0.7*np.log2(1/0.7)
info_D

0.8812908992306926

# 決策樹,對目標值進行劃分
# 三個屬性:日志密度,好友密度,是否真實頭像
# 使用日志密度進行樹構建
# 3 s 0.3 -------> 2no 1yes
# 4 m 0.4 -------> 1no 3yes
# 3 l 0.3 -------> 3yesinfo_L_D = 0.3*(2/3*np.log2(3/2) + 1/3*np.log2(3)) + 0.4 * (0.25*np.log2(4) + 0.75*np.log2(4/3)) + 0.3*(1*np.log2(1))
info_L_D

0.5999999999999999

# 信息增益
info_D - info_L_D

0.2812908992306927

# 好友密度
# 4 s 0.4 ---> 3no 1yes
# 4 m 0.4 ---> 4yes
# 2 l 0.2 ---> 2yes
info_F_D = 0.4*(0.75*np.log2(4/3) + 0.25*np.log2(4)) + 0 + 0
info_F_D

0.32451124978365314

# 信息增益
info_D - info_F_D

0.5567796494470394

二、 決策樹

1導包

from sklearn import datasets import numpy as npfrom sklearn.tree import DecisionTreeClassifierfrom sklearn import datasetsimport matplotlib.pyplot as plt
%matplotlib inlinefrom sklearn import treefrom sklearn.model_selection import train_test_split 

2取數據

X,y = datasets.load_iris(True)
X 
iris = datasets.load_iris()X = iris['data']y = iris['target']feature_names = iris.feature_names
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2,random_state  = 1024)

3決策樹的使用

# 數據清洗,花時間# 特征工程# 使用模型進行訓練# 模型參數調優# sklearn所有算法,封裝好了
# 直接用,使用規則如下clf = DecisionTreeClassifier(criterion='entropy')clf.fit(X_train,y_train)y_ = clf.predict(X_test)from sklearn.metrics import accuracy_scoreaccuracy_score(y_test,y_)

1.0

39/120*np.log2(120/39) + 42/120*np.log2(120/42) + 39/120*np.log2(120/39)

1.5840680553754911

42/81*np.log2(81/42) + 39/81*np.log2(81/39)

0.9990102708804813

plt.figure(figsize=(18,12))
_ = tree.plot_tree(clf,filled = True,feature_names=feature_names,max_depth=1)
plt.savefig('./tree.jpg') 
# 連續的,continuous 屬性 閾值 threshold
X_train 
# 波動程度,越大,離散,越容易分開
X_train.std(axis = 0)

array([0.82300095, 0.42470578, 1.74587112, 0.75016619])

1.9 + 3.3 = 5.25.2/2 = 2.6
np.sort(X_train[:,2])
%%time
# 樹的深度變淺了,樹的裁剪
clf = DecisionTreeClassifier(criterion='entropy',max_depth=5)clf.fit(X_train,y_train)y_ = clf.predict(X_test)from sklearn.metrics import accuracy_scoreprint(accuracy_score(y_test,y_))plt.figure(figsize=(18,12))_ = tree.plot_tree(clf,filled=True,feature_names = feature_names)

1.0
Wall time: 114 ms

%%time
# 樹的深度變淺了,樹的裁剪
clf = DecisionTreeClassifier(criterion='gini',max_depth=5)clf.fit(X_train,y_train)y_ = clf.predict(X_test)from sklearn.metrics import accuracy_scoreprint(accuracy_score(y_test,y_))plt.figure(figsize=(18,12))_ = tree.plot_tree(clf,filled=True,feature_names = feature_names)

1.0
Wall time: 113 ms
在這里插入圖片描述

gini 系數公式:

∑i=0np(xi)?(1?p(xi))\sum_{i = 0}^{n}p(xi)*(1-p(xi))i=0n?p(xi)?(1?p(xi))

# 1.0 其余都是0
# 百分之百純
gini = 1*(1-1)
gini

0

# 39 42 39
39/120*(1 - 39/120)*2 + 42/120*(1 - 42/120)

0.66625

feature_names
['sepal length (cm)','sepal width (cm)','petal length (cm)','petal width (cm)']
X_train2 = X_train[y_train != 0]
X_train2y_train2 = y_train[y_train!=0]
y_train2index = np.argsort(X_train2[:,0])display(X_train2[:,0][index])y_train2[index]```python
index = np.argsort(X_train2[:,1])display(X_train2[:,1][index])y_train2[index] 
index = np.argsort(X_train2[:,2])display(X_train2[:,2][index])y_train2[index] 
index = np.argsort(X_train2[:,3])display(X_train2[:,3][index])y_train2[index]
決策樹模型,不需要對數據進行去量綱化,規劃化,標準化
公司應用中,不用決策樹,太簡單
決策樹升級版:集成算法(隨機森林,(extrem)極限森林,梯度提升樹,adaboost提升樹)

三、隨機森林

import numpy as npimport matplotlib.pyplot as plt
%matplotlib inlinefrom sklearn.ensemble import RandomForestClassifier,ExtraTreesClassifierfrom sklearn import datasetsimport pandas as pdfrom sklearn.model_selection import train_test_splitfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.metrics import accuracy_score

隨機森林 :多顆決策樹構建而成,每一顆決策樹都是剛才講到的決策樹原理
多顆決策樹一起運算------------>集成算法
隨機森林,隨機什么意思

wine = datasets.load_wine()
wine
{'data': array([[1.423e+01, 1.710e+00, 2.430e+00, ..., 1.040e+00, 3.920e+00,1.065e+03],[1.320e+01, 1.780e+00, 2.140e+00, ..., 1.050e+00, 3.400e+00,1.050e+03],[1.316e+01, 2.360e+00, 2.670e+00, ..., 1.030e+00, 3.170e+00,1.185e+03],...,[1.327e+01, 4.280e+00, 2.260e+00, ..., 5.900e-01, 1.560e+00,8.350e+02],[1.317e+01, 2.590e+00, 2.370e+00, ..., 6.000e-01, 1.620e+00,8.400e+02],[1.413e+01, 4.100e+00, 2.740e+00, ..., 6.100e-01, 1.600e+00,5.600e+02]]),'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,2, 2]),'target_names': array(['class_0', 'class_1', 'class_2'], dtype='<U7'),'DESCR': '.. _wine_dataset:\n\nWine recognition dataset\n------------------------\n\n**Data Set Characteristics:**\n\n    :Number of Instances: 178 (50 in each of three classes)\n    :Number of Attributes: 13 numeric, predictive attributes and the class\n    :Attribute Information:\n \t\t- Alcohol\n \t\t- Malic acid\n \t\t- Ash\n\t\t- Alcalinity of ash  \n \t\t- Magnesium\n\t\t- Total phenols\n \t\t- Flavanoids\n \t\t- Nonflavanoid phenols\n \t\t- Proanthocyanins\n\t\t- Color intensity\n \t\t- Hue\n \t\t- OD280/OD315 of diluted wines\n \t\t- Proline\n\n    - class:\n            - class_0\n            - class_1\n            - class_2\n\t\t\n    :Summary Statistics:\n    \n    ============================= ==== ===== ======= =====\n                                   Min   Max   Mean     SD\n    ============================= ==== ===== ======= =====\n    Alcohol:                      11.0  14.8    13.0   0.8\n    Malic Acid:                   0.74  5.80    2.34  1.12\n    Ash:                          1.36  3.23    2.36  0.27\n    Alcalinity of Ash:            10.6  30.0    19.5   3.3\n    Magnesium:                    70.0 162.0    99.7  14.3\n    Total Phenols:                0.98  3.88    2.29  0.63\n    Flavanoids:                   0.34  5.08    2.03  1.00\n    Nonflavanoid Phenols:         0.13  0.66    0.36  0.12\n    Proanthocyanins:              0.41  3.58    1.59  0.57\n    Colour Intensity:              1.3  13.0     5.1   2.3\n    Hue:                          0.48  1.71    0.96  0.23\n    OD280/OD315 of diluted wines: 1.27  4.00    2.61  0.71\n    Proline:                       278  1680     746   315\n    ============================= ==== ===== ======= =====\n\n    :Missing Attribute Values: None\n    :Class Distribution: class_0 (59), class_1 (71), class_2 (48)\n    :Creator: R.A. Fisher\n    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)\n    :Date: July, 1988\n\nThis is a copy of UCI ML Wine recognition datasets.\nhttps://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data\n\nThe data is the results of a chemical analysis of wines grown in the same\nregion in Italy by three different cultivators. There are thirteen different\nmeasurements taken for different constituents found in the three types of\nwine.\n\nOriginal Owners: \n\nForina, M. et al, PARVUS - \nAn Extendible Package for Data Exploration, Classification and Correlation. \nInstitute of Pharmaceutical and Food Analysis and Technologies,\nVia Brigata Salerno, 16147 Genoa, Italy.\n\nCitation:\n\nLichman, M. (2013). UCI Machine Learning Repository\n[https://archive.ics.uci.edu/ml]. Irvine, CA: University of California,\nSchool of Information and Computer Science. \n\n.. topic:: References\n\n  (1) S. Aeberhard, D. Coomans and O. de Vel, \n  Comparison of Classifiers in High Dimensional Settings, \n  Tech. Rep. no. 92-02, (1992), Dept. of Computer Science and Dept. of  \n  Mathematics and Statistics, James Cook University of North Queensland. \n  (Also submitted to Technometrics). \n\n  The data was used with many others for comparing various \n  classifiers. The classes are separable, though only RDA \n  has achieved 100% correct classification. \n  (RDA : 100%, QDA 99.4%, LDA 98.9%, 1NN 96.1% (z-transformed data)) \n  (All results using the leave-one-out technique) \n\n  (2) S. Aeberhard, D. Coomans and O. de Vel, \n  "THE CLASSIFICATION PERFORMANCE OF RDA" \n  Tech. Rep. no. 92-01, (1992), Dept. of Computer Science and Dept. of \n  Mathematics and Statistics, James Cook University of North Queensland. \n  (Also submitted to Journal of Chemometrics).\n','feature_names': ['alcohol','malic_acid','ash','alcalinity_of_ash','magnesium','total_phenols','flavanoids','nonflavanoid_phenols','proanthocyanins','color_intensity','hue','od280/od315_of_diluted_wines','proline']}
X = wine['data']y = wine['target']X.shape

(178, 13)

將數據分割
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2)

使用隨機森林算法訓練獲取預測值和準確率

clf = RandomForestClassifier()clf.fit(X_train,y_train)y_ = clf.predict(X_test)accuracy_score(y_test,y_)

1.0

dt_clf = DecisionTreeClassifier()dt_clf.fit(X_train,y_train)dt_clf.score(X_test,y_test)

0.9444444444444444

對比決策樹和隨機森林算法的差距

score = 0 
for i in range(100):X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2)dt_clf = DecisionTreeClassifier()dt_clf.fit(X_train,y_train)score+=dt_clf.score(X_test,y_test)/100print('決策樹多次運行準確率:',score)

決策樹多次運行準確率: 0.909166666666666

score = 0 
for i in range(100):X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2)clf = RandomForestClassifier(n_estimators=100)clf.fit(X_train,y_train)score+=clf.score(X_test,y_test)/100print('隨機森林多次運行準確率:',score)

隨機森林多次運行準確率: 0.9808333333333332

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/456140.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/456140.shtml
英文地址,請注明出處:http://en.pswp.cn/news/456140.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

HDU 4857 逃生(拓撲排序)

拓撲排序 一.定義 對一個有向無環圖(Directed Acyclic Graph簡稱DAG)G進行拓撲排序&#xff0c;是將G中所有頂點排成一個線性序列&#xff0c;使得圖中任意一對頂點u和v&#xff0c;若<u&#xff0c;v> ∈E(G)&#xff0c;則u在線性序列中出現在v之前。 通常&#xff0c;…

關于deepin系統安裝design compiler的問題解答

關于deepin系統安裝design compiler的問題解答 Design?Compiler是Synopsys綜合軟件的核心產品。它提供約束驅動時序最優化&#xff0c;并支持眾多的設計類型&#xff0c;把設計者的HDL描述綜合成與工藝相關的門級設計&#xff1b;它能夠從速度、面積和功耗等方面來優化組合電…

iOS 數據持久化-- FMDB

一、簡介 1.什么是FMDB FMDB是iOS平臺的SQLite數據庫框架 FMDB以OC的方式封裝了SQLite的C語言API 2.FMDB的優點 使用起來更加面向對象&#xff0c;省去了很多麻煩、冗余的C語言代碼 對比蘋果自帶的Core Data框架&#xff0c;更加輕量級和靈活 提供了多線程安全的數據庫操作方法…

【機器學習】交叉驗證篩選參數K值和weight

交叉驗證 導包 import numpy as npfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn import datasets#model_selection &#xff1a;模型選擇 # cross_val_score: 交叉 &#xff0c;validation&#xff1a;驗證&#xff08;測試&#xff09; #交叉驗證 from s…

jqGrid列的統計

$("#List").jqGrid({ url: "${pageContext.request.contextPath}/cbfx/getCbhzList.do", datatype: "json", mtype: GET, colNames:["成本類別","費用","去年同期費用","備注"], colMod…

手機只能簽榮耀!最忠誠代言人胡歌喊你去天貓超品日

在你心中&#xff0c;男神胡歌是什么樣子&#xff1f;“御劍乘風來&#xff0c;除魔天地間。”也許是《仙劍奇俠傳》里飛揚跋扈、青春不羈的俠客李逍遙。“遍識天下英雄路&#xff0c;俯首江左有梅郎。”也許是《瑯铘榜》中才智冠天下&#xff0c;遠在江湖卻名動帝輦的麒麟才子…

歐式距離與曼哈頓距離

歐式距離&#xff0c;其實就是應用勾股定理計算兩個點的直線距離 二維空間的公式 其中&#xff0c; 為點與點之間的歐氏距離&#xff1b;為點到原點的歐氏距離。 三維空間的公式 n維空間的公式 曼哈頓距離&#xff0c;就是表示兩個點在標準坐標系上的絕對軸距之和&#xff1a…

在maven pom.xml中加載不同的properties ,如localhost 和 dev master等jdbc.properties 中的鏈接不一樣...

【參考】&#xff1a;maven pom.xml加載不同properties配置[轉] 首先 看看效果&#xff1a; 點開我們項目中的Maven projects 后&#xff0c;會發現右側 我們profile有個可勾選選項。默認勾選localhost。localhost對應項目啟動后&#xff0c;會加載配置左側localhost文件夾下面…

4.8-全棧Java筆記:包機制

包機制是java中管理類的重要手段。 開發中&#xff0c;我們會遇到大量同名的類&#xff0c;通過包我們很容易對解決類重名的問題&#xff0c;也可以實現對類的有效管理。 包對于類&#xff0c;相當于&#xff0c;文件夾對于文件的作用。package我們通過package實現對類的管理&a…

python安裝以及版本檢測

Windows 安裝 Python 3 目前Python有兩個大版本&#xff0c;分別是 2.X 和 3.X &#xff0c;我們的教程基于最新版本 3.6.1 首先我們需要獲取Python的安裝包&#xff0c;可以從官網獲取&#xff0c;如果你因為沒有VPN工具而無法訪問官網的話&#xff0c;我已經將它放在網盤了&…

【機器學習】梯度下降原理

import numpy as np import matplotlib.pyplot as plt %matplotlib inlinef lambda x :(x-3)**22.5*x-7.5 f2 lambda x :-(x-3)**22.5*x-7.5求解導數 導數為0 取最小值 x np.linspace(-2,5,100) y f(x) plt.plot(x,y)梯度下降求最小值 #導數函數 d lambda x:2*(x-3)*12.…

C語言的面向對象設計-對X264/FFMPEG架構探討

本文貢獻給ZSVC開源社區&#xff08;https://sourceforge.net/projects/zsvc/&#xff09;&#xff0c;他們是來自于中國各高校的年輕學子&#xff0c;是滿懷激情與夢想的人&#xff0c;他們將用自己的勤勞與智慧在世界開源軟件領域為中國留下腳步&#xff0c;該社區提供大量視…

linux gtest安裝

1. 安裝cmake, 具體步驟這里不詳說。 2. 下載源碼&#xff1a;https://codeload.github.com/google/googletest/zip/release-1.8.0 3. 解壓源碼&#xff1a;unzip googletest-release-1.8.0.zip 4. 進入源碼目錄&#xff1a;cd googletest-release-1.8.0 5. 創建并進入目錄buil…

基于物聯網的智能垃圾桶設計

前言 目前我國各城市包括首都正在深入開展爭創國家衛生城市活動&#xff0c;這是全國愛國衛生運動委員會辦公室評選命名的國家級衛生優秀城市的最高榮譽&#xff0c;是一個城市綜合素質的重要標志。沈陽市正在深入開展創建國家衛生城市和建設國家健康城市(以下簡稱“雙城雙創”…

學習實踐 - 收藏集 - 掘金

2道面試題&#xff1a;輸入URL按回車&HTTP2 - 掘金通過幾輪面試&#xff0c;我發現真正那種問答的技術面&#xff0c;寫一堆項目真不如去刷技術文章作用大&#xff0c;因此刷了一段時間的博客和掘金&#xff0c;整理下曾經被問到的2道面試題 從瀏覽器輸入URL按回車到頁面顯…

【機器學習】自己手寫實現線性回歸,梯度下降 原理

導包 import numpy as npimport matplotlib.pyplot as plt %matplotlib inlinefrom sklearn.linear_model import LinearRegression創建數據 X np.linspace(2,10,20).reshape(-1,1)# f(x) wx b y np.random.randint(1,6,size 1)*X np.random.randint(-5,5,size 1)# 噪…

跨站的藝術-XSS Fuzzing 的技巧

作者 | 張祖優(Fooying) 騰訊云 云鼎實驗室 對于XSS的漏洞挖掘過程&#xff0c;其實就是一個使用Payload不斷測試和調整再測試的過程&#xff0c;這個過程我們把它叫做Fuzzing&#xff1b;同樣是Fuzzing&#xff0c;有些人挖洞比較高效&#xff0c;有些人卻不那么容易挖出漏洞…

H.264/AVC視頻壓縮編碼標準的新進展

H .264/AVC是由ISO/IEC與ITU-T組成的聯合視頻組(JVT)制定的新一代視頻壓縮編碼標準&#xff0c;于2003年5月完成制訂。相對于先前的標準&#xff0c;H.264/AVC無論在壓縮效率、還是在網絡適應性方面都有明顯的提高&#xff0c;因此&#xff0c;業界普遍預測其將在未來的視頻應用…

python注釋及語句分類

注釋 注釋就是&#xff1a;注解&#xff0c;解釋。 主要用于在代碼中給代碼標識出相關的文字提示(提高代碼的可讀性) 或 調試程序。Python中注釋分為兩類&#xff1a; 1.單行注釋 &#xff1a; 單行注釋以 # 號開頭&#xff0c;在當前行內&#xff0c;# 號后面的內容就是注釋…

【機器學習】回歸誤差:MSE、RMSE、MAE、R2、Adjusted R2 +方差、協方差、標準差(標準偏差/均方差)、均方誤差、均方根誤差(標準誤差)、均方根解釋

我們通常采用MSE、RMSE、MAE、R2來評價回歸預測算法。 1、均方誤差&#xff1a;MSE&#xff08;Mean Squared Error&#xff09; 其中&#xff0c;為測試集上真實值-預測值。 def rms(y_test, y): return sp.mean((y_test - y) ** 2) 2、均方根誤差&#xff1a;RMSE&#xff…