基于隨機森林的糖尿病預測模型研究應用(python)

基于隨機森林的糖尿病預測模型研究應用

1、導入糖尿病數據集

In?[14]:

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
data=pd.read_csv('./糖尿病數據集.csv',encoding="gbk")
data.head()#查看前五行數據

Out[14]:

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
0	6	148	72	35	0	33.6	0.627	50	1
1	1	85	66	29	0	26.6	0.351	31	0
2	8	183	64	0	0	23.3	0.672	32	1
3	1	89	66	23	94	28.1	0.167	21	0
4	0	137	40	35	168	43.1	2.288	33	1

In?[2]:

data.tail()

Out[2]:

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
763	10	101	76	48	180	32.9	0.171	63	0
764	2	122	70	27	0	36.8	0.340	27	0
765	5	121	72	23	112	26.2	0.245	30	0
766	1	126	60	0	0	30.1	0.349	47	1
767	1	93	70	31	0	30.4	0.315	23	0

2、糖尿病樣本統計分析

提取進行樣本分析的特征

In?[2]:

##寫一個類方法做一個數據轉換操作，將1轉換成糖尿病患者，0轉換成正常人
data2=data.copy()
def tn_ftn(Outcome):if Outcome==1:return '糖尿病患者'else:return '正常人'
data2['result']=data2['Outcome'].apply(tn_ftn)##目標變量
y1=data2['result']
data2['age_groups'] = pd.cut(data2['Age'], bins=[0, 20, 40, 60,80,100],right=False)##分箱操作

In?[3]:

age_felie=data2.groupby(['age_groups','Outcome'])['result'].count().reset_index()
age_felie['age_groups']=['(0,20]正常人','(0,20]糖尿病患者','(20,40]正常人','(20,40]糖尿病患者','(40,60]正常人','(40,60]糖尿病患者','(60,80]正常人','(60,80]糖尿病患者','(80,100]正常人','(80,100]糖尿病患者']
age_felie

Out[3]:

	age_groups	Outcome	result
0	(0,20]正常人	0	0
1	(0,20]糖尿病患者	1	0
2	(20,40]正常人	0	401
3	(20,40]糖尿病患者	1	160
4	(40,60]正常人	0	76
5	(40,60]糖尿病患者	1	99
6	(60,80]正常人	0	22
7	(60,80]糖尿病患者	1	9
8	(80,100]正常人	0	1
9	(80,100]糖尿病患者	1	0

In?[4]:

fl=data2.groupby(['age_groups'])['Age'].count()
fl

Out[4]:

age_groups
[0, 20)        0
[20, 40)     561
[40, 60)     175
[60, 80)      31
[80, 100)      1
Name: Age, dtype: int64

In?[5]:

age_felie['age_groups']

Out[5]:

0        (0,20]正常人
1      (0,20]糖尿病患者
2       (20,40]正常人
3     (20,40]糖尿病患者
4       (40,60]正常人
5     (40,60]糖尿病患者
6       (60,80]正常人
7     (60,80]糖尿病患者
8      (80,100]正常人
9    (80,100]糖尿病患者
Name: age_groups, dtype: object

一、糖尿病患者在各年齡階段的年齡占比

In?[14]:

from pyecharts.charts import Pie
from pyecharts import options as opts
# 繪制餅圖
pie = Pie()
pie.add("", [list(z) for z in zip(age_felie['age_groups'].values.tolist(), list(age_felie['result']))],radius=[20,200])
pie.set_global_opts(legend_opts=opts.LegendOpts(orient="vertical", pos_bottom="50%", pos_left="75%"))
pie.set_series_opts(label_opts=opts.LabelOpts(formatter="{b}:{c} \n ({d}%)"))
pie.render('各年齡階段糖尿病患者人數.html')
# pie.render_notebook()

Out[14]:

二、各年齡階段人數

In?[13]:

from pyecharts import options as opts
from pyecharts.charts import Bar# 假設age_felie已經定義并包含'age_groups'和'result'列
y_data = age_felie['result'].values
x_data = age_felie['age_groups'].values# 初始化圖表配置
init_opts = opts.InitOpts(width='1200px', height='800px')# 創建柱狀圖
bar = (Bar(init_opts).add_xaxis(x_data.tolist()).add_yaxis('糖尿病患者/正常人', y_data.tolist(), label_opts=opts.LabelOpts(position='insideTop')).set_global_opts(title_opts=opts.TitleOpts(title='各年齡階段人數'),xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(font_size=20, color='skyblue')),yaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(font_size=20, color='skyblue')))
)# 渲染到HTML文件
bar.render('各年齡階段人數.html')
# bar.render_notebook()

Out[13]:

3、查看數據的描述性信息及相關性

數據的形狀

In?[15]:

data.shape

Out[15]:

(768, 9)

數據的標簽

In?[16]:

# 查看標簽分布 
print("數據集一共多少條:",data.shape[0])
print("\n")
print("糖尿病數據標簽的分布:\n")
print(data.Outcome.value_counts()) ##0代表正常人，1代表患者人數

數據集一共多少條: 768糖尿病數據標簽的分布:0    500
1    268
Name: Outcome, dtype: int64

描述信息

In?[17]:

data.describe().round(2)##保留兩位小數

Out[17]:

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
count	768.00	768.00	768.00	768.00	768.00	768.00	768.00	768.00	768.00
mean	3.85	120.89	69.11	20.54	79.80	31.99	0.47	33.24	0.35
std	3.37	31.97	19.36	15.95	115.24	7.88	0.33	11.76	0.48
min	0.00	0.00	0.00	0.00	0.00	0.00	0.08	21.00	0.00
25%	1.00	99.00	62.00	0.00	0.00	27.30	0.24	24.00	0.00
50%	3.00	117.00	72.00	23.00	30.50	32.00	0.37	29.00	0.00
75%	6.00	140.25	80.00	32.00	127.25	36.60	0.63	41.00	1.00
max	17.00	199.00	122.00	99.00	846.00	67.10	2.42	81.00	1.00

In?[18]:

#相關性
data.corr().round(2)

Out[18]:

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
Pregnancies	1.00	0.13	0.14	-0.08	-0.07	0.02	-0.03	0.54	0.22
Glucose	0.13	1.00	0.15	0.06	0.33	0.22	0.14	0.26	0.47
BloodPressure	0.14	0.15	1.00	0.21	0.09	0.28	0.04	0.24	0.07
SkinThickness	-0.08	0.06	0.21	1.00	0.44	0.39	0.18	-0.11	0.07
Insulin	-0.07	0.33	0.09	0.44	1.00	0.20	0.19	-0.04	0.13
BMI	0.02	0.22	0.28	0.39	0.20	1.00	0.14	0.04	0.29
DiabetesPedigreeFunction	-0.03	0.14	0.04	0.18	0.19	0.14	1.00	0.03	0.17
Age	0.54	0.26	0.24	-0.11	-0.04	0.04	0.03	1.00	0.24
Outcome	0.22	0.47	0.07	0.07	0.13	0.29	0.17	0.24	1.00

In?[19]:

#相關性熱力圖
#忽略警告
import warnings
warnings.filterwarnings('ignore')
import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(data.corr(),cmap="Blues",annot=True)

Out[19]:

<Axes: >

4、數據預處理

一、缺失值——均值填充

In?[20]:

#使用seaborn庫繪圖
import seaborn as sns
sns.set_style('whitegrid',{'font.sans-serif':['simhei','Arial']})
plt.figure(figsize=(30, 30))
g = sns.pairplot(data,x_vars=['Pregnancies','Glucose','BloodPressure','SkinThickness'],y_vars=['Age'],palette='Set1',hue='Outcome')
g = g.map_offdiag(plt.scatter)
plt.suptitle('各年齡階段的其他特征情況1', verticalalignment='bottom' , y=1,color="skyblue",size=20)
plt.show()#0為正常人，1為患有糖尿病

<Figure size 3000x3000 with 0 Axes>

In?[21]:

#使用seaborn庫繪圖
sns.set_style('whitegrid',{'font.sans-serif':['simhei','Arial']})
plt.figure(figsize=(30, 30))
g = sns.pairplot(data,x_vars=['Insulin','BMI','DiabetesPedigreeFunction'],y_vars=['Age'],palette='Set1',hue='Outcome')
g = g.map_offdiag(plt.scatter)
plt.suptitle('各年齡階段的其他特征情況2', verticalalignment='bottom' , y=1,color="skyblue",size=20)
plt.show()#0為正常人，1為患有糖尿病

<Figure size 3000x3000 with 0 Axes>

可以觀察到'Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI'上都含有0值，

從現實的實際情況來說，'Pregnancies'列含有0值是正常的，那么我們將其他列含有的0值視為缺失值，現在進行轉換，

將'Glucose','BloodPressure','SkinThickness','Insulin','BMI'上所有列含有的0值填充為NaN值，進行查看空缺值

步驟：

1、缺失值檢查

2、填充缺失值

1、缺失值檢查

第一步：將Glucose、BloodPressure、SkinThickness、Insulin、BMI中的0替換成NaN值

第二步：使用data.info()檢查缺失值

第一步：將Glucose、BloodPressure、SkinThickness、Insulin、BMI中的0替換成NaN值

In?[15]:

column = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
data[column] = data[column].replace(0,np.nan)

第二步：使用data.info()檢查缺失值

In?[23]:

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):#   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  0   Pregnancies               768 non-null    int64  1   Glucose                   763 non-null    float642   BloodPressure             733 non-null    float643   SkinThickness             541 non-null    float644   Insulin                   394 non-null    float645   BMI                       757 non-null    float646   DiabetesPedigreeFunction  768 non-null    float647   Age                       768 non-null    int64  8   Outcome                   768 non-null    int64  
dtypes: float64(6), int64(3)
memory usage: 54.1 KB

可以很清楚的觀察到糖尿病數據集中Glucose含有5條缺失值，BloodPressure含有35條缺失值，

SkinThickness含有227條缺失值，Insulin含有374條缺失值，BMI含有11條缺失值

即缺失值數據條數從多到少排序為：Insulin、SkinThickness、BloodPressure、BMI、Glucose

2、填充缺失值

填充原因：由上述的糖尿病數據相關性可知，目標變量與特征變量之間都存在一定的相關性，

故如果刪除缺失值的話，會可能導致統計效力下降，模型的準確性和泛化能力也會受到影響

In?[16]:

data['Glucose'].fillna(data.Glucose.mean().round(0),inplace=True)
data['BloodPressure'].fillna(data.BloodPressure.mean().round(0),inplace=True)
data['SkinThickness'].fillna(data.SkinThickness.mean().round(0),inplace=True)
data['Insulin'].fillna(data.Insulin.mean().round(0),inplace=True)
data['BMI'].fillna(data.BMI.mean().round(1),inplace=True)

In?[25]:

data.head()##查看填充成功

Out[25]:

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
0	6	148.0	72.0	35.0	156.0	33.6	0.627	50	1
1	1	85.0	66.0	29.0	156.0	26.6	0.351	31	0
2	8	183.0	64.0	29.0	156.0	23.3	0.672	32	1
3	1	89.0	66.0	23.0	94.0	28.1	0.167	21	0
4	0	137.0	40.0	35.0	168.0	43.1	2.288	33	1

二、異常值處理——中位數填充

由上述的描述信息可以看出Pregnancies、BloodPressure、Age這些值在實際生活中是正常的，那么現在需要進行對Glucose、SkinThickness、Insulin、BMI、DiabetesPedigreeFunction進行異常排查

第一步：畫出需要分析列的箱線圖，即畫出糖尿病數據集中經過缺失值填充后Glucose、SkinThickness、Insulin、BMI、DiabetesPedigreeFunction列的箱線圖

第二步：利用z-score的方法找出異常值所在的行

第三步：采用中位數對異常進行填充

In?[26]:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns# 刪除指定的列
df = data.drop(['Pregnancies','BloodPressure','Age','Outcome'], axis=1)# 查看轉換后的DataFrame的數據類型
# print(df.dtypes)# 生成箱型圖
plt.figure(figsize=(15, 8))
sns.boxplot(data=df,orient= 'vertica')
plt.title('Box Plot of All Features')
plt.xlabel('Features')
plt.ylabel('Values')
#保存圖片
plt.savefig('糖尿病數據集缺失值處理后的箱線圖.png') 
plt.show()

①對Glucose列

In?[17]:

##對異常值進行足一排查
import pandas as pd
# 選擇要分析的列，Glucose——葡萄糖
column_to_analyze = 'Glucose'
# 計算該列的平均值和標準差
mean = data[column_to_analyze].mean()
std = data[column_to_analyze].std()
# 計算每個樣本的Z-score
data['z_score'] = (data[column_to_analyze] - mean) / std
# 設定一個閾值，通常選擇3作為標準，表示3個標準差之外的值為異常值
threshold = 3
# 識別異常值，即Z-score的絕對值大于閾值的樣本
data['is_outlier'] = abs(data['z_score']) > threshold
# 打印出異常值的行
print("Glucose異常值所在行:")
print(data[data['is_outlier']])

Glucose異常值所在行:
Empty DataFrame
Columns: [Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age, Outcome, z_score, is_outlier]
Index: []

可以看出Glucose無異常值

②對SkinThickness列

In?[18]:

##第一步：利用Z-Score進行異常值排查
import pandas as pd
import math 
# 選擇要分析的列，SkinThickness——皮脂厚度
column_to_analyze = 'SkinThickness'
# 計算該列的平均值和標準差
mean = data[column_to_analyze].mean()
std = data[column_to_analyze].std()
# 計算每個樣本的Z-score
data['z_score'] = (data[column_to_analyze] - mean) / std
# 設定一個閾值，通常選擇3作為標準，表示3個標準差之外的值為異常值
threshold = 3
# 識別異常值，即Z-score的絕對值大于閾值的樣本
data['is_outlier'] = abs(data['z_score']) > threshold
# 打印出異常值的行
print("SkinThickness異常值所在行:")
print(data[data['is_outlier']])
# 第二步：利用中位數填充異常值
## 使用中位數替換異常值
# 計算列的中位數
median_value = data[column_to_analyze].median()
# 使用中位數替換異常值
data.loc[data['is_outlier'], column_to_analyze] = median_value

SkinThickness異常值所在行:Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
57             0    100.0           88.0           60.0    110.0  46.8   
120            0    162.0           76.0           56.0    100.0  53.2   
445            0    180.0           78.0           63.0     14.0  59.4   
579            2    197.0           70.0           99.0    156.0  34.7   DiabetesPedigreeFunction  Age  Outcome   z_score  is_outlier  
57                      0.962   31        0  3.513952        True  
120                     0.759   25        1  3.058952        True  
445                     2.420   25        1  3.855201        True  
579                     0.575   62        1  7.950196        True

③對Insulin列

In?[19]:

import pandas as pd
# 選擇要分析的列，BloodPressure——血壓
column_to_analyze = 'Insulin'
# 計算該列的平均值和標準差
mean = data[column_to_analyze].mean()
std = data[column_to_analyze].std()
# # 使用math.floor()將均值向下取整為最接近的整數
# mean_value_int = math.floor(mean)
# 計算每個樣本的Z-score
data['z_score'] = (data[column_to_analyze] - mean) / std
# 設定一個閾值，通常選擇3作為標準，表示3個標準差之外的值為異常值
threshold = 3
# 識別異常值，即Z-score的絕對值大于閾值的樣本
data['is_outlier'] = abs(data['z_score']) > threshold# 打印出異常值的行
print("Insulin異常值所在行:")
print(data[data['is_outlier']])
# 計算列的中位數
median_value = data[column_to_analyze].median()
# 使用中位數替換異常值
data.loc[data['is_outlier'], column_to_analyze] = median_value

Insulin異常值所在行:Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
8              2    197.0           70.0           45.0    543.0  30.5   
13             1    189.0           60.0           23.0    846.0  30.1   
111            8    155.0           62.0           26.0    495.0  34.0   
153            1    153.0           82.0           42.0    485.0  40.6   
186            8    181.0           68.0           36.0    495.0  30.1   
220            0    177.0           60.0           29.0    478.0  34.6   
228            4    197.0           70.0           39.0    744.0  36.7   
247            0    165.0           90.0           33.0    680.0  52.3   
286            5    155.0           84.0           44.0    545.0  38.7   
370            3    173.0           82.0           48.0    465.0  38.4   
392            1    131.0           64.0           14.0    415.0  23.7   
409            1    172.0           68.0           49.0    579.0  42.4   
415            3    173.0           84.0           33.0    474.0  35.7   
486            1    139.0           62.0           41.0    480.0  40.7   
584            8    124.0           76.0           24.0    600.0  28.7   
645            2    157.0           74.0           35.0    440.0  39.4   
655            2    155.0           52.0           27.0    540.0  38.7   
695            7    142.0           90.0           24.0    480.0  30.4   
753            0    181.0           88.0           44.0    510.0  43.3   DiabetesPedigreeFunction  Age  Outcome   z_score  is_outlier  
8                       0.158   53        1  4.554521        True  
13                      0.398   59        1  8.118329        True  
111                     0.543   46        1  3.989957        True  
153                     0.687   23        0  3.872340        True  
186                     0.615   60        1  3.989957        True  
220                     1.072   21        1  3.790007        True  
228                     2.329   31        0  6.918631        True  
247                     0.427   23        0  6.165880        True  
286                     0.619   34        0  4.578044        True  
370                     2.137   25        1  3.637105        True  
392                     0.389   21        0  3.049018        True  
409                     0.702   28        1  4.977944        True  
415                     0.258   22        1  3.742960        True  
486                     0.536   21        0  3.813531        True  
584                     0.687   52        1  5.224940        True  
645                     0.134   30        0  3.343061        True  
655                     0.240   25        1  4.519236        True  
695                     0.128   43        1  3.813531        True  
753                     0.222   26        1  4.166383        True

④對BMI列

In?[20]:

import pandas as pd
import math
# 選擇要分析的列
column_to_analyze = 'BMI'
# 計算該列的平均值和標準差
mean = data[column_to_analyze].mean()
std = data[column_to_analyze].std()
# # 使用math.floor()將均值向下取整為最接近的整數
# mean_value_int = math.floor(mean)
# 計算每個樣本的Z-score
data['z_score'] = (data[column_to_analyze] - mean) / std
# 設定一個閾值，通常選擇3作為標準，表示3個標準差之外的值為異常值
threshold = 3
# 識別異常值，即Z-score的絕對值大于閾值的樣本
data['is_outlier'] = abs(data['z_score']) > threshold
# 打印出異常值的行
print("BMI異常值所在行:")
print(data[data['is_outlier']])
# 計算列的中位數
median_value = data[column_to_analyze].median()
# 使用中位數替換異常值
data.loc[data['is_outlier'], column_to_analyze] = median_value

BMI異常值所在行:Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
120            0    162.0           76.0           29.0    100.0  53.2   
125            1     88.0           30.0           42.0     99.0  55.0   
177            0    129.0          110.0           46.0    130.0  67.1   
445            0    180.0           78.0           29.0     14.0  59.4   
673            3    123.0          100.0           35.0    240.0  57.3   DiabetesPedigreeFunction  Age  Outcome   z_score  is_outlier  
120                     0.759   25        1  3.016940        True  
125                     0.496   26        1  3.278753        True  
177                     0.319   26        1  5.038713        True  
445                     2.420   25        1  3.918738        True  
673                     0.880   22        0  3.613291        True

⑤對DiabetesPedigreeFunction列

In?[21]:

import pandas as pd
# 選擇要分析的列，DiabetesPedigreeFunction——糖尿病遺傳函數
column_to_analyze = 'DiabetesPedigreeFunction'
# 計算該列的平均值和標準差
mean = data[column_to_analyze].mean()
std = data[column_to_analyze].std()
# # 使用math.floor()將均值向下取整為最接近的整數
# mean_value_int = math.floor(mean)
# 計算每個樣本的Z-score
data['z_score'] = (data[column_to_analyze] - mean) / std
# 設定一個閾值，通常選擇3作為標準，表示3個標準差之外的值為異常值
threshold = 3
# 識別異常值，即Z-score的絕對值大于閾值的樣本
data['is_outlier'] = abs(data['z_score']) > threshold
# 打印出異常值的行
print("DiabetesPedigreeFunction異常值所在行:")
print(data[data['is_outlier']])
# 計算列的中位數
median_value = data[column_to_analyze].median()
# 使用中位數替換異常值
data.loc[data['is_outlier'], column_to_analyze] = median_value

DiabetesPedigreeFunction異常值所在行:Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
4              0    137.0           40.0           35.0    168.0  43.1   
45             0    180.0           66.0           39.0    156.0  42.0   
58             0    146.0           82.0           29.0    156.0  40.5   
228            4    197.0           70.0           39.0    156.0  36.7   
330            8    118.0           72.0           19.0    156.0  23.1   
370            3    173.0           82.0           48.0    156.0  38.4   
371            0    118.0           64.0           23.0     89.0  32.5   
395            2    127.0           58.0           24.0    275.0  27.7   
445            0    180.0           78.0           29.0     14.0  32.4   
593            2     82.0           52.0           22.0    115.0  28.5   
621            2     92.0           76.0           20.0    156.0  24.2   DiabetesPedigreeFunction  Age  Outcome   z_score  is_outlier  
4                       2.288   33        1  5.481337        True  
45                      1.893   25        1  4.289167        True  
58                      1.781   44        0  3.951134        True  
228                     2.329   31        0  5.605081        True  
330                     1.476   46        0  3.030598        True  
370                     2.137   25        1  5.025596        True  
371                     1.731   21        0  3.800226        True  
395                     1.600   25        0  3.404849        True  
445                     2.420   25        1  5.879733        True  
593                     1.699   25        0  3.703646        True  
621                     1.698   28        0  3.700627        True

數據預處理之后的描述信息

In?[34]:

data.drop(columns=['z_score']).describe().round(2)

Out[34]:

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
count	768.00	768.00	768.00	768.0	768.00	768.00	768.00	768.00	768.00
mean	3.85	121.69	72.39	28.9	146.22	32.29	0.45	33.24	0.35
std	3.37	30.44	12.10	8.2	56.27	6.53	0.28	11.76	0.48
min	0.00	44.00	24.00	7.0	14.00	18.20	0.08	21.00	0.00
25%	1.00	99.75	64.00	25.0	121.50	27.50	0.24	24.00	0.00
50%	3.00	117.00	72.00	29.0	156.00	32.40	0.37	29.00	0.00
75%	6.00	140.25	80.00	32.0	156.00	36.42	0.60	41.00	1.00
max	17.00	199.00	122.00	54.0	402.00	52.90	1.46	81.00	1.00

In?[35]:

data.head(10)

Out[35]:

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome	z_score	is_outlier
0	6	148.0	72.0	35.0	156.0	33.6	0.6270	50	1	0.468187	False
1	1	85.0	66.0	29.0	156.0	26.6	0.3510	31	0	-0.364823	False
2	8	183.0	64.0	29.0	156.0	23.3	0.6720	32	1	0.604004	False
3	1	89.0	66.0	23.0	94.0	28.1	0.1670	21	0	-0.920163	False
4	0	137.0	40.0	35.0	168.0	43.1	0.3725	33	1	5.481337	True
5	5	116.0	74.0	29.0	156.0	25.6	0.2010	30	0	-0.817546	False
6	3	78.0	50.0	32.0	88.0	31.0	0.2480	26	1	-0.675693	False
7	10	115.0	72.0	29.0	156.0	35.3	0.1340	29	0	-1.019762	False
8	2	197.0	70.0	45.0	156.0	30.5	0.1580	53	1	-0.947326	False
9	8	125.0	96.0	29.0	156.0	32.5	0.2320	54	1	-0.723983	False

三、確定糖尿病數據集中的目標值與特征變量

確定實驗二的目標變量與特征變量

In?[22]:

X=data.drop(columns=['Outcome','z_score','is_outlier'])##特征變量（刪除目標變量，其余的數據為特征變量）
y=data['Outcome']##目標變量 ----0為正常人，1為患有糖尿病

In?[23]:

X##特征變量

Out[23]:

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age
0	6	148.0	72.0	35.0	156.0	33.6	0.6270	50
1	1	85.0	66.0	29.0	156.0	26.6	0.3510	31
2	8	183.0	64.0	29.0	156.0	23.3	0.6720	32
3	1	89.0	66.0	23.0	94.0	28.1	0.1670	21
4	0	137.0	40.0	35.0	168.0	43.1	0.3725	33
...	...	...	...	...	...	...	...	...
763	10	101.0	76.0	48.0	180.0	32.9	0.1710	63
764	2	122.0	70.0	27.0	156.0	36.8	0.3400	27
765	5	121.0	72.0	23.0	112.0	26.2	0.2450	30
766	1	126.0	60.0	29.0	156.0	30.1	0.3490	47
767	1	93.0	70.0	31.0	156.0	30.4	0.3150	23

768 rows × 8 columns

確定實驗一的目標變量與特征變量

In?[24]:

##寫一個類方法做一個數據轉換操作，將1轉換成糖尿病患者，0轉換成正常人
data1=data
def tn_ftn(Outcome):if Outcome==1:return '糖尿病患者'else:return '正常人'
data1['result']=data1['Outcome'].apply(tn_ftn)##目標變量
y1=data1['result']

In?[25]:

X#特征變量

Out[25]:

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age
0	6	148.0	72.0	35.0	156.0	33.6	0.6270	50
1	1	85.0	66.0	29.0	156.0	26.6	0.3510	31
2	8	183.0	64.0	29.0	156.0	23.3	0.6720	32
3	1	89.0	66.0	23.0	94.0	28.1	0.1670	21
4	0	137.0	40.0	35.0	168.0	43.1	0.3725	33
...	...	...	...	...	...	...	...	...
763	10	101.0	76.0	48.0	180.0	32.9	0.1710	63
764	2	122.0	70.0	27.0	156.0	36.8	0.3400	27
765	5	121.0	72.0	23.0	112.0	26.2	0.2450	30
766	1	126.0	60.0	29.0	156.0	30.1	0.3490	47
767	1	93.0	70.0	31.0	156.0	30.4	0.3150	23

768 rows × 8 columns

4、糖尿病數據預測模型

實驗一：

測試數據

In?[40]:

##測試數據
data1.iloc[20:40,:].drop(columns=['Outcome','z_score','is_outlier'])

Out[40]:

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	result
20	3	126.0	88.0	41.0	235.0	39.3	0.704	27	正常人
21	8	99.0	84.0	29.0	156.0	35.4	0.388	50	正常人
22	7	196.0	90.0	29.0	156.0	39.8	0.451	41	糖尿病患者
23	9	119.0	80.0	35.0	156.0	29.0	0.263	29	糖尿病患者
24	11	143.0	94.0	33.0	146.0	36.6	0.254	51	糖尿病患者
25	10	125.0	70.0	26.0	115.0	31.1	0.205	41	糖尿病患者
26	7	147.0	76.0	29.0	156.0	39.4	0.257	43	糖尿病患者
27	1	97.0	66.0	15.0	140.0	23.2	0.487	22	正常人
28	13	145.0	82.0	19.0	110.0	22.2	0.245	57	正常人
29	5	117.0	92.0	29.0	156.0	34.1	0.337	38	正常人
30	5	109.0	75.0	26.0	156.0	36.0	0.546	60	正常人
31	3	158.0	76.0	36.0	245.0	31.6	0.851	28	糖尿病患者
32	3	88.0	58.0	11.0	54.0	24.8	0.267	22	正常人
33	6	92.0	92.0	29.0	156.0	19.9	0.188	28	正常人
34	10	122.0	78.0	31.0	156.0	27.6	0.512	45	正常人
35	4	103.0	60.0	33.0	192.0	24.0	0.966	33	正常人
36	11	138.0	76.0	29.0	156.0	33.2	0.420	35	正常人
37	9	102.0	76.0	37.0	156.0	32.9	0.665	46	糖尿病患者
38	2	90.0	68.0	42.0	156.0	38.2	0.503	27	糖尿病患者
39	4	111.0	72.0	47.0	207.0	37.1	1.390	56	糖尿病患者

預測診斷結果

In?[15]:

import pandas as pd
##忽略警告
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression      
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
import numpy as npdef lg_hgui():X_train,X_test,y_train,y_test=train_test_split(X,y1,test_size=0.3,random_state=25)lg=LogisticRegression(penalty='l2',max_iter=5)lg.fit(X_train,y_train)X_test1=data.iloc[20:40,:8]print("邏輯回歸預測結果：",lg.predict(X_test1))def jue_cs():X_train,X_test,y_train,y_test=train_test_split(X,y1,test_size=0.3,random_state=25)jcs=DecisionTreeClassifier(criterion='gini',max_depth=3,splitter='best')jcs.fit(X_train,y_train)X_test1=data.iloc[20:40,:8]print("決策樹預測結果：",jcs.predict(X_test1))def sj_sl():X_train,X_test,y_train,y_test=train_test_split(X,y1,test_size=0.3,random_state=25)sj=RandomForestClassifier(n_estimators=19,max_leaf_nodes=7,max_depth=4)sj.fit(X_train,y_train)X_test1=data.iloc[20:40,:8]print("隨機森林預測結果：",sj.predict(X_test1))def in_out():print("預測結果結束！")print("真實數據：",data.iloc[20:40,9:]['result'].values)   
print("\n")
while True:model=input("請輸入選擇的模型!- - - - - - - - - - - - - - - - - - -")if model == '邏輯回歸':lg_hgui()print("\n")elif model == '決策樹':jue_cs()print("\n")elif model=='隨機森林':sj_sl()else:print("\n")in_out()break

真實數據： ['正常人' '正常人' '糖尿病患者' '糖尿病患者' '糖尿病患者' '糖尿病患者' '糖尿病患者' '正常人' '正常人' '正常人''正常人' '糖尿病患者' '正常人' '正常人' '正常人' '正常人' '正常人' '糖尿病患者' '糖尿病患者' '糖尿病患者']

邏輯回歸預測結果： ['正常人' '正常人' '糖尿病患者' '正常人' '正常人' '正常人' '糖尿病患者' '正常人' '糖尿病患者' '正常人' '正常人''糖尿病患者' '正常人' '正常人' '正常人' '正常人' '正常人' '正常人' '正常人' '正常人']

決策樹預測結果： ['糖尿病患者' '正常人' '糖尿病患者' '正常人' '糖尿病患者' '糖尿病患者' '糖尿病患者' '正常人' '正常人' '正常人''正常人' '糖尿病患者' '正常人' '正常人' '正常人' '正常人' '糖尿病患者' '正常人' '正常人' '正常人']

隨機森林預測結果： ['正常人' '正常人' '糖尿病患者' '正常人' '糖尿病患者' '正常人' '糖尿病患者' '正常人' '正常人' '正常人' '正常人''糖尿病患者' '正常人' '正常人' '正常人' '正常人' '糖尿病患者' '糖尿病患者' '正常人' '正常人']

預測結果結束！

實驗二：

混淆矩陣、模型評估報告、準確率

基于邏輯回歸模型糖尿病的預測模型

In?[1288]:

%%time
import pandas as pd
from sklearn import metrics
##忽略警告
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression      
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
from sklearn.model_selection import cross_val_score
import numpy as np
def lg_re():X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=25)sc = StandardScaler()X_train = sc.fit_transform(X_train)X_test = sc.transform(X_test)lg=LogisticRegression(penalty='l2',max_iter=5)lg.fit(X_train,y_train)y_predict=lg.predict(X_test)print('邏輯回歸混淆矩陣:')confusion_matrix=metrics.confusion_matrix(y_test,y_predict)plt.figure(figsize=(3, 3))# 設置x軸和y軸的刻度標簽heatmap = plt.imshow(confusion_matrix, cmap=plt.cm.Reds)# # 去掉網格線plt.grid(False)for i in range(confusion_matrix.shape[0]):for j in range(confusion_matrix.shape[1]):plt.text(j, i, format(confusion_matrix[i, j], 'd'), ha="center", va="center")plt.colorbar(heatmap)plt.xticks([0,1])plt.yticks([1,0])plt.xlabel('Predicted labels')plt.ylabel('True labels')plt.show()print("\n")print("邏輯回歸模型評估報告:")print(classification_report(y_test,y_predict))#模型評估報告print("\n")# print("邏輯回歸準確率:")print("邏輯回歸準確率:",accuracy_score(y_test,y_predict).round(2))#準確率score_tr=lg.score(X_train,y_train)score_te=lg.score(X_test,y_test)print("邏輯回歸模型訓練集準確率：",score_tr.round(2))print("邏輯回歸模型測試集準確率：",score_te.round(2))score_tc= cross_val_score(lg,X,y,cv=10,scoring = 'accuracy')#使用交叉驗證print("邏輯回歸十次交叉驗證準確率:",score_tc.round(2))
lg_re()##邏輯回歸模型的準確率約為0.82

邏輯回歸混淆矩陣:

邏輯回歸模型評估報告:precision    recall  f1-score   support0       0.86      0.88      0.87       1601       0.72      0.68      0.70        71accuracy                           0.82       231macro avg       0.79      0.78      0.78       231
weighted avg       0.82      0.82      0.82       231邏輯回歸準確率: 0.82
邏輯回歸模型訓練集準確率： 0.76
邏輯回歸模型測試集準確率： 0.82
邏輯回歸十次交叉驗證準確率: [0.69 0.69 0.68 0.62 0.69 0.77 0.7  0.73 0.71 0.66]
CPU times: total: 734 ms
Wall time: 720 ms

基于決策樹模型糖尿病的預測模型

In?[818]:

%%time
from sklearn.tree import DecisionTreeClassifier
sc = StandardScaler()
X= sc.fit_transform(X)
def j_cs():X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=30)sc = StandardScaler()X_train = sc.fit_transform(X_train)X_test = sc.transform(X_test)clf=DecisionTreeClassifier(criterion='gini',max_depth=3,splitter='best')clf.fit(X_train,y_train)y_predict=clf.predict(X_test)print('決策樹混淆矩陣：')confusion_matrix=metrics.confusion_matrix(y_test,y_predict)plt.figure(figsize=(3, 3))# 設置x軸和y軸的刻度標簽heatmap = plt.imshow(confusion_matrix, cmap=plt.cm.Reds)for i in range(confusion_matrix.shape[0]):for j in range(confusion_matrix.shape[1]):plt.text(j, i, format(confusion_matrix[i, j], 'd'), ha="center", va="center")plt.colorbar(heatmap)# # 去掉網格線plt.grid(False)plt.xticks([0,1])plt.yticks([1,0])plt.xlabel('Predicted labels')plt.ylabel('True labels')plt.show()print("\n")print('決策樹模型評估報告：')print(classification_report(y_test,y_predict))print('\n')print('決策樹準確率：',accuracy_score(y_test,y_predict).round(2))print("決策樹模型訓練集準確率：",clf.score(X_train,y_train).round(2))print("決策樹模型測試集準確率：",clf.score(X_test,y_test).round(2))score_tc= cross_val_score(clf,X,y,cv=10,scoring = 'accuracy')#使用交叉驗證print("決策樹十次交叉驗證準確率:",score_tc.round(2))
j_cs()##決策樹模型的準確率約為0.78

決策樹混淆矩陣：

決策樹模型評估報告：precision    recall  f1-score   support0       0.82      0.89      0.85       1591       0.69      0.56      0.62        72accuracy                           0.78       231macro avg       0.75      0.72      0.73       231
weighted avg       0.78      0.78      0.78       231決策樹準確率： 0.78
決策樹模型訓練集準確率： 0.78
決策樹模型測試集準確率： 0.78
決策樹十次交叉驗證準確率: [0.73 0.73 0.74 0.68 0.71 0.75 0.71 0.81 0.71 0.78]
CPU times: total: 844 ms
Wall time: 839 ms

基于隨機森林模型糖尿病的預測模型

In?[1280]:

%%time
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
from sklearn.model_selection import cross_val_score
def sj_sl():X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=25)sc = StandardScaler()X_train = sc.fit_transform(X_train)X_test = sc.transform(X_test)rfc=RandomForestClassifier(n_estimators=19,max_leaf_nodes=7,max_depth=4)rfc.fit(X_train,y_train)y_predict=rfc.predict(X_test)print('隨機森林混淆矩陣：')confusion_matrix=metrics.confusion_matrix(y_test,y_predict)plt.figure(figsize=(3, 3))# 設置x軸和y軸的刻度標簽heatmap = plt.imshow(confusion_matrix, cmap=plt.cm.Reds)for i in range(confusion_matrix.shape[0]):for j in range(confusion_matrix.shape[1]):plt.text(j, i, format(confusion_matrix[i, j], 'd'), ha="center", va="center")# # 去掉網格線plt.grid(False)plt.colorbar(heatmap)plt.xticks([0,1])plt.yticks([1,0])plt.xlabel('Predicted labels')plt.ylabel('True labels')plt.show()print('\n')print('隨機森林模型評估報告：')print(classification_report(y_test,y_predict))print('\n')print('隨機森林準確率：',accuracy_score(y_test,y_predict).round(2))print("隨機森林模型訓練集準確率：",rfc.score(X_train,y_train).round(2))print("隨機森林模型測試集準確率：",rfc.score(X_test,y_test).round(2))score_tc= cross_val_score(rfc,X,y,cv=10,scoring = 'accuracy')#使用交叉驗證print("隨機森林十次交叉驗證準確率:",score_tc.round(2))
sj_sl()##隨機森林模型的準確率約為0.84

隨機森林混淆矩陣：

隨機森林模型評估報告：precision    recall  f1-score   support0       0.87      0.90      0.88       1601       0.75      0.69      0.72        71accuracy                           0.84       231macro avg       0.81      0.80      0.80       231
weighted avg       0.83      0.84      0.83       231隨機森林準確率： 0.84
隨機森林模型訓練集準確率： 0.79
隨機森林模型測試集準確率： 0.84
隨機森林十次交叉驗證準確率: [0.73 0.73 0.75 0.64 0.73 0.78 0.78 0.78 0.7  0.82]
CPU times: total: 1.89 s
Wall time: 1.87 s

邏輯回歸、決策樹、隨機森林十次驗證準確率

In?[191]:

##導包
import matplotlib.pyplot as plt
import numpy as np
plt.rcParams['font.family'] = ['SimHei']   #設置字體為黑體
plt.rcParams['axes.unicode_minus'] = False #解決保存圖像時負號“-”顯示為方塊的問題
#由上述分別得到邏輯回歸、決策樹、隨機森林的十次交叉驗證準確率
##邏輯回歸十次交叉驗證準確率0.69 0.69 0.68 0.62 0.69 0.77 0.7  0.73 0.71 0.66
y1_Logistic=np.array([0.69,0.69,0.68,0.62,0.69,0.77,0.7,0.73,0.71,0.66]).tolist()
##決策樹十次交叉驗證準確率0.73 0.73 0.74 0.68 0.71 0.75 0.71 0.81 0.71 0.78
y2_Decision=np.array([0.73,0.73,0.74,0.68,0.71,0.75,0.71,0.81,0.71,0.78]).tolist()
##隨機森林十次交叉驗證準確率0.73,0.73,0.75,0.64,0.73,0.78,0.78,0.78,0.7,0.82
y3_Random=np.array([0.73,0.73,0.75,0.64,0.73,0.78,0.78,0.78,0.7,0.82]).tolist()
##因為是十次所以現在設置x軸時，要確定x軸的范圍是1~10
x_data=[1,2,3,4,5,6,7,8,9,10]
plt.plot(x_data,y1_Logistic,color="red" ,label="邏輯回歸")
plt.plot(x_data,y2_Decision,color="skyblue" ,label="決策樹")
plt.plot(x_data,y3_Random,color="blue" ,label="隨機森林")
plt.xticks(range(1,11))
plt.yticks([0.10,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.90,1.00])
plt.legend()
plt.xlabel("十次交叉驗證")
plt.ylabel("十次交叉驗證準確率")
plt.show()

邏輯回歸準確率、決策樹準確率、隨機森林準確率柱形圖

In?[196]:

import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
plt.rcParams['font.sans-serif']='SimHei'# 用來正常顯示中文標簽
plt.rcParams['axes.unicode_minus']=False  # 用來正常顯示負號
import pandas as pd# 假設我們有一些數據
data = {'Model': ['邏輯回歸', '決策樹', '隨機森林'],'Value': [0.82, 0.78, 0.84]
}# 將數據轉換為Pandas DataFrame
df = pd.DataFrame(data)# 使用Seaborn的 barplot函數繪制柱形圖
# 在這里，我們不需要hue參數，因為我們只有一個分類變量
plt.figure(figsize=(8, 8))
sns.barplot(x='Model', y='Value', data=df)
# # 去掉網格線
plt.grid(False)
# 添加標題和軸標簽
plt.title('三種算法模型的準確率比較',fontsize=20,color="blue")
plt.xlabel('模型',fontsize=15,color="purple")
plt.ylabel('準確率',fontsize=15,color="purple")# 在每個柱子上方添加準確率數值
for i, v in enumerate(df['Value']):plt.text(i, v + 0.01, f"{v:.2f}", ha='center', va='bottom',bbox=dict(facecolor='skyblue', alpha=0.5))# 顯示圖表
plt.show()

In?[194]:

import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
plt.rcParams['font.sans-serif']='SimHei'# 用來正常顯示中文標簽
plt.rcParams['axes.unicode_minus']=False  # 用來正常顯示負號
import pandas as pd# 假設我們有一些數據
data = {'Model': ['邏輯回歸', '決策樹', '隨機森林'],'Value': [0.0996, 0.1385, 0.0952]
}# 將數據轉換為Pandas DataFrame
df = pd.DataFrame(data)# 使用Seaborn的 barplot函數繪制柱形圖
# 在這里，我們不需要hue參數，因為我們只有一個分類變量
plt.figure(figsize=(8, 8))
sns.barplot(x='Model', y='Value', data=df)
# # 去掉網格線
plt.grid(False)
# 添加標題和軸標簽
plt.title('混淆矩陣的假陰率比較',fontsize=20,color="blue")
plt.xlabel('模型',fontsize=15,color="purple")
# 在每個柱子上方添加準確率數值（百分比形式）
for i, v in enumerate(df['Value']):plt.text(i, v + 0.001, f"{v*100:.2f}%", ha='center', va='bottom',bbox=dict(facecolor='skyblue', alpha=0.5))  # 將浮點數轉換為百分比并保留一位小數
ax=plt.gca()
frame=plt.gca()
# y 軸不可見
frame.axes.get_yaxis().set_visible(False)
##去除x軸橫線
for spine in ax.spines.values():spine.set_visible(False)
plt.show()