section8

本章節的目的是 【明確目標用戶群】 ,以更好的服務現有用戶。

【知識點】

1.作圖

  • 顯示中文

plt.rcParams['font.sans-serif'] = ['SimHei'] # 步驟一(替換sans-serif字體) plt.rcParams['axes.unicode_minus'] = False # 步驟二(解決坐標軸負數的負號顯示問題)

2.數據庫操作

  • sqlalchemy 引擎

engine = create_engine('mysql+pymysql://root:123456@localhost:3306/datascience')

3.批量讀取文件

  • os.wolk()、os.path.join()用法
for root, dirs, files in os.walk(path): for file in files:`rfile = os.path.join(root,file)if rfile.split('.')[-1] == 'tsv':rdf = pd.read_csv(rfile, sep='\t')df = df.append(rdf)

4.groupby()以及agg() 的聯合使用,應對不同列使用不同的函數

  • 按月統計

affc = {'payment':'sum', 'log_date':'count'} dfm = df.groupby(['log_month', 'user_id']).agg(affc).reset_index()

  • 修改列明

renam = {'log_date':'access_days'} dfm.rename(columns=renam, inplace=True)

5.KMeans 聚類的使用

  • 單列的聚類(需要將單列應用 reshape(-1,1)格式化為1列)

from sklearn.cluster import KMeans a47 = action['A47'].reshape(-1, 1) kms = KMeans(n_clusters=3).fit(a47)

  • 聚類的標簽 labels_ 屬性

cluster = kms.labels_

  • 將標簽添加至源數據中,運用groupby()查看分組情況

action['cluster'] = cluster action.groupby(['cluster'])['user_id'].count()

  • 可視化分組
snsdf = action[['user_id','A47','cluster']].sort_values(by='A47',ascending=False)
plt.figure(figsize=(8,5))
snsdf1 = snsdf.reset_index()
snsdf1[snsdf1['cluster']==2]['A47'].plot(color='r',label='2:重度用戶')
snsdf1[snsdf1['cluster']==1]['A47'].plot(color='g',label='1:中度用戶')
snsdf1[snsdf1['cluster']==0]['A47'].plot(color='b',label='0:輕度用戶')
plt.legend()
plt.xlabel('用戶分布')
plt.ylabel('排行榜得分')

6.主成分分析

  • 數據預處理

    • 提取要進行主成分分析的列
      paction = acc.iloc[:,3:(len(acc.columns)-1)]
    • 刪掉0值較多的列
      cc = paction[paction==0].count(axis=0)/len(paction) cc.plot() dd = cc[cc<.9] #刪掉該列中90%以上都是0值的列 paction = paction[dd.index] paction.head()
    • 刪掉相關性較強的列

      # 數據概覽
      corp = paction.corr()
      sns.heatmap(corp)
      mask = np.array(corp)
      mask[np.tril_indices_from(mask)] = False        # 畫下三角heatmap的方法
      sns.heatmap(corp,mask=mask)# 通過下三角矩陣的方式,刪掉相關性較強的數據列
      coll = corp.columns
      corp = pd.DataFrame(np.tril(corp, -1))         # 應用 np.tril(m, -1) 函數獲取下三角,上三角數據全部置為0
      corp.columns = coll
      pac2 = paction.loc[:,(corp.abs()<.8).all()]      # 任何一個數都小于 0.8 的數據 all() 函數
      pac2.head()
    • 進行主成分分析

      from sklearn.decomposition import PCA
      pca = PCA()
      pca.fit(pac2)redio = pca.explained_variance_ratio_          # pca.explained_variance_ratio_ 是PCA降維后的矩陣課解釋性比率
      print(redio) 
      print(pca.singular_values_)                # singular_values_ 是奇異值矩陣
    • 主成分的課解釋性曲線

      recu = redio.cumsum()                     # 應用 cumsum() 函數進行逐數據累加
      plt.plot(recu)
    • 獲取降維后的數據以進行下一步

      pca.set_params(n_components=10)              # 設置 維度 為 10 
      pac3 = pd.DataFrame(pca.fit_transform(pac2))     # 使用fit_transform()函數訓練并獲得降維后的數據
      pac3.head()
    • 繼續應用 KMENAS 進行聚類, 得到所有用戶的 分類 ,然后再 平均 每個分類的每個行為的所有用戶的值
    • 繼續應用相關性 刪除 相關性強的列, 獲得最后 主要觀察指標
    • 對主要觀察指標進行 雷達圖 展示

      # 首先,對數據進行標準化處理
      from sklearn.preprocessing import scale
      ccccc = pd.DataFrame(scale(cccc))
      ccccc.columns = cccc.columns# 畫圖
      plt.figure(figsize=(8,8))                  
      N = ccccc.shape[1]                      # 極坐標的分割分數
      angles = np.linspace(0, 2*np.pi, N, endpoint=False)    # 設置雷達圖的角度,用于平分切開一個圓面      
      angles = np.concatenate((angles,[angles[0]]))   # 使雷達圖一圈封閉起來
      for i in range(len(ccccc)):values = ccccc.loc[i,:]              # 構造數據values = np.concatenate((values,[values[0]]))     # 為了使雷達圖一圈封閉起來plt.polar(angles, values, 'o-', linewidth=2)      # 繪制
      plt.legend(ccccc.index, loc='lower right')
      plt.thetagrids(angles * 180/np.pi, labels=list(ccccc.columns))    # 添加極坐標的標簽
      plt.title('重要指標雷達圖呈現')
      

一、庫導入以及matplotlib顯示中文

import pandas as pd
import numpy as np
import pymysql
from sqlalchemy import create_engine
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
import osplt.rcParams['font.sans-serif'] = ['SimHei'] # 步驟一(替換sans-serif字體)
plt.rcParams['axes.unicode_minus'] = False   # 步驟二(解決坐標軸負數的負號顯示問題)
%matplotlib inline

數據庫引擎

engine = create_engine('mysql+pymysql://root:123456@localhost:3306/datascience')

二、批量讀取文件

def read_files(path):df = pd.DataFrame()for root, dirs, files in os.walk(path):for file in files:rfile = os.path.join(root,file)if rfile.split('.')[-1] == 'tsv':rdf = pd.read_csv(rfile, sep='\t')df = df.append(rdf)return df
action_path  = 'data/sample-data/section8/daily/action/'
dau_path = 'data/sample-data/section8/daily/dau/'
dpu_path = 'data/sample-data/section8/daily/dpu/'action = read_files(action_path)
dau = read_files(dau_path)
dpu = read_files(dpu_path)

查看數據完整性以及頭部信息

print(action.isnull().sum().sum())
print(action.shape)
# print(action.info())
action.head()
0
(2653, 57)
log_dateapp_nameuser_idA1A2A3A4A5A6A7...A45A46A47A48A49A50A51A52A53A54
02013-10-31game-016541330000000...003802565500000.046
12013-10-31game-014255300000101233...19201805433473622400.071
22013-10-31game-017095960000000...004162481700000.02
32013-10-31game-015250470200900...2222352006412210000.0109
42013-10-31game-017969080000000...29293882544410000.064

5 rows × 57 columns

print(dau.isnull().sum().sum())
print(dau.shape)
print(dau.info())
dau.head()
0
(509754, 3)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 509754 entries, 0 to 2410
Data columns (total 3 columns):
log_date    509754 non-null object
app_name    509754 non-null object
user_id     509754 non-null int64
dtypes: int64(1), object(2)
memory usage: 15.6+ MB
None
log_dateapp_nameuser_id
02013-05-01game-01608801
12013-05-01game-01712453
22013-05-01game-01776853
32013-05-01game-01823486
42013-05-01game-01113600
print(dpu.isnull().sum().sum())
print(dpu.shape)
print(dpu.info())
dpu.head()
0
(3532, 4)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3532 entries, 0 to 7
Data columns (total 4 columns):
log_date    3532 non-null object
app_name    3532 non-null object
user_id     3532 non-null int64
payment     3532 non-null int64
dtypes: int64(2), object(2)
memory usage: 138.0+ KB
None
log_dateapp_nameuser_idpayment
02013-05-01game-01804005571
12013-05-01game-0179353781
22013-05-01game-0131771781
32013-05-01game-0131771781
42013-05-01game-01426525324
# 寫入數據庫# action.to_sql('s8_action', engine, index=False)
# dau.to_sql('s8_dau', engine, index=False)
# dpu.to_sql('s8_dpu', engine, index=False)

三、數據預處理

1.合并 DAU DPU

df = pd.merge(dau, dpu[['log_date','user_id','payment']], how='left', on=['user_id','log_date'])
df.head()
log_dateapp_nameuser_idpayment
02013-05-01game-01608801NaN
12013-05-01game-01712453NaN
22013-05-01game-01776853NaN
32013-05-01game-01823486NaN
42013-05-01game-01113600NaN
# 將無消費記錄的消費額設為 0 
print(df.payment.isnull().sum())
df['payment'].fillna(0, inplace=True)
print(df.payment.isnull().sum())
507151
0
# 添加消費額標志位
df['is_pay'] = df['payment'].apply( lambda x: 1 if x>0 else 0 )
df.head()
log_dateapp_nameuser_idpaymentis_pay
02013-05-01game-016088010.00
12013-05-01game-017124530.00
22013-05-01game-017768530.00
32013-05-01game-018234860.00
42013-05-01game-011136000.00

2.按月統計

# 增加月份列
df['log_month'] = df['log_date'].apply(lambda x: x[0:7])
df.head()
log_dateapp_nameuser_idpaymentis_paylog_month
02013-05-01game-016088010.002013-05
12013-05-01game-017124530.002013-05
22013-05-01game-017768530.002013-05
32013-05-01game-018234860.002013-05
42013-05-01game-011136000.002013-05

巧妙運用 groupby 以及 agg 函數,統計出用戶按月份的 消費情況 和 登陸次數

# 按月統計
affc = {'payment':'sum', 'log_date':'count'}
dfm = df.groupby(['log_month', 'user_id']).agg(affc).reset_index()
# 修改列明
renam = {'log_date':'access_days'}
dfm.rename(columns=renam, inplace=True)
dfm.head()
log_monthuser_idpaymentaccess_days
02013-05650.01
12013-051150.01
22013-051940.01
32013-054260.04
42013-055390.01

4.使用 Kmeans 進行分類, 得到排名靠前的用戶,即 重度用戶/中度用戶/輕度用戶

A47 列即是排行榜得分, 從分布圖上看出,大部分用戶得分很低,符合冪律曲線

# 
action['A47'].hist(bins=50, figsize=(6,4))
<matplotlib.axes._subplots.AxesSubplot at 0x1c21d894240>

png

sns.distplot(action['A47'],bins=50,kde=True)
<matplotlib.axes._subplots.AxesSubplot at 0x1c21af07a58>

png

對 A47 列進行聚類,分為3類

from sklearn.cluster import KMeansa47 = action['A47'].reshape(-1, 1)kms = KMeans(n_clusters=3).fit(a47)
D:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:3: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) insteadThis is separate from the ipykernel package so we can avoid doing imports until
cluster = kms.labels_
kms.cluster_centers_
array([[  9359.84787792],[ 69386.11297071],[185857.17948718]])
action['cluster'] = cluster
action.head()
log_dateapp_nameuser_idA1A2A3A4A5A6A7...A46A47A48A49A50A51A52A53A54cluster
02013-10-31game-016541330000000...03802565500000.0460
12013-10-31game-014255300000101233...201805433473622400.0712
22013-10-31game-017095960000000...04162481700000.020
32013-10-31game-015250470200900...22352006412210000.01090
42013-10-31game-017969080000000...293882544410000.0640

5 rows × 58 columns

action.groupby(['cluster'])['user_id'].count()
cluster
0    2096
1     479
2      78
Name: user_id, dtype: int64

圖上顯示,通過聚類分解后用戶分為3個類, 0 表示輕度用戶,排行榜得分最少; 1 表示中度用戶,排行版得分居中; 2 表示重度用戶,排行版得分較高,而且用戶數量較少,符合實際情況。

snsdf = action[['user_id','A47','cluster']].sort_values(by='A47',ascending=False)
snsdf['user'] = range(len(snsdf))
sns.scatterplot(x='user',y='A47',hue='cluster',data=snsdf, palette='rainbow', alpha=.2)
<matplotlib.axes._subplots.AxesSubplot at 0x1c21b9bf898>

png

snsdf = action[['user_id','A47','cluster']].sort_values(by='A47',ascending=False)
snsdf['user'] = range(len(snsdf))plt.figure(figsize=(8,5))
snsdf1 = snsdf.reset_index()
snsdf1[snsdf1['cluster']==2]['A47'].plot(color='r',label='2:重度用戶')
snsdf1[snsdf1['cluster']==1]['A47'].plot(color='g',label='1:中度用戶')
snsdf1[snsdf1['cluster']==0]['A47'].plot(color='b',label='0:輕度用戶')
plt.legend()
plt.xlabel('用戶分布')
plt.ylabel('排行榜得分')
Text(0,0.5,'排行榜得分')

png

限定排名靠前的用戶,即得分較高的重度和中度用戶,以便接下來進行分析

acc = action[action['cluster']>=1]
acc.head()
log_dateapp_nameuser_idA1A2A3A4A5A6A7...A46A47A48A49A50A51A52A53A54cluster
12013-10-31game-014255300000101233...201805433473622400.0712
52013-10-31game-017761200000900...381422146843715000.03122
72013-10-31game-0127619700007058...15546024226150800.0951
82013-10-31game-012215720000100...2439891579240000.0211
92013-10-31game-016924330000600...28507064549168000.01541

5 rows × 58 columns

5.主成分分析

獲取關鍵的參數

paction = acc.iloc[:,3:(len(acc.columns)-1)]
paction.index=acc.user_id
paction.head()
A1A2A3A4A5A6A7A8A9A10...A45A46A47A48A49A50A51A52A53A54
user_id
425530000010123358.25288230...19201805433473622400.071
77612000009000.00325195...19381422146843715000.0312
276197000070587.25150100...1515546024226150800.095
22157200001000.004014...242439891579240000.021
69243300006000.0010295...1528507064549168000.0154

5 rows × 54 columns

1.刪掉 0 值比較多的列

cc = paction[paction==0].count(axis=0)/len(paction)
print(cc.head())
cc.plot()
A1    1.000000
A2    0.926391
A3    1.000000
A4    0.994614
A5    0.055655
dtype: float64<matplotlib.axes._subplots.AxesSubplot at 0x1c21bbb1470>

png

# cc[cc>.8]
dd = cc[cc<.95]
paction = paction[dd.index]
paction.head()
A2A5A6A7A8A9A10A11A12A13...A45A46A47A48A49A50A51A52A53A54
user_id
425530010123358.2528823019219...19201805433473622400.071
77612009000.0032519538819...19381422146843715000.0312
276197070587.2515010015311...1515546024226150800.095
22157201000.004014003...242439891579240000.021
69243306000.0010295002...1528507064549168000.0154

5 rows × 32 columns

2.刪掉相關性較強的列

corp = paction.corr()
plt.figure(figsize=(15,8))
sns.heatmap(corp)
<matplotlib.axes._subplots.AxesSubplot at 0x1c21bc094a8>

png

畫下三角heatmap,使用到的函數

mask = np.array(corp)
mask[np.tril_indices_from(mask)] = False
fig,ax = plt.subplots()
fig.set_size_inches(15,8)
sns.heatmap(corp,mask=mask)
<matplotlib.axes._subplots.AxesSubplot at 0x1c21bc09400>

png

獲取矩陣的下三角,如果要獲取上三角的話, np.tril(m, 1)

coll = corp.columns
corp = pd.DataFrame(np.tril(corp, -1))
corp.columns = coll
corp.head()
A2A5A6A7A8A9A10A11A12A13...A45A46A47A48A49A50A51A52A53A54
00.0000000.0000000.0000000.0000000.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
10.0697440.0000000.0000000.0000000.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
20.0761850.1788330.0000000.0000000.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
30.1587350.2193950.3713600.0000000.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
40.1672000.1861240.2420250.8031610.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0

5 rows × 32 columns

pac2 = paction.loc[:,(corp.abs()<.7).all()]      # 任何一個數都小于0.7 的數據
pac2.head()
A2A11A12A13A20A23A24A43A44A46A48A49A50A51A53A54
user_id
425530019219000.5230.9217420347362240.071
776120038819000.0200.9025638684371500.0312
276197015311000.0100.9200015422615080.095
2215720003000.020.857142457924000.021
6924330002000.0110.7368428454916800.0154

進行主成分分析

from sklearn.decomposition import PCA
pca = PCA()
pca.fit(pac2)
PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,svd_solver='auto', tol=0.0, whiten=False)
redio = pca.explained_variance_ratio_
print(redio) 
print(pca.singular_values_)  
[9.97843804e-01 1.92024564e-03 1.20120771e-04 5.57014208e-052.67905481e-05 1.54533752e-05 9.31262940e-06 4.38846214e-063.02317261e-06 8.36725295e-07 1.31874979e-07 9.78197162e-083.86464536e-08 2.94647596e-08 1.82272465e-08 7.54580333e-09]
[3.96183910e+04 1.73797668e+03 4.34684952e+02 2.96004755e+022.05284590e+02 1.55911168e+02 1.21032418e+02 8.30848288e+016.89599635e+01 3.62791414e+01 1.44027941e+01 1.24044853e+017.79687146e+00 6.80796010e+00 5.35458829e+00 3.44523057e+00]
recu = redio.cumsum()
print(recu)
x = np.arange(len(recu))
plt.plot(recu, color='r')
[0.9978438  0.99976405 0.99988417 0.99993987 0.99996666 0.999982120.99999143 0.99999582 0.99999884 0.99999968 0.99999981 0.999999910.99999994 0.99999997 0.99999999 1.        ][<matplotlib.lines.Line2D at 0x1c21dadada0>]

png

得到降維后的數據

pca.set_params(n_components=10)
pac3 = pd.DataFrame(pca.fit_transform(pac2))
pacsse = pac3.copy()
pac3.head()
0123456789
02706.266005-100.824346-1.874787-1.57753612.481591-2.3943209.7708787.8075350.021273-2.169596
12373.811140147.314930-16.386795-8.42865510.019577-3.0047256.0097710.961469-1.5985312.144615
2-1171.733361-5.4930810.7449950.542033-0.785251-5.756412-1.012336-1.7780677.2568840.343277
3-2738.903900-50.4684872.3284912.965415-5.79434711.8912892.965366-1.1824130.0656191.245358
4-1493.64261858.686385-10.80761211.7779737.6646929.3129684.3764291.994214-1.5680500.426246

6.KMeans 進行聚類

from sklearn.cluster import KMeanskm = KMeans(n_clusters=5)
km.fit(pac3)
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,n_clusters=5, n_init=10, n_jobs=1, precompute_distances='auto',random_state=None, tol=0.0001, verbose=0)
clu = km.labels_
pac3['clu'] = clu
pac3.head()
0123456789clu
02706.266005-100.824346-1.874787-1.57753612.481591-2.3943209.7708787.8075350.021273-2.1695960
12373.811140147.314930-16.386795-8.42865510.019577-3.0047256.0097710.961469-1.5985312.1446150
2-1171.733361-5.4930810.7449950.542033-0.785251-5.756412-1.012336-1.7780677.2568840.3432771
3-2738.903900-50.4684872.3284912.965415-5.79434711.8912892.965366-1.1824130.0656191.2453584
4-1493.64261858.686385-10.80761211.7779737.6646929.3129684.3764291.994214-1.5680500.4262461
pac3.groupby('clu')[2].count()
clu
0     90
1    113
2    122
3    109
4    123
Name: 2, dtype: int64

#### palette 的顏色風格:
Accent, Accent_r, Blues, Blues_r, BrBG, BrBG_r, BuGn, BuGn_r, BuPu, BuPu_r, CMRmap, CMRmap_r, Dark2, Dark2_r, GnBu, GnBu_r, Greens, Greens_r, Greys, Greys_r, OrRd, OrRd_r, Oranges, Oranges_r, PRGn, PRGn_r, Paired, Paired_r, Pastel1, Pastel1_r, Pastel2, Pastel2_r, PiYG, PiYG_r, PuBu, PuBuGn, PuBuGn_r, PuBu_r, PuOr, PuOr_r, PuRd, PuRd_r, Purples, Purples_r, RdBu, RdBu_r, RdGy, RdGy_r, RdPu, RdPu_r, RdYlBu, RdYlBu_r, RdYlGn, RdYlGn_r, Reds, Reds_r, Set1, Set1_r, Set2, Set2_r, Set3, Set3_r, Spectral, Spectral_r, Vega10, Vega10_r, Vega20, Vega20_r, Vega20b, Vega20b_r, Vega20c, Vega20c_r, Wistia, Wistia_r, YlGn, YlGnBu, YlGnBu_r, YlGn_r, YlOrBr, YlOrBr_r, YlOrRd, YlOrRd_r, afmhot, afmhot_r, autumn, autumn_r, binary, binary_r, bone, bone_r, brg, brg_r, bwr, bwr_r, cool, cool_r, coolwarm, coolwarm_r, copper, copper_r, cubehelix, cubehelix_r, flag, flag_r, gist_earth, gist_earth_r, gist_gray, gist_gray_r, gist_heat, gist_heat_r, gist_ncar, gist_ncar_r, gist_rainbow, gist_rainbow_r, gist_stern, gist_stern_r, gist_yarg, gist_yarg_r, gnuplot, gnuplot2, gnuplot2_r, gnuplot_r, gray, gray_r, hot, hot_r, hsv, hsv_r, icefire, icefire_r, inferno, inferno_r, jet, jet_r, magma, magma_r, mako, mako_r, nipy_spectral, nipy_spectral_r, ocean, ocean_r, pink, pink_r, plasma, plasma_r, prism, prism_r, rainbow, rainbow_r, rocket, rocket_r, seismic, seismic_r, spectral, spectral_r, spring, spring_r, summer, summer_r, tab10, tab10_r, tab20, tab20_r, tab20b, tab20b_r, tab20c, tab20c_r, terrain, terrain_r, viridis, viridis_r, vlag, vlag_r, winter, winter_r

plt.figure(figsize=(13,7))
sns.scatterplot(x=0, y=1, data=pac3,style='clu',hue='clu', palette='autumn')
<matplotlib.axes._subplots.AxesSubplot at 0x1c21db35438>

png

將分類后的類別添加至原數據中

pac4 = pac2.copy()
pac4['cluster'] = list(pac3.clu)
pac4.head()
A2A11A12A13A20A23A24A43A44A46A48A49A50A51A53A54cluster
user_id
425530019219000.5230.9217420347362240.0710
776120038819000.0200.9025638684371500.03120
276197015311000.0100.9200015422615080.0951
2215720003000.020.857142457924000.0214
6924330002000.0110.7368428454916800.01541
# 計算每個類的平均值
clu5 = pac4.groupby('cluster').mean()
# 刪除相關性較高的列
clu5.drop(columns='A53',inplace=True)
c5cor = clu5.corr()
plt.figure(figsize=(15,8))
sns.heatmap(c5cor,annot=True)
<matplotlib.axes._subplots.AxesSubplot at 0x1c21d92a780>

png

ccrp = pd.DataFrame(np.tril(c5cor,-1))
ccrp.columns = clu5.columns
cccc = clu5.loc[:,(ccrp.abs()<.95).all()]
cccc
A2A20A23A24A44A46A50A51A54
cluster
00.0222220.3222220.6555560.1676910.85819327.60000010.6666672.011111166.711111
10.0796460.2743360.3628320.0952310.84402720.1592923.0088501.469027102.106195
20.0737700.3770490.3360660.0706280.84934324.7377054.2868851.844262121.909836
30.0183490.2293580.2844040.0982520.84598124.1192665.2660551.733945146.871560
40.2032520.2926830.2439020.0636860.77507618.9837402.1300810.97561084.032520
from sklearn.preprocessing import scaleccccc = pd.DataFrame(scale(cccc))
ccccc.columns = cccc.columns
ccccc
A2A20A23A24A44A46A50A51A54
0-0.8555900.4688591.9184001.8620200.7858821.4229701.8677731.1184571.424282
10.002962-0.503392-0.094337-0.1049610.315530-0.940402-0.688647-0.381093-0.746672
2-0.0848841.582038-0.278379-0.7728260.4920380.513827-0.2619980.656909-0.081200
3-0.913505-1.416613-0.633601-0.0229440.3803870.3173940.0648790.3517420.757602
41.851016-0.130892-0.912083-0.961289-1.973837-1.313789-0.982007-1.746015-1.354012
plt.figure(figsize=(8,8))
# 極坐標的分割分數
N = ccccc.shape[1]
# 設置雷達圖的角度,用于平分切開一個圓面
angles = np.linspace(0, 2*np.pi, N, endpoint=False)
# 使雷達圖一圈封閉起來
angles = np.concatenate((angles,[angles[0]]))
for i in range(len(ccccc)):# 構造數據values = ccccc.loc[i,:]# 為了使雷達圖一圈封閉起來values = np.concatenate((values,[values[0]]))# 繪制plt.polar(angles, values, 'o-', linewidth=2)
plt.legend(ccccc.index, loc='lower right')
# 添加極坐標的標簽
plt.thetagrids(angles * 180/np.pi, labels=list(ccccc.columns))
plt.title('重要指標雷達圖呈現')
Text(0.5,1.05,'重要指標雷達圖呈現')

png

不進行預處理的降維

dfp = acc.iloc[:,3:(len(acc.columns)-1)]
dfp.index=acc.user_id
dfp.head()
A1A2A3A4A5A6A7A8A9A10...A45A46A47A48A49A50A51A52A53A54
user_id
425530000010123358.25288230...19201805433473622400.071
77612000009000.00325195...19381422146843715000.0312
276197000070587.25150100...1515546024226150800.095
22157200001000.004014...242439891579240000.021
69243300006000.0010295...1528507064549168000.0154

5 rows × 54 columns

from sklearn.decomposition import PCApca = PCA(whiten=False)
pca.fit(dfp)
PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,svd_solver='auto', tol=0.0, whiten=False)
retio = pca.explained_variance_ratio_
# print(retio) 
# print(pca.singular_values_)  rec = retio.cumsum()
print(rec)
x = np.arange(len(rec))
plt.plot(rec, color='r')
[0.9996008  0.99995245 0.99997489 0.99999016 0.9999933  0.999995640.99999759 0.99999838 0.99999897 0.9999995  0.99999962 0.999999720.99999979 0.99999986 0.9999999  0.99999993 0.99999996 0.999999970.99999997 0.99999998 0.99999998 0.99999999 0.99999999 0.999999990.99999999 1.         1.         1.         1.         1.1.         1.         1.         1.         1.         1.1.         1.         1.         1.         1.         1.1.         1.         1.         1.         1.         1.1.         1.         1.         1.         1.         1.        ][<matplotlib.lines.Line2D at 0x1c21f406780>]

png

pca.set_params(n_components=10)
pacsse = pd.DataFrame(pca.fit_transform(dfp))
pacsse.head()
0123456789
094938.293061-342.891655-161.442878-199.6162101.83069273.107938153.124982124.440657-34.37161246.548951
156613.313155-960.580156-38.560364-45.83657113.67016690.767620-145.846645-40.25513410.50820316.287863
2-31060.195159388.005529-6.932692-0.948812-5.33272818.23729311.39346714.689011-7.99490932.398532
3-45806.2524431579.357883-81.812845-96.488345-18.477649-90.05921731.377291-22.865193-19.72483716.293640
4-34963.135693611.858506-18.187490-16.454233-5.597209-9.722257-63.112236-3.9432667.222725-10.889839

手肘法獲取最優 K 值

from sklearn.cluster import KMeansdf_features = pacsse # 讀入數據
# '利用SSE選擇k'
SSE = []  # 存放每次結果的誤差平方和
for k in range(1,9):estimator = KMeans(n_clusters=k)  # 構造聚類器estimator.fit(df_features)SSE.append(estimator.inertia_)
X = range(1,9)
plt.xlabel('k')
plt.ylabel('SSE')
plt.plot(X,SSE,'o-')
[<matplotlib.lines.Line2D at 0x1c2211cac50>]

png

顯然,先標準化數據是不合適的

# 顯然,先標準化數據是不合適的df_features = pd.DataFrame(scale(pacsse)) SSE = []  
for k in range(1,9):estimator = KMeans(n_clusters=k) estimator.fit(df_features)SSE.append(estimator.inertia_)
X = range(1,9)
plt.xlabel('k')
plt.ylabel('SSE')
plt.plot(X,SSE,'o-')
[<matplotlib.lines.Line2D at 0x1c2213bc438>]

png

km = KMeans(n_clusters=4)
km.fit(pacsse)
clu = km.labels_
pacsse['clu'] = clu
pacsse.head()
0123456789clu
094938.293061-342.891655-161.442878-199.6162101.83069273.107938153.124982124.440657-34.37161246.5489512
156613.313155-960.580156-38.560364-45.83657113.67016690.767620-145.846645-40.25513410.50820316.2878630
2-31060.195159388.005529-6.932692-0.948812-5.33272818.23729311.39346714.689011-7.99490932.3985321
3-45806.2524431579.357883-81.812845-96.488345-18.477649-90.05921731.377291-22.865193-19.72483716.2936401
4-34963.135693611.858506-18.187490-16.454233-5.597209-9.722257-63.112236-3.9432667.222725-10.8898391
pacsse.groupby('clu')[2].count()
clu
0    153
1    344
2     54
3      6
Name: 2, dtype: int64
plt.figure(figsize=(13,7))
sns.scatterplot(x=0, y=1, data=pacsse,style='clu',hue='clu', palette='autumn')
<matplotlib.axes._subplots.AxesSubplot at 0x1c22118b668>

png

顯然,不進行預處理的數據聚類是有問題的, 第一主成分和第二主成分 顯然是相關的

pac4 = pac2.copy()
pac4['cluster'] = list(pacsse.clu)
pac4.head()clu5 = pac4.groupby('cluster').mean()
clu5.drop(columns='A53',inplace=True)
c5cor = clu5.corr()
plt.figure(figsize=(15,8))
sns.heatmap(c5cor,annot=True)
<matplotlib.axes._subplots.AxesSubplot at 0x1c22145a4e0>

png

ccrp = pd.DataFrame(np.tril(c5cor,-1))
ccrp.columns = clu5.columns
cccc = clu5.loc[:,(ccrp.abs()<.95).all()]
cccc
A12A20A51A54
cluster
03.3986930.2287581.810458146.287582
11.9389530.3168601.433140101.531977
24.5925930.4074071.870370169.777778
32.1666670.1666671.666667213.833333
from sklearn.preprocessing import scaleccccc = pd.DataFrame(scale(cccc))ccccc.columns = cccc.columns
ccccc
A12A20A51A54
00.352533-0.5627840.684599-0.285229
1-1.0217050.406288-1.555764-1.388557
21.4765021.4022491.0403380.293858
3-0.807330-1.245753-0.1691731.379928
plt.figure(figsize=(8,8))
# 極坐標的分割分數
N = ccccc.shape[1]
# 設置雷達圖的角度,用于平分切開一個圓面
angles = np.linspace(0, 2*np.pi, N, endpoint=False)
# 使雷達圖一圈封閉起來
angles = np.concatenate((angles,[angles[0]]))
for i in range(len(ccccc)):# 構造數據values = ccccc.loc[i,:]# 為了使雷達圖一圈封閉起來values = np.concatenate((values,[values[0]]))# 繪制plt.polar(angles, values, 'o-', linewidth=2)
plt.legend(ccccc.index, loc='lower right')
# 添加極坐標的標簽
plt.thetagrids(angles * 180/np.pi, labels=list(ccccc.columns))
plt.title('重要指標雷達圖呈現')
Text(0.5,1.05,'重要指標雷達圖呈現')

png

轉載于:https://www.cnblogs.com/cvlas/p/9537532.html

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/451499.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/451499.shtml
英文地址,請注明出處:http://en.pswp.cn/news/451499.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

經典算法匯總

一. 數組相關 二. 鏈表相關 三. 字符串相關 LeetCode第3題&#xff1a;無重復字符的最長子串 LeetCode第567題&#xff1a;字符串的排列 四. 二叉樹相關

碼農如何實現高帥富

當今IT界真可謂是好不熱鬧&#xff1a;谷歌推出chrome os&#xff0c;微軟推出windows 8&#xff0c;W3C的HTML5也讓人如雷貫耳&#xff0c;又有“云”飄在天上&#xff0c;至于最近最火的移動開發諸如ios、iphone、windows phone更是無人不知&#xff0c;無人不曉了&#xff0…

jQuery的name選擇器 模糊匹配

前些天發現了一個巨牛的人工智能學習網站&#xff0c;通俗易懂&#xff0c;風趣幽默&#xff0c;忍不住分享一下給大家。點擊跳轉到教程。 $("div[id]") 選擇所有含有id屬性的div元素 $("input[nameeinsdan]") 選擇所有的name屬性等于einsdan的input元素 $…

2018.8.26 Spring自學如門

什么是Spring&#xff1f; Spring是一個開源框架&#xff0c;Spring是于2003 年興起的一個輕量級的Java 開發框架&#xff0c;由Rod Johnson 在其著作Expert One-On-One J2EE Development and Design中闡述的部分理念和原型衍生而來。它是為了解決企業應用開發的復雜性而創建的…

排查騰訊云服務器被挖礦病毒【pnscan】挾持

一、問題發現 最新在使用騰訊云部署項目應用&#xff0c;具體方式為docker部署。今天早上發現騰訊云發來一條報警信息&#xff1a; 看到信息中說到攻擊行為&#xff0c;懷疑是否中了病毒&#xff0c;決定排查一下問題。 二、排查過程 首先登錄騰訊云服務器控制臺&#xff0…

redis 操作

一、自動分配&#xff08;redis&#xff09; 數據放緩存了&#xff0c;為的是速度快 redis是支持持久化的&#xff0c;如果關機了以后&#xff0c;數據已經會放在文件里了 先買上一臺電腦&#xff1a;裝上redis服務器軟件 - 這個服務器有個工網IP&#xff1a;47.93.4.198 - 端口…

GroupID和ArtifactID

GroupID是項目組織唯一的標識符&#xff0c;實際對應JAVA的包的結構&#xff0c;是main目錄里java的目錄結構。 ArtifactID就是項目的唯一的標識符&#xff0c;實際對應項目的名稱&#xff0c;就是項目根目錄的名稱。

解決報錯:java.lang.NoSuchMethodException: com.tangyuan.entity.RicherProduct.<init>()

前些天發現了一個巨牛的人工智能學習網站&#xff0c;通俗易懂&#xff0c;風趣幽默&#xff0c;忍不住分享一下給大家。點擊跳轉到教程。 1.報錯;java.lang.NoSuchMethodException: com.tangyuan.entity.RicherProduct.<init>() 2. 我看到網上有人說是因為少寫這一句&…

從另一個角度看大數據量處理利器:布隆過濾器

思路&#xff1a;從簡單的排序談到BitMap算法&#xff0c;再談到數據去重問題&#xff0c;談到大數據量處理利器&#xff1a;布隆過濾器。 情景1&#xff1a;對無重復的數據進行排序 給定數據&#xff08;2&#xff0c;4&#xff0c;1&#xff0c;12&#xff0c;9&#xff0c…

例題練習

1,購物車 功能要求&#xff1a;要求用戶輸入自己擁有總資產&#xff0c;例如&#xff1a;2000顯示商品列表&#xff0c;讓用戶根據序號選擇商品&#xff0c;加入購物車購買&#xff0c;如果商品總額大于總資產&#xff0c;提示賬戶余額不足&#xff0c;否則&#xff0c;購買成功…

A端,B端,C端

A端是開發界面。即管理員所接觸的界面。 B端是商家界面。即瀏覽器界面&#xff0c;依托于web界面。 C端是用戶界面。即app的界面&#xff0c;用戶所接觸最為廣泛的界面。

怎么用js動態 設置select中的某個值為選中項

可以使用javascript和jQuery兩種實現方式 1&#xff1a;使用javascript實現 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns"http://www.w3.org…

java常用簡略語含義

首先這些對象都應用都是一些單詞的簡稱&#xff0c;也是一種應用思想&#xff0c;故其他語言也可以使用&#xff0c;在 Java 里比較常見這些對象吧。下面來一一解釋。 一、POJO&#xff08;Plain Ordinary Java Object&#xff09;。 簡單而言&#xff0c;就是一個簡單的對象&…

并行計算的強大

最近在處理一批數據&#xff0c;10的8次方&#xff0c;處理完畢大概要一個月&#xff0c;并且這個程序占用的CPU只有一個&#xff08;我從來沒有注意到這個問題啊啊啊&#xff09;。 突然師兄提醒我可以把10的8次方條數據拆成10個10的7次方&#xff0c;作為10條任務并行處理&a…

Kubernetes集群(概念篇)

Kubernetes介紹 2013年docker誕生&#xff0c;自此一發不可收拾&#xff0c;它的發展如火如荼&#xff0c;作為一個運維如果不會docker&#xff0c;那真的是落伍了。 而2014年出現的kubernetes&#xff08;又叫k8s&#xff09;更加炙手可熱&#xff0c;我想大部分人僅僅是聽說過…

cannot resolve symbol xxxx問題

1.File->Invalidate Caches/Restart 清除緩存重啟 2.還不行就maven -> Reinport

$(“#addLowForm“).serialize()同時提交其它參數的寫法

前些天發現了一個巨牛的人工智能學習網站&#xff0c;通俗易懂&#xff0c;風趣幽默&#xff0c;忍不住分享一下給大家。點擊跳轉到教程。 1. 原本寫法&#xff1a; 2. 不光傳表單參數&#xff0c;還有別的參數的寫法&#xff1a;

JAVA自學筆記25

JAVA自學筆記25 1、GUI 1&#xff09;圖形用戶接口&#xff0c;以圖形的方式&#xff0c;來顯示計算機操作的界面&#xff0c;更方便更直觀 2&#xff09;CLI 命令行用戶接口&#xff0c;就是常見的Dos&#xff0c;操作不直觀 3&#xff09; 類Dimension 類內封裝單個對象…

360——新式的流氓

360確實是一種新式的流氓。提供一些很多用戶有用的工具&#xff0c;然后在同時&#xff0c;也提供一些流氓性的工具或者流浪性的推廣方法&#xff0c;比如&#xff1a;對360瀏覽器&#xff0c;360桌面等工具&#xff0c;通過暗示性的廣告語進行推廣&#xff0c;而對于安裝的諸多…

跳板機

現在一定規模互聯網企業&#xff0c;往往都擁有大量服務器&#xff0c;如何安全并高效的管理這些服務器是每個系統運維或安全運維人員必要工作。現在比較常見的方案是搭建堡壘機環境作為線上服務器的入口&#xff0c;所有服務器只能通過堡壘機進行登陸訪問&#xff0c;合格的堡…