數據可視化（十一）：Pandas餐飲信息表分析——交叉表、離群點分析，多維分析等高級操作

Tips："分享是快樂的源泉💧，在我的博客里，不僅有知識的海洋🌊，還有滿滿的正能量加持💪，快來和我一起分享這份快樂吧😊！

喜歡我的博客的話，記得點個紅心??和小關小注哦！您的支持是我創作的動力！數據源存放在我的資源下載區啦！

數據可視化（十一）：Pandas餐飲信息表分析——交叉表、離群點分析，多維分析等高級操作

數據可視化（十一）：Pandas餐飲信息表分析——交叉表、離群點分析，多維分析等高級操作
- 案例三：餐飲信息表分析
- - 問題1：按類型聚合餐飲店數量并畫出水平直方圖
  - 問題2：按城市聚合餐飲店數量，畫出垂直柱狀圖
  - 問題3：交叉表查看不同城市不同餐飲店的餐飲數量
  - 問題4：找出點評最多的10個餐飲店
  - 問題5：找出人均離群點（過大的數），并刪除
  - 問題6：按類型分組，計算人均最高最低均值，畫成對比水平直方圖
  - 問題7：以服務為橫坐標，口味為縱坐標，畫出散點圖
  - 問題8：以人均為橫坐標，服務口味環境為縱坐標，以不同顏色畫出散點圖
  - 問題9：一線城市北上廣深，一個畫幅小4個餅圖，畫出'川菜', '湘菜', '江浙菜', '東北菜', '粵菜', '徽菜', '客家菜', '贛菜', '湖北菜'的餐飲店占比
  - 問題10：跟上相似，一線城市北上廣深，一個畫幅小4個餅圖，畫出每個城市餐飲店最多的10種類型的占比圖
  - 問題11：采用jieba分詞，對所有店名進行分詞，找出出現頻率最高10個詞，詞長度要大于1
  - 問題12：將上面分詞結果繪制成詞云

案例三：餐飲信息表分析

# 準備數據import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inlineplt.rcParams['font.sans-serif'] = ['Arial Unicode MS'] 
plt.rcParams['axes.unicode_minus'] = False #用來正常顯示負號import warnings
warnings.filterwarnings('ignore')# 導入數據df = pd.read_csv('data/catering.csv',encoding='gb2312')
df.sample(5)

# 查看數據信息df.info()

在這里插入圖片描述

# 查看每個列的不同值數量df.nunique()

# 查看  類型 有哪些類型df['類型'].unique()

在這里插入圖片描述

問題1：按類型聚合餐飲店數量并畫出水平直方圖

# 按類型聚合餐飲店數量并畫出水平直方圖s = df.類型.value_counts(ascending=True)
display(s)
fig = plt.figure(figsize=(8,20))
plt.barh(s.index, s.values, height=1)
plt.title('不同類型餐飲店數量對比')
plt.xlabel('店數量')
plt.ylabel('餐飲類型')
plt.grid()plt.show()

在這里插入圖片描述

問題2：按城市聚合餐飲店數量，畫出垂直柱狀圖

# 按城市聚合餐飲店數量，畫出垂直柱狀圖s = df.城市.value_counts()
display(s)
fig = plt.figure(figsize=(20, 10))
plt.bar(s.index, s.values, width=0.8, color='green', alpha=0.7)
plt.title('不同城市餐飲店數量對比')
plt.xticks(rotation=90)
plt.xlabel('店數量')
plt.ylabel('城市')
plt.grid()plt.show()

在這里插入圖片描述

問題3：交叉表查看不同城市不同餐飲店的餐飲數量

# 交叉表查看不同城市不同餐飲店的餐飲數量pd.crosstab(df.城市, df.類型, margins=True).sort_values(by='All', ascending=False)

# 拷貝一份表格，用于刪除缺失值df_ = df.copy()# 刪除 點評 為空的行df_.dropna(subset=['點評'], inplace=True)# 查看點評數據的統計信息display( df_.點評.describe() )

在這里插入圖片描述

問題4：找出點評最多的10個餐飲店

# 找出點評最多的10個餐飲店df_.sort_values(by='點評', ascending=False)[:10]

# 拷貝一份表格，用于刪除缺失值
df_ = df.copy()
# 刪除 人均 為空的行
df_.dropna(subset=['人均'], inplace=True)

在這里插入圖片描述

問題5：找出人均離群點（過大的數），并刪除

# 找出 人均 離群點（過大的數）
# 不去除利群點，畫直方圖時會出現圖形縮小在一個小范圍def out_range(s:pd.Series, a:int):bool_inds = (s<s.mean()-a*s.std())|(s>s.mean()+a*s.std())return s[bool_inds].indexdisplay( out_range(df_['人均'], 3) )
df_.drop(out_range(df_['人均'], 3), axis=0, inplace=True)

在這里插入圖片描述

問題6：按類型分組，計算人均最高最低均值，畫成對比水平直方圖

# 按 類型 分組， 計算 人均 最高 最低 均值
df_1 = np.round(df_.groupby(by='類型')['人均'].agg([np.mean, np.max, np.min]))
df_1# 把上面DataFrame畫成對比水平直方圖df_1.plot.barh(figsize=(8,25))plt.show()

在這里插入圖片描述

問題7：以服務為橫坐標，口味為縱坐標，畫出散點圖

# 以 服務 為橫坐標，口味 為縱坐標，畫出散點圖plt.figure(figsize=(10,10))
plt.scatter(x=df_.服務, y=df_.口味)
plt.xlabel('服務')
plt.ylabel('口味')
plt.grid()plt.show()
# 可以看出兩者基本正相關
# 改善服務可以增加顧客對口感的好評

在這里插入圖片描述

問題8：以人均為橫坐標，服務口味環境為縱坐標，以不同顏色畫出散點圖

# 以 人均 為橫坐標，服務 口味 環境 為縱坐標，以不同顏色畫出散點圖plt.figure(figsize=(10,10))
plt.scatter(df_.人均, df_.服務, color='r', label='服務', alpha=0.3, edgecolors='none')
plt.scatter(df_.人均, df_.口味, color='g', label='口味', alpha=0.3, edgecolors='none')
plt.scatter(df_.人均, df_.環境, color='b', label='環境', alpha=0.3, edgecolors='none')
plt.xlim(0, 300) # 防止點過于聚集
plt.ylim(5.5, 9.5) # 防止點過于聚集
plt.legend()plt.show()
# 可以看出餐飲店的指標基本集中在：人均100元以下，三種評價都在6.5～8.5

在這里插入圖片描述

問題9：一線城市北上廣深，一個畫幅小4個餅圖，畫出’川菜’, ‘湘菜’, ‘江浙菜’, ‘東北菜’, ‘粵菜’, ‘徽菜’, ‘客家菜’, ‘贛菜’, '湖北菜’的餐飲店占比

# 一線城市北上廣深，一個畫幅小4個餅圖，
# 畫出'川菜', '湘菜', '江浙菜', '東北菜', '粵菜', '徽菜', '客家菜', '贛菜', '湖北菜'的餐飲店占比types = ['川菜', '湘菜', '江浙菜', '東北菜', '粵菜', '徽菜', '客家菜', '贛菜', '湖北菜']bj = df_[ df_['城市']=='北京' ][ df_['類型'].isin(types) ]['類型'].value_counts()
sh = df_[ df_['城市']=='上海' ][ df_['類型'].isin(types) ]['類型'].value_counts()
gz = df_[ df_['城市']=='廣州' ][ df_['類型'].isin(types) ]['類型'].value_counts()
sz = df_[ df_['城市']=='深圳' ][ df_['類型'].isin(types) ]['類型'].value_counts()fig = plt.figure(figsize=(12,12))ax1 = fig.add_subplot(2,2,1)
ax1.pie(bj.values, labels=bj.index, explode=np.ones(len(bj.index))*0.1, autopct='%.2f%%')
ax1.set_title('北京')
ax2 = fig.add_subplot(2,2,2)
ax2.pie(sh.values, labels=sh.index, explode=np.ones(len(sh.index))*0.1, autopct='%.2f%%')
ax2.set_title('上海')
ax3 = fig.add_subplot(2,2,3)
ax3.pie(gz.values, labels=gz.index, explode=np.ones(len(gz.index))*0.1, autopct='%.2f%%')
ax3.set_title('廣州')
ax4 = fig.add_subplot(2,2,4)
ax4.pie(sz.values, labels=sz.index, explode=np.ones(len(sz.index))*0.1, autopct='%.2f%%')
ax4.set_title('深圳')plt.show()

在這里插入圖片描述

問題10：跟上相似，一線城市北上廣深，一個畫幅小4個餅圖，畫出每個城市餐飲店最多的10種類型的占比圖

# 跟上相似，一線城市北上廣深，一個畫幅小4個餅圖，
# 畫出每個城市餐飲店最多的10種類型的占比圖bj = df_[ df_['城市']=='北京' ]['類型'].value_counts()[:10]
sh = df_[ df_['城市']=='上海' ]['類型'].value_counts()[:10]
gz = df_[ df_['城市']=='廣州' ]['類型'].value_counts()[:10]
sz = df_[ df_['城市']=='深圳' ]['類型'].value_counts()[:10]fig = plt.figure(figsize=(12,12))ax1 = fig.add_subplot(2,2,1)
ax1.pie(bj.values, labels=bj.index, explode=np.ones(10)*0.1, autopct='%.2f%%')
ax1.set_title('北京')
ax2 = fig.add_subplot(2,2,2)
ax2.pie(sh.values, labels=sh.index, explode=np.ones(10)*0.1, autopct='%.2f%%')
ax2.set_title('上海')
ax3 = fig.add_subplot(2,2,3)
ax3.pie(gz.values, labels=gz.index, explode=np.ones(10)*0.1, autopct='%.2f%%')
ax3.set_title('廣州')
ax4 = fig.add_subplot(2,2,4)
ax4.pie(sz.values, labels=sz.index, explode=np.ones(10)*0.1, autopct='%.2f%%')
ax4.set_title('深圳')plt.show()

在這里插入圖片描述

問題11：采用jieba分詞，對所有店名進行分詞，找出出現頻率最高10個詞，詞長度要大于1

# 采用jieba分詞，對所有店名進行分詞，找出出現頻率最高10個詞，詞長度要大于1# 這次采用字符串相加
import jieba
ss = df['店名'].sum()
ss = ss.replace('.', "")
lt = jieba.lcut(ss)
results = {}
for word in lt:if len(word)>1 and '店' not in word: # 詞里有‘店’也不要results[word] = results.get(word, 0) + 1
words = list(results.items())
words.sort(key=lambda x:x[1], reverse=True)
words[:10]

在這里插入圖片描述

問題12：將上面分詞結果繪制成詞云

# 將上面分詞結果繪制成詞云
from wordcloud import WordCloudwordcloud = WordCloud(font_path='./SimHei.ttf', width=1000,height=1000,background_color='white')
wordcloud.fit_words(results)
plt.figure(figsize=(15,15))
axs = plt.imshow(wordcloud)#正常顯示詞云
plt.axis('off')#關閉坐標軸plt.show()