目錄
- 序言:第三方庫及所需材料
- 函數模塊介紹
- 分詞
- 詞頻統計
- 條形圖繪制
- 詞云繪制
- 主函數
- 效果預覽
- 全部代碼
序言:第三方庫及所需材料
編程語言:Python3.9。
編程環境:Anaconda3,Spyder5。
使用到的主要第三方庫:jieba-0.42.1,wordcloud-1.8.2.2,matplotlib-3.5.1。
- 文本數據:txt格式,本文以2023年內蒙古自治區政府工作報告為例,命名為“2023.txt”。
- 停用詞:“cn_stopwords.txt”,網絡下載
- 字體文件:tff格式,本文使用方正粗黑宋簡體,命名為“fzch.tff”
- 主題背景圖片:本文使用白底內蒙古自治區地圖,命名為“R-C.png”
以上文件置于py文件的同級目錄下,使用相對路徑讀取。
函數模塊介紹
具體的代碼可見全部代碼部分,這部分只介紹思路和相應的函數模塊
分詞
在主函數中讀取文本數據,調用分詞函數cutWord,使用jieba分詞庫和停用詞表對文本進行分詞操作,并返回詞語組成的列表。
def cutWord(text): words=jieba.cut(text)stopwords = {}.fromkeys([ line.rstrip() for line in open('cn_stopwords.txt',encoding='utf-8') ])finalwords = []for word in words:if word not in stopwords:if (word != "。" and word != ",") :finalwords.append(word) return finalwords
詞頻統計
將詞語列表傳入詞頻統計函數countWord,去除單字詞和換行符后,統計各詞語出現的頻率,并返回各詞語的頻數列表。
def countWord(text):counts={}for word in text: if len(word) == 1 or word=='\n':#單個詞和換行符不計算在內continueelse:if word not in counts.keys():counts[word]=1else:counts[word]+=1return counts
條形圖繪制
將詞頻字典傳入高頻詞條形圖繪制函數drawBar,根據注釋傳入參數,選擇前RANGE項詞語和圖像橫豎
def drawBar(countdict,RANGE, heng):#函數來源于:https://blog.csdn.net/leokingszx/article/details/101456624,有改動#dicdata:字典的數據。#RANGE:截取顯示的字典的長度。#heng=0,代表條狀圖的柱子是豎直向上的。heng=1,代表柱子是橫向的。考慮到文字是從左到右的,讓柱子橫向排列更容易觀察坐標軸。by_value = sorted(countdict.items(),key = lambda item:item[1],reverse=True)print(by_value[:20])x = []y = []plt.figure(figsize=(9, 6))for d in by_value:x.append(d[0])y.append(d[1])if heng == 0:plt.bar(x[0:RANGE], y[0:RANGE])plt.show()return elif heng == 1:plt.barh(x[0:RANGE], y[0:RANGE])plt.show()return else:return "heng的值僅為0或1!"
詞云繪制
將詞語列表傳入詞云繪制函數drawWordCloud,繪制詞云圖。進一步地,將詞語列表傳入詞云繪制函數drawWordCloudwithMap,以內蒙古自治區地圖為背景繪制詞云圖。
def drawWordCloud(textList):wc = WordCloud(font_path ="fzch.ttf",background_color="white",width=1800,height=1200).fit_words(countdict)plt.figure(figsize=(18, 12))plt.imshow(wc)plt.axis("off")plt.show()def drawWordCloudwithMap(textList):d = path.dirname(__file__)map_coloring = np.array(Image.open(path.join(d, "R-C.png"))) wc = WordCloud(font_path ="fzch.ttf",mask=map_coloring,background_color="white",width=1800,height=1200).fit_words(countdict)plt.figure(figsize=(18, 12))plt.imshow(wc)plt.axis("off")plt.show()
主函數
if __name__ == "__main__":with open('2023.txt','r',encoding='utf-8') as f:text=f.read()#讀取文本cutText=cutWord(text)#jieba分詞countdict=countWord(cutText)#生成詞頻字典drawBar(countdict,10,0)#繪制詞語出現次數前10的豎向條形圖 drawBar(countdict,20,1)#繪制詞語出現次數前20的橫向條形圖 drawWordCloud(cutText)#繪制詞云圖drawWordCloudwithMap(cutText)#以地圖為背景繪制詞云圖
效果預覽
全部代碼
# -*- coding: utf-8 -*-
# @Time : 2023/11/22
# @Author : Ryo_Yuki
# @Software: Spyderimport jieba
import jieba.analyse
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image
from os import path
plt.rcParams['font.sans-serif'] = ['SimHei'] # 用來正常顯示中文標簽def cutWord(text): words=jieba.cut(text)stopwords = {}.fromkeys([ line.rstrip() for line in open('cn_stopwords.txt',encoding='utf-8') ])finalwords = []for word in words:if word not in stopwords:if (word != "。" and word != ",") :finalwords.append(word) return finalwordsdef countWord(text):counts={}for word in text: if len(word) == 1 or word=='\n':#單個詞和換行符不計算在內continueelse:if word not in counts.keys():counts[word]=1else:counts[word]+=1return countsdef drawBar(countdict,RANGE, heng):#函數來源于:https://blog.csdn.net/leokingszx/article/details/101456624,有改動#dicdata:字典的數據。#RANGE:截取顯示的字典的長度。#heng=0,代表條狀圖的柱子是豎直向上的。heng=1,代表柱子是橫向的。考慮到文字是從左到右的,讓柱子橫向排列更容易觀察坐標軸。by_value = sorted(countdict.items(),key = lambda item:item[1],reverse=True)print(by_value[:20])x = []y = []plt.figure(figsize=(9, 6))for d in by_value:x.append(d[0])y.append(d[1])if heng == 0:plt.bar(x[0:RANGE], y[0:RANGE])plt.show()return elif heng == 1:plt.barh(x[0:RANGE], y[0:RANGE])plt.show()return else:return "heng的值僅為0或1!"def drawWordCloud(textList):wc = WordCloud(font_path ="fzch.ttf",background_color="white",width=1800,height=1200).fit_words(countdict)plt.figure(figsize=(18, 12))plt.imshow(wc)plt.axis("off")plt.show()def drawWordCloudwithMap(textList):d = path.dirname(__file__)map_coloring = np.array(Image.open(path.join(d, "R-C.png"))) wc = WordCloud(font_path ="fzch.ttf",mask=map_coloring,background_color="white",width=1800,height=1200).fit_words(countdict)plt.figure(figsize=(18, 12))plt.imshow(wc)plt.axis("off")plt.show()#主函數
if __name__ == "__main__":with open('2023.txt','r',encoding='utf-8') as f:text=f.read()#讀取文本cutText=cutWord(text)#jieba分詞countdict=countWord(cutText)#生成詞頻字典drawBar(countdict,10,0)#繪制詞語出現次數前10的豎向條形圖 drawBar(countdict,20,1)#繪制詞語出現次數前20的橫向條形圖 drawWordCloud(cutText)#繪制詞云圖drawWordCloudwithMap(cutText)#以地圖為背景繪制詞云圖