Python+jieba+wordcloud實現文本分詞、詞頻統計、條形圖繪制及不同主題的詞云圖繪制

序言：第三方庫及所需材料

編程語言：Python3.9。
編程環境：Anaconda3，Spyder5。
使用到的主要第三方庫：jieba-0.42.1，wordcloud-1.8.2.2，matplotlib-3.5.1。

文本數據：txt格式，本文以2023年內蒙古自治區政府工作報告為例，命名為“2023.txt”。
停用詞：“cn_stopwords.txt”，網絡下載
字體文件：tff格式，本文使用方正粗黑宋簡體，命名為“fzch.tff”
主題背景圖片：本文使用白底內蒙古自治區地圖，命名為“R-C.png”

以上文件置于py文件的同級目錄下，使用相對路徑讀取。

函數模塊介紹

具體的代碼可見全部代碼部分，這部分只介紹思路和相應的函數模塊

分詞

在主函數中讀取文本數據，調用分詞函數cutWord，使用jieba分詞庫和停用詞表對文本進行分詞操作，并返回詞語組成的列表。

def cutWord(text):     words=jieba.cut(text)stopwords = {}.fromkeys([ line.rstrip() for line in open('cn_stopwords.txt',encoding='utf-8') ])finalwords = []for word in words:if word not in stopwords:if (word != "。" and word != "，") :finalwords.append(word) return finalwords

詞頻統計

將詞語列表傳入詞頻統計函數countWord，去除單字詞和換行符后，統計各詞語出現的頻率，并返回各詞語的頻數列表。

def countWord(text):counts={}for word in text: if len(word) == 1 or word=='\n':#單個詞和換行符不計算在內continueelse:if word not in counts.keys():counts[word]=1else:counts[word]+=1return counts

條形圖繪制

將詞頻字典傳入高頻詞條形圖繪制函數drawBar，根據注釋傳入參數，選擇前RANGE項詞語和圖像橫豎

def drawBar(countdict,RANGE, heng):#函數來源于：https://blog.csdn.net/leokingszx/article/details/101456624，有改動#dicdata：字典的數據。#RANGE：截取顯示的字典的長度。#heng=0，代表條狀圖的柱子是豎直向上的。heng=1，代表柱子是橫向的。考慮到文字是從左到右的，讓柱子橫向排列更容易觀察坐標軸。by_value = sorted(countdict.items(),key = lambda item:item[1],reverse=True)print(by_value[:20])x = []y = []plt.figure(figsize=(9, 6))for d in by_value:x.append(d[0])y.append(d[1])if heng == 0:plt.bar(x[0:RANGE], y[0:RANGE])plt.show()return elif heng == 1:plt.barh(x[0:RANGE], y[0:RANGE])plt.show()return else:return "heng的值僅為0或1！"

詞云繪制

將詞語列表傳入詞云繪制函數drawWordCloud，繪制詞云圖。進一步地，將詞語列表傳入詞云繪制函數drawWordCloudwithMap，以內蒙古自治區地圖為背景繪制詞云圖。

def drawWordCloud(textList):wc = WordCloud(font_path ="fzch.ttf",background_color="white",width=1800,height=1200).fit_words(countdict)plt.figure(figsize=(18, 12))plt.imshow(wc)plt.axis("off")plt.show()def drawWordCloudwithMap(textList):d = path.dirname(__file__)map_coloring = np.array(Image.open(path.join(d, "R-C.png")))  wc = WordCloud(font_path ="fzch.ttf",mask=map_coloring,background_color="white",width=1800,height=1200).fit_words(countdict)plt.figure(figsize=(18, 12))plt.imshow(wc)plt.axis("off")plt.show()

主函數

if __name__ == "__main__":with open('2023.txt','r',encoding='utf-8') as f:text=f.read()#讀取文本cutText=cutWord(text)#jieba分詞countdict=countWord(cutText)#生成詞頻字典drawBar(countdict,10,0)#繪制詞語出現次數前10的豎向條形圖 drawBar(countdict,20,1)#繪制詞語出現次數前20的橫向條形圖        drawWordCloud(cutText)#繪制詞云圖drawWordCloudwithMap(cutText)#以地圖為背景繪制詞云圖

效果預覽

在這里插入圖片描述

全部代碼

# -*- coding: utf-8 -*-
# @Time    : 2023/11/22
# @Author  : Ryo_Yuki
# @Software: Spyderimport jieba
import jieba.analyse
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image
from os import path
plt.rcParams['font.sans-serif'] = ['SimHei']  # 用來正常顯示中文標簽def cutWord(text):     words=jieba.cut(text)stopwords = {}.fromkeys([ line.rstrip() for line in open('cn_stopwords.txt',encoding='utf-8') ])finalwords = []for word in words:if word not in stopwords:if (word != "。" and word != "，") :finalwords.append(word) return finalwordsdef countWord(text):counts={}for word in text: if len(word) == 1 or word=='\n':#單個詞和換行符不計算在內continueelse:if word not in counts.keys():counts[word]=1else:counts[word]+=1return countsdef drawBar(countdict,RANGE, heng):#函數來源于：https://blog.csdn.net/leokingszx/article/details/101456624，有改動#dicdata：字典的數據。#RANGE：截取顯示的字典的長度。#heng=0，代表條狀圖的柱子是豎直向上的。heng=1，代表柱子是橫向的。考慮到文字是從左到右的，讓柱子橫向排列更容易觀察坐標軸。by_value = sorted(countdict.items(),key = lambda item:item[1],reverse=True)print(by_value[:20])x = []y = []plt.figure(figsize=(9, 6))for d in by_value:x.append(d[0])y.append(d[1])if heng == 0:plt.bar(x[0:RANGE], y[0:RANGE])plt.show()return elif heng == 1:plt.barh(x[0:RANGE], y[0:RANGE])plt.show()return else:return "heng的值僅為0或1！"def drawWordCloud(textList):wc = WordCloud(font_path ="fzch.ttf",background_color="white",width=1800,height=1200).fit_words(countdict)plt.figure(figsize=(18, 12))plt.imshow(wc)plt.axis("off")plt.show()def drawWordCloudwithMap(textList):d = path.dirname(__file__)map_coloring = np.array(Image.open(path.join(d, "R-C.png")))  wc = WordCloud(font_path ="fzch.ttf",mask=map_coloring,background_color="white",width=1800,height=1200).fit_words(countdict)plt.figure(figsize=(18, 12))plt.imshow(wc)plt.axis("off")plt.show()#主函數
if __name__ == "__main__":with open('2023.txt','r',encoding='utf-8') as f:text=f.read()#讀取文本cutText=cutWord(text)#jieba分詞countdict=countWord(cutText)#生成詞頻字典drawBar(countdict,10,0)#繪制詞語出現次數前10的豎向條形圖 drawBar(countdict,20,1)#繪制詞語出現次數前20的橫向條形圖        drawWordCloud(cutText)#繪制詞云圖drawWordCloudwithMap(cutText)#以地圖為背景繪制詞云圖

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/165057.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/165057.shtml
英文地址，請注明出處：http://en.pswp.cn/news/165057.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！