LLM中的N-Gram、TF-IDF和Word embedding

文章目錄

1. N-Gram和TF-IDF：通俗易懂的解析
- 1.1 N-Gram：讓AI學會"猜詞"的技術
- - 1.1.1 基本概念
  - 1.1.2 工作原理
  - 1.1.3 常見類型
  - 1.1.4 應用場景
  - 1.1.5 優缺點
- 1.2 TF-IDF：衡量詞語重要性的尺子
- - 1.2.1 基本概念
  - 1.2.2 計算公式
  - 1.2.3 為什么需要TF-IDF？
  - 1.2.4 應用場景
  - 1.2.5 實際案例
  - 1.2.6 優缺點
- 1.3 總結對比
- 簡單示例
2. Word Embedding（詞嵌入）
- - 2.1 🌍 把詞語變成"坐標"
  - 2.2 🔍 Word Embedding 是什么？
  - 2.3 💡 為什么要用 Word Embedding？
  - 2.4 🛠? 舉個實際例子
  - 2.5 📚 常見的 Word Embedding 方法
  - 2.6 ? 簡單總結
3. Word2Vec
- 3.1 Word2Vec的兩種模型
- - (1) CBOW (Continuous Bag of Words)
  - (2) Skip-gram
- 3.2 Word2Vec的實現步驟
- - Step 1: 數據預處理
  - Step 2: 構建神經網絡模型
  - Step 3: 訓練模型
  - Step 4: 提取詞向量
- 3.3 關鍵優化技術
- - (1) 負采樣（Negative Sampling）
  - (2) 層次 Softmax（Hierarchical Softmax）
- 3.4 代碼示例（Python）
- 3.5 總結
4. 實操
- 4.1 用N-Gram和TF-IDF為酒店建立內容推薦系統
- - 4.1.1 準備
  - 4.1.2 步驟
  - 4.1.3 示例代碼
  - 4.1.4 結果
- 4.2 用Word Embedding為三國演義找相似詞
- - 4.2.1 準備
  - 4.2.2 步驟
  - 4.2.3 示例代碼
  - 4.2.4 結果

1. N-Gram和TF-IDF：通俗易懂的解析

1.1 N-Gram：讓AI學會"猜詞"的技術

1.1.1 基本概念

N-Gram是一種讓計算機理解語言規律的基礎方法，主要用于預測文本中下一個可能出現的詞。它的核心思想很簡單：假設一個詞的出現只和前面的幾個詞有關。

舉個例子：

"我想吃"后面接"蘋果"的概率，可能比接"游泳"更高
輸入法在你打出"dddd"時推薦"帶帶弟弟"就是基于這種原理

1.1.2 工作原理

分段統計：把文本拆成連續的詞組合（比如2個詞的"我吃"，3個詞的"我想吃"），統計每個組合出現的次數
計算概率：用"下一個詞出現的次數除以當前組合出現的總次數"得到條件概率
處理零概率：給從未出現過的組合分配很小的概率，避免完全排除可能性

1.1.3 常見類型

Unigram（一元組）：單個詞為一組（如"我"、“喜歡”）
Bigram（二元組）：兩個連續詞為一組（如"我喜歡"、“喜歡學習”）
Trigram（三元組）：三個連續詞為一組（如"我喜歡學習"）

1.1.4 應用場景

手機輸入法候選詞預測
文本生成（如自動補全句子）
拼寫檢查（判斷詞語組合是否合理）
搜索引擎查詢擴展

1.1.5 優缺點

? 優點：

簡單易實現，計算效率高
可解釋性強，易于調試

? 缺點：

只能記住有限上下文（長句子容易出錯）
需要大量數據訓練
對未見過的新詞組合預測能力差

1.2 TF-IDF：衡量詞語重要性的尺子

1.2.1 基本概念

TF-IDF（詞頻-逆文檔頻率）是一種評估詞語重要性的方法，它考慮兩個因素：

詞頻（TF）：詞在文檔中出現的頻率
逆文檔頻率（IDF）：詞在整個文檔集合中的罕見程度

簡單說：一個詞在本文中出現越多（TF高），同時在別的文章中出現越少（IDF高），就越重要。

1.2.2 計算公式

TF-IDF = TF × IDF

其中：

TF = 詞在文檔中的出現次數 / 文檔總詞數
IDF = log(文檔總數 / 包含該詞的文檔數)

1.2.3 為什么需要TF-IDF？

直接統計詞頻會有一個問題：像"的"、"是"這種詞雖然出現很多，但對理解內容沒幫助。TF-IDF通過IDF降低了這類詞的權重。

1.2.4 應用場景

搜索引擎排序（找出文檔真正重要的詞）
文本分類（如新聞分類）
關鍵詞自動提取
推薦系統（分析用戶興趣）

1.2.5 實際案例

如果分析專利文檔：

"中國"可能詞頻高但IDF低（很多文檔都提到）
"專利"詞頻適中但IDF高（較少文檔提到）
→ "專利"的TF-IDF值會更高，更能代表主題

1.2.6 優缺點

? 優點：

簡單有效，易于計算
能自動過濾常見無意義詞

? 缺點：

不考慮詞語順序和語義關系
對同義詞處理不好（如"電腦"和"計算機"）

1.3 總結對比

特性	N-Gram	TF-IDF
主要用途	預測下一個詞/生成文本	評估詞語重要性/文檔特征提取
核心思想	詞語出現的概率依賴前幾個詞	重要=在本文檔多見+在其它文檔少見
典型應用	輸入法、機器翻譯、拼寫檢查	搜索引擎、文本分類、關鍵詞提取
優勢	保持語言連貫性	識別文檔關鍵主題詞
局限	長距離依賴差、需要大量訓練數據	忽略詞語順序和語義關系

兩者常結合使用，比如先用TF-IDF提取重要詞，再用N-Gram分析這些詞的關系。

簡單示例

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np# 示例詞庫
words = ["蘋果", "香蕉", "橙子", "葡萄", "菠蘿", "芒果", "西瓜", "草莓", "藍莓", "櫻桃","蘋果手機", "蘋果電腦", "蘋果汁", "紅蘋果", "青蘋果"
]# 定義n-gram函數（這里使用2-gram）
def get_ngrams(word, n=2):return [word[i:i+n] for i in range(len(word)-n+1)]# 為每個詞生成n-gram特征
word_ngrams = [" ".join(get_ngrams(word)) for word in words]
print("詞語的2-gram表示示例:")
for word, ngram in zip(words[:5], word_ngrams[:5]):print(f"{word} → {ngram}")# 使用TF-IDF向量化
vectorizer = TfidfVectorizer(tokenizer=lambda x: x.split())
tfidf_matrix = vectorizer.fit_transform(word_ngrams)# 定義查找相似詞的函數
def find_similar_words(target_word, top_n=5):# 生成目標詞的n-gramtarget_ngram = " ".join(get_ngrams(target_word))# 轉換為TF-IDF向量target_vec = vectorizer.transform([target_ngram])# 計算余弦相似度similarities = cosine_similarity(target_vec, tfidf_matrix)# 獲取最相似的詞similar_indices = np.argsort(similarities[0])[::-1][1:top_n+1]  # 排除自己print(f"\n與'{target_word}'最相似的{top_n}個詞:")for idx in similar_indices:print(f"{words[idx]}: {similarities[0][idx]:.3f}")# 測試示例
find_similar_words("蘋果", top_n=5)
find_similar_words("菠蘿", top_n=3)
find_similar_words("蘋果手機", top_n=3)

2. Word Embedding（詞嵌入）

2.1 🌍 把詞語變成"坐標"

想象你是一個外星人，第一次來地球學習人類的語言。你發現：

單詞 “貓” 和 “狗” 經常一起出現（因為它們都是寵物）。
單詞 “蘋果” 和 “香蕉” 也經常一起出現（因為它們都是水果）。
但 “貓” 和 “蘋果” 幾乎不會同時出現（因為它們屬于不同類別）。

于是，你決定給每個單詞分配一個 “坐標”（比如在三維空間里的位置）：

"貓" → [0.8, 0.2, 0.1]
"狗" → [0.7, 0.3, 0.1]
"蘋果" → [0.1, 0.9, 0.4]
"香蕉" → [0.2, 0.8, 0.3]

這樣：
? 相似的詞（比如貓和狗）坐標接近。
? 不相似的詞（比如貓和蘋果）坐標遠離。

2.2 🔍 Word Embedding 是什么？

Word Embedding 就是通過數學方法，把單詞變成 一串數字（向量），讓計算機能通過這些數字：

理解詞語的意思（比如"貓"和"狗"都是動物）。
計算詞語的關系（比如"國王 - 男 + 女 ≈ 女王"）。

2.3 💡 為什么要用 Word Embedding？

直接給單詞編號（比如"貓=1，狗=2"）會丟失語義信息。而 Word Embedding 能：

壓縮信息：用少數幾個數字表示復雜含義。
發現規律：自動學習"貓→狗"和"蘋果→香蕉"的相似關系。
兼容算法：機器學習模型（如神經網絡）只能處理數字，不能直接處理文字。

2.4 🛠? 舉個實際例子

假設用 3 維向量表示詞語：

"科技" → [0.9, 0.1, 0.2]
"手機" → [0.8, 0.2, 0.3]
"水果" → [0.1, 0.9, 0.4]

計算機看到：

"科技" 和 "手機" 的向量接近 → 它們相關。
"科技" 和 "水果" 的向量遠離 → 它們無關。

2.5 📚 常見的 Word Embedding 方法

Word2Vec：通過上下文預測詞語（比如"貓愛吃__" → 預測"魚"）。
GloVe：統計詞語共同出現的頻率（比如"貓"和"狗"經常一起出現）。
BERT（現代方法）：結合上下文動態調整向量（比如"蘋果"在"吃蘋果"和"蘋果手機"中含義不同）。

2.6 ? 簡單總結

Word Embedding 就是 讓計算機通過數字"理解"詞語，像人類一樣知道"貓和狗相似，但和蘋果無關"。它是自然語言處理（NLP）的基礎技術，用于翻譯、搜索、聊天機器人等場景。

3. Word2Vec

3.1 Word2Vec的兩種模型

(1) CBOW (Continuous Bag of Words)

目標：用上下文詞語預測中心詞（適合小型數據集）。
例子：
句子："我愛自然語言處理"
假設窗口大小為 2（左右各 2 個詞），則：
- 輸入：["我", "愛", "語言", "處理"]（上下文）
- 輸出："自然"（中心詞）

(2) Skip-gram

目標：用中心詞預測上下文詞語（適合大型數據集）。
例子：
同一句子 "我愛自然語言處理"，窗口大小為 2：
- 輸入："自然"（中心詞）
- 輸出：["我", "愛", "語言", "處理"]（上下文）

CBOW vs Skip-gram：

CBOW 訓練更快，適合高頻詞。
Skip-gram 對低頻詞效果更好，但需要更多數據。

3.2 Word2Vec的實現步驟

Step 1: 數據預處理

分詞（如用 jieba 對中文分詞）。
構建詞匯表（給每個詞分配唯一 ID，如 我=0, 愛=1, 自然=2...）。

Step 2: 構建神經網絡模型

Word2Vec 本質上是一個 單隱層神經網絡，結構如下：

輸入層 → 隱藏層（Embedding 層） → 輸出層（Softmax）

輸入層：詞語的 one-hot 編碼（如 "自然" = [0, 0, 1, 0, 0]）。
隱藏層：權重矩陣（即詞向量表），維度 = [詞匯表大小, 嵌入維度]（如 300 維）。
輸出層：預測上下文詞的概率（Softmax 歸一化）。

Step 3: 訓練模型

輸入一個詞（如 "自然" 的 one-hot 向量 [0, 0, 1, 0, 0]）。
乘以權重矩陣，得到隱藏層的 詞向量（如 [0.2, -0.5, 0.7, ...]）。
用 Softmax 計算預測的上下文詞概率。
通過反向傳播（Backpropagation）更新權重，使預測更準。

Step 4: 提取詞向量

訓練完成后，隱藏層的權重矩陣就是詞向量表！

例如，"自然" 的詞向量是權重矩陣的第 3 行（假設 "自然" 的 ID=2）。

3.3 關鍵優化技術

直接計算 Softmax 對大規模詞匯表效率極低，因此 Word2Vec 用兩種優化方法：

(1) 負采樣（Negative Sampling）

問題：Softmax 要計算所有詞的概率，計算量太大。
解決：每次訓練只采樣少量負樣本（隨機選非上下文詞），優化目標變為：
- 最大化真實上下文詞的概率。
- 最小化負樣本詞的概率。

(2) 層次 Softmax（Hierarchical Softmax）

用哈夫曼樹（Huffman Tree）編碼詞匯表，將計算復雜度從 O(N) 降到 O(log N)。
每個詞對應樹的一個葉子節點，預測時只需計算路徑上的節點概率。

3.4 代碼示例（Python）

用 gensim 庫快速訓練 Word2Vec：

from gensim.models import Word2Vec# 示例數據（已分詞的句子）
sentences = [["我", "愛", "自然", "語言", "處理"],["深度", "學習", "真", "有趣"]
]# 訓練模型（Skip-gram + 負采樣）
model = Word2Vec(sentences,vector_size=100,  # 詞向量維度window=5,         # 上下文窗口大小min_count=1,      # 忽略低頻詞sg=1,             # 1=Skip-gram, 0=CBOWnegative=5,       # 負采樣數epochs=10         # 訓練輪次
)# 獲取詞向量
vector = model.wv["自然"]  # "自然"的詞向量
print(vector)# 找相似詞
similar_words = model.wv.most_similar("自然", topn=3)
print(similar_words)  # 輸出：[('語言', 0.92), ('學習', 0.88), ...]

3.5 總結

核心思想：用上下文學習詞向量（CBOW/Skip-gram）。
關鍵步驟：
1. 分詞 → 構建詞匯表 → one-hot 編碼。
2. 訓練單隱層神經網絡，提取隱藏層權重作為詞向量。
優化方法：負采樣、層次 Softmax 加速訓練。
應用場景：語義搜索、推薦系統、機器翻譯等。

Word2Vec 的優點是簡單高效，但缺點是無法處理多義詞（如"蘋果"在水果和公司語境中含義不同）。后續的 GloVe、BERT 等模型對此做了改進。

4. 實操

4.1 用N-Gram和TF-IDF為酒店建立內容推薦系統

4.1.1 準備

西雅圖酒店數據集：

下載地址：https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/Seattle_Hotels.csv
字段：name,address,desc
目標：基于用戶選擇的酒店，推薦相似度高的Top10個其他酒店
方法：計算當前酒店特征向量與整個酒店特征矩陣的余弦相似度，取相似度最大的Top-k個

4.1.2 步驟

Step1，對酒店描述（Desc）進行特征提取
- N-Gram，提取N個連續字的集合，作為特征
- TF-IDF，按照(min_df,max_df)提取關鍵詞，并生成TFIDF矩陣
Step2，計算酒店之間的相似度矩陣
- 余弦相似度
Step3，對于指定的酒店，選擇相似度最大的Top-K個酒店進行輸出

4.1.3 示例代碼

import pandas as pd
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import re
pd.options.display.max_columns = 30
import matplotlib.pyplot as plt
# 支持中文
plt.rcParams['font.sans-serif'] = ['SimHei']  # 用來正常顯示中文標簽
df = pd.read_csv('Seattle_Hotels.csv', encoding="latin-1")
# 數據探索
# print(df.head())
print('數據集中的酒店個數：', len(df))# 創建英文停用詞列表
ENGLISH_STOPWORDS = {'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"
}def print_description(index):example = df[df.index == index][['desc', 'name']].values[0]if len(example) > 0:print('Name:', example[1])print(example[0])
print('第10個酒店的描述：')
print_description(10)# 得到酒店描述中n-gram特征中的TopK個特征,默認n=1即1-gram,k=None，表示所有的特征)
def get_top_n_words(corpus, n=1, k=None):# 統計ngram詞頻矩陣，使用自定義停用詞列表vec = CountVectorizer(ngram_range=(n, n), stop_words=list(ENGLISH_STOPWORDS)).fit(corpus)bag_of_words = vec.transform(corpus)"""print('feature names:')print(vec.get_feature_names())print('bag of words:')print(bag_of_words.toarray())"""sum_words = bag_of_words.sum(axis=0)words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]# 按照詞頻從大到小排序words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)return words_freq[:k]
# 生成n=1.k=20的可視圖
# n_gram=1
# common_words = get_top_n_words(df['desc'], n=n_gram,k=20)
# # 生成n=3.k=20的可視圖
n_gram=3
common_words = get_top_n_words(df['desc'], n=n_gram,k=20)
# common_words = get_top_n_words(df['desc'], 3, 20)
print(f"comon_words are \n {common_words}")
df1 = pd.DataFrame(common_words, columns = ['desc' , 'count'])
df1.groupby('desc').sum()['count'].sort_values().plot(kind='barh', title=f'去掉停用詞后，酒店描述中的Top20-{n_gram}單詞')
plt.savefig(f'./top20-{n_gram}words.png')
plt.show()# 文本預處理
REPLACE_BY_SPACE_RE = re.compile(r'[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
# 使用自定義的英文停用詞列表替代nltk的stopwords
STOPWORDS = ENGLISH_STOPWORDS
# 對文本進行清洗
def clean_text(text):# 全部小寫text = text.lower()# 用空格替代一些特殊符號，如標點text = REPLACE_BY_SPACE_RE.sub(' ', text)# 移除BAD_SYMBOLS_REtext = BAD_SYMBOLS_RE.sub('', text)# 從文本中去掉停用詞text = ' '.join(word for word in text.split() if word not in STOPWORDS)return text
# 對desc字段進行清理，apply針對某列
df['desc_clean'] = df['desc'].apply(clean_text)
#print(df['desc_clean'])# 建模
df.set_index('name', inplace = True)
# 使用TF-IDF提取文本特征，使用自定義停用詞列表,min_df=0.01：如果有1000篇文檔，只保留至少在10篇文檔中出現的詞(1000×1%)
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0.01, stop_words=list(ENGLISH_STOPWORDS))
# 針對desc_clean提取tfidf
tfidf_matrix = tf.fit_transform(df['desc_clean'])
# print('TFIDF feature names:')
# print(tf.get_feature_names_out())
print('length of feature_names_out:')
print(len(tf.get_feature_names_out()))
# print('tfidf_matrix:')
# print(tfidf_matrix)
print('tfidf_matrix shape=')
print(tfidf_matrix.shape)
# 計算酒店之間的余弦相似度（線性核函數）
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)
# print(f'cosine_similarities為\n {cosine_similarities}')
print("conine_similarities.shape=")
print(cosine_similarities.shape)
indices = pd.Series(df.index) #df.index是酒店名稱# 基于相似度矩陣和指定的酒店name，推薦TOP10酒店
def recommendations(name, cosine_similarities = cosine_similarities):recommended_hotels = []# 找到想要查詢酒店名稱的idxidx = indices[indices == name].index[0]# print('idx=', idx)# 對于idx酒店的余弦相似度向量按照從大到小進行排序score_series = pd.Series(cosine_similarities[idx]).sort_values(ascending = False)# 取相似度最大的前10個（除了自己以外）top_10_indexes = list(score_series.iloc[1:11].index)# 放到推薦列表中for i in top_10_indexes:recommended_hotels.append(list(df.index)[i])return recommended_hotels
hotel_name='Hilton Seattle Airport & Conference Center'
recommended=recommendations(hotel_name)
print(f"top 10 similar to {hotel_name} are\n")
for i in range(len(recommended)):print (f"top{(i+1):02d}        {recommended[i]}")
# print(recommendations('Hilton Seattle Airport & Conference Center'))
# print(recommendations('The Bacon Mansion Bed and Breakfast'))
# #print(result)

4.1.4 結果

數據集中的酒店個數： 152
第10個酒店的描述：
Name: W Seattle
Soak up the vibrant scene in the Living Room Bar and get in the mix with our live music and DJ series before heading to a memorable dinner at TRACE. Offering inspired seasonal fare in an award-winning atmosphere, it's a not-to-be-missed culinary experience in downtown Seattle. Work it all off the next morning at FIT?, our state-of-the-art fitness center before wandering out to explore many of the area's nearby attractions, including Pike Place Market, Pioneer Square and the Seattle Art Museum. As always, we've got you covered during your time at W Seattle with our signature Whatever/Whenever? service - your wish is truly our command.
comon_words are [('pike place market', 85), ('seattle tacoma international', 21), ('tacoma international airport', 21), ('free wi fi', 19), ('washington state convention', 17), ('seattle art museum', 16), ('place market seattle', 16), ('state convention center', 15), ('within walking distance', 14), ('high speed internet', 14), ('space needle pike', 12), ('needle pike place', 11), ('south lake union', 11), ('downtown seattle hotel', 10), ('sea tac airport', 10), ('home away home', 9), ('heart downtown seattle', 8), ('link light rail', 8), ('free high speed', 8), ('24 hour fitness', 7)]
length of feature_names_out:
3347
tfidf_matrix shape=
(152, 3347)
conine_similarities.shape=
(152, 152)
top 10 similar to Hilton Seattle Airport & Conference Center aretop01        Embassy Suites by Hilton Seattle Tacoma International Airport
top02        DoubleTree by Hilton Hotel Seattle Airport
top03        Seattle Airport Marriott
top04        Four Points by Sheraton Downtown Seattle Center
top05        Motel 6 Seattle Sea-Tac Airport South
top06        Hampton Inn Seattle/Southcenter
top07        Radisson Hotel Seattle Airport
top08        Knights Inn Tukwila
top09        Hotel Hotel
top10        Home2 Suites by Hilton Seattle Airport

在這里插入圖片描述

4.2 用Word Embedding為三國演義找相似詞

4.2.1 準備

準備三國演義的txt文件

4.2.2 步驟

Step1，先對文件進行分詞（用jieba包）
Step2，設置模型參數進行訓練
Step3，計算兩個詞的相似度、找出一個詞或幾個詞加減后的最相近詞。

4.2.3 示例代碼

word_seg.py

# -*-coding: utf-8 -*-
# 對txt文件進行中文分詞
import jieba
import os
from utils import files_processing# 源文件所在目錄
source_folder = './three_kingdoms/source'
segment_folder = './three_kingdoms/segment'
# 字詞分割，對整個文件內容進行字詞分割
def segment_lines(file_list,segment_out_dir,stopwords=[]):for i,file in enumerate(file_list):segment_out_name=os.path.join(segment_out_dir,'segment_{}.txt'.format(i))with open(file, 'rb') as f:document = f.read()document_cut = jieba.cut(document)sentence_segment=[]for word in document_cut:if word not in stopwords:sentence_segment.append(word)result = ' '.join(sentence_segment)result = result.encode('utf-8')with open(segment_out_name, 'wb') as f2:f2.write(result)# 對source中的txt文件進行分詞，輸出到segment目錄中
file_list=files_processing.get_files_list(source_folder, postfix='*.txt')
segment_lines(file_list, segment_folder)

word_similarity_three_kingdoms.py

# -*-coding: utf-8 -*-
# 先運行 word_seg進行中文分詞，然后再進行word_similarity計算
# 將Word轉換成Vec，然后計算相似度 
from gensim.models import word2vec
import multiprocessing# 如果目錄中有多個文件，可以使用PathLineSentences
segment_folder = './three_kingdoms/segment'
# 切分之后的句子合集
sentences = word2vec.PathLineSentences(segment_folder)
#=============== 設置模型參數，進行訓練
model = word2vec.Word2Vec(sentences, vector_size=100, window=3, min_count=1)
model.save('./three_kingdoms/model/word2Vec.model')
print(model.wv.similarity('曹操', '劉備'))
print(model.wv.similarity('曹操', '張飛'))
query_name = "曹操"
similar_words = model.wv.most_similar(query_name, topn=5)
print(f"與{query_name}最相似的5個詞:")
for word, similarity in similar_words:print(f"{word}: {similarity:.3f}")
print("曹操+劉備-張飛=?")
similar_words = model.wv.most_similar(positive=['曹操', '劉備'], negative=['張飛'], topn=5)
for word, similarity in similar_words:print(f"{word}: {similarity:.3f}")
#================= 設置模型參數，進行訓練
model2 = word2vec.Word2Vec(sentences, vector_size=128, window=5, min_count=5, workers=multiprocessing.cpu_count())
# 保存模型
model2.save('./three_kingdoms/model/word2Vec.model2')
print(model2.wv.similarity('曹操', '劉備'))
print(model2.wv.similarity('曹操', '張飛'))query_name = "曹操"
similar_words = model2.wv.most_similar(query_name, topn=5)
print(f"與{query_name}最相似的5個詞:")
for word, similarity in similar_words:print(f"{word}: {similarity:.3f}")
print("曹操+劉備-張飛=?")
similar_words = model2.wv.most_similar(positive=['曹操', '劉備'], negative=['張飛'], topn=5)
for word, similarity in similar_words:print(f"{word}: {similarity:.3f}")

4.2.4 結果

0.9805809
0.9755627
與曹操最相似的5個詞:
孫權: 0.988
司馬懿: 0.987
已: 0.986
孔明: 0.986
沮授: 0.986
曹操+劉備-張飛=?
某: 0.992
丞相: 0.991
臣: 0.990
既: 0.989
大叫: 0.989
0.82772493
0.7733702
與曹操最相似的5個詞:
孫權: 0.959
喝: 0.953
回報: 0.953
大叫: 0.952
其事: 0.950
曹操+劉備-張飛=?
臣: 0.976
何為: 0.964
丞相: 0.962
朕: 0.960
主公: 0.959Process finished with exit code 0