Datawhale AI 夏令營：用戶洞察挑戰賽 Notebook（2）

針對文本聚類優化

優化TF-IDF特征工程


# 調整ngram_range：設置為(1, 2)，捕捉單字和雙字詞（如“不錯”“不滿意”）。
# 限制特征數量：通過max_features=5000保留高信息密度特征，降低維度。
# 過濾低頻/高頻詞：設置min_df=2（過濾僅出現1次的詞）和max_df=0.8（過濾出現超過80%樣本的通用詞）。from sklearn.feature_extraction.text import TfidfVectorizertfidf = TfidfVectorizer(ngram_range=(1, 2),  # 包含單字和雙字詞max_features=5000,   # 最大特征數min_df=2,            # 最小文檔頻率（出現至少2次）max_df=0.8,          # 最大文檔頻率（不超過80%樣本）token_pattern=r"\b\w+\b"  # 匹配單詞邊界（兼容中文）
)

動態選擇最佳簇數 n_clusters

import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score# 提取TF-IDF特征
X = tfidf.fit_transform(comments_to_cluster)# 輪廓系數：尋找最高得分
best_k = 0
best_silhouette = -1for k in range(5, 9):kmeans = KMeans(n_clusters=k, random_state=42)labels = kmeans.fit_predict(X)score = silhouette_score(X, labels)if score > best_silhouette:best_silhouette = scorebest_k = k

改進聚類算法

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import Normalizer  # 新增歸一化步驟
from sklearn.cluster import KMeans
from sklearn.pipeline import make_pipeline# 構建管道：TF-IDF → 歸一化 → KMeans
kmeans_predictor = make_pipeline(TfidfVectorizer(tokenizer=jieba.lcut, # 分詞器ngram_range=(1, 2),  # 包含單字和雙字詞max_features=5000,   # 最大特征數min_df=2,            # 最小文檔頻率（出現至少2次）max_df=0.8,          # 最大文檔頻率（不超過80%樣本）token_pattern=r"\b\w+\b"  # 匹配單詞邊界（兼容中文）),Normalizer(norm="l2"),  # 歸一化向量長度為1（L2范數）KMeans(n_clusters=best_k, random_state=42, n_init=10)  # 使用標準KMeans
)# 訓練與預測（保持原有邏輯）
comments_data_clean = comments_data[comments_data["sentiment_category"].isin([1, 3])]
kmeans_predictor.fit(comments_data_clean["comment_text"])
kmeans_cluster_label = kmeans_predictor.predict(comments_data_clean["comment_text"])kmeans_top_word = []
tfidf_vectorizer = kmeans_predictor.named_steps['tfidfvectorizer']
kmeans_model = kmeans_predictor.named_steps['kmeans']
feature_names = tfidf_vectorizer.get_feature_names_out()
cluster_centers = kmeans_model.cluster_centers_
for i in range(kmeans_model.n_clusters):top_feature_indices = cluster_centers[i].argsort()[::-1]top_word = ' '.join([feature_names[idx] for idx in top_feature_indices[:top_n_words]])kmeans_top_word.append(top_word)comments_data.loc[comments_data["sentiment_category"].isin([1, 3]), "positive_cluster_theme"] = [kmeans_top_word[x] for x in kmeans_cluster_label]

提交得分

在這里插入圖片描述

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/pingmian/88275.shtml
繁體地址，請注明出處：http://hk.pswp.cn/pingmian/88275.shtml
英文地址，請注明出處：http://en.pswp.cn/pingmian/88275.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！