用AI做帶貨視頻評論分析進階提分【Datawhale AI 夏令營】

文章目錄

回顧賽題
優化1??
優化2??

回顧賽題

模塊	內容類型	說明/示例
賽題背景	概述	參賽者需構建端到端評論分析系統，實現商品識別、多維情感分析、評論聚類與主題提煉三大任務。
商品識別	輸入	`video_desc`（視頻描述）+ `video_tags`（標簽）
	輸出	商品名稱（如：Xfaiyx Smart Translator/Recorder）
多維情感分析	情感維度	- 情感傾向（5類） - 用戶場景 - 用戶疑問 - 用戶建議
	挑戰點	隱晦表達處理，如“這重量出門帶著剛好”暗示出行場景
評論聚類與主題提煉	聚類目標	針對5類評論進行聚類分析
	輸出示例	主題詞如：續航短｜充電慢｜發熱嚴重
賽題目標	AI目標	從原始評論中提取商品與用戶洞察，轉化為商業智能
評估標準	商品識別	準確率（Accuracy）：正確識別商品的比例
	情感分析	宏平均 F1 值：多分類性能衡量
	評論聚類	輪廓系數（Silhouette Score）：評估聚類合理性
數據集	視頻數據	85 條，4 個字段，部分標注 `product_name`
	評論數據	6,477 條，12 個字段，部分情感字段已標注
挑戰與難點	標注比例低	僅約 15% 樣本有人工標注
	泛化能力挑戰	需提升未標注樣本上的表現
	推薦方法	- 半監督學習（如 UDA） - 提示學習（Prompt Learning）
最終目標	總結	構建商品識別 → 情感分析 → 聚類主題提煉的完整 AI 處理鏈路

優化1??

? 使用 Pipeline 封裝 TF-IDF + 分類/聚類流程

說明：通過 make_pipeline() 將 TfidfVectorizer 和 SGDClassifier / KMeans 組合成統一流程，簡化訓練和預測步驟。

? 聚類 + 高頻關鍵詞提取邏輯封裝成函數

說明：extract_cluster_theme(...) 函數統一處理文本聚類與主題詞抽取，減少冗余代碼。

? 文本字段預處理策略合理整合

說明：將 video_desc 與 video_tags 組合生成 text 字段用于分類模型訓練。

import os
import jieba
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import make_pipeline
from sklearn.cluster import KMeans# -----------------------------
# 1. 加載數據
# -----------------------------
video_data = pd.read_csv("origin_videos_data.csv")
comments_data = pd.read_csv("origin_comments_data.csv")# 合并視頻文本信息作為商品預測輸入
video_data["text"] = video_data["video_desc"].fillna("") + " " + video_data["video_tags"].fillna("")# -----------------------------
# 2. 商品名稱預測（分類任務）
# -----------------------------
product_name_predictor = make_pipeline(TfidfVectorizer(tokenizer=jieba.lcut, max_features=50),SGDClassifier()
)
video_train = video_data[~video_data["product_name"].isnull()]
product_name_predictor.fit(video_train["text"], video_train["product_name"])
video_data["product_name"] = product_name_predictor.predict(video_data["text"])# -----------------------------
# 3. 評論情感&屬性多維度分類
# -----------------------------
target_cols = ['sentiment_category', 'user_scenario', 'user_question', 'user_suggestion']for col in target_cols:predictor = make_pipeline(TfidfVectorizer(tokenizer=jieba.lcut),SGDClassifier())train_data = comments_data[~comments_data[col].isnull()]predictor.fit(train_data["comment_text"], train_data[col])comments_data[col] = predictor.predict(comments_data["comment_text"])# -----------------------------
# 4. 聚類 + 主題提取封裝函數
# -----------------------------
def extract_cluster_theme(dataframe, filter_cond, target_column, n_clusters=5, top_n_words=10):"""對特定子集評論進行聚類并提取主題詞"""cluster_texts = dataframe[filter_cond]["comment_text"]kmeans_pipeline = make_pipeline(TfidfVectorizer(tokenizer=jieba.lcut),KMeans(n_clusters=n_clusters, random_state=42))kmeans_pipeline.fit(cluster_texts)cluster_labels = kmeans_pipeline.predict(cluster_texts)# 提取高頻主題詞tfidf = kmeans_pipeline.named_steps['tfidfvectorizer']kmeans = kmeans_pipeline.named_steps['kmeans']feature_names = tfidf.get_feature_names_out()cluster_centers = kmeans.cluster_centers_top_keywords = []for i in range(n_clusters):indices = cluster_centers[i].argsort()[::-1][:top_n_words]keywords = ' '.join([feature_names[idx] for idx in indices])top_keywords.append(keywords)# 寫入對應字段dataframe.loc[filter_cond, target_column] = [top_keywords[label] for label in cluster_labels]# -----------------------------
# 5. 進行五個維度的聚類主題提取
# -----------------------------
extract_cluster_theme(comments_data,comments_data["sentiment_category"].isin([1, 3]),"positive_cluster_theme"
)extract_cluster_theme(comments_data,comments_data["sentiment_category"].isin([2, 3]),"negative_cluster_theme"
)extract_cluster_theme(comments_data,comments_data["user_scenario"] == 1,"scenario_cluster_theme"
)extract_cluster_theme(comments_data,comments_data["user_question"] == 1,"question_cluster_theme"
)extract_cluster_theme(comments_data,comments_data["user_suggestion"] == 1,"suggestion_cluster_theme"
)# -----------------------------
# 6. 導出預測結果
# -----------------------------
os.makedirs("submit", exist_ok=True)video_data[["video_id", "product_name"]].to_csv("submit/submit_videos.csv", index=False)comments_data[['video_id', 'comment_id', 'sentiment_category','user_scenario', 'user_question', 'user_suggestion','positive_cluster_theme', 'negative_cluster_theme','scenario_cluster_theme', 'question_cluster_theme','suggestion_cluster_theme'
]].to_csv("submit/submit_comments.csv", index=False)

對比效果

在這里插入圖片描述

優化2??

import os
import jieba
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import make_pipeline
from sklearn.cluster import KMeans# -----------------------------
# 1. 加載數據
# -----------------------------
video_data = pd.read_csv("origin_videos_data.csv")
comments_data = pd.read_csv("origin_comments_data.csv")# 合并視頻描述 + 標簽，形成商品分類模型的輸入字段
video_data["text"] = video_data["video_desc"].fillna("") + " " + video_data["video_tags"].fillna("")# -----------------------------
# 2. 商品名稱預測（分類任務）
# -----------------------------
# 構建商品分類器：使用 TF-IDF（最多 50 個詞）+ SGD 分類器（適合大規模稀疏特征）
product_name_predictor = make_pipeline(TfidfVectorizer(tokenizer=jieba.lcut, max_features=50),SGDClassifier()
)
# 拿有真實標簽的部分訓練模型
video_train = video_data[~video_data["product_name"].isnull()]
product_name_predictor.fit(video_train["text"], video_train["product_name"])# 使用模型預測所有視頻的商品名稱
video_data["product_name"] = product_name_predictor.predict(video_data["text"])# ? 可選優化：
# - 模型替換：`SGDClassifier` 可替換為 `LogisticRegression`, `XGBoost`, `RandomForest` 等
# - 分詞改進：`jieba` 可替換為 `pkuseg`, `LAC`，或使用 `BERT` tokenizer（更強但慢）
# - 增加 n-gram：`ngram_range=(1,2)` 可捕捉“關鍵詞組合”，提高分類準確率# -----------------------------
# 3. 評論情感&屬性多維度分類
# -----------------------------
# 要預測的評論屬性標簽（分類任務）
target_cols = ['sentiment_category', 'user_scenario', 'user_question', 'user_suggestion']# 對每個目標列都訓練一個 TF-IDF + SGD 分類器
for col in target_cols:predictor = make_pipeline(TfidfVectorizer(tokenizer=jieba.lcut),SGDClassifier())train_data = comments_data[~comments_data[col].isnull()]predictor.fit(train_data["comment_text"], train_data[col])comments_data[col] = predictor.predict(comments_data["comment_text"])# ? 可選優化：
# - 使用 `MultiOutputClassifier` 構建聯合多標簽分類器
# - 樣本不均衡時，考慮添加 `class_weight='balanced'`
# - 加入 `classification_report` 輸出分類指標，輔助調參# -----------------------------
# 4. 聚類 + 主題提取封裝函數
# -----------------------------
def extract_cluster_theme(dataframe, filter_cond, target_column, n_clusters=5, top_n_words=10):"""對指定條件篩選出的評論子集，使用 KMeans 聚類并提取每類高頻關鍵詞，寫入主題字段"""cluster_texts = dataframe[filter_cond]["comment_text"]# 構建聚類模型：TF-IDF + KMeanskmeans_pipeline = make_pipeline(TfidfVectorizer(tokenizer=jieba.lcut),KMeans(n_clusters=n_clusters, random_state=42))kmeans_pipeline.fit(cluster_texts)cluster_labels = kmeans_pipeline.predict(cluster_texts)# 提取每個聚類的高頻關鍵詞（TF-IDF 值最高的前 n 個詞）tfidf = kmeans_pipeline.named_steps['tfidfvectorizer']kmeans = kmeans_pipeline.named_steps['kmeans']feature_names = tfidf.get_feature_names_out()cluster_centers = kmeans.cluster_centers_top_keywords = []for i in range(n_clusters):indices = cluster_centers[i].argsort()[::-1][:top_n_words]keywords = ' '.join([feature_names[idx] for idx in indices])top_keywords.append(keywords)# 為篩選子集中的每條評論賦予對應主題標簽dataframe.loc[filter_cond, target_column] = [top_keywords[label] for label in cluster_labels]# ? 可選優化：
# - 聚類算法替換：`KMeans` → `MiniBatchKMeans`（更快）、`LDA`（更語義）、`HDBSCAN`（無需指定簇數）
# - TF-IDF 可以添加 `max_features`, `stop_words`, `ngram_range` 等增強表達
# - 可加 `TSNE` / `UMAP` 降維可視化聚類分布
# - 可保存最具代表性的樣本（如每類中心附近評論）# -----------------------------
# 5. 進行五個維度的聚類主題提取
# -----------------------------
# 對以下幾類評論子集做主題提取，并寫入指定列
extract_cluster_theme(comments_data,comments_data["sentiment_category"].isin([1, 3]),"positive_cluster_theme"
)extract_cluster_theme(comments_data,comments_data["sentiment_category"].isin([2, 3]),"negative_cluster_theme"
)extract_cluster_theme(comments_data,comments_data["user_scenario"] == 1,"scenario_cluster_theme"
)extract_cluster_theme(comments_data,comments_data["user_question"] == 1,"question_cluster_theme"
)extract_cluster_theme(comments_data,comments_data["user_suggestion"] == 1,"suggestion_cluster_theme"
)# ? 可選優化：
# - 添加異常處理，避免聚類文本為空時程序崩潰
# - 若后續支持多語言數據，可替換 tokenizer 和聚類邏輯為更通用版本# -----------------------------
# 6. 導出預測結果
# -----------------------------
# 創建輸出目錄
os.makedirs("submit", exist_ok=True)# 導出商品預測結果
video_data[["video_id", "product_name"]].to_csv("submit/submit_videos.csv", index=False)# 導出評論多分類 + 聚類主題提取結果
comments_data[['video_id', 'comment_id', 'sentiment_category','user_scenario', 'user_question', 'user_suggestion','positive_cluster_theme', 'negative_cluster_theme','scenario_cluster_theme', 'question_cluster_theme','suggestion_cluster_theme'
]].to_csv("submit/submit_comments.csv", index=False)

對比結果：
在這里插入圖片描述