Datawhale AI夏令營-基于帶貨視頻評論的用戶洞察挑戰賽

一.賽事目標

基于星火大模型Spark 4.0 Ultra，對視頻和評論的數據進行商品識別，情感分析，歸類分析，最終為帶貨效果進行評價。并通過優化模型來提高評價準確度

二.賽事環境

1.基礎平臺：星火大模型Spark 4.0 Ultra

2.賽事數據：視頻，視頻彈窗評論

包含85條脫敏后的帶貨視頻數據及6477條評論文本數據?

包括少量有人工標注結果的訓練集（僅包含商品識別和情感分析的標注結果）以及未標注的測試集。

帶貨視頻內容文本信息的數據格式

序號	變量名稱	變量格式	解釋
1	video_id	string	視頻id
2	video_desc	string	視頻描述
3	video_tags	string	視頻標簽
4	product_name	string	推廣商品名稱

評論區文本信息的數據格式

序號	變量名稱	變量格式	解釋
1	video_id	string	視頻id
2	comment_id	string	評論id
3	comment_text	string	評論文本
4	sentiment_category	int	關于商品的情感傾向分類
5	user_scenario	int	是否與用戶場景有關，0表示否，1表示是
6	user_question	int	是否與用戶疑問有關，0表示否，1表示是
7	user_suggestion	int	是否與用戶建議有關，0表示否，1表示是
8	positive_cluster_theme	string	按正面傾向聚類的類簇主題詞
9	negative_cluster_theme	string	按負面傾向聚類的類簇主題詞
10	scenario_cluster_theme	string	按用戶場景聚類的類簇主題詞
11	question_cluster_theme	string	按用戶疑問聚類的類簇主題詞
12	suggestion_cluster_theme	string	按用戶建議聚類的類簇主題詞

?三.賽事任務

【商品識別】從視頻的數據里精準識別推廣商品；
【情感分析】對評論文本進行多維度情感分析，涵蓋維度見數據說明；
【評論聚類】按商品對歸屬指定維度的評論進行聚類，并提煉類簇總結詞。

四.賽事目標

基于給定的賽事步驟

1.加載數據

import pandas as pd
video_data = pd.read_csv("origin_videos_data.csv")
comments_data = pd.read_csv("origin_comments_data.csv")

2.取樣表條數10條

video_data.sample(10)

?3.提取表頭

comments_data.head()

4.將視頻數據的兩個字段合并到一個字段，用來提取商品信息

video_data["text"] = video_data["video_desc"].fillna("") + " " + video_data["video_tags"].fillna("")

?5.加載分詞器，情感分析，聚類分析等工具包

import jieba
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.svm import LinearSVC
from sklearn.cluster import KMeans
from sklearn.pipeline import make_pipeline

6.進行分詞，從視頻數據里獲取到商品名product_name的集合

product_name_predictor = make_pipeline(TfidfVectorizer(tokenizer=jieba.lcut, max_features=50), SGDClassifier()
)
product_name_predictor.fit(video_data[~video_data["product_name"].isnull()]["text"],video_data[~video_data["product_name"].isnull()]["product_name"],
)
video_data["product_name"] = product_name_predictor.predict(video_data["text"])

max_features=50：表示在轉換過程中只保留前50個最常見的詞匯特征。這有助于減少特征的維度，并提高計算效率。?

7.加載評論數據

comments_data.columns

?8.對評論數據進行情感分析

for col in ['sentiment_category','user_scenario', 'user_question', 'user_suggestion']:predictor = make_pipeline(TfidfVectorizer(tokenizer=jieba.lcut), SGDClassifier())predictor.fit(comments_data[~comments_data[col].isnull()]["comment_text"],comments_data[~comments_data[col].isnull()][col],)comments_data[col] = predictor.predict(comments_data["comment_text"])

9.聚類提取關鍵詞數量

top_n_words = 20

top_n_words 的意義
主題表示：top_n_words 決定了你從每個聚類中提取多少個關鍵詞來代表該聚類的主題。例如，如果 top_n_words=20，那么每個聚類主題將包含 20 個關鍵詞，這些關鍵詞是根據它們在聚類中心的貢獻度排名選出的。

對理解聚類的影響：選擇不同數量的關鍵詞會影響你對聚類主題的理解。更多的關鍵詞可以提供更全面的主題描述，但也可能引入噪聲；較少的關鍵詞則可能會導致主題描述不夠完整。

總結來說，top_n_words 是一個關鍵的參數，它幫助定義在每個聚類中提取的最重要詞匯的數量，進而影響對聚類主題的理解。

?10.情感分析-1

kmeans_predictor = make_pipeline(TfidfVectorizer(tokenizer=jieba.lcut), KMeans(n_clusters=5)
)kmeans_predictor.fit(comments_data[comments_data["sentiment_category"].isin([1, 3])]["comment_text"])
kmeans_cluster_label = kmeans_predictor.predict(comments_data[comments_data["sentiment_category"].isin([1, 3])]["comment_text"])kmeans_top_word = []
tfidf_vectorizer = kmeans_predictor.named_steps['tfidfvectorizer']
kmeans_model = kmeans_predictor.named_steps['kmeans']
feature_names = tfidf_vectorizer.get_feature_names_out()
cluster_centers = kmeans_model.cluster_centers_
for i in range(kmeans_model.n_clusters):top_feature_indices = cluster_centers[i].argsort()[::-1]top_word = ' '.join([feature_names[idx] for idx in top_feature_indices[:top_n_words]])kmeans_top_word.append(top_word)comments_data.loc[comments_data["sentiment_category"].isin([1, 3]), "positive_cluster_theme"] = [kmeans_top_word[x] for x in kmeans_cluster_label]

?11.情感分析-1

kmeans_predictor = make_pipeline(TfidfVectorizer(tokenizer=jieba.lcut), KMeans(n_clusters=5)
)kmeans_predictor.fit(comments_data[comments_data["sentiment_category"].isin([2, 3])]["comment_text"])
kmeans_cluster_label = kmeans_predictor.predict(comments_data[comments_data["sentiment_category"].isin([2, 3])]["comment_text"])kmeans_top_word = []
tfidf_vectorizer = kmeans_predictor.named_steps['tfidfvectorizer']
kmeans_model = kmeans_predictor.named_steps['kmeans']
feature_names = tfidf_vectorizer.get_feature_names_out()
cluster_centers = kmeans_model.cluster_centers_
for i in range(kmeans_model.n_clusters):top_feature_indices = cluster_centers[i].argsort()[::-1]top_word = ' '.join([feature_names[idx] for idx in top_feature_indices[:top_n_words]])kmeans_top_word.append(top_word)comments_data.loc[comments_data["sentiment_category"].isin([2, 3]), "negative_cluster_theme"] = [kmeans_top_word[x] for x in kmeans_cluster_label]

?14.情感分析-3

kmeans_predictor = make_pipeline(TfidfVectorizer(tokenizer=jieba.lcut), KMeans(n_clusters=5)
)kmeans_predictor.fit(comments_data[comments_data["user_scenario"].isin([1])]["comment_text"])
kmeans_cluster_label = kmeans_predictor.predict(comments_data[comments_data["user_scenario"].isin([1])]["comment_text"])kmeans_top_word = []
tfidf_vectorizer = kmeans_predictor.named_steps['tfidfvectorizer']
kmeans_model = kmeans_predictor.named_steps['kmeans']
feature_names = tfidf_vectorizer.get_feature_names_out()
cluster_centers = kmeans_model.cluster_centers_
for i in range(kmeans_model.n_clusters):top_feature_indices = cluster_centers[i].argsort()[::-1]top_word = ' '.join([feature_names[idx] for idx in top_feature_indices[:top_n_words]])kmeans_top_word.append(top_word)comments_data.loc[comments_data["user_scenario"].isin([1]), "scenario_cluster_theme"] = [kmeans_top_word[x] for x in kmeans_cluster_label]

15.情感分析-4

kmeans_predictor = make_pipeline(TfidfVectorizer(tokenizer=jieba.lcut), KMeans(n_clusters=5)
)kmeans_predictor.fit(comments_data[comments_data["user_question"].isin([1])]["comment_text"])
kmeans_cluster_label = kmeans_predictor.predict(comments_data[comments_data["user_question"].isin([1])]["comment_text"])kmeans_top_word = []
tfidf_vectorizer = kmeans_predictor.named_steps['tfidfvectorizer']
kmeans_model = kmeans_predictor.named_steps['kmeans']
feature_names = tfidf_vectorizer.get_feature_names_out()
cluster_centers = kmeans_model.cluster_centers_
for i in range(kmeans_model.n_clusters):top_feature_indices = cluster_centers[i].argsort()[::-1]top_word = ' '.join([feature_names[idx] for idx in top_feature_indices[:top_n_words]])kmeans_top_word.append(top_word)comments_data.loc[comments_data["user_question"].isin([1]), "question_cluster_theme"] = [kmeans_top_word[x] for x in kmeans_cluster_label]

16.情感分析-5

kmeans_predictor = make_pipeline(TfidfVectorizer(tokenizer=jieba.lcut), KMeans(n_clusters=5)
)kmeans_predictor.fit(comments_data[comments_data["user_suggestion"].isin([1])]["comment_text"])
kmeans_cluster_label = kmeans_predictor.predict(comments_data[comments_data["user_suggestion"].isin([1])]["comment_text"])kmeans_top_word = []
tfidf_vectorizer = kmeans_predictor.named_steps['tfidfvectorizer']
kmeans_model = kmeans_predictor.named_steps['kmeans']
feature_names = tfidf_vectorizer.get_feature_names_out()
cluster_centers = kmeans_model.cluster_centers_
for i in range(kmeans_model.n_clusters):top_feature_indices = cluster_centers[i].argsort()[::-1]top_word = ' '.join([feature_names[idx] for idx in top_feature_indices[:top_n_words]])kmeans_top_word.append(top_word)comments_data.loc[comments_data["user_suggestion"].isin([1]), "suggestion_cluster_theme"] = [kmeans_top_word[x] for x in kmeans_cluster_label]

17.創建目錄

mkdir submit

18.打壓縮包

video_data[["video_id", "product_name"]].to_csv("submit/submit_videos.csv", index=None)
comments_data[['video_id', 'comment_id', 'sentiment_category','user_scenario', 'user_question', 'user_suggestion','positive_cluster_theme', 'negative_cluster_theme','scenario_cluster_theme', 'question_cluster_theme','suggestion_cluster_theme']].to_csv("submit/submit_comments.csv", index=None)