一.賽事目標
基于星火大模型Spark 4.0 Ultra,對視頻和評論的數據進行商品識別,情感分析,歸類分析,最終為帶貨效果進行評價。并通過優化模型來提高評價準確度
二.賽事環境
1.基礎平臺:星火大模型Spark 4.0 Ultra
2.賽事數據:視頻,視頻彈窗評論
包含85條脫敏后的帶貨視頻數據及6477條評論文本數據?
包括少量有人工標注結果的訓練集(僅包含商品識別和情感分析的標注結果)以及未標注的測試集。
- 帶貨視頻內容文本信息的數據格式
序號 | 變量名稱 | 變量格式 | 解釋 |
---|---|---|---|
1 | video_id | string | 視頻id |
2 | video_desc | string | 視頻描述 |
3 | video_tags | string | 視頻標簽 |
4 | product_name | string | 推廣商品名稱 |
- 評論區文本信息的數據格式
序號 | 變量名稱 | 變量格式 | 解釋 |
---|---|---|---|
1 | video_id | string | 視頻id |
2 | comment_id | string | 評論id |
3 | comment_text | string | 評論文本 |
4 | sentiment_category | int | 關于商品的情感傾向分類 |
5 | user_scenario | int | 是否與用戶場景有關,0表示否,1表示是 |
6 | user_question | int | 是否與用戶疑問有關,0表示否,1表示是 |
7 | user_suggestion | int | 是否與用戶建議有關,0表示否,1表示是 |
8 | positive_cluster_theme | string | 按正面傾向聚類的類簇主題詞 |
9 | negative_cluster_theme | string | 按負面傾向聚類的類簇主題詞 |
10 | scenario_cluster_theme | string | 按用戶場景聚類的類簇主題詞 |
11 | question_cluster_theme | string | 按用戶疑問聚類的類簇主題詞 |
12 | suggestion_cluster_theme | string | 按用戶建議聚類的類簇主題詞 |
?三.賽事任務
【商品識別】從視頻的數據里精準識別推廣商品;
【情感分析】對評論文本進行多維度情感分析,涵蓋維度見數據說明;
【評論聚類】按商品對歸屬指定維度的評論進行聚類,并提煉類簇總結詞。
四.賽事目標
基于給定的賽事步驟
1.加載數據
import pandas as pd
video_data = pd.read_csv("origin_videos_data.csv")
comments_data = pd.read_csv("origin_comments_data.csv")
2.取樣表條數10條
video_data.sample(10)
?3.提取表頭
comments_data.head()
4.將視頻數據的兩個字段合并到一個字段,用來提取商品信息
video_data["text"] = video_data["video_desc"].fillna("") + " " + video_data["video_tags"].fillna("")
?5.加載分詞器,情感分析,聚類分析等工具包
import jieba
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.svm import LinearSVC
from sklearn.cluster import KMeans
from sklearn.pipeline import make_pipeline
6.進行分詞,從視頻數據里獲取到商品名product_name的集合
product_name_predictor = make_pipeline(TfidfVectorizer(tokenizer=jieba.lcut, max_features=50), SGDClassifier()
)
product_name_predictor.fit(video_data[~video_data["product_name"].isnull()]["text"],video_data[~video_data["product_name"].isnull()]["product_name"],
)
video_data["product_name"] = product_name_predictor.predict(video_data["text"])
max_features=50
:表示在轉換過程中只保留前50個最常見的詞匯特征。這有助于減少特征的維度,并提高計算效率。?
7.加載評論數據
comments_data.columns
?8.對評論數據進行情感分析
for col in ['sentiment_category','user_scenario', 'user_question', 'user_suggestion']:predictor = make_pipeline(TfidfVectorizer(tokenizer=jieba.lcut), SGDClassifier())predictor.fit(comments_data[~comments_data[col].isnull()]["comment_text"],comments_data[~comments_data[col].isnull()][col],)comments_data[col] = predictor.predict(comments_data["comment_text"])
9.聚類提取關鍵詞數量
top_n_words = 20
top_n_words 的意義
主題表示:top_n_words 決定了你從每個聚類中提取多少個關鍵詞來代表該聚類的主題。例如,如果 top_n_words=20,那么每個聚類主題將包含 20 個關鍵詞,這些關鍵詞是根據它們在聚類中心的貢獻度排名選出的。
對理解聚類的影響:選擇不同數量的關鍵詞會影響你對聚類主題的理解。更多的關鍵詞可以提供更全面的主題描述,但也可能引入噪聲;較少的關鍵詞則可能會導致主題描述不夠完整。
總結來說,top_n_words 是一個關鍵的參數,它幫助定義在每個聚類中提取的最重要詞匯的數量,進而影響對聚類主題的理解。
?10.情感分析-1
kmeans_predictor = make_pipeline(TfidfVectorizer(tokenizer=jieba.lcut), KMeans(n_clusters=5)
)kmeans_predictor.fit(comments_data[comments_data["sentiment_category"].isin([1, 3])]["comment_text"])
kmeans_cluster_label = kmeans_predictor.predict(comments_data[comments_data["sentiment_category"].isin([1, 3])]["comment_text"])kmeans_top_word = []
tfidf_vectorizer = kmeans_predictor.named_steps['tfidfvectorizer']
kmeans_model = kmeans_predictor.named_steps['kmeans']
feature_names = tfidf_vectorizer.get_feature_names_out()
cluster_centers = kmeans_model.cluster_centers_
for i in range(kmeans_model.n_clusters):top_feature_indices = cluster_centers[i].argsort()[::-1]top_word = ' '.join([feature_names[idx] for idx in top_feature_indices[:top_n_words]])kmeans_top_word.append(top_word)comments_data.loc[comments_data["sentiment_category"].isin([1, 3]), "positive_cluster_theme"] = [kmeans_top_word[x] for x in kmeans_cluster_label]
?11.情感分析-1
kmeans_predictor = make_pipeline(TfidfVectorizer(tokenizer=jieba.lcut), KMeans(n_clusters=5)
)kmeans_predictor.fit(comments_data[comments_data["sentiment_category"].isin([2, 3])]["comment_text"])
kmeans_cluster_label = kmeans_predictor.predict(comments_data[comments_data["sentiment_category"].isin([2, 3])]["comment_text"])kmeans_top_word = []
tfidf_vectorizer = kmeans_predictor.named_steps['tfidfvectorizer']
kmeans_model = kmeans_predictor.named_steps['kmeans']
feature_names = tfidf_vectorizer.get_feature_names_out()
cluster_centers = kmeans_model.cluster_centers_
for i in range(kmeans_model.n_clusters):top_feature_indices = cluster_centers[i].argsort()[::-1]top_word = ' '.join([feature_names[idx] for idx in top_feature_indices[:top_n_words]])kmeans_top_word.append(top_word)comments_data.loc[comments_data["sentiment_category"].isin([2, 3]), "negative_cluster_theme"] = [kmeans_top_word[x] for x in kmeans_cluster_label]
?14.情感分析-3
kmeans_predictor = make_pipeline(TfidfVectorizer(tokenizer=jieba.lcut), KMeans(n_clusters=5)
)kmeans_predictor.fit(comments_data[comments_data["user_scenario"].isin([1])]["comment_text"])
kmeans_cluster_label = kmeans_predictor.predict(comments_data[comments_data["user_scenario"].isin([1])]["comment_text"])kmeans_top_word = []
tfidf_vectorizer = kmeans_predictor.named_steps['tfidfvectorizer']
kmeans_model = kmeans_predictor.named_steps['kmeans']
feature_names = tfidf_vectorizer.get_feature_names_out()
cluster_centers = kmeans_model.cluster_centers_
for i in range(kmeans_model.n_clusters):top_feature_indices = cluster_centers[i].argsort()[::-1]top_word = ' '.join([feature_names[idx] for idx in top_feature_indices[:top_n_words]])kmeans_top_word.append(top_word)comments_data.loc[comments_data["user_scenario"].isin([1]), "scenario_cluster_theme"] = [kmeans_top_word[x] for x in kmeans_cluster_label]
15.情感分析-4
kmeans_predictor = make_pipeline(TfidfVectorizer(tokenizer=jieba.lcut), KMeans(n_clusters=5)
)kmeans_predictor.fit(comments_data[comments_data["user_question"].isin([1])]["comment_text"])
kmeans_cluster_label = kmeans_predictor.predict(comments_data[comments_data["user_question"].isin([1])]["comment_text"])kmeans_top_word = []
tfidf_vectorizer = kmeans_predictor.named_steps['tfidfvectorizer']
kmeans_model = kmeans_predictor.named_steps['kmeans']
feature_names = tfidf_vectorizer.get_feature_names_out()
cluster_centers = kmeans_model.cluster_centers_
for i in range(kmeans_model.n_clusters):top_feature_indices = cluster_centers[i].argsort()[::-1]top_word = ' '.join([feature_names[idx] for idx in top_feature_indices[:top_n_words]])kmeans_top_word.append(top_word)comments_data.loc[comments_data["user_question"].isin([1]), "question_cluster_theme"] = [kmeans_top_word[x] for x in kmeans_cluster_label]
16.情感分析-5
kmeans_predictor = make_pipeline(TfidfVectorizer(tokenizer=jieba.lcut), KMeans(n_clusters=5)
)kmeans_predictor.fit(comments_data[comments_data["user_suggestion"].isin([1])]["comment_text"])
kmeans_cluster_label = kmeans_predictor.predict(comments_data[comments_data["user_suggestion"].isin([1])]["comment_text"])kmeans_top_word = []
tfidf_vectorizer = kmeans_predictor.named_steps['tfidfvectorizer']
kmeans_model = kmeans_predictor.named_steps['kmeans']
feature_names = tfidf_vectorizer.get_feature_names_out()
cluster_centers = kmeans_model.cluster_centers_
for i in range(kmeans_model.n_clusters):top_feature_indices = cluster_centers[i].argsort()[::-1]top_word = ' '.join([feature_names[idx] for idx in top_feature_indices[:top_n_words]])kmeans_top_word.append(top_word)comments_data.loc[comments_data["user_suggestion"].isin([1]), "suggestion_cluster_theme"] = [kmeans_top_word[x] for x in kmeans_cluster_label]
17.創建目錄
mkdir submit
18.打壓縮包
video_data[["video_id", "product_name"]].to_csv("submit/submit_videos.csv", index=None)
comments_data[['video_id', 'comment_id', 'sentiment_category','user_scenario', 'user_question', 'user_suggestion','positive_cluster_theme', 'negative_cluster_theme','scenario_cluster_theme', 'question_cluster_theme','suggestion_cluster_theme']].to_csv("submit/submit_comments.csv", index=None)
五.操作步驟
為了提高賽事提高分數,本文從一下幾個方面進行微調
1.修改n_clusters=5,默認抽取10個視頻樣本
結果:
分數詳情?
2.20個關鍵字,仍然保持n_clusters=5
最終結果
分數詳情
3.增加樣本,選擇20個樣本,其他保持和2一致
分數更低,因為視頻數據有一些是空的,沒有手工標注的。可能手工標注之后會提高
?總結:
賽事最高微調到228份多點,后面可以再手工標注后優化