Datawhale AI 夏令營【更新中】

夏令營簡介
大模型技術（文本）方向：用AI做帶貨視頻評論分析
機器學習（數據挖掘）方向：用AI預測新增用戶

夏令營簡介

本次AI夏令營是Datawhale在暑期發起的大規模AI學習活動，匯聚產學研資源和開源社區力量，為學習者提供項目實踐和學習機會，提升專業能力和就業競爭力。合作企業包括：科大訊飛、螞蟻集團、魔搭社區、阿里云天池、英特爾、浪潮信息、上海科學智能研究院等。為線上活動，全程免費。

第一期夏令營主要有三個方向可供學習：大模型技術、機器學習、MCP Server開發。

大模型技術（文本）方向：用AI做帶貨視頻評論分析

本次學習實踐基于科大訊飛主辦的2025 iFLYTEK AI開發者大賽中的基于帶貨視頻評論的用戶洞察挑戰賽賽項，在實踐中學習知識。

Datawhale提供一個Baseline給零基礎學員熟悉環境，從零開始學習，夏令營的task 1就是跑通Baseline。代碼采用Python編寫，利用 TF-IDF 和 線性分類器/KMeans 聚類 來完成商品識別、情感分析和評論聚類。最終可獲得約 176 左右 的分數。本次活動提供了基于魔搭Notebook的網絡編程環境。

下面對Baseline進行分析：

[1]導入 Pandas 庫，并且讀取兩個 CSV 文件，將它們加載為 DataFrame。其中，origin_videos_data.csv存放的是視頻相關的數據，origin_comments_data.csv則存儲著評論數據。

# [1]
import pandas as pd
video_data = pd.read_csv("origin_videos_data.csv")
comments_data = pd.read_csv("origin_comments_data.csv")

[2]隨機抽取視頻數據中的 10 行樣本。

# [2]
video_data.sample(10)

[3]顯示評論數據的前幾行內容。

# [3]
comments_data.head()

[4]把視頻描述（video_desc）和視頻標簽（video_tags）組合成一個新的文本特征（text）。對于原數據中可能存在的缺失值，使用空字符串進行填充，這樣可以保證新生成的文本特征不會因為缺失值而出現問題。

# [4]
video_data["text"] = video_data["video_desc"].fillna("") + " " + video_data["video_tags"].fillna("")

[5]這里導入了一系列后續會用到的庫和工具：

jieba用于中文文本分詞。
TfidfVectorizer用于將文本轉換為 TF-IDF 特征向量。
SGDClassifier是一種隨機梯度下降分類器。
LinearSVC是線性支持向量分類器。
KMeans用于聚類分析。
make_pipeline用于構建機器學習流水線。

# [5]
import jieba
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.svm import LinearSVC
from sklearn.cluster import KMeans
from sklearn.pipeline import make_pipeline

[6]這部分構建了一個預測產品名稱的模型：

利用TfidfVectorizer結合 jieba 分詞對文本進行處理，只保留前 50 個最重要的特征詞。
運用SGDClassifier訓練分類模型。
對所有視頻數據的產品名稱進行預測，包括那些原本產品名稱缺失的數據。

# [6]
product_name_predictor = make_pipeline(TfidfVectorizer(tokenizer=jieba.lcut, max_features=50), SGDClassifier()
)
product_name_predictor.fit(video_data[~video_data["product_name"].isnull()]["text"],video_data[~video_data["product_name"].isnull()]["product_name"],
)
video_data["product_name"] = product_name_predictor.predict(video_data["text"])

[7]查看評論數據的列名。

# [7]
comments_data.columns

[8]此代碼針對評論數據進行多類別預測：

對情感類別、用戶場景、用戶問題和用戶建議這四個類別依次進行預測。
對于每個類別，使用有標簽的數據訓練模型，然后預測所有評論的對應類別。

# [8]
for col in ['sentiment_category','user_scenario', 'user_question', 'user_suggestion']:predictor = make_pipeline(TfidfVectorizer(tokenizer=jieba.lcut), SGDClassifier())predictor.fit(comments_data[~comments_data[col].isnull()]["comment_text"],comments_data[~comments_data[col].isnull()][col],)comments_data[col] = predictor.predict(comments_data["comment_text"])

[9]設定了后續聚類分析中每個聚類提取的主題詞數量為 10 個。

# [9]
top_n_words = 10

[10]這部分對情感類別為 1 或 3 的正向評論進行聚類分析：

采用 K-means 算法將評論分為 2 個聚類。
提取每個聚類中 TF-IDF 值最高的 10 個詞作為主題詞。
把聚類主題添加到評論數據中。

# [10]
kmeans_predictor = make_pipeline(TfidfVectorizer(tokenizer=jieba.lcut), KMeans(n_clusters=2)
)kmeans_predictor.fit(comments_data[comments_data["sentiment_category"].isin([1, 3])]["comment_text"])
kmeans_cluster_label = kmeans_predictor.predict(comments_data[comments_data["sentiment_category"].isin([1, 3])]["comment_text"])kmeans_top_word = []
tfidf_vectorizer = kmeans_predictor.named_steps['tfidfvectorizer']
kmeans_model = kmeans_predictor.named_steps['kmeans']
feature_names = tfidf_vectorizer.get_feature_names_out()
cluster_centers = kmeans_model.cluster_centers_
for i in range(kmeans_model.n_clusters):top_feature_indices = cluster_centers[i].argsort()[::-1]top_word = ' '.join([feature_names[idx] for idx in top_feature_indices[:top_n_words]])kmeans_top_word.append(top_word)comments_data.loc[comments_data["sentiment_category"].isin([1, 3]), "positive_cluster_theme"] = [kmeans_top_word[x] for x in kmeans_cluster_label]

[11]對情感類別為 2 或 3 的負向評論進行聚類，流程與正向評論聚類類似。

# [11]
kmeans_predictor = make_pipeline(TfidfVectorizer(tokenizer=jieba.lcut), KMeans(n_clusters=2)
)kmeans_predictor.fit(comments_data[comments_data["sentiment_category"].isin([2, 3])]["comment_text"])
kmeans_cluster_label = kmeans_predictor.predict(comments_data[comments_data["sentiment_category"].isin([2, 3])]["comment_text"])kmeans_top_word = []
tfidf_vectorizer = kmeans_predictor.named_steps['tfidfvectorizer']
kmeans_model = kmeans_predictor.named_steps['kmeans']
feature_names = tfidf_vectorizer.get_feature_names_out()
cluster_centers = kmeans_model.cluster_centers_
for i in range(kmeans_model.n_clusters):top_feature_indices = cluster_centers[i].argsort()[::-1]top_word = ' '.join([feature_names[idx] for idx in top_feature_indices[:top_n_words]])kmeans_top_word.append(top_word)comments_data.loc[comments_data["sentiment_category"].isin([2, 3]), "negative_cluster_theme"] = [kmeans_top_word[x] for x in kmeans_cluster_label]

[12-14]分別對用戶場景、用戶問題和用戶建議進行聚類分析，處理流程與前面的情感評論聚類相同。

# [12]
kmeans_predictor = make_pipeline(TfidfVectorizer(tokenizer=jieba.lcut), KMeans(n_clusters=2)
)kmeans_predictor.fit(comments_data[comments_data["user_scenario"].isin([1])]["comment_text"])
kmeans_cluster_label = kmeans_predictor.predict(comments_data[comments_data["user_scenario"].isin([1])]["comment_text"])kmeans_top_word = []
tfidf_vectorizer = kmeans_predictor.named_steps['tfidfvectorizer']
kmeans_model = kmeans_predictor.named_steps['kmeans']
feature_names = tfidf_vectorizer.get_feature_names_out()
cluster_centers = kmeans_model.cluster_centers_
for i in range(kmeans_model.n_clusters):top_feature_indices = cluster_centers[i].argsort()[::-1]top_word = ' '.join([feature_names[idx] for idx in top_feature_indices[:top_n_words]])kmeans_top_word.append(top_word)comments_data.loc[comments_data["user_scenario"].isin([1]), "scenario_cluster_theme"] = [kmeans_top_word[x] for x in kmeans_cluster_label]

# [13]
kmeans_predictor = make_pipeline(TfidfVectorizer(tokenizer=jieba.lcut), KMeans(n_clusters=2)
)kmeans_predictor.fit(comments_data[comments_data["user_question"].isin([1])]["comment_text"])
kmeans_cluster_label = kmeans_predictor.predict(comments_data[comments_data["user_question"].isin([1])]["comment_text"])kmeans_top_word = []
tfidf_vectorizer = kmeans_predictor.named_steps['tfidfvectorizer']
kmeans_model = kmeans_predictor.named_steps['kmeans']
feature_names = tfidf_vectorizer.get_feature_names_out()
cluster_centers = kmeans_model.cluster_centers_
for i in range(kmeans_model.n_clusters):top_feature_indices = cluster_centers[i].argsort()[::-1]top_word = ' '.join([feature_names[idx] for idx in top_feature_indices[:top_n_words]])kmeans_top_word.append(top_word)comments_data.loc[comments_data["user_question"].isin([1]), "question_cluster_theme"] = [kmeans_top_word[x] for x in kmeans_cluster_label]

# [14]
kmeans_predictor = make_pipeline(TfidfVectorizer(tokenizer=jieba.lcut), KMeans(n_clusters=2)
)kmeans_predictor.fit(comments_data[comments_data["user_suggestion"].isin([1])]["comment_text"])
kmeans_cluster_label = kmeans_predictor.predict(comments_data[comments_data["user_suggestion"].isin([1])]["comment_text"])kmeans_top_word = []
tfidf_vectorizer = kmeans_predictor.named_steps['tfidfvectorizer']
kmeans_model = kmeans_predictor.named_steps['kmeans']
feature_names = tfidf_vectorizer.get_feature_names_out()
cluster_centers = kmeans_model.cluster_centers_
for i in range(kmeans_model.n_clusters):top_feature_indices = cluster_centers[i].argsort()[::-1]top_word = ' '.join([feature_names[idx] for idx in top_feature_indices[:top_n_words]])kmeans_top_word.append(top_word)comments_data.loc[comments_data["user_suggestion"].isin([1]), "suggestion_cluster_theme"] = [kmeans_top_word[x] for x in kmeans_cluster_label]

[15]使用 shell 命令創建了名為submit的文件夾，用于存放后續的結果文件。

# [15]
!mkdir submit

[16]將處理好的視頻數據和評論數據保存為 CSV 文件，存儲在之前創建的submit文件夾中。

# [16]
video_data[["video_id", "product_name"]].to_csv("submit/submit_videos.csv", index=None)
comments_data[['video_id', 'comment_id', 'sentiment_category','user_scenario', 'user_question', 'user_suggestion','positive_cluster_theme', 'negative_cluster_theme','scenario_cluster_theme', 'question_cluster_theme','suggestion_cluster_theme']].to_csv("submit/submit_comments.csv", index=None)

[17]使用 shell 命令將submit文件夾壓縮為submit.zip文件，方便提交。

# [17]
!zip -r submit.zip submit/

該Baseline代碼部分地方有不足，可以進行優化。

[1]沒有對文件路徑是否正確以及文件格式是否符合要求進行檢查。
[5]代碼導入了LinearSVC卻沒有使用它，這可能是冗余操作。
[6]max_features=50可能會導致特征不足，降低模型的預測準確性。
[8]沒有對模型進行參數調優，可能會影響預測效果。
[10]硬性指定n_clusters=2，可能無法準確反映數據的真實聚類情況。
[11]類別 3 同時出現在了正向和負向聚類中，這可能會導致聚類結果不準確。
[12-14]同樣采用了固定的n_clusters=2，可能無法很好地適應不同類型數據的特點。\

機器學習（數據挖掘）方向：用AI預測新增用戶

下面對Baseline進行分析：

[1]該代碼通過pip命令安裝LightGBM 庫。LightGBM 是一種高效的梯度提升決策樹框架，常用于機器學習中的分類、回歸等任務，后續代碼將使用該庫進行模型訓練，因此需要提前安裝。

# [1]
!pip install lightgbm

[2]該代碼導入了后續數據處理和模型訓練所需的核心庫與工具：

pandas和numpy用于數據讀取、清洗和數值計算；
json用于處理可能的 JSON 格式數據；
lightgbm是梯度提升模型庫，將用于構建預測模型；
sklearn.model_selection.StratifiedKFold用于分層交叉驗證，確保類別分布一致；
sklearn.metrics.f1_score用于模型評估（F1 分數計算）；
sklearn.preprocessing.LabelEncoder用于類別特征的編碼轉換；
最后通過warnings.filterwarnings('ignore')忽略運行過程中的警告信息，避免干擾輸出。

# [2]
import pandas as pd
import numpy as np
import json
import lightgbm as lgb
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import f1_score
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings('ignore')

[3]該代碼實現數據加載與時間特征工程：

加載訓練集（train.csv）和測試集（testA_data.csv）數據，存儲為 DataFrame，并創建提交結果的基礎 DataFrame（submit）；
合并訓練集和測試集為full_df，用于后續可能的全局特征處理；
對三個 DataFrame（train_df、test_df、full_df）進行時間特征提取：
- 將原始毫秒級時間戳（common_ts）轉換為 datetime 格式（ts）；
- 從ts中提取日（day）、星期幾（dayofweek）、小時（hour）作為新特征；
- 刪除臨時的ts列，減少冗余數據。

# [3]
%%time
# 1. 數據加載
train_df = pd.read_csv('./train.csv')
test_df = pd.read_csv('./testA_data.csv')
submit = test_df[['did']]full_df = pd.concat([train_df, test_df], axis=0)# 2. 時間特征工程
for df in [train_df, test_df, full_df]:# 轉換為時間戳df['ts'] = pd.to_datetime(df['common_ts'], unit='ms')# 提取時間特征df['day'] = df['ts'].dt.daydf['dayofweek'] = df['ts'].dt.dayofweekdf['hour'] = df['ts'].dt.hour# 刪除原始時間列df.drop(['ts'], axis=1, inplace=True)

[4]該代碼用于分析訓練集與測試集的用戶重疊情況：

提取訓練集（train_df）和測試集（test_df）中唯一的用戶標識（did），并轉換為集合；
計算兩個集合的交集（overlap_dids），即同時出現在訓練集和測試集的用戶；
統計重疊用戶的數量、在訓練集總用戶中的占比、在測試集總用戶中的占比；
輸出結果以評估訓練集與測試集的用戶分布一致性，為模型泛化能力分析提供參考。

# [4]
%%time
############################### 簡單分析
# 獲取 train 和 test 中唯一的 did
train_dids = set(train_df['did'].unique())
test_dids = set(test_df['did'].unique())# 計算交集
overlap_dids = train_dids & test_dids# 數量統計
num_overlap = len(overlap_dids)
num_train = len(train_dids)
num_test = len(test_dids)# 占比
ratio_in_train = num_overlap / num_train if num_train > 0 else 0
ratio_in_test = num_overlap / num_test if num_test > 0 else 0# 輸出結果
print(f"重疊 did 數量: {num_overlap}")
print(f"占 train 比例: {ratio_in_train:.4f} ({num_overlap}/{num_train})")
print(f"占 test 比例: {ratio_in_test:.4f} ({num_overlap}/{num_test})")

[5]該代碼對類別特征進行標簽編碼處理：

定義需要編碼的類別特征列表（cat_features），包括設備品牌、網絡類型、地區、操作系統等；
初始化字典（label_encoders）用于保存每個特征的編碼器；
對每個類別特征：
- 使用LabelEncoder將類別值轉換為 0 開始的自然數；
- 合并訓練集和測試集的特征值以訓練編碼器，確保編碼規則在兩數據集間一致；
- 將編碼后的值替換原始特征，并保存編碼器供后續使用。
  通過編碼，將非數值型的類別特征轉換為模型可處理的數值形式。

# [5]
%%time
# 需要編碼的特征列表
cat_features = ['device_brand', 'ntt', 'operator', 'common_country','common_province', 'common_city', 'appver', 'channel','os_type', 'udmap'
]
# 初始化編碼器字典
label_encoders = {}for feature in cat_features:# 創建編碼器，將類別特征轉為0-N的自然數le = LabelEncoder()# 合并訓練集和測試集的所有類別all_values = pd.concat([train_df[feature], test_df[feature]]).astype(str)# 訓練編碼器（使用所有可能值）le.fit(all_values)# 保存編碼器label_encoders[feature] = le# 應用編碼train_df[feature] = le.transform(train_df[feature].astype(str))test_df[feature] = le.transform(test_df[feature].astype(str))

[6]該代碼用于準備模型訓練的輸入數據：

定義模型使用的特征列表（features），包括原始特征（如設備信息、地區信息）和時間特征（如小時、星期幾）；
從訓練集（train_df）中提取特征列作為模型輸入（X_train），提取標簽列（is_new_did，需預測的目標變量）作為y_train；
從測試集（test_df）中提取與訓練集相同的特征列作為X_test，為后續模型預測做準備。
此步驟完成了模型輸入數據的篩選與劃分。

# [6]
%%time
# 基礎特征 + 目標編碼特征 + 聚合特征
features = [# 原始特征'mid', 'eid', 'device_brand', 'ntt', 'operator', 'common_country', 'common_province', 'common_city','appver', 'channel', 'os_type', 'udmap',# 時間特征'hour', 'dayofweek', 'day', 'common_ts'
]# 準備訓練和測試數據
X_train = train_df[features]
y_train = train_df['is_new_did']
X_test = test_df[features]

[7]該代碼實現LightGBM 模型的訓練與交叉驗證：

定義find_optimal_threshold函數，通過搜索閾值最大化 F1 分數（二分類評估指標）；
配置 LightGBM 模型參數（params），包括目標函數、樹深度、學習率等，并設置動態隨機種子；
使用五折分層交叉驗證（StratifiedKFold）訓練模型：
- 每個折中劃分訓練集和驗證集，訓練模型并通過早停機制（early_stopping）防止過擬合；
- 在驗證集上預測概率，搜索最優閾值并計算 F1 分數，保存每個折的模型和結果；
- 對測試集進行預測，累加各折預測結果的平均值作為最終測試集概率。
  此步驟完成模型訓練、驗證與測試集初步預測。

# [7]
%%time
# 6. F1閾值優化函數
def find_optimal_threshold(y_true, y_pred_proba):"""尋找最大化F1分數的閾值"""best_threshold = 0.5best_f1 = 0for threshold in [0.1,0.15,0.2,0.25,0.3,0.35,0.4]:y_pred = (y_pred_proba >= threshold).astype(int)f1 = f1_score(y_true, y_pred)if f1 > best_f1:best_f1 = f1best_threshold = thresholdreturn best_threshold, best_f1# 7. 模型訓練與交叉驗證
import time
# 動態生成隨機種子（基于當前時間）
seed = int(time.time()) % 1000000  # 取當前時間戳模一個數，避免太大
params = {'objective': 'binary','metric': 'binary_logloss','max_depth': '12','num_leaves': 63,'learning_rate': 0.1,'feature_fraction': 0.7,'bagging_fraction': 0.8,'bagging_freq': 5,'min_child_samples': 10,'verbose': -1,'n_jobs':8,'seed': seed  # 使用動態生成的 seed
}# 五折交叉驗證，使用五折構建特征時的切分規則，保證切分一致
n_folds = 5
kf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=42)
test_preds = np.zeros(len(X_test))
fold_thresholds = []
fold_f1_scores = []
models = []
oof_preds = np.zeros(len(X_train))
oof_probas = np.zeros(len(X_train))print("\n開始模型訓練...")
for fold, (train_idx, val_idx) in enumerate(kf.split(X_train, y_train)):print(f"\n======= Fold {fold+1}/{n_folds} =======")X_tr, X_val = X_train.iloc[train_idx], X_train.iloc[val_idx]y_tr, y_val = y_train.iloc[train_idx], y_train.iloc[val_idx]# 創建數據集（指定類別特征）train_set = lgb.Dataset(X_tr, label=y_tr)val_set = lgb.Dataset(X_val, label=y_val)# 模型訓練model = lgb.train(params,train_set,num_boost_round=1000,valid_sets=[train_set, val_set],callbacks=[lgb.early_stopping(stopping_rounds=50, verbose=False),lgb.log_evaluation(period=100)])models.append(model)# 驗證集預測val_pred_proba = model.predict(X_val)oof_probas[val_idx] = val_pred_proba# 閾值優化best_threshold, best_f1 = find_optimal_threshold(y_val, val_pred_proba)fold_thresholds.append(best_threshold)# 使用優化閾值計算F1val_pred_labels = (val_pred_proba >= best_threshold).astype(int)fold_f1 = f1_score(y_val, val_pred_labels)fold_f1_scores.append(fold_f1)oof_preds[val_idx] = val_pred_labelsprint(f"Fold {fold+1} Optimal Threshold: {best_threshold:.4f}")print(f"Fold {fold+1} F1 Score: {fold_f1:.5f}")# 測試集預測test_preds += model.predict(X_test) / n_folds

[8]該代碼完成模型評估與預測結果生成：

評估交叉驗證整體性能：計算各折最優閾值的平均值，基于該閾值生成訓練集的 OOF（Out-of-Fold）預測標簽，計算最終的 OOF F1 分數；
輸出交叉驗證結果，包括平均閾值、各折 F1 分數、平均 F1 分數和 OOF F1 分數，評估模型穩定性；
生成測試集預測結果：使用平均閾值將測試集預測概率轉換為標簽（is_new_did），保存到提交文件（submit.csv）；
分析特征重要性：提取模型訓練的特征重要性分數，輸出 Top 10 重要特征，為特征優化提供參考。
此步驟完成從模型評估到提交文件生成的全流程。

# [8]
# 8. 整體結果評估
# 使用交叉驗證平均閾值
avg_threshold = np.mean(fold_thresholds)
final_oof_preds = (oof_probas >= avg_threshold).astype(int)
final_f1 = f1_score(y_train, final_oof_preds)print("\n===== Final Results =====")
print(f"Average Optimal Threshold: {avg_threshold:.4f}")
print(f"Fold F1 Scores: {[f'{s:.5f}' for s in fold_f1_scores]}")
print(f"Average Fold F1: {np.mean(fold_f1_scores):.5f}")
print(f"OOF F1 Score: {final_f1:.5f}")# 9. 測試集預測與提交文件生成
# 使用平均閾值進行預測
test_pred_labels = (test_preds >= avg_threshold).astype(int)
submit['is_new_did'] = test_pred_labels# 保存提交文件
submit[['is_new_did']].to_csv('submit.csv', index=False)
print("\nSubmission file saved: submit.csv")
print(f"Predicted new user ratio: {test_pred_labels.mean():.4f}")
print(f"Test set size: {len(test_pred_labels)}")# 10. 特征重要性分析
feature_importance = pd.DataFrame({'Feature': features,'Importance': models[0].feature_importance(importance_type='gain')
}).sort_values('Importance', ascending=False)print("\nTop 10 Features:")
print(feature_importance.head(10))