比賽的目的:
- 通過分析網上的系統日志和用戶行為信息,來預測某些網頁上項目的點擊率。
- 是一個二分類的問題,只需要預測出用戶是否點擊即可
- 最好能夠輸出某個概率,比如:用戶點擊某個廣告的概率。
比賽官網
文件信息:
train - Training set. 10 days of click-through data, ordered chronologically. Non-clicks and clicks are subsampled according to different strategies.
test - Test set. 1 day of ads to for testing your model predictions.
sampleSubmission.csv - Sample submission file in the correct format, corresponds to the All-0.5 Benchmark.
屬性信息:
- id: ad identifier
- click: 0/1 for non-click/click
- hour: format is YYMMDDHH, so 14091123 means 23:00 on Sept. 11, 2014 UTC.
- C1 – anonymized categorical variable
- banner_pos
- site_id
- site_domain
- site_category
- app_id
- app_domain
- app_category
- device_id
- device_ip
- device_model
- device_type
- device_conn_type
- C14-C21 – anonymized categorical variables
初步分析:
- 這是一個點擊率預測的問題,是一個二分類的問題
- 通過初步查看給出的屬性,主要分為用戶,網站,廣告和時間四種類型的屬性
- 時間應該是一個重要的屬性,可以好好分析,因為每個人在不同時間喜歡看不同的東西
- 網站類型也是一個和用戶相關性比較大的屬性
- 設備類型可以反映出用戶的一個經濟范圍和消費水平
- 等等!肯定還有很多相關性在這些屬性中,我們應該設身處地的思考這些問題。
Load Data
import pandas as pd# Initial setup
train_filename = "train_small.csv" #由于原始數據量比較多,所以這里先導入一個經過下采樣的樣本
test_filename = "test.csv"
submission_filename = "submit.csv"training_set = pd.read_csv(train_filename)
Explore Data
training_set.shape
(99999, 24)
#我們首先看看數據的樣子
training_set.head(10)
id | click | hour | C1 | banner_pos | site_id | site_domain | site_category | app_id | app_domain | ... | device_type | device_conn_type | C14 | C15 | C16 | C17 | C18 | C19 | C20 | C21 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.000009e+18 | 0 | 14102100 | 1005 | 0 | 1fbe01fe | f3845767 | 28905ebd | ecad2386 | 7801e8d9 | ... | 1 | 2 | 15706 | 320 | 50 | 1722 | 0 | 35 | -1 | 79 |
1 | 1.000017e+19 | 0 | 14102100 | 1005 | 0 | 1fbe01fe | f3845767 | 28905ebd | ecad2386 | 7801e8d9 | ... | 1 | 0 | 15704 | 320 | 50 | 1722 | 0 | 35 | 100084 | 79 |
2 | 1.000037e+19 | 0 | 14102100 | 1005 | 0 | 1fbe01fe | f3845767 | 28905ebd | ecad2386 | 7801e8d9 | ... | 1 | 0 | 15704 | 320 | 50 | 1722 | 0 | 35 | 100084 | 79 |
3 | 1.000064e+19 | 0 | 14102100 | 1005 | 0 | 1fbe01fe | f3845767 | 28905ebd | ecad2386 | 7801e8d9 | ... | 1 | 0 | 15706 | 320 | 50 | 1722 | 0 | 35 | 100084 | 79 |
4 | 1.000068e+19 | 0 | 14102100 | 1005 | 1 | fe8cc448 | 9166c161 | 0569f928 | ecad2386 | 7801e8d9 | ... | 1 | 0 | 18993 | 320 | 50 | 2161 | 0 | 35 | -1 | 157 |
5 | 1.000072e+19 | 0 | 14102100 | 1005 | 0 | d6137915 | bb1ef334 | f028772b | ecad2386 | 7801e8d9 | ... | 1 | 0 | 16920 | 320 | 50 | 1899 | 0 | 431 | 100077 | 117 |
6 | 1.000072e+19 | 0 | 14102100 | 1005 | 0 | 8fda644b | 25d4cfcd | f028772b | ecad2386 | 7801e8d9 | ... | 1 | 0 | 20362 | 320 | 50 | 2333 | 0 | 39 | -1 | 157 |
7 | 1.000092e+19 | 0 | 14102100 | 1005 | 1 | e151e245 | 7e091613 | f028772b | ecad2386 | 7801e8d9 | ... | 1 | 0 | 20632 | 320 | 50 | 2374 | 3 | 39 | -1 | 23 |
8 | 1.000095e+19 | 1 | 14102100 | 1005 | 0 | 1fbe01fe | f3845767 | 28905ebd | ecad2386 | 7801e8d9 | ... | 1 | 2 | 15707 | 320 | 50 | 1722 | 0 | 35 | -1 | 79 |
9 | 1.000126e+19 | 0 | 14102100 | 1002 | 0 | 84c7ba46 | c4e18dd6 | 50e219e0 | ecad2386 | 7801e8d9 | ... | 0 | 0 | 21689 | 320 | 50 | 2496 | 3 | 167 | 100191 | 23 |
10 rows × 24 columns
- 目前主要有22個屬性,其中有很多是類別的屬性。
- 訓練集總共有99999個樣本,還行,不多也不少。
training_set.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99999 entries, 0 to 99998
Data columns (total 24 columns):
id 99999 non-null float64
click 99999 non-null int64
hour 99999 non-null int64
C1 99999 non-null int64
banner_pos 99999 non-null int64
site_id 99999 non-null object
site_domain 99999 non-null object
site_category 99999 non-null object
app_id 99999 non-null object
app_domain 99999 non-null object
app_category 99999 non-null object
device_id 99999 non-null object
device_ip 99999 non-null object
device_model 99999 non-null object
device_type 99999 non-null int64
device_conn_type 99999 non-null int64
C14 99999 non-null int64
C15 99999 non-null int64
C16 99999 non-null int64
C17 99999 non-null int64
C18 99999 non-null int64
C19 99999 non-null int64
C20 99999 non-null int64
C21 99999 non-null int64
dtypes: float64(1), int64(14), object(9)
memory usage: 18.3+ MB
- 因為是處理好的,所以數據比較完整,沒有缺失值,這為我們省去很多的工作
- 數據中很多屬性是類別的,需要進行編碼處理
- 數值型的數據取值都是int64,但是還是需要看看數據范圍是否一致,不然還要歸一化處理。
- 接下來看一下數值型的數據的一個分布情況
#查看訓練集
training_set.describe()
id | click | hour | C1 | banner_pos | device_type | device_conn_type | C14 | C15 | C16 | C17 | C18 | C19 | C20 | C21 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 9.999900e+04 | 99999.000000 | 99999.0 | 99999.000000 | 99999.000000 | 99999.000000 | 99999.000000 | 99999.000000 | 99999.000000 | 99999.000000 | 99999.000000 | 99999.000000 | 99999.000000 | 99999.000000 | 99999.000000 |
mean | 9.500834e+18 | 0.174902 | 14102100.0 | 1005.034440 | 0.198302 | 1.055741 | 0.199272 | 17682.106071 | 318.333943 | 56.818988 | 1964.029090 | 0.789328 | 131.735447 | 37874.606366 | 88.555386 |
std | 5.669435e+18 | 0.379885 | 0.0 | 1.088705 | 0.402641 | 0.583986 | 0.635271 | 3237.726956 | 11.931998 | 36.924283 | 394.961129 | 1.223747 | 244.077816 | 48546.369299 | 45.482979 |
min | 3.237563e+13 | 0.000000 | 14102100.0 | 1001.000000 | 0.000000 | 0.000000 | 0.000000 | 375.000000 | 120.000000 | 20.000000 | 112.000000 | 0.000000 | 33.000000 | -1.000000 | 13.000000 |
25% | 4.183306e+18 | 0.000000 | 14102100.0 | 1005.000000 | 0.000000 | 1.000000 | 0.000000 | 15704.000000 | 320.000000 | 50.000000 | 1722.000000 | 0.000000 | 35.000000 | -1.000000 | 61.000000 |
50% | 1.074496e+19 | 0.000000 | 14102100.0 | 1005.000000 | 0.000000 | 1.000000 | 0.000000 | 17654.000000 | 320.000000 | 50.000000 | 1993.000000 | 0.000000 | 35.000000 | -1.000000 | 79.000000 |
75% | 1.457544e+19 | 0.000000 | 14102100.0 | 1005.000000 | 0.000000 | 1.000000 | 0.000000 | 20362.000000 | 320.000000 | 50.000000 | 2306.000000 | 2.000000 | 39.000000 | 100083.000000 | 156.000000 |
max | 1.844670e+19 | 1.000000 | 14102100.0 | 1010.000000 | 5.000000 | 5.000000 | 5.000000 | 21705.000000 | 728.000000 | 480.000000 | 2497.000000 | 3.000000 | 1835.000000 | 100248.000000 | 157.000000 |
- 數值型數據取值范圍相差較大,后面需要對其進行歸一化處理。
# id: ad identifier
# click: 0/1 for non-click/click
# hour: format is YYMMDDHH, so 14091123 means 23:00 on Sept. 11, 2014 UTC.
# C1 -- anonymized categorical variable
# banner_pos
# site_id
# site_domain
# site_category
# app_id
# app_domain
# app_category
# device_id
# device_ip
# device_model
# device_type
# device_conn_type
# C14-C21 -- anonymized categorical variables
from sklearn.externals import joblib
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metricsfrom utils import load_df
E:\Anaconda2\soft\lib\site-packages\sklearn\cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20."This module will be removed in 0.20.", DeprecationWarning)
# 結果衡量
def print_metrics(true_values, predicted_values):print "Accuracy: ", metrics.accuracy_score(true_values, predicted_values)print "AUC: ", metrics.roc_auc_score(true_values, predicted_values)print "Confusion Matrix: ", + metrics.confusion_matrix(true_values, predicted_values)print metrics.classification_report(true_values, predicted_values)# 擬合分類器
def classify(classifier_class, train_input, train_targets):classifier_object = classifier_class()classifier_object.fit(train_input, train_targets)return classifier_object# 模型存儲
def save_model(clf):joblib.dump(clf, 'classifier.pkl')
train_data = load_df('train_small.csv').values
train_data.shape #數據量還是99999個
(99999L, 14L)
train_data[:,:]
array([[ 0, 14102100, 1005, ..., 35, -1, 79],[ 0, 14102100, 1005, ..., 35, 100084, 79],[ 0, 14102100, 1005, ..., 35, 100084, 79],...,[ 0, 14102100, 1005, ..., 35, -1, 79],[ 1, 14102100, 1005, ..., 35, -1, 79],[ 0, 14102100, 1005, ..., 35, -1, 79]],dtype=int64)
先訓練一個baseline看看,說起baseline當然選用工業界認同的baseline模型LR
# 訓練和存儲模型
X_train, X_test, y_train, y_test = train_test_split(train_data[0::, 1::], train_data[0::, 0],test_size=0.3, random_state=0)classifier = classify(LogisticRegression, X_train, y_train) #使用LR模型
predictions = classifier.predict(X_test)
print_metrics(y_test, predictions) #通過多種評價指標對分類的模型進行評判
save_model(classifier) #保存模型
Accuracy: 0.8233
AUC: 0.5
Confusion Matrix: [[24699 0][ 5301 0]]precision recall f1-score support0 0.82 1.00 0.90 246991 0.00 0.00 0.00 5301avg / total 0.68 0.82 0.74 30000E:\Anaconda2\soft\lib\site-packages\sklearn\metrics\classification.py:1135: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.'precision', 'predicted', average, warn_for)
從baseline的結果,我們可以得出如下幾點結論:
- 將結果全部預測為沒有點擊后的準確率可以達到82.33%,這顯然是不對的
- 從混淆矩陣可以看出原本為點擊的結果全部預測為了不點擊,猜想的原因可能是樣布不均衡問題導致的。因為畢竟廣告點擊的較少,數據中大部分的數據的標簽都是沒有點擊的,這會導致模型偏向于去預測不點擊
- 從實驗結果可以發現準確率有時候非常不準,對于模型的狀態預判。
#樣本中未點擊的樣本數占總體樣本的83%多,這和我們分析的原因是一樣的,樣本非常不均衡。
training_set[training_set["click"] == 0].count()[0] * 1.0 / training_set.shape[0]
0.8250982509825098
# 按照指定的格式生成結果
def create_submission(ids, predictions, filename='submission.csv'):submissions = np.concatenate((ids.reshape(len(ids), 1), predictions.reshape(len(predictions), 1)), axis=1)df = DataFrame(submissions)df.to_csv(filename, header=['id', 'click'], index=False)
import numpy as np
from pandas import DataFrameclassifier = joblib.load('classifier.pkl')
test_data_df = load_df('test.csv', training=False)
ids = test_data_df.values[0:, 0]
predictions = classifier.predict(test_data_df.values[0:, 1:])
create_submission(ids, predictions)