kaggle競賽實戰3

接前文，本文主要做以下幾件事：

1、把前面處理完的幾個表拼成一個大表

2、做特征衍生（把離散特征和連續特征兩兩組合得出）

# In[89]:

#開始拼接表
transaction = pd.concat([new_transaction, history_transaction], axis=0, ignore_index=True)#最后一個參數表示產生新的索引

# In[91]:

transaction['purchase_month'] = transaction['purchase_date'].apply(lambda x:'-'.join(x.split(' ')[0].split('-')[:2]))#先提取出月份和小時

# In[92]:

transaction['purchase_hour_section'] = transaction['purchase_date'].apply(lambda x: x.split(' ')[1].split(':')[0]).astype(int)

# In[95]:

transaction['purchase_month'] = change_object_cols(transaction['purchase_month'].fillna(-1).astype(str))

# In[96]:

cols = ['merchant_id', 'most_recent_sales_range', 'most_recent_purchases_range', 'category_4']

# In[98]:

#做合并
transaction=pd.merge(transaction,merchant[cols],how='left',on='merchant_id')

# In[99]:

numeric_cols = ['purchase_amount', 'installments']

# In[100]:

category_cols = ['authorized_flag', 'city_id', 'category_1','category_3',
? ? ? ? ? ? ? ? ?'merchant_category_id','month_lag','most_recent_sales_range',
? ? ? ? ? ? ? ? ?'most_recent_purchases_range', 'category_4',
? ? ? ? ? ? ? ? ?'purchase_month', 'purchase_hour_section', 'purchase_day']

# In[101]:

id_cols = ['card_id', 'merchant_id']

# In[102]:

#對合成的表再做一下異常值處理
transaction[cols[1:]] = transaction[cols[1:]].fillna(-1).astype(int)

# In[103]:

transaction[category_cols] =transaction[category_cols].fillna(-1).astype(str)

# In[104]:

#導出成csv
transaction.to_csv("d:/transaction_d_pre.csv",index=False)

# In[105]:

del transaction

# In[106]:

gc.collect()

# In[107]:

#開始特征工程，這里用兩兩特征組合的方式,使得一個卡號就一條記錄。具體來說，看各個卡號A特征取值為1時，C特征的和
from datetime import datetime

# In[108]:

#搞個小數據集玩一下
d1={'card_id':[1,2,1,3],'A':[1, 2, 1, 2],
? ? 'B':[2, 1, 2, 2], 'C':[4, 5, 1, 5], 'D':[7, 5, 4, 8]}

# In[110]:

t1=pd.DataFrame(d1)

# In[111]:

numeric_cols = ['C', 'D']
category_cols = ['A', 'B']

# In[112]:

# In[113]:

#創建以id為key的空字典
features={}
card_all=t1['card_id'].values.tolist()#拿出所有catd_id
for card in card_all:
? ? features[card]={}

# In[114]:

features

# In[115]:

columns=t1.columns.tolist()#把所有字段名稱拿出

# In[116]:

columns

# In[129]:

idx = columns.index('card_id')
idx

# In[122]:

#拿出離散型字段的索引值
category_cols_index=[columns.index(col)for col in category_cols]

# In[123]:

numeric_cols_index=[columns.index(col)for col in numeric_cols]

# In[130]:

#開始吧離散字段和連續字段兩兩組合
for i in range(t1.shape[0]):
? ? va=t1.loc[i].values#取出每行的值
? ? card=va[idx]#取出cardid
? ? for cate_ind in category_cols_index:
? ? ? ? for num_ind in numeric_cols_index:
? ? ? ? ? ? col_name = '&'.join([columns[cate_ind], str(va[cate_ind]), columns[num_ind]])
? ? ? ? ? ? features[card][col_name] = features[card].get(col_name, 0) + va[num_ind]
? ??

# In[131]:

features

# In[135]:

#轉化為df
df = pd.DataFrame(features).T.reset_index()#再設置個索引

# In[137]:

cols = df.columns.tolist()

# In[139]:

df.columns = ['card_id'] + cols[1:]#這兩句作用就是把第一列索引名改為card_id

最終輸出的結果是兩兩組合的特征及對應值，如圖所示：

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/bicheng/16544.shtml
繁體地址，請注明出處：http://hk.pswp.cn/bicheng/16544.shtml
英文地址，請注明出處：http://en.pswp.cn/bicheng/16544.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！