Feature Preprocessing on Kaggle

剛入手data science, 想著自己玩一玩kaggle,玩了新手Titanic和House Price的 項目, 覺得基本的baseline還是可以寫出來,但是具體到一些細節,以至于到能拿到的出手的成績還是需要理論分析的。

本文旨在介紹kaggle比賽到各種原理與技巧,當然一切源自于coursera,由于課程都是英文的,且都比較好理解,這里直接使用英文

  • Reference
    How to Win a Data Science Competition: Learn from Top Kagglers

Features: numeric, categorical, ordinal, datetime, coordinate, text

Numeric features

All models are divided into tree-based model and non-tree-based model.

?

Scaling

For example: if we apply KNN algorithm to the instances below, as we see in the second row, we caculate the distance between the instance and the object. It is obvious that dimension of large scale dominates the distance.

?

Tree-based models doesn’t depend on scaling

Non-tree-based models hugely depend on scaling

How to do

sklearn:

  1. To [0,1]
    sklearn.preprocessing.MinMaxScaler
    X = ( X-X.min( ) )/( X.max()-X.min() )
  2. To mean=0, std=1
    sklearn.preprocessing.StandardScaler
    X = ( X-X.mean( ) )/X.std()

    • if you want to use KNN, we can go one step ahead and recall that the bigger feature is, the more important it will be for KNN. So, we can optimize scaling parameter to boost features which seems to be more important for us and see if this helps

Outliers

The outliers make the model diviate like the red line.

這里寫圖片描述

We can clip features values between teo chosen values of lower bound and upper bound

  • Rank Transformation

If we have outliers, it behaves better than scaling. It will move the outliers closer to other objects

Linear model, KNN, Neural Network will benefit from this mothod.

rank([-100, 0, 1e5]) == [0,1,2]  
rank([1000,1,10]) = [2,0,1]

scipy:

scipy.stats.rankdata

  • Other method

    1. Log transform: np.log(1 + x)
    2. Raising to the power < 1: np.sqrt(x + 2/3)

Feature Generation

Depends on

a. Prior knowledge
b. Exploratory data analysis


Ordinal features

Examples:

  • Ticket class: 1,2,3
  • Driver’s license: A, B, C, D
  • Education: kindergarden, school, undergraduate, bachelor, master, doctoral

Processing

1.Label Encoding
* Alphabetical (sorted)
[S,C,Q] -> [2, 1, 3]

sklearn.preprocessing.LabelEncoder

  • Order of appearance
    [S,C,Q] -> [1, 2, 3]

Pandas.factorize

This method works fine with two ways because tree-methods can split feature, and extract most of the useful values in categories on its own. Non-tree-based-models, on the other side,usually can’t use this feature effectively.

2.Frequency Encoding
[S,C,Q] -> [0.5, 0.3, 0.2]

encoding = titanic.groupby(‘Embarked’).size()  
encoding = encoding/len(titanic)  
titanic[‘enc’] = titanic.Embarked.map(encoding)

from scipy.stats import rankdata

For linear model, it is also helpful.
if frequency of category is correlated with target value, linear model will utilize this dependency.

3.One-hot Encoding

pandas.get_dummies

It give all the categories of one feature a new columns and often used for non-tree-based model.
It will slow down tree-based model, so we introduce sparse matric. Most of libaraies can work with these sparse matrices directly. Namely, xgboost, lightGBM

Feature generation

Interactions of categorical features can help linear models and KNN

By concatenating string

這里寫圖片描述


Datetime and Coordinates

Date and time

1.Periodicity
2.Time since

a. Row-independent moment  
For example: since 00:00:00 UTC, 1 January 1970;b. Row-dependent important moment  
Number of days left until next holidays/ time passed after last holiday.

3.Difference betwenn dates

We can add date_diff feature which indicates number of days between these events

Coordicates

1.Interesting places from train/test data or additional data

Generate distance between the instance to a flat or an old building(Everything that is meanful)

2.Aggergates statistics

The price of surrounding building

3.Rotation

Sometime it makes the model more precisely to classify the instances.

這里寫圖片描述


Missing data

Hidden Nan, numeric

When drawing a histgram, we see the following picture:

這里寫圖片描述

It is obivous that -1 is a hidden Nan which is no meaning for this feature.

Fillna approaches

1.-999,-1,etc(outside the feature range)

It is useful in a way that it gives three possibility to take missing value into separate category. The downside of this is that performance of linear networks can suffer.

2.mean,median

Second method usually beneficial for simple linear models and neural networks. But again for trees it can be harder to select object which had missing values in the first place.

3.Reconstruct:

  • Isnull

  • Prediction

這里寫圖片描述
* Replace the missing data with the mean of medain grouped by another feature.
But sometimes it can be screwed up, like:

這里寫圖片描述

The way to handle this is to ignore missing values while calculating means for each category.

  • Treating values which do not present in trian data

Just generate new feature indicating number of occurrence in the data(freqency)

這里寫圖片描述

  • Xgboost can handle Nan

4.Remove rows with missing values

This one is possible, but it can lead to loss of important samples and a quality decrease.


Text

Bag of words

Text preprocessing

1.Lowercase

2.Lemmatization and Stemming
這里寫圖片描述

3.Stopwords

Examples:
1.Articles(冠詞) or prepositions
2.Very common words

sklearn.feature_extraction.text.CountVectorizer:
max_df

  • max_df : float in range [0.0, 1.0] or int, default=1.0
    When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

CountVectorizer

The number of times a term occurs in a given document

sklearn.feature_extraction.text.CountVectorizer

TFiDF

In order to re-weight the count features into floating point values suitable for usage by a classifier

  • Term frequency
    tf = 1 / x.sum(axis=1) [:,None]
    x = x * tf

  • Inverse Document Frequency
    idf = np.log(x.shape[0] / (x > 0).sum(0))
    x = x * idf

N-gram

這里寫圖片描述

sklearn.feature_extraction.text.CountVectorizer:
Ngram_range, analyzer

  • ngram_range : tuple (min_n, max_n)
    The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.

Embeddings(~word2vec)

It converts each word to some vector in some sophisticated space, which usually have several hundred dimensions

a. Relatively small vectors

b. Values in vector can be interpreted only in some cases

c. The words with similar meaning often have similar
embeddings

Example:

這里寫圖片描述

?

轉載于:https://www.cnblogs.com/bjwu/p/8970821.html

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/251863.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/251863.shtml
英文地址,請注明出處:http://en.pswp.cn/news/251863.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

十天沖刺-04

昨天&#xff1a;完成了日歷界面的部署&#xff0c;并且能夠獲取到選中的日期 今天&#xff1a;完成根據日期查找消費記錄功能 問題&#xff1a;日歷界面占用屏幕太多&#xff0c;后期會進行調整轉載于:https://www.cnblogs.com/liujinxin123/p/10760254.html

構建Spring Boot程序有用的文章

構建Spring Boot程序有用的文章&#xff1a; http://www.jb51.net/article/111546.htm轉載于:https://www.cnblogs.com/xiandedanteng/p/7508334.html

如果您遇到文件或數據庫問題,如何重置Joomla

2019獨角獸企業重金招聘Python工程師標準>>> 如果您遇到Joomla站點的問題&#xff0c;那么重新安裝其核心文件和數據庫可能是最佳解決方案。 了解問題 這種方法無法解決您的所有問題。但它主要適用于由Joomla核心引起的問題。 運行Joomla核心更新后&#xff0c;這些…

數組初始化 和 vector初始化

int result[256] {0}; 整個數組都初始化為0 vector<int> B(length,1); 整個vector初始化為1 如果你定義的vector是這樣定義的&#xff1a; vector<int> B; 去初始化&#xff0c;千萬不要用&#xff1a; for(int i 0;i < length;i)B[i] 1; 這樣會數組越界&…

Genymotion模擬器拖入文件報An error occured while deploying the file的錯誤

今天需要用到資源文件&#xff0c;需要將資源文件拖拽到sd卡中&#xff0c;但老是出現這個問題&#xff1a; 資源文件拖不進去genymotion。查看了sd的DownLoad目錄&#xff0c;確實沒有成功拖拽進去。 遇到這種問題的&#xff0c;我按下面的思路排查問題&#xff1a; Genymotio…

激光炸彈(BZOJ1218)

激光炸彈&#xff08;BZOJ1218&#xff09; 一種新型的激光炸彈&#xff0c;可以摧毀一個邊長為R的正方形內的所有的目標。現在地圖上有n(N<10000)個目標&#xff0c;用整數Xi,Yi(其值在[0,5000])表示目標在地圖上的位置&#xff0c;每個目標都有一個價值。激光炸彈的投放是…

/usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.15 not found

解決錯誤呈現該錯誤的原因是當前的GCC版本中&#xff0c;沒有GLIBCXX_3.4.15&#xff0c;須要安裝更高版本。我們可以輸入&#xff1a;strings /usr/lib/libstdc.so.6 | grep GLIBCXX&#xff0c;查看當前的GCC版本&#xff0c;成果如下&#xff1a;GLIBCXX_3.4 GLIBCXX_3.4.1 …

用servlet設計OA管理系統時遇到問題

如果不加單引號會使得除變量和int類型的值不能傳遞 轉發和重定向的區別 轉發需要填寫完整路徑&#xff0c;重定向只需要寫相對路徑。原因是重定向是一次請求之內已經定位到了服務器端&#xff0c;轉發則需要兩次請求每次都需要完整的路徑。 Request和response在解決中文亂碼時的…

JDK源碼——利用模板方法看設計模式

前言&#xff1a; 相信很多人都聽過一個問題&#xff1a;把大象關進冰箱門&#xff0c;需要幾步&#xff1f; 第一&#xff0c;把冰箱門打開&#xff1b;第二&#xff0c;把大象放進去&#xff1b;第三&#xff0c;把冰箱門關上。我們可以看見&#xff0c;這個問題的答案回答的…

[Usaco2010 Mar]gather 奶牛大集會

1827: [Usaco2010 Mar]gather 奶牛大集會 Time Limit: 1 Sec Memory Limit: 64 MB Submit: 1129 Solved: 525 [Submit][Status][Discuss]Description Bessie正在計劃一年一度的奶牛大集會&#xff0c;來自全國各地的奶牛將來參加這一次集會。當然&#xff0c;她會選擇最方便的…

與TIME_WAIT相關的幾個內核參數

問題 公司用瀏覽器訪問線上服務一會失敗一會成功&#xff0c;通過ssh連接服務器排查時發現ssh也是這樣&#xff1b; 檢查 抓包后發現建立連接的請求已經到了服務器&#xff0c;但它沒有響應&#xff1b; 糾結了好久&#xff0c;后來在騰訊云技術支持及查了相關資料后發現是開啟…

View的繪制-layout流程詳解

目錄 作用 根據 measure 測量出來的寬高&#xff0c;確定所有 View 的位置。 具體分析 View 本身的位置是通過它的四個點來控制的&#xff1a; 以下涉及到源碼的部分都是版本27的&#xff0c;為方便理解觀看&#xff0c;代碼有所刪減。 layout 的流程 先通過 measure 測量出 Vi…

1-1、作用域深入和面向對象

課時1&#xff1a;預解釋 JS中的數據類型 number、string、 boolean、null、undefined JS中引用數據類型 object: {}、[]、/^$/、Date Function var num12; var obj{name:白鳥齊鳴,age:10}; function fn(){ console.log(勿忘初心方得始終&#xff01;) }console.log(fn);//把整…

茶杯頭開槍ahk代碼

;說明這個工具是為了茶杯頭寫的,F1表示換槍攻擊,F3表示不換槍攻擊,F2表示停止攻擊. $F1::loop{ GetKeyState, state, F2, Pif state D{break } Send, {l down}Send, {l up}sleep,10Send,{m down}Send,{m up} }return $F3::loop{ GetKeyState, state, F2, Pif state D{break }…

Vim使用技巧:撤銷與恢復撤銷

在使用VIM的時候&#xff0c;難免會有輸錯的情況&#xff0c;這個時候我們應該如何撤銷&#xff0c;然后回到輸錯之前的狀態呢&#xff1f;答案&#xff1a;使用u&#xff08;小寫&#xff0c;且在命令模式下&#xff09;命令。 但如果有時我們一不小心在命令模式下輸入了u&…

PaddlePaddle開源平臺的應用

最近接觸了百度的開源深度學習平臺PaddlePaddle&#xff0c;想把使用的過程記錄下來。 作用&#xff1a;按照這篇文章&#xff0c;能夠實現對圖像的訓練和預測。我們準備了四種顏色的海洋球數據&#xff0c;然后給不同顏色的海洋球分類為0123四種。 一、安裝paddlepaddle 1.系統…

Hyperledger Fabric區塊鏈工具configtxgen配置configtx.yaml

configtx.yaml是Hyperledger Fabric區塊鏈網絡運維工具configtxgen用于生成通道創世塊或通道交易的配置文件&#xff0c;configtx.yaml的內容直接決定了所生成的創世區塊的內容。本文將給出configtx.yaml的詳細中文說明。 如果需要快速掌握Fabric區塊鏈的鏈碼與應用開發&#x…

js閉包??

<script>var name "The Window";var object {name : "My Object",getNameFunc : function(){console.log("11111");console.log(this); //this object //調用該匿名函數的是對象return function(){console.log("22222");co…

JavaScript----BOM(瀏覽器對象模型)

BOM 瀏覽器對象模型 BOM 的全稱為 Browser Object Model,被譯為瀏覽器對象模型。BOM提供了獨立于 HTML 頁面內容&#xff0c;而與瀏覽器相關的一系列對象。主要被用于管理瀏覽器窗口及與瀏覽器窗口之間通信等功能。 1、Window 對象 window對象是BOM中最頂層對象&#xff1b;表示…

JWT協議學習筆記

2019獨角獸企業重金招聘Python工程師標準>>> 官方 https://jwt.io 英文原版 https://www.ietf.org/rfc/rfc7519.txt 或 https://tools.ietf.org/html/rfc7519 中文翻譯 https://www.jianshu.com/p/10f5161dd9df 1. 概述 JSON Web Token&#xff08;JWT&#xff09;是…