kaggle比賽數據

This article was originally written by Shahul ES and posted on the Neptune blog.

本文最初由 Shahul ES 撰寫， 并發布在 Neptune博客上。

In this article, I will discuss some great tips and tricks to improve the performance of your structured data binary classification model. These tricks are obtained from solutions of some of Kaggle’s top tabular data competitions. Without much lag, let’s begin.

在本文中，我將討論一些很棒的技巧和竅門，以提高結構化數據二進制分類模型的性能。這些技巧是從Kaggle的一些頂級表格數據競賽的解決方案中獲得的。沒有太多的滯后，讓我們開始吧。

These are the five competitions that I have gone through to create this article:

以下是我撰寫本文時經歷的五項比賽：

Home credit default risk
房屋信貸違約風險
Santander Customer Transaction Prediction
桑坦德銀行客戶交易預測
VSB Power Line Fault Detection
VSB電源線故障檢測
Microsoft Malware Prediction
Microsoft惡意軟件預測
IEEE-CIS Fraud Detection
IEEE-CIS欺詐檢測

處理更大的數據集 (Dealing with larger datasets)

One issue you might face in any machine learning competition is the size of your data set. If the size of your data is large, that is 3GB + for kaggle kernels and more basic laptops you could find it difficult to load and process with limited resources. Here is the link to some of the articles and kernels that I have found useful in such situations.

在任何機器學習競賽中，您可能面臨的一個問題是數據集的大小。如果數據量很大，那么kaggle內核和更基本的筆記本電腦需要3GB以上的內存，您可能會發現很難用有限的資源來加載和處理數據。這里是我發現在這種情況下有用的一些文章和內核的鏈接。

Faster data loading with pandas.
使用熊貓更快地加載數據。
Data compression techniques to reduce the size of data by 70%.
數據壓縮技術可將數據大小減少70％。
Optimize the memory by reducing the size of some attributes.
通過減小某些屬性的大小來優化內存。
Use open-source libraries such as Dask to read and manipulate the data, it performs parallel computing and saves up memory space.
使用諸如Dask之類的開源庫來讀取和處理數據，它可以執行并行計算并節省內存空間。
Use cudf.
使用cudf 。
Convert data to parquet format.
將數據轉換為鑲木地板格式。
Converting data to feather format.
將數據轉換為羽毛格式。
Reducing memory usage for optimizing RAM.
減少內存使用以優化RAM 。

數據探索 (Data exploration)

Data exploration always helps to better understand the data and gain insights from it. Before starting to develop machine learning models, top competitors always read/do a lot of exploratory data analysis for the data. This helps in feature engineering and cleaning of the data.

數據探索始終有助于更好地理解數據并從中獲得見解。在開始開發機器學習模型之前，頂級競爭者總是會讀取/進行大量探索性數據分析。這有助于功能設計和數據清理。

EDA for microsoft malware detection.
用于Microsoft 惡意軟件檢測的 EDA 。
Time Series EDA for malware detection.
用于檢測惡意軟件的時間序列EDA。
Complete EDA for home credit loan prediction.
完整的EDA用于房屋信用貸款預測。
Complete EDA for Santader prediction.
完成用于Santader預測的EDA。
EDA for VSB Power Line Fault Detection.
用于VSB電源線故障檢測的 EDA 。

資料準備 (Data preparation)

After data exploration, the first thing to do is to use those insights to prepare the data. To tackle issues like class imbalance, encoding categorical data, etc. Let’s see the methods used to do it.

在進行數據探索之后，要做的第一件事就是利用這些見解來準備數據。為了解決類不平衡，對分類數據進行編碼等問題，讓我們看看用于實現此目的的方法。

Methods to tackle class imbalance.
解決班級失衡的方法。
Data augmentation by Synthetic Minority Oversampling Technique.
綜合少數民族過采樣技術的數據擴充。
Fast inplace shuffle for augmentation.
快速就地洗牌以增強效果。
Finding synthetic samples in the dataset.
在數據集中查找合成樣本。
Signal denoising used in signal processing competitions.
信號處理比賽中使用的信號降噪。
Finding patterns of missing data.
查找丟失數據的模式。
Methods to handle missing data.
處理丟失數據的方法。
An overview of various encoding techniques for categorical data.
用于分類數據的各種編碼技術的概述。
Building model to predict missing values.
建立模型以預測缺失值。
Random shuffling of data to create new synthetic training set.
隨機對數據進行改組以創建新的綜合訓練集。

特征工程 (Feature engineering)

Next, you can check the most popular feature and feature engineering techniques used in these top kaggle competitions. The feature engineering part varies from problem to problem depending on the domain.

接下來，您可以查看在這些熱門kaggle比賽中使用的最受歡迎的功能和功能工程技術。功能工程部分的問題因域而異。

Target encoding cross validation for better encoding.
目標編碼交叉驗證可實現更好的編碼。
Entity embedding to handle categories.
實體嵌入處理類別。
Encoding cyclic features for deep learning.
編碼循環功能以進行深度學習。
Manual feature engineering methods.
手動特征工程方法。
Automated feature engineering techniques using featuretools.
使用featuretools的自動化特征工程技術。
Top hard crafted features used in microsoft malware detection.
Microsoft惡意軟件檢測中使用的頂級精選功能。
Denoising NN for feature extraction.
神經網絡去噪特征提取。
Feature engineering using RAPIDS framework.
使用RAPIDS框架進行功能工程。
Things to remember while processing features using LGBM.
使用LGBM處理功能時要記住的事情。
Lag features and moving averages.
滯后特征和移動平均線。
Principal component analysis for dimensionality reduction.
用于降維的主成分分析。
LDA for dimensionality reduction.
LDA用于降維。
Best hand crafted LGBM features for microsoft malware detection.
用于Microsoft惡意軟件檢測的最佳手工LGBM功能。
Generating frequency features.
生成頻率特征。
Dropping variables with different train and test distribution.
丟棄具有不同訓練和測試分布的變量。
Aggregate time series features for home credit competition.
匯總家庭信用競爭的時間序列特征。
Time Series features used in home credit default risk.
家庭信用違約風險中使用的時間序列功能。
Scale,Standardize and normalize with sklearn.
使用sklearn進行縮放，標準化和標準化。
Handcrafted features for Home default risk competition.
本地默認風險競爭的手工功能。
Handcrafted features used in Santander Transaction Prediction.
桑坦德交易預測中使用的手工功能。

功能選擇 (Feature selection)

After generating many features from your data, you need to decide which all features to use in your model to get the maximum performance out of your model. This step also includes identifying the impact each feature is having on your model. Let’s see some of the most popular feature selection methods.

從數據中生成許多功能之后，您需要決定在模型中使用哪些所有功能，以使模型獲得最大性能。此步驟還包括確定每個功能對模型的影響。讓我們看一些最受歡迎的功能選擇方法。

Six ways to do features selection using sklearn.
使用sklearn選擇功能的六種方法。
Permutation feature importance.
排列特征的重要性。
Adversarial feature validation.
對抗特征驗證。
Feature selection using null importances.
使用空重要性的特征選擇。
Tree explainer using SHAP.
使用SHAP的樹解釋器。
DeepNN explainer using SHAP.
使用SHAP的 DeepNN解釋器。

造型 (Modeling)

After handcrafting and selecting your features, you should choose the right Machine learning algorithm to make your prediction. These are the collection of some of the most used ML models in structured data classification challenges.

手工制作并選擇了特征之后，您應該選擇正確的機器學習算法來進行預測。這些是在結構化數據分類挑戰中一些最常用的機器學習模型的集合。

Random forest classifier.
隨機森林分類器。
XGBoost : Gradient boosted decision trees.
XGBoost：梯度增強決策樹。
LightGBM for distributed and faster training.
LightGBM可進行分布式和更快的培訓。
CatBoost to handle categorical data.
CatBoost處理分類數據。
Naive bayes classifier.
天真的貝葉斯分類器。
Gaussian naive bayes model.
高斯樸素貝葉斯模型。
LGBM + CNN model used in 3rd place solution of Santander Customer Transaction Prediction
LGBM + CNN模型用于桑坦德銀行客戶交易預測的第三名解決方案
Knowledge distillation in Neural Network.
神經網絡中的知識提煉。
Follow the regularized leader method.
遵循正則化領導方法。
Comparison between LGB boosting methods (goss, gbdt and dart).
LGB增強方法 (goss，gbdt和dart)之間的比較。
NN + focal loss experiment.
NN +焦點損失實驗。
Keras NN with timeseries splitter.
Keras NN與時間序列分割器。
5th place NN architecture with code for Santander Transaction prediction.
第五名NN體系結構，帶有用于桑坦德交易預測的代碼。

超參數調整 (Hyperparameter tuning)

LGBM hyperparameter tuning methods.
LGBM 超參數調整方法。
Automated model tuning methods.
自動化的模型調整方法。
Parameter tuning with hyperopt.
使用hyperopt進行參數調整。
Bayesian optimization for hyperparameter tuning.
貝葉斯優化超參數調整。
Gpyopt Hyperparameter Optimisation.
Gpyopt超參數優化。

評價 (Evaluation)

Choosing a suitable validation strategy is very important to avoid huge shake-ups or poor performance of the model in the private test set.

選擇合適的驗證策略對于避免在私人測試集中出現巨大的波動或模型的不良性能非常重要。

The traditional 80:20 split wouldn’t work for many cases. Cross-validation works in most cases over the traditional single train-validation split to estimate the model performance.

傳統的80:20分割在很多情況下都不起作用。在大多數情況下，交叉驗證都可以通過傳統的單列火車驗證拆分來估計模型性能。

There are different variations of KFold cross-validation such as group k-fold that should be chosen accordingly.

KFold交叉驗證有不同的變體，例如應相應選擇組k倍。

K-fold cross-validation.
K折交叉驗證。
Stratified KFold cross-validation.
分層KFold交叉驗證。
Group KFold
KFold組
Adversarial validation to check if train and test distributions are similar or not.
對抗性驗證，以檢查訓練和測試分布是否相似。
Time Series split validation.
時間序列分割驗證。
Extensive time series splitter.
廣泛的時間序列分割器。

Note:

注意：

There are various metrics that you can use to evaluate the performance of your tabular models. A bunch of useful classification metrics are listed and explained here.

您可以使用多種指標來評估表格模型的性能。 這里列出并解釋了 許多有用的 分類指標 。

其他訓練技巧 (Other training tricks)

GPU acceleration for LGBM.
LGBM的GPU加速。
Use the GPU efficiently.
有效地使用GPU 。
Free keras memory.
免費的keras記憶。
Save and load models to save runtime and memory.
保存和加載模型以節省運行時間和內存。

合奏 (Ensemble)

If you’re in the competing environment one won’t get to the top of the leaderboard without ensembling. Selecting the appropriate ensembling/stacking method is very important to get the maximum performance out of your models.

如果您在競爭激烈的環境中，那么如果不進行整合，就不會登上排行榜的首位。選擇合適的組裝/堆疊方法對于使模型發揮最大性能非常重要。

Let’s see some of the popular ensembling techniques used in kaggle competitions:

讓我們看看kaggle比賽中使用的一些流行合奏技術：

Weighted average ensemble.
加權平均合奏。
Stacked generalization ensemble.
堆疊泛化合奏。
Out of folds predictions.
出人意料的預測。
Blending with linear regression.
與線性回歸融合。
Use optuna to determine blending weights.
使用optuna確定混合權重。
Power average ensemble.
平均功率合奏。
Power 3.5 blending strategy.
Power 3.5混合策略。
Blending diverse models.
融合多種模式。
Different stacking approaches.
不同的堆疊方法。
AUC weight optimization.
AUC權重優化。
Geometric mean for low correlation predictions.
低相關性預測的幾何平均值。
Weighted rank average.
加權排名平均。

最后的想法 (Final thoughts)

In this article, you saw many popular and effective ways to improve the performance of your tabular data binary classification model. Hopefully, you will find them useful in your projects.

在本文中，您看到了許多流行和有效的方法來改善表格數據二進制分類模型的性能。希望您會發現它們在您的項目中很有用。

This article was originally written by Shahul ES and posted on the Neptune blog. You can find more in-depth articles for machine learning practitioners there.

本文最初由 Shahul ES 撰寫， 并發布在 Neptune博客上 。 您可以在此處找到針對機器學習從業人員的更多深入文章。

翻譯自: https://medium.com/neptune-ai/tabular-data-binary-classification-all-tips-and-tricks-from-5-kaggle-competitions-51667b21876e

kaggle比賽數據

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/392236.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/392236.shtml
英文地址，請注明出處：http://en.pswp.cn/news/392236.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！

leetcode 1579. 保證圖可完全遍歷（并查集）

Alice 和 Bob 共有一個無向圖，其中包含 n 個節點和 3 種類型的邊： 類型 1：只能由 Alice 遍歷。類型 2：只能由 Bob 遍歷。類型 3：Alice 和 Bob 都可以遍歷。給你一個數組 edges ，其中 edges[i] [typei,…

別把“運氣”當“實力”

成功是兩分靠努力，八分靠天命–何英圻何英圻先生，大家口中的Steven，是臺灣網路創業圈的傳奇人物。他先后創辦力傳(Ubid)與興奇(Monday)兩家公司，最后都以高價出售給北美網路巨人—Ubid在2002年以美金950萬賣給eBay，而M…

品牌推廣前期要進行哪些針對性的步驟？

企業在品牌推廣前需要制訂一系列有針對性和連續性的步驟，這些步驟定睛于長期策略，而且要適應目標客戶的使用方式和習慣。在企業內部導入品牌VI是前提，外部的宣傳則是強調品牌所宣揚的內涵和精神實質，總體來說，這只是一…

php的set 容器,關于STL中set容器的一些總結

1.關于setC STL 之所以得到廣泛的贊譽，也被很多人使用，不只是提供了像vector, string, list等方便的容器，更重要的是STL封裝了許多復雜的數據結構算法和大量常用數據結構操作。vector封裝數組，list封裝了鏈表，map和set…

強化學習應用于組合優化問題_如何將強化學習應用于現實生活中的計劃問題

強化學習應用于組合優化問題by Sterling Osborne, PhD Researcher作者：斯特林奧斯本(Sterling Osborne)，博士研究員如何將強化學習應用于現實生活中的計劃問題 (How to apply Reinforcement Learning to real life planning problems) Recently, I hav…

導入導出報錯

導入導出報錯：另：右鍵--共享：停止共享；可能無效。此時，可以通過修改文件夾的權限，來達到停止共享的目的；轉載于:https://www.cnblogs.com/chenjx/p/7107336.html

leetcode 724. 尋找數組的中心索引

給定一個整數類型的數組 nums，請編寫一個能夠返回數組 “中心索引” 的方法。我們是這樣定義數組中心索引的：數組中心索引的左側所有元素相加的和等于右側所有元素相加的和。如果數組不存在中心索引，那么我們應該返回 -1。如果數組有多…

基于mosquitto的MQTT服務器---SSL/TLS 單向認證+雙向認證

配置單/雙向認證 1.生成證書使用如下shell 來生成證書： # * Redistributions in binary form must reproduce the above copyright# notice, this list of conditions and the following disclaimer in the# documentation and/or other materials provided wi…

mysql復制的工作原理及主從復制的實現

mysql的復制功能主要有3個步驟主服務器將改變記錄到二進制日志中，（這些記錄叫做二進制日志事件）從服務器將主服務器的二進制日志事件拷貝到它的中繼日志中從服務器重做中繼日志中的事件。該過程的第一部分就是主服務器記錄二進制日志&#xf…

33條C#、.Net經典面試題目及答案

1， 請你說說.NET中類和結構的區別? 答：結構和類具有大體的語法，但是結構受到的限制比類要多。結構不能申明有默認的構造函數，為結構的副本是又編譯器創建和銷毀的，所以不需要默認的構造函數和析構函數。結構是值類型&…

pb 放棄數據窗口所做修改_為什么我放棄在線數據課程進行基于項目的學習

pb 放棄數據窗口所做修改by Josh Temple通過喬什坦普爾為什么我放棄在線數據課程進行基于項目的學習 (Why I abandoned online data courses for project-based learning) 如何通過處理有趣的項目來發展基本數據技能 (How to develop essential data skills by tackling inte…

數字濾波器的matlab 與fpga實現,1 數字濾波器的MATLAB與FPGA實現——杜勇(配套光盤) 程序源碼 - 下載 - 搜珍網...

壓縮包 : f3d09239c2bf5ce6f06578c866ff06.rar 列表Chapter_3/E3_1/incremental_db/compiled_partitions/SymbExam.db_infoChapter_3/E3_1/incremental_db/READMEChapter_3/E3_1/simulation/modelsim/modelsim.iniChapter_3/E3_1/simulation/modelsim/msim_transcriptChapter_3…