Machine Learning from Start to Finish with Scikit-Learn

2019獨角獸企業重金招聘Python工程師標準>>> hot3.png

Machine Learning from Start to Finish with Scikit-Learn

This notebook covers the basic Machine Learning process in Python step-by-step. Go from raw data to at least 78% accuracy on the Titanic Survivors dataset.

Steps Covered

  1. Importing a DataFrame
  2. Visualize the Data
  3. Cleanup and Transform the Data
  4. Encode the Data
  5. Split Training and Test Sets
  6. Fine Tune Algorithms
  7. Cross Validate with KFold
  8. Upload to Kaggle

CSV to DataFrame

CSV files can be loaded into a dataframe by calling?pd.read_csv?. After loading the training and test files, print a?sample?to see what you're working with.

In?[1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inlinedata_train = pd.read_csv('../input/train.csv')
data_test = pd.read_csv('../input/test.csv')data_train.sample(3)

Out[1]:

?PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
474813O'Driscoll, Miss. BridgetfemaleNaN00143117.7500NaNQ
29629703Hanna, Mr. Mansourmale23.50026937.2292NaNC
53053112Quick, Miss. Phyllis Mayfemale2.0112636026.0000NaNS

Visualizing Data

Visualizing data is crucial for recognizing underlying patterns to exploit in the model.

In?[2]:

sns.barplot(x="Embarked", y="Survived", hue="Sex", data=data_train);

29143249_QZiT.png

In?[3]:

sns.pointplot(x="Pclass", y="Survived", hue="Sex", data=data_train,palette={"male": "blue", "female": "pink"},markers=["*", "o"], linestyles=["-", "--"]);

29143250_Owy5.png

Transforming Features

  1. Aside from 'Sex', the 'Age' feature is second in importance. To avoid overfitting, I'm grouping people into logical human age groups.
  2. Each Cabin starts with a letter. I bet this letter is much more important than the number that follows, let's slice it off.
  3. Fare is another continuous value that should be simplified. I ran?data_train.Fare.describe()?to get the distribution of the feature, then placed them into quartile bins accordingly.
  4. Extract information from the 'Name' feature. Rather than use the full name, I extracted the last name and name prefix (Mr. Mrs. Etc.), then appended them as their own features.
  5. Lastly, drop useless features. (Ticket and Name)

In?[4]:

def simplify_ages(df):df.Age = df.Age.fillna(-0.5)bins = (-1, 0, 5, 12, 18, 25, 35, 60, 120)group_names = ['Unknown', 'Baby', 'Child', 'Teenager', 'Student', 'Young Adult', 'Adult', 'Senior']categories = pd.cut(df.Age, bins, labels=group_names)df.Age = categoriesreturn dfdef simplify_cabins(df):df.Cabin = df.Cabin.fillna('N')df.Cabin = df.Cabin.apply(lambda x: x[0])return dfdef simplify_fares(df):df.Fare = df.Fare.fillna(-0.5)bins = (-1, 0, 8, 15, 31, 1000)group_names = ['Unknown', '1_quartile', '2_quartile', '3_quartile', '4_quartile']categories = pd.cut(df.Fare, bins, labels=group_names)df.Fare = categoriesreturn dfdef format_name(df):df['Lname'] = df.Name.apply(lambda x: x.split(' ')[0])df['NamePrefix'] = df.Name.apply(lambda x: x.split(' ')[1])return df    def drop_features(df):return df.drop(['Ticket', 'Name', 'Embarked'], axis=1)def transform_features(df):df = simplify_ages(df)df = simplify_cabins(df)df = simplify_fares(df)df = format_name(df)df = drop_features(df)return dfdata_train = transform_features(data_train)
data_test = transform_features(data_test)
data_train.head()

Out[4]:

?PassengerIdSurvivedPclassSexAgeSibSpParchFareCabinLnameNamePrefix
0103maleStudent101_quartileNBraund,Mr.
1211femaleAdult104_quartileCCumings,Mrs.
2313femaleYoung Adult001_quartileNHeikkinen,Miss.
3411femaleYoung Adult104_quartileCFutrelle,Mrs.
4503maleYoung Adult002_quartileNAllen,Mr.

In?[5]:

?

In?[5]:

sns.barplot(x="Age", y="Survived", hue="Sex", data=data_train);

29143251_Uo7e.png

In?[6]:

sns.barplot(x="Cabin", y="Survived", hue="Sex", data=data_train);

29143253_wK19.png

In?[7]:

sns.barplot(x="Fare", y="Survived", hue="Sex", data=data_train);

29143253_HeJf.png

Some Final Encoding

The last part of the preprocessing phase is to normalize labels. The LabelEncoder in Scikit-learn will convert each unique string value into a number, making out data more flexible for various algorithms.

The result is a table of numbers that looks scary to humans, but beautiful to machines.

In?[8]:

from sklearn import preprocessing
def encode_features(df_train, df_test):features = ['Fare', 'Cabin', 'Age', 'Sex', 'Lname', 'NamePrefix']df_combined = pd.concat([df_train[features], df_test[features]])for feature in features:le = preprocessing.LabelEncoder()le = le.fit(df_combined[feature])df_train[feature] = le.transform(df_train[feature])df_test[feature] = le.transform(df_test[feature])return df_train, df_testdata_train, data_test = encode_features(data_train, data_test)
data_train.head()

Out[8]:

?PassengerIdSurvivedPclassSexAgeSibSpParchFareCabinLnameNamePrefix
010314100710019
121100103218220
231307000732916
341107103226720
45031700171519

Splitting up the Training Data

Now its time for some Machine Learning.

First, separate the features(X) from the labels(y).

X_all:?All features minus the value we want to predict (Survived).

y_all:?Only the value we want to predict.

Second, use Scikit-learn to randomly shuffle this data into four variables. In this case, I'm training 80% of the data, then testing against the other 20%.

Later, this data will be reorganized into a KFold pattern to validate the effectiveness of a trained algorithm.

In?[9]:

from sklearn.model_selection import train_test_splitX_all = data_train.drop(['Survived', 'PassengerId'], axis=1)
y_all = data_train['Survived']num_test = 0.20
X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size=num_test, random_state=23)

Fitting and Tuning an Algorithm

Now it's time to figure out which algorithm is going to deliver the best model. I'm going with the RandomForestClassifier, but you can drop any other classifier here, such as Support Vector Machines or Naive Bayes.

In?[10]:

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.model_selection import GridSearchCV# Choose the type of classifier. 
clf = RandomForestClassifier()# Choose some parameter combinations to try
parameters = {'n_estimators': [4, 6, 9], 'max_features': ['log2', 'sqrt','auto'], 'criterion': ['entropy', 'gini'],'max_depth': [2, 3, 5, 10], 'min_samples_split': [2, 3, 5],'min_samples_leaf': [1,5,8]}# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(accuracy_score)# Run the grid search
grid_obj = GridSearchCV(clf, parameters, scoring=acc_scorer)
grid_obj = grid_obj.fit(X_train, y_train)# Set the clf to the best combination of parameters
clf = grid_obj.best_estimator_# Fit the best algorithm to the data. 
clf.fit(X_train, y_train)

Out[10]:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',max_depth=5, max_features='sqrt', max_leaf_nodes=None,min_impurity_decrease=0.0, min_impurity_split=None,min_samples_leaf=1, min_samples_split=3,min_weight_fraction_leaf=0.0, n_estimators=9, n_jobs=1,oob_score=False, random_state=None, verbose=0,warm_start=False)

In?[11]:

predictions = clf.predict(X_test)
print(accuracy_score(y_test, predictions))
0.798882681564

Validate with KFold

Is this model actually any good? It helps to verify the effectiveness of the algorithm using KFold. This will split our data into 10 buckets, then run the algorithm using a different bucket as the test set for each iteration.

In?[12]:

from sklearn.cross_validation import KFolddef run_kfold(clf):kf = KFold(891, n_folds=10)outcomes = []fold = 0for train_index, test_index in kf:fold += 1X_train, X_test = X_all.values[train_index], X_all.values[test_index]y_train, y_test = y_all.values[train_index], y_all.values[test_index]clf.fit(X_train, y_train)predictions = clf.predict(X_test)accuracy = accuracy_score(y_test, predictions)outcomes.append(accuracy)print("Fold {0} accuracy: {1}".format(fold, accuracy))     mean_outcome = np.mean(outcomes)print("Mean Accuracy: {0}".format(mean_outcome)) run_kfold(clf)
Fold 1 accuracy: 0.8111111111111111
Fold 2 accuracy: 0.8651685393258427
Fold 3 accuracy: 0.7640449438202247
Fold 4 accuracy: 0.8426966292134831
Fold 5 accuracy: 0.8314606741573034
Fold 6 accuracy: 0.8202247191011236
Fold 7 accuracy: 0.7528089887640449
Fold 8 accuracy: 0.8089887640449438
Fold 9 accuracy: 0.8876404494382022
Fold 10 accuracy: 0.8426966292134831
Mean Accuracy: 0.8226841448189763
/opt/conda/lib/python3.6/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20."This module will be removed in 0.20.", DeprecationWarning)

Predict the Actual Test Data

And now for the moment of truth. Make the predictions, export the CSV file, and upload them to Kaggle.

In?[13]:

ids = data_test['PassengerId']
predictions = clf.predict(data_test.drop('PassengerId', axis=1))output = pd.DataFrame({ 'PassengerId' : ids, 'Survived': predictions })
# output.to_csv('titanic-predictions.csv', index = False)
output.head()

Out[13]:

?PassengerIdSurvived
08920
18930
28940
38950
48960

轉載于:https://my.oschina.net/cloudcoder/blog/1068712

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/395256.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/395256.shtml
英文地址,請注明出處:http://en.pswp.cn/news/395256.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

Excel 宏編碼實現,指定列的字符串截取

1、打開Excel憑證,啟用宏,ALTF11 或 菜單“視圖”-"宏-查看宏" Sub 分割字符串1() Dim i As Integer Dim b() As String Dim length 用length表示數組的長度 Dim sublength Dim bb() As String 篩選日期 2 點 For i 2 To 20000 b() Split(Ce…

mysql for update 鎖_MySql FOR UPDATE 鎖的一點問題……

問題描述假設一個情況,這里只是假設,真實的情況可能不會這樣設計,但是假如真的發生了....鐵老大有一張這樣的ticket表,用來存放北京到上海的票。iduidstart_addrend_addrbook_time11300009860上海北京1386666032120上海北京30上海…

服務器機房新風系統,某機房新風系統設計方案參考

《某機房新風系統設計方案參考》由會員分享,可在線閱讀,更多相關《某機房新風系統設計方案參考(3頁珍藏版)》請在人人文庫網上搜索。1、某機房新風系統設計方案參考根據以上要求并結合中華人民共和國電子計算機機房的設計規范,為保證機房正壓…

css 畫三角形

CSS三角形繪制方法#triangle-up {width: 0;height: 0;border-left: 50px solid transparent;border-right: 50px solid transparent;border-bottom: 100px solid red;}#triangle-down {width: 0;height: 0;border-left: 50px solid transparent;border-right: 50px solid trans…

面試官面試前端_如何面試面試官

面試官面試前端by Aline Lerner通過艾琳勒納(Aline Lerner) 如何面試面試官 (How to interview your interviewers) For the last few semesters, I’ve had the distinct pleasure of guest-lecturing MIT’s required technical communication class for computer science m…

shell 字符串分割

語法1: substring${string:start:len} string的下標從0開始,以start可是,截取len個字符,并賦值于substring 1 #!/bin/bash 2 #substr${string:start:len} 3 str"123456789" 4 substr${str:3:3} 5 echo $substr 6 7 輸出&#xff1…

方格取數(網絡流)

題目鏈接:ヾ(≧?≦*)ゝ 大致題意:給你一個\(n*m\)的矩陣,可以取任意多個數,但若你取了一個數,那么這個數上下左右的數你就都不能取,問能取到的最大值是多少。 Solution: 首先,我們可以把矩陣上…

mysql創建的數據庫都在哪里看_mysql 怎么查看創建的數據庫和表

1、 //看當前使用的是哪個數據庫 ,如果你還沒選擇任何數據庫,結果是NULL。mysql>select database(); ------------ | DATABASE() | ------------ | menagerie | ------------2、//查看有哪些數據庫 mysql> show databases;--------------------| Database …

wordpress 基礎文件

需要用到的PHP基礎文件有&#xff1a; 404.php404模板 rtl.css 如果網站的閱讀方向是自右向左的&#xff0c;會被自動包含進來comments.php 評論模板single.php文章模板。顯示單獨的一篇文章時被調用&#xff0c;如果模板不存在會使用 index.phpsingle-<post-type>.php自…

ajax請求 apend,jsp如何獲取ajax append的數據?

該樓層疑似違規已被系統折疊 隱藏此樓查看此樓我在網上下了個上傳圖片的js&#xff0c;我想上傳圖片的時候還提交一些參數&#xff0c;但是后臺用request.getParameter("th");獲取出來是nullfunction uploadSubmitHandler () {if (state.fileBatch.length ! 0) {var …

linux 機器格式化_為什么機器人應該為我們格式化代碼

linux 機器格式化by Artem Sapegin通過Artem Sapegin 為什么機器人應該為我們格式化代碼 (Why robots should format our code for us) I used to think that a personal code style is a good thing for a programmer. It shows you are a mature developer who knows what g…

Pytest高級進階之Fixture

From: https://www.jianshu.com/p/54b0f4016300 一. fixture介紹 fixture是pytest的一個閃光點&#xff0c;pytest要精通怎么能不學習fixture呢&#xff1f;跟著我一起深入學習fixture吧。其實unittest和nose都支持fixture&#xff0c;但是pytest做得更炫。 fixture是pytest特有…

mysql 慢日志報警_一則MySQL慢日志監控誤報的問題分析

之前因為各種原因&#xff0c;有些報警沒有引起重視&#xff0c;最近放假馬上排除了一些潛在的人為原因&#xff0c;發現數據庫的慢日志報警有些奇怪&#xff0c;主要表現是慢日志報警不屬實&#xff0c;收到報警的即時通信提醒后&#xff0c;隔一會去數據庫里面去排查&#xf…

用css實現自定義虛線邊框

開發產品功能的時候ui往往會給出虛線邊框的效果圖&#xff0c;于是乎&#xff0c;我們往往第一時間想到的是用css里的border&#xff0c;可是border里一般就提供兩種效果&#xff0c;dashed或者dotted&#xff0c;ui這時就不滿意了&#xff0c;說虛線太密了。廢話不多說&#x…

無限復活服務器,絕地求生無限復活模式怎么玩 無限復活新手教程

相信不少的絕地求生玩家們最近都聽說了其無限復活模式吧?因此肯定想要知道這種模式究竟該怎么玩&#xff0c;所以下面就來為各位帶來此玩法的攻略相關&#xff0c;希望各位在看了如下的內容之后恩呢狗狗了解到新手教程攻略一覽。“War”模式的設定以及玩法規則如下&#xff1a…

lua math.random()

math.random([n [,m]]) 用法&#xff1a;1.無參調用&#xff0c;產生[0, 1)之間的浮點隨機數。 2.一個參數n&#xff0c;產生[1, n]之間的整數。 3.兩個參數&#xff0c;產生[n, m]之間的整數。 math.randomseed(n) 用法&#xff1a;接收一個整數n作為隨即序列的種子。 例&…

零基礎學習ruby_學習Ruby:從零到英雄

零基礎學習ruby“Ruby is simple in appearance, but is very complex inside, just like our human body.” — Matz, creator of the Ruby programming language“ Ruby的外觀很簡單&#xff0c;但是內部卻非常復雜&#xff0c;就像我們的人體一樣。” — Matz &#xff0c;R…

windows同時啟動多個微信

1、創建mychat.bat文件(文件名任意)&#xff0c;輸入以下代碼&#xff0c;其中"C:\Program Files (x86)\Tencent\WeChat\"為微信的安裝路徑。以下示例為同時啟動兩個微信 start/d "C:\Program Files (x86)\Tencent\WeChat\" Wechat.exe start/d "C:\P…

mysql date time year_YEAR、DATE、TIME、DATETIME和TIMESTAMP詳細介紹[MySQL數據類型]

為了方便在數據庫中存儲日期和時間&#xff0c;MySQL提供了表示日期和時間的數據類型&#xff0c;分別是YEAR、DATE、TIME、DATETIME和TIMESTAMP。下面列舉了這些MSL中日期和時間數據類型所對應的字節數、取值范圍、日期格式以及零值。從上圖中可以看出&#xff0c;每種日期和時…

九度oj 題目1380:lucky number

題目描述&#xff1a;每個人有自己的lucky number&#xff0c;小A也一樣。不過他的lucky number定義不一樣。他認為一個序列中某些數出現的次數為n的話&#xff0c;都是他的lucky number。但是&#xff0c;現在這個序列很大&#xff0c;他無法快速找到所有lucky number。既然這…