網頁縮放與窗口縮放_功能縮放—不同的Scikit-Learn縮放器的效果:深入研究

網頁縮放與窗口縮放

內部AI (Inside AI)

In supervised machine learning, we calculate the value of the output variable by supplying input variable values to an algorithm. Machine learning algorithm relates the input and output variable with a mathematical function.

在有監督的機器學習中,我們通過將輸入變量值提供給算法來計算輸出變量的值。 機器學習算法將輸入和輸出變量與數學函數相關聯。

Output variable value = (2.4* Input Variable 1 )+ (6*Input Variable 2) + 3.5

輸出變量值=(2.4 *輸入變量1)+(6 *輸入變量2)+ 3.5

There are a few specific assumptions behind each of the machine learning algorithms. To build an accurate model, we need to ensure that the input data meets those assumptions. In case, the data fed to machine learning algorithms do not satisfy the assumptions then prediction accuracy of the model is compromised.

每個機器學習算法背后都有一些特定的假設。 為了建立準確的模型,我們需要確保輸入數據符合這些假設。 如果饋送到機器學習算法的數據不滿足假設,則模型的預測準確性會受到損害。

Most of the supervised algorithms in sklearn require standard normally distributed input data centred around zero and have variance in the same order. If the value range from 1 to 10 for an input variable and 4000 to 700,000 for the other variable then the second input variable values will dominate and the algorithm will not be able to learn from other features correctly as expected.

sklearn中的大多數監督算法都需要以零為中心的標準正態分布輸入數據,并且具有相同順序的方差。 如果輸入變量的值范圍是1到10,其他變量的值范圍是4000到700,000,則第二個輸入變量值將占主導地位,并且該算法將無法正確地從其他功能中學習。

In this article, I will illustrate the effect of scaling the input variables with different scalers in scikit-learn and three different regression algorithms.

在本文中,我將說明在scikit-learn中使用不同的縮放器和三種不同的回歸算法來縮放輸入變量的效果。

In the below code, we import the packages we will be using for the analysis. We will create the test data with the help of make_regression

在下面的代碼中,我們導入將用于分析的軟件包。 我們將在make_regression的幫助下創建測試數據

from sklearn.datasets import make_regression
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import *
from sklearn.linear_model import*

We will use the sample size of 100 records with three independent (input) variables. Further, we will inject three outliers using the method “np.random.normal”

我們將使用100個記錄的樣本大小以及三個獨立的(輸入)變量。 此外,我們將使用“ np.random.normal”方法注入三個異常值

X, y, coef = make_regression(n_samples=100, n_features=3,noise=2,tail_strength=0.5,coef=True, random_state=0)X[:3] = 1 + 0.9 * np.random.normal(size=(3,3))
y[:3] = 1 + 2 * np.random.normal(size=3)

We will print the real coefficients of the sample datasets as a reference and compare with predicted coefficients.

我們將打印樣本數據集的實際系數作為參考,并與預測系數進行比較。

print("The real coefficients are ". coef)
Image for post

We will train the algorithm with 80 records and reserve the remaining 20 samples unseen by the algorithm earlier for testing the accuracy of the model.

我們將使用80條記錄來訓練該算法,并保留該算法之前看不到的其余20個樣本,以測試模型的準確性。

X_train, X_test, y_train, y_test = train_test_split(X, y,           test_size=0.20,random_state=42)

We will study the scaling effect with the scikit-learn StandardScaler, MinMaxScaler, power transformers, RobustScaler and, MaxAbsScaler.

我們將使用scikit-learn StandardScaler,MinMaxScaler,電源變壓器,RobustScaler和MaxAbsScaler研究縮放效果。

regressors=[StandardScaler(),MinMaxScaler(),
PowerTransformer(method='yeo-johnson'),
RobustScaler(quantile_range=(25,75)),MaxAbsScaler()]

All the regression model we will be using is mentioned in a list object.

我們將使用的所有回歸模型都在列表對象中提到。

models=[Ridge(alpha=1.0),HuberRegressor(),LinearRegression()]

In the code below, we scale the training and test sample input variable by calling each scaler in succession from the regressor list defined earlier. We will draw a scatter plot of the original first input variable and scaled the first input variable to get an insight on various scaling. We see each of these plots little later in this article.

在下面的代碼中,我們通過從先前定義的回歸列表中依次調用每個縮放器來縮放訓練和測試樣本輸入變量。 我們將繪制原始第一個輸入變量的散點圖,并縮放第一個輸入變量,以了解各種縮放比例。 我們將在本文的稍后部分看到這些圖。

Further, we fit each of the models with scaled input variables from different scalers and predict the values of dependent variables for test sample dataset.

此外,我們使用來自不同縮放器的縮放輸入變量擬合每個模型,并預測測試樣本數據集的因變量值。

for regressor in regressors:    X_train_scaled=regressor.fit_transform(X_train)
X_test_scaled=regressor.transform(X_test)
Scaled =plt.scatter(X_train_scaled[:,0],y_train, marker='^', alpha=0.8)
Original=plt.scatter(X_train[:,0],y_train)
plt.legend((Scaled, Original),('Scaled', 'Original'),loc='best',fontsize=13)
plt.xlabel("Feature 1")
plt.ylabel("Train Target")
plt.show() for model in models:
reg_lin=model.fit(X_train_scaled, y_train)
y_pred=reg_lin.predict(X_test_scaled)
print("The calculated coeffiects with ", model , "and", regressor, reg_lin.coef_)

Finally, the predicted coefficients from the model fit are printed for the comparison with real coefficients.

最后,打印來自模型擬合的預測系數,以便與實際系數進行比較。

Image for post
Image for post
Results in tabular format
表格格式的結果

On first glance itself, we can deduce that same regression estimator predicts different values of the coefficients based on the scalers.Predicted coefficients with MaxAbsScaler and MinMax scaler is quite far from true coefficient values.We can see the importance of appropriate scalers in the prediction accuracy of the model from this example.

乍一看,我們可以推斷出相同的回歸估計器基于縮放器預測系數的不同值。使用MaxAbsScaler和MinMax縮放器預測的系數與真實系數值相差很遠,我們可以看到合適的縮放器在預測精度中的重要性。此示例中的模型。

As a self-exploration and learning exercise, I will encourage you all to calculate the R2 score and Root Mean Square Error (RMSE) for each of the training and testing set combination and compare it with each other.

作為一項自我探索和學習的練習,我鼓勵大家為每種訓練和測試集組合計算R2得分和均方根誤差(RMSE),并將其相互比較。

Now that we understand the importance of scaling and selecting suitable scalers, we will get into the inner working of each scaler.

現在我們了解了縮放和選擇合適的縮放器的重要性,我們將深入研究每個縮放器的內部工作。

Standard Scaler: It is one of the popular scalers used in various real-life machine learning projects. The mean value and standard deviation of each input variable sample set are determined separately. It then subtracts the mean from each data point and divides by the standard deviation to transforms the variables to zero mean and standard deviation of one. It does not bound the values to a specific range, and it can be an issue for a few algorithms.

Standard Scaler:它是在各種現實機器學習項目中使用的流行縮放器之一。 每個輸入變量樣本集的平均值和標準偏差分別確定。 然后,它從每個數據點減去平均值,然后除以標準差,以將變量轉換為零均值和標準差為1。 它不會將值限制在特定范圍內,并且對于某些算法而言可能是個問題。

Image for post
Standard Scaler — Original Vs Scaled Plot based on the code discussed in the article
Standard Scaler —基于本文討論的代碼的原始Vs縮放圖

MinMax Scaler: All the numeric values scaled between 0 and 1 with a MinMax Scaler

MinMax Scaler:使用MinMax Scaler在0到1之間縮放所有數值

Xscaled= (X-Xmin)/(Xmax-Xmin)

Xscaled =(X-Xmin)/(Xmax-Xmin)

MinMax scaling is quite affected by the outliers. If we have one or more extreme outlier in our data set, then the min-max scaler will place the normal values quite closely to accommodate the outliers within the 0 and 1 range. We saw earlier that the predicted coefficients with MinMax scaler are approximately three times the real coefficient. I will recommend not to use MinMax Scaler with outlier dataset.

MinMax縮放比例受異常值的影響很大。 如果我們在數據集中有一個或多個極端離群值,則最小-最大縮放器將非常接近地放置正常值以適應0和1范圍內的離群值。 前面我們看到,用MinMax縮放器預測的系數大約是實際系數的三倍。 我建議不要對異常數據集使用MinMax Scaler

Image for post
Robust Scaler — Original Vs Scaled Plot based on the code discussed in the article
穩健的縮放器—基于本文討論的代碼的原始VS縮放圖

Robust Scaler- Robust scaler is one of the best-suited scalers for outlier data sets. It scales the data according to the interquartile range. The interquartile range is the middle range where most of the data points exist.

穩健的縮放器-穩健的縮放器是離群數據集最適合的縮放器之一。 它根據四分位數范圍縮放數據。 四分位數范圍是存在大多數數據點的中間范圍。

Power Transformer Scaler: Power transformer tries to scale the data like Gaussian. It attempts optimal scaling to stabilize variance and minimize skewness through maximum likelihood estimation. Sometimes, Power transformer fails to scale Gaussian-like results hence it is important to check the plot the scaled data

電力變壓器縮放器:電力變壓器嘗試縮放像高斯這樣的數據。 它嘗試最佳縮放以通過最大似然估計來穩定方差并使偏斜最小化。 有時,電源變壓器無法縮放類似高斯的結果,因此檢查繪圖的縮放數據很重要

Image for post
Power Transformer Scaler — Original Vs Scaled Plot based on the code discussed in the article
Power Transformer Scaler —基于本文討論的代碼的原始Vs縮放圖

MaxAbs Scaler: MaxAbsScaler is best suited to scale the sparse data. It scales each feature by dividing it with the largest maximum value in each feature.

MaxAbs Scaler: MaxAbsScaler最適合縮放稀疏數據。 它通過將每個特征除以每個特征中的最大值來縮放每個特征。

Image for post
Maxabs Scaler — Original Vs Scaled Plot based on the code discussed in the article
Maxabs Scaler —基于本文討論的代碼的原始Vs縮放圖

For example, if an input variable has the original value [2,-1,0,1] then MaxAbs will scale it as [1,-0.5,0,0.5]. It divided each value with the highest value i.e. 2. It is not advised to use with large outlier dataset.

例如,如果輸入變量的原始值為[2,-1,0,1],則MaxAbs會將其縮放為[1,-0.5,0,0.5]。 它將每個值除以最高值,即2。不建議將其用于大型離群數據集。

We have learnt that scaling the input variables with suitable scaler is as vital as selecting the right machine learning algorithm. Few of the scalers are quite sensitive to outlier dataset, and others are robust. Each of the scalers in Scikit-Learn has its strengths and limitations, and we need to be mindful of it while using it.

我們已經知道,使用合適的縮放器縮放輸入變量與選擇正確的機器學習算法一樣重要。 很少有縮放器對異常數據集非常敏感,而其他縮放器則很健壯。 Scikit-Learn中的每個定標器都有其優勢和局限性,我們在使用它時需要謹記。

It also highlights the importance of performing the exploratory data analysis (EDA) initially to identify the presence or absence of outliers and other idiosyncrasies which will guide the selection of appropriate scaler.

它還強調了首先進行探索性數據分析(EDA)的重要性,以識別異常值和其他特質的存在與否,這將指導選擇合適的定標器。

In my article, 5 Advanced Visualisation for Exploratory data analysis (EDA) you can learn more about this area.

在我的文章5探索性數據分析的高級可視化(EDA)中,您可以了解有關此領域的更多信息。

In case, you would like to learn a structured approach to identify the appropriate independent variables to make accurate predictions then read my article “How to identify the right independent variables for Machine Learning Supervised.

如果您想學習一種結構化的方法來識別適當的獨立變量以做出準確的預測,然后閱讀我的文章“如何為受監督的機器學習確定正確的獨立變量” 。

"""Full Code"""from sklearn.datasets import make_regression
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import *
from sklearn.linear_model import*
import matplotlib.pyplot as plt
import seaborn as sns
X, y, coef = make_regression(n_samples=100, n_features=3,noise=2,tail_strength=0.5,coef=True, random_state=0)print("The real coefficients are ", coef)X[:3] = 1 + 0.9 * np.random.normal(size=(3,3))
y[:3] = 1 + 2 * np.random.normal(size=3)
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.20,random_state=42)regressors=[StandardScaler(),MinMaxScaler(),PowerTransformer(method='yeo-johnson'),RobustScaler(quantile_range=(25, 75)),MaxAbsScaler()]models=[Ridge(alpha=1.0),HuberRegressor(),LinearRegression()]for regressor in regressors:
X_train_scaled=regressor.fit_transform(X_train)
X_test_scaled=regressor.transform(X_test)
Scaled =plt.scatter(X_train_scaled[:,0],y_train, marker='^', alpha=0.8)
Original=plt.scatter(X_train[:,0],y_train)
plt.legend((Scaled, Original),('Scaled', 'Original'),loc='best',fontsize=13)
plt.xlabel("Feature 1")
plt.ylabel("Train Target")
plt.show()
for model in models:
reg_lin=model.fit(X_train_scaled, y_train)
y_pred=reg_lin.predict(X_test_scaled)
print("The calculated coeffiects with ", model , "and", regressor, reg_lin.coef_)

翻譯自: https://towardsdatascience.com/feature-scaling-effect-of-different-scikit-learn-scalers-deep-dive-8dec775d4946

網頁縮放與窗口縮放

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/390956.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/390956.shtml
英文地址,請注明出處:http://en.pswp.cn/news/390956.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

在構造器里調用可重寫的方法有什么問題?

問題:在構造器里調用可重寫的方法有什么問題? 我有一個檢票頁面的類通過抽象方法的結果去去設置頁的標題 public abstract class BasicPage extends WebPage {public BasicPage() {add(new Label("title", getTitle()));}protected abstract…

創建hugo博客_如何創建您的第一個Hugo博客:實用指南

創建hugo博客Hugo is a great tool to use if you want to start a blog.如果您想創建博客,Hugo是一個很好的工具。 I use Hugo myself for my blog, flaviocopes.com, and Ive been using it for more than two years. I have a few reasons for loving Hugo.我本…

Python自動化開發01

一、 變量變量命名規則變量名只能是字母、數字或下劃線的任意組合變量名的第一個字符不能是數字以下關鍵字不能聲明為變量名 [and, as, assert, break, class, continue, def, del, elif, else, except, exec, finally, for, from, global, if, import, in, is, lambda, not,…

記錄關于vs2008 和vs2015 的報錯問題

出現了 VS2008無法創建項目,無法打開項目的情況,提示這個注冊表鍵值有問題 HKEY_LOCAL_MACHINE \ SOFTWARE \ Microsoft \ MSBuild \ ToolsVersions \ 14.0, 但是查看了注冊表沒有這個鍵。費盡辛萬苦,中午在思密達的一個網站上看到…

未越獄設備提取數據_從三星設備中提取健康數據

未越獄設備提取數據Health data is collected every time you have your phone in your pocket. Apple or Android, the phones are equipped with a pedometer that counts your steps. Hence, health data is recorded. This data could be your one free data mart for a si…

怎么樣用System.out.println在控制臺打印出顏色

問題:怎么樣用System.out.println在控制臺打印出顏色 怎么樣才能在控制臺里打印顏色啊?我想要展示一些有顏色的字體,當處理器發送數據和接收數據的時候,也使用不同顏色的字體。 回答一 在這個Java類里面帶有public static 的數…

sql注入語句示例大全_SQL Order By語句:示例語法

sql注入語句示例大全Order By is a SQL command that lets you sort the resulting output from a SQL query.Order By是一個SQL命令,可讓您對SQL查詢的結果輸出進行排序。 訂購依據(ASC,DESC) (Order By (ASC, DESC)) ORDER BY gives us a way to SORT…

[BZOJ2599][IOI2011]Race 點分治

2599: [IOI2011]Race Time Limit: 70 Sec Memory Limit: 128 MBSubmit: 3934 Solved: 1163[Submit][Status][Discuss]Description 給一棵樹,每條邊有權.求一條簡單路徑,權值和等于K,且邊的數量最小.N < 200000, K < 1000000 Input 第一行 兩個整數 n, k第二..n行 每行三…

分詞消除歧義_角色標題消除歧義

分詞消除歧義折磨數據&#xff0c;它將承認任何事情 (Torture the data, and it will confess to anything) Disambiguation as defined in the vocabulary.com dictionary refers to the removal of ambiguity by making something clear and narrowing down its meaning. Whi…

北航教授李波:說AI會有低潮就是胡扯,這是人類長期的追求

這一輪所謂人工智能的高潮&#xff0c;和以往的幾次都有所不同&#xff0c;那是因為其受到了產業界的極大關注和參與。而以前并不是這樣。 當今世界是一個高度信息化的世界&#xff0c;甚至我們有一只腳已經踏入了智能化時代。而在我們日常交流和信息互動中&#xff0c;迅速發…

創建字符串枚舉的最好方法

問題&#xff1a;創建字符串枚舉的最好方法 用一個枚舉類型去表示一組字符串的最好方法是什么 我嘗試這樣&#xff1a; enum Strings{STRING_ONE("ONE"), STRING_TWO("TWO") }我怎么樣才可以像使用字符串那樣使用它們&#xff1f; 回答一 我不知道你想…

網絡安全習慣_健康習慣,確保良好的網絡安全

網絡安全習慣In a similar fashion to everyone getting the flu now and again, the risk of catching a cyberattack is a common one. Both a sophisticated social engineering attack or grammatically-lacking email phishing scam can cause real damage. No one who c…

attr和prop的區別

由于prop(property的縮寫)和attr(attribute的縮寫)翻譯成漢語&#xff0c;均有“特性、屬性”等意思的原因&#xff0c;導致大家容易混淆分不清。 (1)在處理自定義時屬性時&#xff0c;用attr()&#xff0c;若用prop(),則結果為undefined&#xff1b; (2)DOM固有屬性&#xff0…

15行Python代碼,幫你理解令牌桶算法

在網絡中傳輸數據時&#xff0c;為了防止網絡擁塞&#xff0c;需限制流出網絡的流量&#xff0c;使流量以比較均勻的速度向外發送&#xff0c;令牌桶算法就實現了這個功能&#xff0c;可控制發送到網絡上數據的數目&#xff0c;并允許突發數據的發送。 什么是令牌 從名字上看令…

在Java中,如何使一個字符串的首字母變為大寫

問題&#xff1a;在Java中&#xff0c;如何使一個字符串的首字母變為大寫 我使用Java去獲取用戶的字符串輸入。我嘗試使他們輸入的第一個字符大寫 我嘗試這樣: String name;BufferedReader br new InputStreamReader(System.in);String s1 name.charAt(0).toUppercase());…

在加利福尼亞州投資于新餐館:一種數據驅動的方法

“It is difficult to make predictions, especially about the future.”“很難做出預測&#xff0c;尤其是對未來的預測。” ~Niels Bohr?尼爾斯波爾 Everything is better interpreted through data. And data-driven decision making is crucial for success in any ind…

javascript腳本_使用腳本src屬性將JavaScript鏈接到HTML

javascript腳本The ‘src’ attribute in a tag is the path to an external file or resource that you want to link to your HTML document.標記中的src屬性是您要鏈接到HTML文檔的外部文件或資源的路徑。 For example, if you had your own custom JavaScript file named …

阿里云ESC上的Ubuntu圖形界面的安裝

系統裝的是Ubuntu Server 16.04 64位版的圖形界面&#xff0c;這里是轉載的一個大神的帖子 http://blog.csdn.net/dk_0228/article/details/54571867&#xff0c; 當然自己也再記錄一下&#xff0c;加深點印象 1.更新apt-get 保證最新 apt-get update 2.用putty或者Xshell連接遠…

leetcode 1269. 停在原地的方案數(dp)

示例 1&#xff1a; 輸入&#xff1a;steps 3, arrLen 2 輸出&#xff1a;4 解釋&#xff1a;3 步后&#xff0c;總共有 4 種不同的方法可以停在索引 0 處。 向右&#xff0c;向左&#xff0c;不動 不動&#xff0c;向右&#xff0c;向左 向右&#xff0c;不動&#xff0c;向…

JavaScript Onclick事件解釋

The onclick event in JavaScript lets you as a programmer execute a function when an element is clicked.JavaScript中的onclick事件可讓您作為程序員在單擊元素時執行功能。 按鈕Onclick示例 (Button Onclick Example) <button onclick"myFunction()">C…