線性回歸算法數學原理
內部AI (Inside AI)
Linear regression is one of the most popular algorithms used in different fields well before the advent of computers. Today with the powerful computers, we can solve multi-dimensional linear regression which was not possible earlier. In single or multidimensional linear regression, the basic mathematical concept is quite the same.
在計算機出現之前,線性回歸是在不同領域中使用最廣泛的算法之一。 如今,借助功能強大的計算機,我們可以解決以前無法實現的多維線性回歸。 在一維或多維線性回歸中,基本的數學概念完全相同。
Today with machine learning libraries, like Scikit-learn, it is possible to use the linear regression in modelling without understanding the mathematical concept behind it. In my opinion, it is quite essential for a data scientist and machine learning professional to understand the mathematical concept and logic behind an algorithm before using it.
如今, 借助Scikit-learn之類的機器學習庫,可以在建模中使用線性回歸而無需了解其背后的數學概念。 我認為,對于數據科學家和機器學習專業人員來說,在使用算法之前了解其數學概念和邏輯非常重要。
Most of us may not have studied advanced mathematics and statistics, and we get scared by seeing the mathematical notation and jargon behind the algorithms. In this article, I will explain the math and logic behind linear regressions with simplified python code and easy math to build your understanding
我們大多數人可能沒有研究過高級數學和統計學,而看到算法背后的數學符號和行話讓我們感到害怕。 在本文中,我將通過簡化的python代碼和簡單的數學方法解釋線性回歸背后的數學和邏輯,以幫助您理解
Overview
總覽
We will start with a simple linear equation with one variable and without any intercept/bias. First, we will learn the step by step approach taken by packages like Scikit-learn to solve linear regression. During this walkthrough, we will understand the important concept of Gradient Descent. Further, we will see an example with a simple linear equation with one variable and an intercept/bias.
我們將從具有一個變量且沒有任何截距/偏差的簡單線性方程式開始。 首先,我們將學習Scikit-learn之類的軟件包采用的逐步方法來解決線性回歸問題。 在本演練中,我們將了解“梯度下降”的重要概念。 此外,我們將看到一個帶有一個變量和一個截距/偏置的簡單線性方程的示例。
Step 1: We will use the python package NumPy for working with a sample dataset and Matplotlib to plot various graphs for visualisation.
步驟1:我們將使用python軟件包NumPy處理示例數據集,并使用Matplotlib繪制各種圖形以進行可視化。
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Step 2: Let us consider a simple scenario where a single input /independent variable controls the outcome/dependent variable value. In the code below, we have declared two NumPy arrays to hold the values of the independent and dependent variables.
步驟2:讓我們考慮一個簡單的場景,其中單個輸入/獨立變量控制結果/因變量值。 在下面的代碼中,我們聲明了兩個NumPy數組來保存自變量和因變量的值。
Independent_Variable=np.array([1,2,3,12,15,17,20,21,5,7,9,10,3,12,15,17,20,7])
Dependent_Variable=np.array([7,14,21,84,105,116.1,139,144.15,32.6,50.1,65.4,75.4,20.8,83.4,103.15,110.9,136.6,48.7])
Step 3: Let us quickly draw a scatter plot to understand the data points.
步驟3:讓我們快速繪制散點圖以了解數據點。
plt.scatter(Independent_Variable, Dependent_Variable, color='green')
plt.xlabel('Independent Variable/Input Parameter')
plt.ylabel('Dependent Variable/ Output Parameter')
plt.show()

Our goal is to formulate a linear equation which can predict the dependent variable value with minimum error for an independent/input variable.
我們的目標是制定一個線性方程式,該方程式可以預測自變量/輸入變量的因變量值,并且誤差最小。
Dependent Variable = Constant * Independent Variable
因變量=常數*自變量
In mathematical terms, Y=constant*X
用數學術語,Y =常數* X
In terms of visualisation, we need to find the best fit line to get a minimum error for points.
在可視化方面,我們需要找到最佳擬合線以使點的誤差最小。
Minimum error is also known as loss function in the machine learning world.
最小誤差在機器學習領域也被稱為損失函數。

We can calculate the loss in the iteration for each assumed constant value in the equation Y=constant*X with all independent data points. The goal is to find the constant for which this loss is minimum and formulate the equation. Please note that in the loss function equation “m” stands for the number of points. In the current example, we have 18 points hence 1/2m translates to 1/36. Do not get terrified by seeing the loss function formula. We calculate the loss as the summation of the square of the difference between the calculated value and actual value for each data points and then divide it with twice the number of points. We will decipher it step by step in the article below with the help of code in python.
我們可以在方程式Y = constant * X中使用所有獨立的數據點,為每個假定的常數值計算迭代中的損失。 目的是找到損耗最小的常數,并公式化。 請注意,在損失函數方程中,“ m”代表點數。 在當前示例中,我們有18個點,因此1 / 2m轉換為1/36。 不要因為看到損失函數公式而感到恐懼。 我們將損失計算為每個數據點的計算值與實際值之差的平方之和,然后將其除以點數的兩倍。 我們將在下面的文章中借助python中的代碼逐步解釋它。
Step 4: To understand the core idea and math behind identifying the equation, we will consider the limited set of constant values mentioned in the code below and calculate the loss function.
步驟4:要了解識別方程式背后的核心思想和數學運算,我們將考慮以下代碼中提及的有限常量集,并計算損失函數。
In actual linear regression algorithms, in particular gaps, constants are considered for loss function calculation. Initially, the gaps between the two constants considered for loss function calculation is bigger. As we move closer to the actual solution constants with smaller gaps are considered. In the machine learning world, the learning rate is the gap in which constant is increased/decreased for loss function calculation.
在實際的線性回歸算法中,尤其是在間隙中,考慮常數用于損失函數計算。 最初,用于損失函數計算的兩個常數之間的差距較大。 隨著我們越來越接近實際的解決方案常數,將考慮較小的差距。 在機器學習世界中,學習率是指為進行損失函數計算而增加/減少常數的差距。
m=[-5,-3,-1,1,3,5,6.6,7,8.5,9,11,13,15]
Step 5: In the code below, we calculate the loss function for each value of constant (i.e. values in the list m declared in the earlier step) for all input and output data points.
步驟5:在下面的代碼中,我們為所有輸入和輸出數據點的每個常數值(即,在前面的步驟中聲明的列表m中的值)計算損耗函數。
We store the calculated loss for each constant in a Numpy array “errormargin”.
我們將計算出的每個常數的損耗存儲在Numpy數組“ errormargin”中。
errormargin=np.array([])
for slope in m:
counter=0
sumerror=0
cost=sumerror/10
for x in Independent_Variable:
yhat=slope*x
error=(yhat-Dependent_Variable[counter])*(yhat-Dependent_Variable[counter])
sumerror=error+sumerror
counter=counter+1
cost=sumerror/18
errormargin=np.append(errormargin,cost)
Step 6:We will plot the calculated loss function for the constant to determine the actual constant value.
步驟6:我們將為常數繪制計算出的損失函數,以確定實際常數值。
plt.plot(m,errormargin)
plt.xlabel("Slope Values")
plt.ylabel("Loss Function")
plt.show()
The value of the constant for which the curve is at the lowest point is the real constant with which we can formulate the equation of the line.
曲線處于最低點的常數的值是實常數,我們可以用它來表示直線方程。

In our example, for the value of constant 6.8, the curve is at the lowest point.
在我們的示例中,對于常數6.8而言,曲線位于最低點。
A line with this value as Y=6.8*X can best fit the data points with minimum error.
該值為Y = 6.8 * X的線可以以最小的誤差最好地擬合數據點。

This approach of plotting the loss function and identifying the true values of the fixed parameters in the equation at the lowest point of the loss curve is known as gradient descent. As an example, we have considered one variable for simplicity, hence the loss function is a 2-dimensional curve. In the case of multiple linear regression, the gradient descent curve will be multi-dimensional.
這種在損失曲線的最低點繪制損失函數并確定方程式中固定參數的真值的方法稱為梯度下降 。 例如,為簡單起見,我們考慮了一個變量,因此損失函數為二維曲線。 在多元線性回歸的情況下,梯度下降曲線將是多維的。
We have learnt the inner working to calculate the coefficient of the independent variable. Next, let us learn the step by step way to calculate the coefficient and intercept/bias in linear regression.
我們已經了解了計算自變量系數的內部工作。 接下來,讓我們逐步學習在線性回歸中計算系數和截距/偏差的方法。
Step 1: Just like earlier, let us consider a sample set of independent and dependent variable values. These are the input and output data points available our goal is to formulate a linear equation which can predict the dependent variable value with minimum error for an independent/input variable.
步驟1:就像之前一樣,讓我們??考慮一組獨立變量和因變量值。 這些是可用的輸入和輸出數據點,我們的目標是建立一個線性方程式,該方程式可以預測自變量/輸入變量的因變量值,并且誤差最小。
Dependent Variable = (Coefficient*Independent Variable)+ Constant
因變量=(系數*因變量)+常數
In mathematical terms, y=(Coefficient*x)+ c
用數學術語,y =(系數* x)+ c
Please note that coefficient is also a constant term multiplied with the independent variable in the equation.
請注意,系數也是常數項乘以方程式中的獨立變量。
Independent_Variable=np.array([1,2,4,3,5])
Dependent_Variable=np.array([1,3,3,2,5])
Step 2: We will assume that the initial value of the coefficient and constant “m” and “c” respectively as zero. We will increase the value of m and c after each iteration of error calculation by a small learning rate of 0.001. Epoch is the number of times we want to do this calculation on the entire available data points. As we increase the number of epoch, the solution becomes more accurate, but it consumes time and computing power. Based on the business case, we can decide the acceptable error in the calculated values to stop the iterations.
步驟2:我們假設系數的初始值和常數“ m”和“ c”分別為零。 每次誤差計算迭代后,我們將m和c的值增加一個小的學習率0.001。 時期是我們要對整個可用數據點進行此計算的次數。 隨著我們增加紀元的數量,解決方案變得更加準確,但是卻消耗了時間和計算能力。 根據業務案例,我們可以確定計算值中可接受的誤差以停止迭代。
LR=0.001
m=0
c=0
epoch=0
Step 3: In the below code, we run 1100 iterations on the available dataset and calculate the coefficient and constant value.
步驟3:在下面的代碼中,我們在可用數據集上運行1100次迭代,并計算系數和常數值。
For each independent data point, we calculate the dependent value (i.e. yhat) and then calculate the error between the calculated and actual dependent value.
對于每個獨立數據點,我們計算相關值(即yhat),然后計算所計算的相關值與實際相關值之間的誤差。
Based on the error, we change the value of coefficient and constant for the next iteration calculation.
根據該誤差,我們更改系數和常數的值,以進行下一次迭代計算。
New coefficient= Current coefficient — (Learning Rate*Error)
新系數=當前系數-(學習率*錯誤)
New Constant = Current Constant -(Learning Rate*Error*Independent Variable Value)
新常數=當前常數-(學習率*錯誤*獨立變量值)
while epoch<1100:
epoch=epoch+1
counter=0
for x in Independent_Variable:
yhat=(m*x)+c
error=yhat-Dependent_Variable[counter]
c=c-(LR*error)
m=m-(LR*error*x)
counter=counter+1
We check the value of the coefficient and constant after 1100 iterations on the available dataset.
我們在可用數據集上進行1100次迭代后檢查系數和常數的值。
print("The final value of m", m)
print("The final value of c", c)

Mathematically, it can be represented as y=(0.81*x)+0.33
數學上可以表示為y =(0.81 * x)+0.33

Finally, let us compare the earlier output with the Scikit-learn linear regression algorithm result
最后,讓我們將早期輸出與Scikit-learn線性回歸算法的結果進行比較
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(Independent_Variable.reshape(-1,1), Dependent_Variable)
print(reg.coef_)
print(reg.intercept_)

With 1100 iterations on the available dataset the calculated value of the coefficient and constant/bias is very close to the output of Scikit-learn linear regression algorithm.
在可用數據集上進行1100次迭代后,系數和常數/偏置的計算值非常接近Scikit-learn線性回歸算法的輸出。
I hope this gives article gave you a firm understanding on behind the scene mathematical calculation and concept in linear regression. Also, we have seen the way gradient descent is applied to find the optimal solution. In the case of multiple linear regression, the math and logic remain the same, and it is just scaled further in more dimensions.
我希望本文能使您對線性回歸的數學計算和概念有一個深刻的了解。 同樣,我們已經看到了應用梯度下降法找到最佳解的方法。 在多元線性回歸的情況下,數學和邏輯保持不變,并且只是在更多維度上進行了進一步擴展。
翻譯自: https://towardsdatascience.com/linear-regression-algorithm-under-the-hood-math-for-non-mathematicians-c228d244e3f3
線性回歸算法數學原理
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/390730.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/390730.shtml 英文地址,請注明出處:http://en.pswp.cn/news/390730.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!