線性回歸算法數學原理_線性回歸算法-非數學家的高級數學

線性回歸算法數學原理

內部AI (Inside AI)

Linear regression is one of the most popular algorithms used in different fields well before the advent of computers. Today with the powerful computers, we can solve multi-dimensional linear regression which was not possible earlier. In single or multidimensional linear regression, the basic mathematical concept is quite the same.

在計算機出現之前,線性回歸是在不同領域中使用最廣泛的算法之一。 如今,借助功能強大的計算機,我們可以解決以前無法實現的多維線性回歸。 在一維或多維線性回歸中,基本的數學概念完全相同。

Today with machine learning libraries, like Scikit-learn, it is possible to use the linear regression in modelling without understanding the mathematical concept behind it. In my opinion, it is quite essential for a data scientist and machine learning professional to understand the mathematical concept and logic behind an algorithm before using it.

如今, 借助Scikit-learn之類的機器學習庫,可以在建模中使用線性回歸而無需了解其背后的數學概念。 我認為,對于數據科學家和機器學習專業人員來說,在使用算法之前了解其數學概念和邏輯非常重要。

Most of us may not have studied advanced mathematics and statistics, and we get scared by seeing the mathematical notation and jargon behind the algorithms. In this article, I will explain the math and logic behind linear regressions with simplified python code and easy math to build your understanding

我們大多數人可能沒有研究過高級數學和統計學,而看到算法背后的數學符號和行話讓我們感到害怕。 在本文中,我將通過簡化的python代碼和簡單的數學方法解釋線性回歸背后的數學和邏輯,以幫助您理解

Overview

總覽

We will start with a simple linear equation with one variable and without any intercept/bias. First, we will learn the step by step approach taken by packages like Scikit-learn to solve linear regression. During this walkthrough, we will understand the important concept of Gradient Descent. Further, we will see an example with a simple linear equation with one variable and an intercept/bias.

我們將從具有一個變量且沒有任何截距/偏差的簡單線性方程式開始。 首先,我們將學習Scikit-learn之類的軟件包采用的逐步方法來解決線性回歸問題。 在本演練中,我們將了解“梯度下降”的重要概念。 此外,我們將看到一個帶有一個變量和一個截距/偏置的簡單線性方程的示例。

Step 1: We will use the python package NumPy for working with a sample dataset and Matplotlib to plot various graphs for visualisation.

步驟1:我們將使用python軟件包NumPy處理示例數據集,并使用Matplotlib繪制各種圖形以進行可視化。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Step 2: Let us consider a simple scenario where a single input /independent variable controls the outcome/dependent variable value. In the code below, we have declared two NumPy arrays to hold the values of the independent and dependent variables.

步驟2:讓我們考慮一個簡單的場景,其中單個輸入/獨立變量控制結果/因變量值。 在下面的代碼中,我們聲明了兩個NumPy數組來保存自變量和因變量的值。

Independent_Variable=np.array([1,2,3,12,15,17,20,21,5,7,9,10,3,12,15,17,20,7])
Dependent_Variable=np.array([7,14,21,84,105,116.1,139,144.15,32.6,50.1,65.4,75.4,20.8,83.4,103.15,110.9,136.6,48.7])

Step 3: Let us quickly draw a scatter plot to understand the data points.

步驟3:讓我們快速繪制散點圖以了解數據點。

plt.scatter(Independent_Variable, Dependent_Variable,  color='green')
plt.xlabel('Independent Variable/Input Parameter')
plt.ylabel('Dependent Variable/ Output Parameter')
plt.show()
Image for post

Our goal is to formulate a linear equation which can predict the dependent variable value with minimum error for an independent/input variable.

我們的目標是制定一個線性方程式,該方程式可以預測自變量/輸入變量的因變量值,并且誤差最小。

Dependent Variable = Constant * Independent Variable

因變量=常數*自變量

In mathematical terms, Y=constant*X

用數學術語,Y =常數* X

In terms of visualisation, we need to find the best fit line to get a minimum error for points.

在可視化方面,我們需要找到最佳擬合線以使點的誤差最小。

Minimum error is also known as loss function in the machine learning world.

最小誤差在機器學習領域也被稱為損失函數。

Image for post
Loss function formula ( written by the author in word and then taken a screenshot)
損失函數公式(由作者用文字寫成,然后截取屏幕截圖)

We can calculate the loss in the iteration for each assumed constant value in the equation Y=constant*X with all independent data points. The goal is to find the constant for which this loss is minimum and formulate the equation. Please note that in the loss function equation “m” stands for the number of points. In the current example, we have 18 points hence 1/2m translates to 1/36. Do not get terrified by seeing the loss function formula. We calculate the loss as the summation of the square of the difference between the calculated value and actual value for each data points and then divide it with twice the number of points. We will decipher it step by step in the article below with the help of code in python.

我們可以在方程式Y = constant * X中使用所有獨立的數據點,為每個假定的常數值計算迭代中的損失。 目的是找到損耗最小的常數,并公式化。 請注意,在損失函數方程中,“ m”代表點數。 在當前示例中,我們有18個點,因此1 / 2m轉換為1/36。 不要因為看到損失函數公式而感到恐懼。 我們將損失計算為每個數據點的計算值與實際值之差的平方之和,然后將其除以點數的兩倍。 我們將在下面的文章中借助python中的代碼逐步解釋它。

Step 4: To understand the core idea and math behind identifying the equation, we will consider the limited set of constant values mentioned in the code below and calculate the loss function.

步驟4:要了解識別方程式背后的核心思想和數學運算,我們將考慮以下代碼中提及的有限常量集,并計算損失函數。

In actual linear regression algorithms, in particular gaps, constants are considered for loss function calculation. Initially, the gaps between the two constants considered for loss function calculation is bigger. As we move closer to the actual solution constants with smaller gaps are considered. In the machine learning world, the learning rate is the gap in which constant is increased/decreased for loss function calculation.

在實際的線性回歸算法中,尤其是在間隙中,考慮常數用于損失函數計算。 最初,用于損失函數計算的兩個常數之間的差距較大。 隨著我們越來越接近實際的解決方案常數,將考慮較小的差距。 在機器學習世界中,學習率是指為進行損失函數計算而增加/減少常數的差距。

m=[-5,-3,-1,1,3,5,6.6,7,8.5,9,11,13,15]

Step 5: In the code below, we calculate the loss function for each value of constant (i.e. values in the list m declared in the earlier step) for all input and output data points.

步驟5:在下面的代碼中,我們為所有輸入和輸出數據點的每個常數值(即,在前面的步驟中聲明的列表m中的值)計算損耗函數。

We store the calculated loss for each constant in a Numpy array “errormargin”.

我們將計算出的每個常數的損耗存儲在Numpy數組“ errormargin”中。

errormargin=np.array([])
for slope in m:
counter=0
sumerror=0
cost=sumerror/10
for x in Independent_Variable:
yhat=slope*x
error=(yhat-Dependent_Variable[counter])*(yhat-Dependent_Variable[counter])
sumerror=error+sumerror
counter=counter+1
cost=sumerror/18
errormargin=np.append(errormargin,cost)

Step 6:We will plot the calculated loss function for the constant to determine the actual constant value.

步驟6:我們將為常數繪制計算出的損失函數,以確定實際常數值。

plt.plot(m,errormargin)
plt.xlabel("Slope Values")
plt.ylabel("Loss Function")
plt.show()

The value of the constant for which the curve is at the lowest point is the real constant with which we can formulate the equation of the line.

曲線處于最低點的常數的值是實常數,我們可以用它來表示直線方程。

Image for post

In our example, for the value of constant 6.8, the curve is at the lowest point.

在我們的示例中,對于常數6.8而言,曲線位于最低點。

A line with this value as Y=6.8*X can best fit the data points with minimum error.

該值為Y = 6.8 * X的線可以以最小的誤差最好地擬合數據點。

Image for post

This approach of plotting the loss function and identifying the true values of the fixed parameters in the equation at the lowest point of the loss curve is known as gradient descent. As an example, we have considered one variable for simplicity, hence the loss function is a 2-dimensional curve. In the case of multiple linear regression, the gradient descent curve will be multi-dimensional.

這種在損失曲線的最低點繪制損失函數并確定方程式中固定參數的真值的方法稱為梯度下降 。 例如,為簡單起見,我們考慮了一個變量,因此損失函數為二維曲線。 在多元線性回歸的情況下,梯度下降曲線將是多維的。

We have learnt the inner working to calculate the coefficient of the independent variable. Next, let us learn the step by step way to calculate the coefficient and intercept/bias in linear regression.

我們已經了解了計算自變量系數的內部工作。 接下來,讓我們逐步學習在線性回歸中計算系數和截距/偏差的方法。

Step 1: Just like earlier, let us consider a sample set of independent and dependent variable values. These are the input and output data points available our goal is to formulate a linear equation which can predict the dependent variable value with minimum error for an independent/input variable.

步驟1:就像之前一樣,讓我們??考慮一組獨立變量和因變量值。 這些是可用的輸入和輸出數據點,我們的目標是建立一個線性方程式,該方程式可以預測自變量/輸入變量的因變量值,并且誤差最小。

Dependent Variable = (Coefficient*Independent Variable)+ Constant

因變量=(系數*因變量)+常數

In mathematical terms, y=(Coefficient*x)+ c

用數學術語,y =(系數* x)+ c

Please note that coefficient is also a constant term multiplied with the independent variable in the equation.

請注意,系數也是常數項乘以方程式中的獨立變量。

Independent_Variable=np.array([1,2,4,3,5])
Dependent_Variable=np.array([1,3,3,2,5])

Step 2: We will assume that the initial value of the coefficient and constant “m” and “c” respectively as zero. We will increase the value of m and c after each iteration of error calculation by a small learning rate of 0.001. Epoch is the number of times we want to do this calculation on the entire available data points. As we increase the number of epoch, the solution becomes more accurate, but it consumes time and computing power. Based on the business case, we can decide the acceptable error in the calculated values to stop the iterations.

步驟2:我們假設系數的初始值和常數“ m”和“ c”分別為零。 每次誤差計算迭代后,我們將m和c的值增加一個小的學習率0.001。 時期是我們要對整個可用數據點進行此計算的次數。 隨著我們增加紀元的數量,解決方案變得更加準確,但是卻消耗了時間和計算能力。 根據業務案例,我們可以確定計算值中可接受的誤差以停止迭代。

LR=0.001
m=0
c=0
epoch=0

Step 3: In the below code, we run 1100 iterations on the available dataset and calculate the coefficient and constant value.

步驟3:在下面的代碼中,我們在可用數據集上運行1100次迭代,并計算系數和常數值。

For each independent data point, we calculate the dependent value (i.e. yhat) and then calculate the error between the calculated and actual dependent value.

對于每個獨立數據點,我們計算相關值(即yhat),然后計算所計算的相關值與實際相關值之間的誤差。

Based on the error, we change the value of coefficient and constant for the next iteration calculation.

根據該誤差,我們更改系數和常數的值,以進行下一次迭代計算。

New coefficient= Current coefficient — (Learning Rate*Error)

新系數=當前系數-(學習率*錯誤)

New Constant = Current Constant -(Learning Rate*Error*Independent Variable Value)

新常數=當前常數-(學習率*錯誤*獨立變量值)

while epoch<1100:
epoch=epoch+1
counter=0
for x in Independent_Variable:
yhat=(m*x)+c
error=yhat-Dependent_Variable[counter]
c=c-(LR*error)
m=m-(LR*error*x)
counter=counter+1

We check the value of the coefficient and constant after 1100 iterations on the available dataset.

我們在可用數據集上進行1100次迭代后檢查系數和常數的值。

print("The final value of  m", m)
print("The final value of c", c)
Image for post

Mathematically, it can be represented as y=(0.81*x)+0.33

數學上可以表示為y =(0.81 * x)+0.33

Image for post

Finally, let us compare the earlier output with the Scikit-learn linear regression algorithm result

最后,讓我們將早期輸出與Scikit-learn線性回歸算法的結果進行比較

from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(Independent_Variable.reshape(-1,1), Dependent_Variable)
print(reg.coef_)
print(reg.intercept_)
Image for post

With 1100 iterations on the available dataset the calculated value of the coefficient and constant/bias is very close to the output of Scikit-learn linear regression algorithm.

在可用數據集上進行1100次迭代后,系數和常數/偏置的計算值非常接近Scikit-learn線性回歸算法的輸出。

I hope this gives article gave you a firm understanding on behind the scene mathematical calculation and concept in linear regression. Also, we have seen the way gradient descent is applied to find the optimal solution. In the case of multiple linear regression, the math and logic remain the same, and it is just scaled further in more dimensions.

我希望本文能使您對線性回歸的數學計算和概念有一個深刻的了解。 同樣,我們已經看到了應用梯度下降法找到最佳解的方法。 在多元線性回歸的情況下,數學和邏輯保持不變,并且只是在更多維度上進行了進一步擴展。

翻譯自: https://towardsdatascience.com/linear-regression-algorithm-under-the-hood-math-for-non-mathematicians-c228d244e3f3

線性回歸算法數學原理

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/390730.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/390730.shtml
英文地址,請注明出處:http://en.pswp.cn/news/390730.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

您應該在2020年首先學習哪種編程語言? ????d???s????:???su?

Most people’s journey toward learning to program starts with a single late-night Google search.大多數人學習編程的旅程都是從一個深夜Google搜索開始的。 Usually it’s something like “Learn ______”通常它類似于“學習______” But how do they decide which la…

Linux 概述

UNIX發展歷程 第一個版本是1969年由Ken Thompson&#xff08;UNIX之父&#xff09;在AT& T貝爾實驗室實現Ken Thompson和Dennis Ritchie&#xff08;C語言之父&#xff09;使用C語言對整個系統進行了再加工和編寫UNIX的源代碼屬于SCO公司&#xff08;AT&T ->Novell …

課程一(Neural Networks and Deep Learning),第四周(Deep Neural Networks)—— 0.學習目標...

Understand the key computations underlying deep learning, use them to build and train deep neural networks, and apply it to computer vision. 學習目標 See deep neural networks as successive blocks put one after each otherBuild and train a deep L-layer Neura…

使用ActionTrail Python SDK

ActionTrail提供官方的Python SDK。本文將簡單介紹一下如何使用ActionTrail的Python SDK。 安裝Aliyun Core SDK。 pip install aliyun-python-sdk-core 安裝ActionTrail Python SDK。 pip install aliyun-python-sdk-actiontrail 下面是測試的代碼。調用LookupEventsRequest獲…

泰坦尼克:機器從災難中學習_用于災難響應的機器學習研究:什么才是好的論文?...

泰坦尼克:機器從災難中學習For the first time in 2021, a major Machine Learning conference will have a track devoted to disaster response. The 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2021) has a track on…

github持續集成的設置_如何使用GitHub Actions和Puppeteer建立持續集成管道

github持續集成的設置Lately Ive added continuous integration to my blog using Puppeteer for end to end testing. My main goal was to allow automatic dependency updates using Dependabot. In this guide Ill show you how to create such a pipeline yourself. 最近&…

shell與常用命令

虛擬控制臺 一臺計算機的輸入輸出設備就是一個物理的控制臺 &#xff1b; 如果在一臺計算機上用軟件的方法實現了多個互不干擾獨立工作的控制臺界面&#xff0c;就是實現了多個虛擬控制臺&#xff1b; Linux終端的工作方式是字符命令行方式&#xff0c;用戶通過鍵盤輸入命令進…

C中的malloc:C中的動態內存分配

什么是C中的malloc()&#xff1f; (What is malloc() in C?) malloc() is a library function that allows C to allocate memory dynamically from the heap. The heap is an area of memory where something is stored.malloc()是一個庫函數&#xff0c;它允許C從堆動態分配…

Linux文本編輯器

Linux文本編輯器 Linux系統下有很多文本編輯器。 按編輯區域&#xff1a; 行編輯器 ed 全屏編輯器 vi 按運行環境&#xff1a; 命令行控制臺編輯器 vi X Window圖形界面編輯器 gedit ed 它是一個很古老的行編輯器&#xff0c;vi這些編輯器都是ed演化而來。 每次只能對一…

Alpha第十天

Alpha第十天 聽說 031502543 周龍榮&#xff08;隊長&#xff09; 031502615 李家鵬 031502632 伍晨薇 031502637 張檉 031502639 鄭秦 1.前言 任務分配是VV、ZQ、ZC負責前端開發&#xff0c;由JP和LL負責建庫和服務器。界面開發的教輔材料是《第一行代碼》&#xff0c;利用And…

Streamlit —使用數據應用程序更好地測試模型

介紹 (Introduction) We use all kinds of techniques from creating a very reliable validation set to using k-fold cross-validation or coming up with all sorts of fancy metrics to determine how good our model performs. However, nothing beats looking at the ra…

Spring MVC Boot Cloud 技術教程匯總(長期更新)

昨天我們發布了Java成神之路上的知識匯總&#xff0c;今天繼續。 Java成神之路技術整理&#xff08;長期更新&#xff09; 以下是Java技術棧微信公眾號發布的關于 Spring/ Spring MVC/ Spring Boot/ Spring Cloud 的技術干貨&#xff0c;本文長期更新。 Spring 系列 Java 必看的…

X Window系統

X Window系統 一種以位圖方式顯示的軟件窗口系統。誕生于1984&#xff0c;比Microsoft Windows要早。是一套獨立于內核的軟件 Linux上的X Window系統 X Window系統由三個基本元素組成&#xff1a;X Server、X Client和二者通信的通道。 X Server&#xff1a;是控制輸出及輸入…

冒名頂替上大學羅彩霞_什么是冒名頂替綜合癥,您如何克服?

冒名頂替上大學羅彩霞冒名頂替綜合癥 (Imposter Syndrome) Imposter Syndrome is a feeling of being a fraud or not being good enough to get the job done. Its common among software engineers, developers and designers working in tech companies, especially those n…

Linux命令----用戶管理

修改用戶密碼&#xff1a; sudo passwd &#xff08;當前&#xff09;用戶名  【sudo是super user do的簡寫&#xff0c;passwd是password的簡寫】 顯示當前正在操作系統的用戶&#xff1a;whoami   顯示當前登錄系統的用戶信息&#xff1a;who am i 注意&#xff1a; 普通…

lasso回歸和嶺回歸_如何計劃新產品和服務機會的回歸

lasso回歸和嶺回歸Marketers sometimes have to be creative to offer customers something new without the luxury of that new item being a brand-new product or built-from-scratch service. In fact, incrementally introducing features is familiar to marketers of c…

python代碼

原始字符串&#xff0c;不做任何特殊的處理 print("Newlines are indicated by \n")#Newlines are indicated by print(r"Newlines are indicated by \n")#Newlines are indicated by \n 格式輸出&#xff0c;轉化為字符串由format自動完成 ag…

Linux 設備管理和進程管理

設備管理 Linux系統中設備是用文件來表示的&#xff0c;每種設備都被抽象為設備文件的形式&#xff0c;這樣&#xff0c;就給應用程序一個一致的文件界面&#xff0c;方便應用程序和操作系統之間的通信。 設備文件集中放置在/dev目錄下&#xff0c;一般有幾千個&#xff0c;不…

樂高ev3涉及到的一些賽事_使您成為英雄的前五名開發者技能(提示:涉及LEGO)

樂高ev3涉及到的一些賽事Programming is like building something with LEGOs. Any developer can pick up a brand new LEGO set and build it following the instructions. This is very easy. Think of it as coding school assignments or entry level tutorials.編程就像用…

貝葉斯 定理_貝葉斯定理實際上是一個直觀的分數

貝葉斯 定理Bayes’ Theorem is one of the most known to the field of probability, and it is used often as a baseline model in machine learning. It is, however, too often memorized and chanted by people who don’t really know what P(B|E) P(E|B) * P(B) / P(E…