Day 8: What is Linear Regression with Derivation.

Introduction to Linear Regression. Derivation of Ordinary Least Square method.

By Nandeshwar

Jun 08, 2021

Simple Linear Regression / Univariate Linear Regression

As the name suggests this algorithm is used for Regression problems and again from the name we can infer that Linear Regression is a Linear Model. This basically means that this algorithm establishes a linear relationship between the input variable(X) and the single output variable (Y). It is the simplest model in Machine Learning.

When there are multiple input variables(X), it is called Multiple Linear Regression.

Wanna jump right to code, check out complete code on Github.

In this algorithm we consider an input variable - X and one output variable - Y. We have to establish a linear relationship between them. The linear relationship can be defined as follows.

$Y = β_{0} + β_{1} X$

$β_{1}$ is called scale factor or slope or coefficient
$β_{0}$ is called intercept or bias coefficient

This equation is similar to the slope of a line ie $y = m x + b$ where $m = β_{1}$ (Slope of the line) and $b = β_{0}$ (Intercept). Hence we can state that in the Simple Linear Regression model we want to draw a line between X and Y which defines the relationship between them.

Assumptions for Linear Regression

A linear relationship between features and the target variable.
Additivity means that the effect of changes in one of the features on the target variable does not depend on values of other features. For example, a model for predicting the revenue of a company has one of the two features - the number of items "a" sold and the number of items "b" sold. When a company sells more items "a" the revenue increases and this is independent of the number of items "b" sold. But, if customers who buy "a" stop buying "b", the additivity assumption is violated.
No correlation between features (no collinearity). This affects the model severely.
Homoscedasticity

Different types of Linear Regression

Univariate Linear Regression or Simple Linear Regression - Single Variable Linear Regression is a technique used to model the relationship between a single input independent variable (feature variable) and an output dependent variable using a linear model i.e a line.
Multiple Linear Regression - It is a statistical technique that uses several explanatory variables to predict the outcome of a single dependent variable.
Ridge Regression - Ridge Regression is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates are unbiased, but their variances are large so they may be far from the true value. By adding a degree of bias to the regression estimates, ridge regression reduces the standard errors
Lasso Regression - Lasso regression is a type of linear regression that uses shrinkage. Shrinkage is where data values are shrunk towards a central point, like the mean. The lasso procedure encourages simple, sparse models (i.e. models with fewer parameters).

How does Simple Linear Regression work?

The main task of this algorithm is to find the best line which fits the given input data.

The hypothesis is defined by

$h_{θ} (x_{i}) = β_{0} + β_{1} x_{i}$

which can be rewritten as

$\begin{matrix} (1) & \hat{y_{i}} = β_{0} + β_{1} x_{i} \end{matrix}$

where $h_{θ} (x_{i})$ represents the predictive response for $i_{t h}$ observation.

Ordinary Least Square

One of the most common and accurate methods is the Ordinary Least Square method.

Let us create a dataset first

Our goal is to find a model whose line fits best with respect to the given data.

There should be a minimal error between predicted values and actual values.

Error is the distance (d) of the actual point from the model fit line. The Sum of all the errors can be denoted by E. So we can say

$E = \sum_{i = 0}^{n} (y_{(a c t u a l)} - y_{(p r e d i c t e d)})^{2}$

which can be written as for $i_{t h}$ value

$\begin{matrix} (2) & E = \sum_{i = 0}^{n} (y_{i} - \hat{y_{i}})^{2} \end{matrix}$

where n is total number of inputs

The square is taken because some points are above the line and some points are below. So to eliminate the negatives we square the value.

The algorithm's goal is to minimize this error function.

Derivation

Given n inputs and outputs $(x_{i}, y_{i})$
Find best fit for the line in equation 1 $y_{i} = β_{0} + β_{1} X_{i}$
The best fit line should have minimum error. To achieve the same we will have to minimize the error function defined above

Using equation 1 and equation 2

$\begin{matrix} (3) & E = \sum_{i = 0}^{n} (y_{i} - β_{0} - β_{1} x_{i})^{2} \end{matrix}$

To minimize our above-mentioned cost function, as we have learned in calculus, a univariate optimization involves taking the derivative and setting equal to 0. Similarly, this minimization problem above is solved by setting the partial derivatives equal to 0. That is, take the derivative of (1) with respect to $β_{0}$ and set it equal to 0. We then do the same thing for $β_{1}$ . This gives us,

$\begin{matrix} (4) & \frac{\partial E}{\partial β_{0}} = \sum_{i = 1}^{n} - 2 (y_{i} - β_{0} - β_{1} x_{i}) = 0 \end{matrix}$

and

$\begin{matrix} (5) & \frac{\partial E}{\partial β_{1}} = \sum_{i = 1}^{n} - 2 x_{i} (y_{i} - β_{0} - β_{1} x_{i}) = 0 \end{matrix}$

respectively

Before going further we know for the fact that
$\begin{matrix} (6) & \sum_{i = 1}^{N} y_{i} = N \bar{y} \end{matrix}$
where $\bar{y}$ is mean of y

Now let us find the value of $β_{0}$ first from equation 2

We can simply divide both sides with -2 which will give

$\frac{\partial E}{\partial β_{0}} = \sum_{i = 1}^{n} (y_{i} - β_{0} - β_{1} x_{i}) = 0$

Using equation 6

$n β_{0} = n \bar{y} - n β_{1} \bar{x}$

We simply divide everything by n and amazingly!

$\begin{matrix} (7) & β_{0} = \bar{y} - β_{1} \bar{x} \end{matrix}$

Now let us find the value of $β_{1}$ from equation 5

We can simply divide both sides with -2 again which will give

$\sum_{i = 1}^{n} x_{i} y_{i} - β_{0} x_{i} - β_{1} x_{i}^{2} = 0$

Using equation 6, we can substitute $β_{0}$ in the above equation

$\sum_{i = 1}^{n} x_{i} y_{i} - (\bar{y} - β_{1} \bar{x}) x_{i} - β_{1} x_{i}^{2} = 0$

Note that the summation is applying to everything in the above equation. We can distribute the sum to each term to get,

$\sum_{i = 1}^{n} x_{i} y_{i} - \bar{y} \sum_{i = 1}^{n} x_{i} + β_{1} \bar{x} \sum_{i = 1}^{n} x_{i} - β_{1} \sum_{i = 1}^{n} x_{i}^{2} = 0$

Using equation 6 again we get

$β_{1} = \frac{\sum_{i = 1}^{n} x_{i} y_{i} - n \bar{x} \bar{y}}{\sum_{i = 1}^{n} x_{i}^{2} - n {\bar{x}}^{2}}$

You can either look up or derive for yourself that $\sum_{i = 1}^{n} (x_{i} - \bar{x}) (y_{i} - \bar{y}) = \sum_{i = 1}^{n} x_{i} y_{i} - n \bar{x} \bar{y}$ . You can also easily derive that $\sum_{i = 1}^{n} {(x_{i} - \bar{x})}^{2} = \sum_{i = 1}^{n} x_{i}^{2} - n {\bar{x}}^{2}$ . These two can be derived very easily using algebra. Give it a shot yourself.

$β_{1} = \frac{\sum_{i = 1}^{n} (x_{i} - \bar{x}) (y_{i} - \bar{y})}{\sum_{i = 1}^{n} {(x_{i} - \bar{x})}^{2}}$

Voila! we are done.

Get the complete code on GitHub.

References

https://are.berkeley.edu/courses/EEP118/current/derive_ols.pdf