# Day 9: Linear Regression from scratch using NumPy.

**Linear regression implementation from scratch using NumPy and Scikit Learn both.**

Jun 09, 2021

### Linear Regression Implementation using Ordinary Least Square in NumPy

In the previous article, we had derived the formula for Univariate Linear Regression which is

$$

Y = \beta_{0} + \beta_{1}X \tag{1}

$$

where

$$

\beta_{1}=\frac{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)\left(y_{i}-\bar{y}\right)}{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}} \tag{2}

$$

and

$$

\beta_{0}=\bar{y}-\beta_{1} \bar{x} \tag{3}

$$

Now let us use the same to find the best fit line.

Wanna jump right to code, check out complete code on Github.

#### Create a dataset

In order to apply Linear Regression, we need a dataset. We use amazing Sklearn library to create this dataset

Let us first import all the needed libraries

```
import numpy as np
from matplotlib import pyplot as plt
from sklearn.datasets import make_regression
%matplotlib inline
plt.rcParams['figure.figsize'] = [10, 7]
# Helper function to plot line on graph
def plot_line(ax, slope, intercept, *args, **kwargs):
x_vals = np.array(ax.get_xlim())
y_vals = intercept + (slope * x_vals)
ax.plot(x_vals, y_vals , *args, **kwargs)
```

Now we create a dataset with just 1 feature of 100 samples and visualise the same

```
# Generate dataset for regression with singl feature
X, y = make_regression(n_samples=100, n_features=1, noise=15.0, random_state=6)
fig = plt.figure(figsize=(10,8))
plt.scatter(X, y)
plt.xlabel("X")
plt.ylabel("y")
```

Now in order to use equation 1 mentioned above we need to calculate $\beta_{0}$ and $\beta_{1}$. Equations for calculating both of them is mentioned on the top which we derived in the [previous article](https://nandeshwar.in/100-days-of-deep-learning/what-is-linear-regression-with-derivation/).

We first calculate the mean of both X and y ie $\bar{X}$ and $\bar{y}$ respectively. And then we do the vector calculation using the same for **equation 2**.

```
# mean_x = np.mean(X)
mean_y = np.mean(y)
# from equation 2
b1 = sum(np.dot((X-mean_x).T, y-mean_y))/sum(np.square(X-mean_x))
# from equation 3
b0 = mean_y - b1 * mean_x
print(f"b0:{b0} b1:{b1}")
# Output
# b0:[-0.22071127] b1:[55.63351962]
```

##### Now we have values for both $\beta_{0}$ and $\beta_{1}$.

Let us plot this to see how it performs on our dataset

```
ax = plt.gca()
x_min, x_max = min(X), max(X)
y_min, y_max = min(y), max(y)
ax.scatter(X, y, label='Scatter Plot')
plot_line(ax, b1[0], b0[0], *['r'], **{'linewidth': 1, 'label': 'Regression Line'})
plt.legend()
plt.xlabel("X")
plt.ylabel("y")
plt.show()
```

##### Great!

#### Scikit Learn Implementation

Now let us implement the same algorithm using sklearn library

```
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
model = LinearRegression()
model.fit(X, y)
# Y Prediction
y_pred = model.predict(X)
print(model.score(X, y))
# Output
# 0.9380085983573465
print(model.intercept_, model.coef_)
# Output
#(-0.22071126775203442, array([55.63351962]))
```

We can see that intercept and coefficient calculated by Sklearn are exactly as the values calulated by our NumPy model above.

```
# Calculating RMSE and R2 Score
mse = mean_squared_error(y, y_pred)
rmse = np.sqrt(mse)
r2_score = model.score(X, y)
print('RMSE = ', np.sqrt(mse))
print('R2 Score =', r2_score)
# Output
# RMSE = 14.230445772845755
# R2 Score = 0.9380085983573465
```

We have very low value of RMSE score and a good R^{2} score. Our model is pretty good.

Get the complete code on GitHub.