# Day 9: Linear Regression from scratch using NumPy.

Linear regression implementation from scratch using NumPy and Scikit Learn both.

Jun 09, 2021

### Linear Regression Implementation using Ordinary Least Square in NumPy

In the previous article, we had derived the formula for Univariate Linear Regression which is

$$Y = \beta_{0} + \beta_{1}X \tag{1}$$

where

$$\beta_{1}=\frac{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)\left(y_{i}-\bar{y}\right)}{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}} \tag{2}$$

and

$$\beta_{0}=\bar{y}-\beta_{1} \bar{x} \tag{3}$$

Now let us use the same to find the best fit line.

Wanna jump right to code, check out complete code on Github.

#### Create a dataset

In order to apply Linear Regression, we need a dataset. We use amazing Sklearn library to create this dataset

Let us first import all the needed libraries

import numpy as np

from matplotlib import pyplot as plt
from sklearn.datasets import make_regression

%matplotlib inline
plt.rcParams['figure.figsize'] = [10, 7]

# Helper function to plot line on graph
def plot_line(ax, slope, intercept, *args, **kwargs):
x_vals = np.array(ax.get_xlim())
y_vals = intercept + (slope * x_vals)
ax.plot(x_vals, y_vals , *args, **kwargs)

Now we create a dataset with just 1 feature of 100 samples and visualise the same

# Generate dataset for regression with singl feature
X, y = make_regression(n_samples=100, n_features=1, noise=15.0, random_state=6)

fig = plt.figure(figsize=(10,8))

plt.scatter(X, y)
plt.xlabel("X")
plt.ylabel("y")

Now in order to use equation 1 mentioned above we need to calculate $\beta_{0}$ and $\beta_{1}$. Equations for calculating both of them is mentioned on the top which we derived in the [previous article](https://nandeshwar.in/100-days-of-deep-learning/what-is-linear-regression-with-derivation/).

We first calculate the mean of both X and y ie $\bar{X}$ and $\bar{y}$ respectively. And then we do the vector calculation using the same for **equation 2**.

# mean_x = np.mean(X)
mean_y = np.mean(y)

# from equation 2
b1 = sum(np.dot((X-mean_x).T, y-mean_y))/sum(np.square(X-mean_x))

# from equation 3
b0 = mean_y - b1 * mean_x

print(f"b0:{b0} b1:{b1}")

# Output
# b0:[-0.22071127] b1:[55.63351962]
##### Now we have values for both $\beta_{0}$ and $\beta_{1}$.

Let us plot this to see how it performs on our dataset

ax = plt.gca()
x_min, x_max = min(X), max(X)
y_min, y_max = min(y), max(y)
ax.scatter(X, y, label='Scatter Plot')

plot_line(ax, b1[0], b0[0], *['r'], **{'linewidth': 1, 'label': 'Regression Line'})
plt.legend()
plt.xlabel("X")
plt.ylabel("y")
plt.show()

#### Scikit Learn Implementation

Now let us implement the same algorithm using sklearn library

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

model = LinearRegression()
model.fit(X, y)

# Y Prediction
y_pred = model.predict(X)

print(model.score(X, y))
# Output
# 0.9380085983573465

print(model.intercept_, model.coef_)
# Output
#(-0.22071126775203442, array([55.63351962]))

We can see that intercept and coefficient calculated by Sklearn are exactly as the values calulated by our NumPy model above.

# Calculating RMSE and R2 Score
mse = mean_squared_error(y, y_pred)
rmse = np.sqrt(mse)
r2_score = model.score(X, y)

print('RMSE = ', np.sqrt(mse))
print('R2 Score =', r2_score)

# Output
# RMSE =  14.230445772845755
# R2 Score = 0.9380085983573465

We have very low value of RMSE score and a good R2 score. Our model is pretty good.

Get the complete code on GitHub.

Code
Machine Learning
Beginner