Day 9: Linear Regression from scratch using NumPy.

Linear regression implementation from scratch using NumPy and Scikit Learn both.

By Nandeshwar

Jun 09, 2021

Linear Regression Implementation using Ordinary Least Square in NumPy

In the previous article, we had derived the formula for Univariate Linear Regression which is

$$
Y = \beta_{0} + \beta_{1}X \tag{1}
$$

where

$$
\beta_{1}=\frac{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)\left(y_{i}-\bar{y}\right)}{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}} \tag{2}
$$

and

$$
\beta_{0}=\bar{y}-\beta_{1} \bar{x} \tag{3}
$$

Now let us use the same to find the best fit line.

Wanna jump right to code, check out complete code on Github.

Create a dataset

In order to apply Linear Regression, we need a dataset. We use amazing Sklearn library to create this dataset

Let us first import all the needed libraries

import numpy as np

from matplotlib import pyplot as plt
from sklearn.datasets import make_regression

%matplotlib inline
plt.rcParams['figure.figsize'] = [10, 7]


# Helper function to plot line on graph
def plot_line(ax, slope, intercept, *args, **kwargs):
    x_vals = np.array(ax.get_xlim())
    y_vals = intercept + (slope * x_vals)
    ax.plot(x_vals, y_vals , *args, **kwargs)

Now we create a dataset with just 1 feature of 100 samples and visualise the same

# Generate dataset for regression with singl feature
X, y = make_regression(n_samples=100, n_features=1, noise=15.0, random_state=6)

fig = plt.figure(figsize=(10,8))

plt.scatter(X, y)
plt.xlabel("X")
plt.ylabel("y")

Now in order to use equation 1 mentioned above we need to calculate $\beta_{0}$ and $\beta_{1}$. Equations for calculating both of them is mentioned on the top which we derived in the [previous article](https://nandeshwar.in/100-days-of-deep-learning/what-is-linear-regression-with-derivation/).

We first calculate the mean of both X and y ie $\bar{X}$ and $\bar{y}$ respectively. And then we do the vector calculation using the same for **equation 2**.

# mean_x = np.mean(X)
mean_y = np.mean(y)

# from equation 2
b1 = sum(np.dot((X-mean_x).T, y-mean_y))/sum(np.square(X-mean_x))

# from equation 3
b0 = mean_y - b1 * mean_x

print(f"b0:{b0} b1:{b1}")

# Output
# b0:[-0.22071127] b1:[55.63351962]

Now we have values for both $\beta_{0}$ and $\beta_{1}$.

Let us plot this to see how it performs on our dataset

ax = plt.gca()
x_min, x_max = min(X), max(X)
y_min, y_max = min(y), max(y)
ax.scatter(X, y, label='Scatter Plot')

plot_line(ax, b1[0], b0[0], *['r'], **{'linewidth': 1, 'label': 'Regression Line'})
plt.legend()
plt.xlabel("X")
plt.ylabel("y")
plt.show()

Great!

Scikit Learn Implementation

Now let us implement the same algorithm using sklearn library

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

model = LinearRegression()
model.fit(X, y)

# Y Prediction
y_pred = model.predict(X)

print(model.score(X, y))
# Output
# 0.9380085983573465

print(model.intercept_, model.coef_)
# Output
#(-0.22071126775203442, array([55.63351962]))

We can see that intercept and coefficient calculated by Sklearn are exactly as the values calulated by our NumPy model above.

# Calculating RMSE and R2 Score
mse = mean_squared_error(y, y_pred)
rmse = np.sqrt(mse)
r2_score = model.score(X, y)

print('RMSE = ', np.sqrt(mse))
print('R2 Score =', r2_score)

# Output
# RMSE =  14.230445772845755
# R2 Score = 0.9380085983573465

We have very low value of RMSE score and a good R² score. Our model is pretty good.