Linear regression Part 2

By Mohendra Roy

In our earlier class, we have discussed the cost function and its importance in linear regression. The objective in any regression method is to minimise the cost function. In our previous class, we did it manually. However, we need an optimization algorithm to do it automatically. The gradient descent algorithm is one of the best optimization methods for linear regression problem. Here we will try to understand about this algorithm.

In [1]:
import numpy as np
from IPython.display import display, Math, Latex
import matplotlib.pyplot as plt
from matplotlib.legend_handler import HandlerLine2D
import pylab as plb
from IPython.display import Image, display

As we discussed, we choose the sum of square error as a cost function for this problem. Here our objective is to find the minimum value of C by adjusting the parameter a and b. For simplicity let us consider a = P1 and b = P2

In [2]:
display(Math(r'C(P_1, P_2) =\frac 1{2m} \sum_{i=1}^m (H_i - Y_i)^2')) # cost function 

Gradient descent:

 The equation of gradient descent is as given below
In [3]:
display(Math(r'P_k = P_k -\alpha\frac{\partial }{\partial P_k}C(P_1,P_2) \ ( for \ k =1 \ and \ k =2)'))

Here α is the learning rate or the step at which the gradient will descent. For simplicity let us take P2 = 0. The gradient of the cost function J(P1) with respect to P1 can be illustrated as given below. Now if the gradient is positive the value of P1 will decrease (as shown by the green arrow). However, if the gradient is negative then P1 will increase (as shown by the maroon arrow), and eventually, it will reach to minima of J with a step of α.

In [4]:
display(Image(filename='Linear Regression Gradnt.png'))

The learning rate will tell us that how quickly or slowly the optimization process will happen. However, there is a trade-off between the learning rate and accuracy of the optimization. This can be well explained with this diagram below. Let us suppose one is to step down with a step of two (as shown in red line) then it will not reach the bottom floor. If the step down happens in one steps then it will reach to the bottom floor, however, this will take much time compared to a two-step process. Therefore, we have to choose an optimized learning rate that will provide fast and better accuracy.

In [5]:
display(Image(filename='Learning Rate.png'))

Now if we expand the equation of the gradient descent as below.

In [6]:
display(Math(r'P_k = P_k -\alpha\frac{\partial }{\partial P_k}C(P_1,P_2) \ ( for \ k =1 \ and \ k =2)'))


In [7]:
display(Math(r'\frac{\partial }{\partial P_k}C(P_1,P_2) =\frac{\partial }{\partial P_k}\frac{1}{2m}\sum_{i =1}^{m}(P_2 + P_1X^i-Y^i)^2'))
In [8]:
display(Math(r'for\ P_1:\ \frac{\partial }{\partial P_1}\frac{1}{2m}\sum_{i =1}^{m}(P_2 + P_1X^i-Y^i)^2 = \frac{1}{2m}\sum_{i =1}^{m}(P_2 + P_1X^i-Y^i)X^i'))
In [9]:
display(Math(r'for\ P_2:\ \frac{\partial }{\partial P_2}\frac{1}{2m}\sum_{i =1}^{m}(P_2 + P_1X^i-Y^i)^2 = \frac{1}{2m}\sum_{i =1}^{m}(P_2 + P_1X^i-Y^i)'))

In linear regression, the gradient descent will always find a global minimum, since there are no local minima for linear regression. Here we will implement the gradient descent algorithm from scratch.

In [10]:
# function for evaluation of cost 
def cost_f(X,Y,P): # X and Y are for the data and P is the parameters for the hypothesis.
    H = P[0]*X + P[1]
    C = (np.sum(np.square(H - Y)))/2*len(X)
    return C

In [11]:
# Function for gradient descent.
def g_d(alpha, X, Y, P, num_itr): # alpha=learning rate, num_itr = number of iteration, P = parameters of hypothesis
    cf = np.zeros((num_itr))
    pplt = np.zeros((2, num_itr))
    for i in range(num_itr):
        p1_new = P[0] - alpha*( np.sum( (P[0]*X + P[1] - Y)*X )  ) / 2*len(X)
        p2_new = P[1] - alpha*( np.sum( (P[0]*X + P[1] - Y) )  ) / 2*len(X)
        # Update of new parameters
        P[0] = p1_new
        P[1] = p2_new
        cf[i] = cost_f(X,Y,P)
        pplt[0,i] = P[0]
        pplt[1,i] = P[1]
    return P, pplt, cf
In [12]:
# initial parameters P1 = 0 and P2 = 0 
P = [0.0, 0.0]
P = np.array(P)
In [13]:
X = [1, 2, 3, 4, 5]
X = np.array(X)
#X = X/max(X)
Y = [10, 40, 50, 78, 83]
Y = np.array(Y)
#Y = Y/max(Y)
In [14]:
itr = 100 # number of iteration 
alpha = 0.01 # learning rate
In [15]:
# Applying gradient descent algorithm 
PP, PPlt, CF = g_d(alpha, X, Y, P, itr) 

PP stored the optimized parameters, 

PPlt stored all the values of parametres of each iteration

CF stored the cost fuction of each iteration

'\nPP stored the optimized parameters, \n\nPPlt stored all the values of parametres of each iteration\n\nCF stored the cost fuction of each iteration\n\n'
In [16]:
# display the figure that was generated in plotly using the optimized values with gradient descent algorithm.
display(Image(filename='optimization of parmeters.png'))

This is the surface plot of the P1, P2 and cost function. From this graph, it is clear that the optimum value of P1 =18 and P2 = -2.

In [17]:
# arranging data for plot.
P1f = PP[0]
P2f = PP[1]
In [18]:
Hf = P1f*X + P2f 
In [19]:
line1, = plt.plot(X,Y, "ro", markersize=5, label='Original Data')
line2, = plt.plot(X,Hf, label='Optimised Model')
plt.title('X vs Y')
plt.legend(handler_map={line1: HandlerLine2D(numpoints=4)})

The above graph is the automatically optimised linear regression model. Next, we will implement the same in TensorFlow



Leave a Reply

Your email address will not be published. Required fields are marked *