Gradient Descents

In this post, we are going to discuss Gradient Descents; what it is and how to use it. Gradient Descent is used in many machine learning algorithms to find the local miniumum in differentiable functions where there is a linear relationship. It goes hand in hand with optimization problems. In this tutorial, we will focus on linear regressions. A linear regression is used to model the relationship between one or more variables. In our case, we have an input \(x\) and an output \(y\), but we do not know the slope \(m\) or the intercept \(b\).

\[y = mx + b\]

To build our model, we need to tune two parameters, namely \(m\) and \(b\). To do so, we need to use the mean squared error (MSE) as our loss function. This function measures the average squared difference between the estimated values and the actual value. We expect this to be 0 if our predicted values equal our actual value.

\[MSE = \frac{1}{N}\Sigma_{i=1}^{n}(Y_{i}-\hat Y_{i})^{2} \] Therefore, we will need optimize the MSE for a linear regression using the gradient descent. A gradient is defined as follows:

\[ \nabla J(\theta) = [\frac{\partial J}{\partial \theta_{1}}, \frac{\partial J}{\partial \theta_{2}}, ..., \frac{\partial J}{\partial \theta_{p}}] \] The gradient descent is defined in the following equation. The algorithm will iterate over and over using the gradient descent to find new values of \(m\) and \(b\). This will continue until MSE converges to zero, letting us know that our predicted values equal to our actual values.

\[ \theta_{n + 1} = \theta_{n} - \eta \nabla J(\theta_{n})\]

Let’s first solve for \(\nabla J(\theta_{n})\) by plugging in the MSE equation and finding the partial derivatives with respect to \(m\) and \(b\).

\[MSE = \frac{1}{N}\Sigma_{i=1}^{n}(y_{i}-(mx_{i}+b))^{2} \]

\[ f(m,b) = \frac{1}{N}\Sigma_{i=1}^{n}(y_{i}-(mx_{i}+b))^{2} \] \[ \frac{\partial f(m,b)}{\partial m} = \frac{1}{N}\Sigma_{i=1}^{n}-2(y_{i}-(mx_{i}+b)) \frac{\partial f}{\partial m}(mx_{i}+b))\] \[ \frac{\partial f(m,b)}{\partial m} = \frac{1}{N}\Sigma_{i=1}^{n}-2x_{i}(y_{i}-(mx_{i}+b))\]

\[ \frac{\partial f(m,b)}{\partial b} = \frac{1}{N}\Sigma_{i=1}^{n}-2(y_{i}-(mx_{i}+b)) \frac{\partial f}{\partial b}(mx_{i}+b))\] \[ \frac{\partial f(m,b)}{\partial b} = \frac{1}{N}\Sigma_{i=1}^{n}-2(y_{i}-(mx_{i}+b))\] Now we have our two final equations. We can begin implementing them.

\[ \hat m = m - \eta(\frac{1}{N}\Sigma_{i=1}^{n}-2x_{i}(y_{i}-(mx_{i}+b)))\]

\[ \hat b = b - \eta(\frac{1}{N}\Sigma_{i=1}^{n}-2(y_{i}-(mx_{i}+b)))\] We will input two vectors \(x\) and \(y\) and solve for \(m\) and \(b\). In addition, we set a learning rate, \(\eta\) of 0.001. This controls the magnitude of the gradient. If it is too large, we will miss the minima, but if it is too smmall, the number of iterations will increase. For this function, we will iterate 10,000 times and break the loop if \(MSE \le \epsilon\), \(\epsilon = 0.001\).

gradient_descent <- function(x,y, epsilon = 0.001, lr = 0.001, steps = 10000){
  epsilon = 0.0001
  m = runif(1) #Initialize a random number
  b = runif(1) #Initialize a random number
  N = length(x)
  log = c()
  mseloss = c()
  for (step in 1:steps){
    f = y - (m*x + b) #Subtracting the estimated value from the actual value
    mseloss = c(mseloss, sum(f ** 2) / N)
    dm <- -2 / N * sum(x %*% f)
    db <- -2 / N * sum(f)
    m = m - lr * dm
    b = b - lr * db
    log = rbind(log, c(m,b))
    if (sum(f ** 2) / N <= epsilon){ #stop the algorithm once we have converged to zero
      break
    }
  }
  list = list('m' = m, 'b' = b, 'steps' = step, 'log' = log, 'mseloss' = mseloss)
  return(list)
}

We can intialize a simple example by setting x = c(1,2,3,4,5) and y = 2 * x. We expect m = 2 and b = 0.

x = c(1,2,3,4,5)
y = 2 * x
ptm <- proc.time()
res <- gradient_descent(x,y)
proc.time() - ptm
##    user  system elapsed 
##   0.700   0.437   1.139
res$m
## [1] 1.98832
res$b
## [1] 0.04216883
cat('The predicted values are:', head(res$m * x + res$b))
## The predicted values are: 2.030489 4.018809 6.007129 7.995448 9.983768
cat('\nThe actual values are:', head(y))
## 
## The actual values are: 2 4 6 8 10

We can observe convergance by plotting the MSE against the number of iterations.

library(ggplot2)
df <- data.frame('iters' = 1:res$steps, 'mseloss' = res$mseloss)
ggplot(df, aes(x = iters, y = mseloss)) + geom_point() + theme_minimal() + ylab('MSE Loss')

It appears to converge early but continues on to meet our requirement of \(MSE \le \epsilon\). Let’s repeat this with a larger data set. We would expect this to take longer.

x = runif(10000)
y = 2 * x
ptm <- proc.time()
res <- gradient_descent(x,y)
proc.time() - ptm
##    user  system elapsed 
##   2.374   1.306   3.685
res$m
## [1] 1.65309
res$b
## [1] 0.1860987
cat('The predicted values are:', head(res$m * x + res$b))
## The predicted values are: 0.3324867 1.440416 0.2772778 0.1880146 0.2445843 0.4983348
cat('\nThe actual values are:', head(y))
## 
## The actual values are: 0.1771083 1.517542 0.1103135 0.002317988 0.07075917 0.3777606
df <- data.frame('iters' = 1:res$steps, 'mseloss' = res$mseloss)
ggplot(df, aes(x = iters, y = mseloss)) + geom_point() + theme_minimal() + ylab('MSE Loss')

As you can see, the predicted values don’t match the actual values. In addition, the MSE loss does not converge. This suggests we may need more iterations. Let’s try with 50,000 iterations.

ptm <- proc.time()
res <- gradient_descent(x,y, steps = 50000)
proc.time() - ptm
##    user  system elapsed 
##   9.934   5.783  15.746
res$m
## [1] 1.965831
res$b
## [1] 0.01832982
cat('The predicted values are:', head(res$m * x + res$b))
## The predicted values are: 0.1924123 1.509946 0.1267587 0.0206082 0.0878801 0.3896366
cat('\nThe actual values are:', head(y))
## 
## The actual values are: 0.1771083 1.517542 0.1103135 0.002317988 0.07075917 0.3777606
df <- data.frame('iters' = 1:res$steps, 'mseloss' = res$mseloss)
ggplot(df, aes(x = iters, y = mseloss)) + geom_point() + theme_minimal() + ylab('MSE Loss')

Now it looks like we have converged and reached an optimal value. However, the time it took was much longer. To wrap up, we were able to determmine a suitable \(m\) and \(b\) to satisfy our optimization problem. However, this algorithm can be slow if we are working with large matrices. In that case, we can use Stochastic Gradient Descent and use only a subset of the matrix with one sample at a time for the optimimization.

Let’s see what this would look like.

s_gradient_descent <- function(x,y, epsilon = 0.001, lr = 0.001, steps = 10000, batch_size = 1){
  epsilon = 0.0001
  m = runif(1) #Initialize a random number
  b = runif(1) #Initialize a random number
  log = c()
  mseloss = c()
  for (step in 1:steps){
    index <- sample.int(length(y), batch_size)
    N = length(index)
    Xs <- x[index]
    Ys <- y[index]
    f = Ys - (m*Xs + b) #Subtracting the estimated value from the actual value
    mseloss = c(mseloss, sum(f ** 2) / N)
    dm <- -2 / N * sum(Xs %*% f)
    db <- -2 / N * sum(f)
    m = m - lr * dm
    b = b - lr * db
    log = rbind(log, c(m,b))
    if (sum(f ** 2) / N <= epsilon){ #stop the algorithm once we have converged to zero
      break
    }
  }
  list = list('m' = m, 'b' = b, 'steps' = step, 'log' = log, 'mseloss' = mseloss)
  return(list)
}

With the above function, we will only be using a subset of our data for optimization. In this case, I will set it to 100. Feel free to play around with the parameters.

ptm <- proc.time()
sgd_res <- s_gradient_descent(x,y, steps = 50000, batch = 100)
proc.time() - ptm
##    user  system elapsed 
##   5.224   3.328   8.565
sgd_res$m
## [1] 1.960288
sgd_res$b
## [1] 0.02123877
cat('The predicted values are:', head(sgd_res$m * x + sgd_res$b))
## The predicted values are: 0.1948304 1.508649 0.1293619 0.02351073 0.09059295 0.3914986
cat('\nThe actual values are:', head(y))
## 
## The actual values are: 0.1771083 1.517542 0.1103135 0.002317988 0.07075917 0.3777606

As you can see, we nearly cut out run time in half by just using stochastic gradient descent.

Additonal Resources:

Written on April 5, 2021

[ tutorials  R  machine-learning  ]