# Stochastic gradient descent in R

### In Data Science

Stochastic gradient descent (SGD) is a simple but very efficient approach to fitting linear classifiers and regressors to convex loss functions "

## What is Stochastic gradient descent

Although gradient descent is one of the first topics studied in optimization theory and machine learning, it is difficult to overestimate the importance of one of its modifications - stochastic gradient descent, which we will often refer to simply as SGD (Stochastic Gradient Descent).

Recall that the essence of gradient descent is to minimize a function by making small steps towards the fastest decreasing function. The name of the method comes from mathematical analysis: the vector of partial derivatives of a function defines the direction of the fastest increase of this function. So by moving towards the antigradient of the function, one can decrease the values of this function the fastest. ## Stochastic gradient descent in R: example #1

Step 1: Packages are installed and loaded to create a model showing gradient descent.


# Installing Packages

install.packages(""e1071"")
install.packages(""caTools"")
install.packages(""caret"")

library(e1071)
library(caTools)
library(caret)



Step 2: the mtcars dataset pre-loaded in R is attached to use gradient descent.


attach(mtcars)



Step 3: A linear regression model constructs a prediction of miles per gallon (mileage) using the hp (horsepower) function using the mtcars dataset. Suitable values for intersection and slope are determined using the lm() function, which creates a linear model that determines the most appropriate intersection and slope or coefficients.


# Model Building
model_grad <- lm(mpg ~ hp, data = mtcars)



Step 4: The coefficients in the model are determined using coef().


# Coefficients in model

(Intercept)          hp
30.09886054 -0.06822828

# The intercept is set at 30.09886054 on y axis and that the gradient is set at -0.06822828. The negative gradient
# tells that there is an inverse relationship between mpg and horse power with the one unit increase in horse power results in a
# 0.06 unit decrease in mpg that is mileage.



Step 5: predictions are made for the model using predict().


# Model prediction




Moreover, model construction can be performed using abline(). The root-mean-square error or mse can also be calculated by summing the squares of the differences between the observed y-value and the predicted y-value and then dividing by the number of observations equal to n.

• The method is suitable for dynamic (online) training, when training objects arrive in streams, and it is necessary to quickly update the vector w.
• The algorithm is able to learn on excessively large samples due to the fact that a random subsample can be sufficient for training.
• Different learning strategies are possible. If the sample is excessively large, or learning occurs dynamically, it is acceptable not to retain training objects. If the sample is small, then the same objects can be repeatedly presented for training.

## Stochastic gradient descent in R: example #2

quite easy to implement in R as a function which we call gradientR below:


epsilon = 0.0001
X = as.matrix(data.frame(rep(1,length(y)),X))
N= dim(X)
print(""Initialize parameters..."")
theta.init = as.matrix(rnorm(n=dim(X), mean=0,sd = 1)) # Initialize theta
theta.init = t(theta.init)
e = t(y) - theta.init%*%t(X)
l2loss = c()
for(i in 1:iters){
l2loss = c(l2loss,sqrt(sum((t(y) - theta%*%t(X))^2)))
e = t(y) - theta%*%t(X)
break
}
}
print(""Algorithm converged"")
values<-list(""coef"" = t(theta), ""l2loss"" = l2loss)
return(values)
}



Let’s also make a function that estimates the parameters with the normal equations:


normalest <- function(y, X){
X = data.frame(rep(1,length(y)),X)
X = as.matrix(X)
theta = solve(t(X)%*%X)%*%t(X)%*%y
return(theta)
}



Now let’s make up some fake data and see gradient descent in action with η=100η=100 and 1000 epochs:


y = rnorm(n = 10000, mean = 0, sd = 1)
x1 = rnorm(n = 10000, mean = 0, sd = 1)
x2 = rnorm(n = 10000, mean = 0, sd = 1)
x3 = rnorm(n = 10000, mean = 0, sd = 1)
x4 = rnorm(n = 10000, mean = 0, sd = 1)
x5 = rnorm(n = 10000, mean = 0, sd = 1)

ptm <- proc.time()
gdec.eta1 = gradientR(y = y, X = data.frame(x1,x2,x3, x4,x5), eta = 100, iters = 1000)

#Output:

##  ""Initialize parameters...""
##  ""Algorithm converged""
##  ""Final gradient norm is 9.84293888300476e-05""

proc.time() - ptm

#Output:

##    user  system elapsed
##   0.232   0.004   0.237




Let’s check if we got the correct parameter values


normalest(y=y, X = data.frame(x1,x2,x3,x4,x5))

#Output:

##                           [,1]
## rep.1..length.y..  0.006146655
## x1                 0.011377991
## x2                -0.005618812
## x3                 0.004558885
## x4                -0.003782078
## x5                -0.001489584

gdec.eta1$coef #Coefficients from gradient descent #Output: ## [,1] ## rep.1..length.y.. 0.006162578 ## x1 0.011349010 ## x2 -0.005620819 ## x3 0.004591632 ## x4 -0.003783501 ## x5 -0.001495606   Pretty close. Let’s take a look at the L2-loss for each epoch:  plot(1:length(gdec.eta1$l2loss),gdec.eta1$l2loss,xlab = ""Epoch"", ylab = ""L2-loss"") lines(1:length(gdec.eta1$l2loss),gdec.eta1$l2loss) What if we decreased the learning rate to 10?  #Output: ##  ""Initialize parameters..."" ##  ""Algorithm converged"" ##  ""Final gradient norm is 0.0493919754917037"" plot(1:length(gdec.eta1$l2loss),gdec.eta1$l2loss,xlab = ""Epoch"", ylab = ""L2-loss"") lines(1:length(gdec.eta1$l2loss),gdec.eta1\$l2loss) ## Problems with Stochastic Gradient Descent method

• The algorithm may fail to converge, or may converge too slowly (see algorithm convergence).
• As a rule, the Q-functional is multi-extreme and the gradient descent process can get "stuck" in one of its local minima. To fight it the technique of shaking of coefficients (jog of weights) is used. It consists in making random modifications of the vector w in a rather large neighborhood of the current value at each stabilization of the functional and starting the process of gradient descent from new points.
• With large feature space dimension n and/or small sample length l, overtraining is possible, i.e. classification becomes unstable and error probability increases. This strongly increases the norm of the weight vector. To deal with this drawback, the weights decay method is used. It consists in limiting the possible growth of the w norm by adding a penalty component to Q(w):
• If the activation function has horizontal asymptotes, the process can fall into a state of "paralysis". At high values of the scalar product <w, x_i> the value of \varphi^\prime becomes close to zero and the vector w ceases to change significantly. Therefore it is common practice to normalise the features beforehand
"