## In Data Science

Stochastic gradient descent (SGD) is a simple but very efficient approach to fitting linear classifiers and regressors to convex loss functions

Management

## What is Stochastic gradient descent

Although gradient descent is one of the first topics studied in optimization theory and machine learning, it is difficult to overestimate the importance of one of its modifications - stochastic gradient descent, which we will often refer to simply as SGD (Stochastic Gradient Descent).

Recall that the essence of gradient descent is to minimize a function by making small steps towards the fastest decreasing function. The name of the method comes from mathematical analysis: the vector of partial derivatives of a function defines the direction of the fastest increase of this function. So by moving towards the antigradient of the function, one can decrease the values of this function the fastest.

### SGD in neural networks

Gradient descent in a neural network is an optimization algorithm that is used to find parameter values or coefficients of a function that minimizes a cost function. Gradient descent is also known as a neural network engine. Gradient descent is mainly used when the parameters cannot be calculated analytically, i.e. in linear algebra, and must be searched by an optimization algorithm. There are three types of gradient descent: batch gradient descent, stochastic gradient descent and mini-batch gradient descent. Batch gradient descent refers to computing the derivative of all training data before computing the update. Stochastic gradient descent refers to computing the derivative of each instance of training data and computing the update immediately. Linear regression is an example of gradient descent, as in y = mx + c, where "y" is the response or outcome variable, "m" is the gradient of the linear trend line, "x" is the predictor variable, and "c" is the intercept, where the intercept is the point on the y axis where the predictor x value is always zero.

## Stochastic gradient descent in R: example #1

**Step 1: **Packages are installed and loaded to create a model showing gradient descent.

```
# Installing Packages
install.packages(""e1071"")
install.packages(""caTools"")
install.packages(""caret"")
# Loading package
library(e1071)
library(caTools)
library(caret)
```

**Step 2: **the mtcars dataset pre-loaded in R is attached to use gradient descent.

```
# Loading data
attach(mtcars)
```

**Step 3:** A linear regression model constructs a prediction of miles per gallon (mileage) using the hp (horsepower) function using the mtcars dataset. Suitable values for intersection and slope are determined using the lm() function, which creates a linear model that determines the most appropriate intersection and slope or coefficients.

```
# Model Building
model_grad <- lm(mpg ~ hp, data = mtcars)
```

**Step 4:** The coefficients in the model are determined using coef().

```
# Coefficients in model
coef(model_grad)
(Intercept) hp
30.09886054 -0.06822828
# The intercept is set at 30.09886054 on y axis and that the gradient is set at -0.06822828. The negative gradient
# tells that there is an inverse relationship between mpg and horse power with the one unit increase in horse power results in a
# 0.06 unit decrease in mpg that is mileage.
```

**Step 5: **predictions are made for the model using predict().

```
# Model prediction
y_pred <- predict(model_grad)
```

Moreover, model construction can be performed using abline(). The root-mean-square error or mse can also be calculated by summing the squares of the differences between the observed y-value and the predicted y-value and then dividing by the number of observations equal to n.

### Advantages of Stochastic Gradient Descent method

- The method is suitable for dynamic (online) training, when training objects arrive in streams, and it is necessary to quickly update the vector w.
- The algorithm is able to learn on excessively large samples due to the fact that a random subsample can be sufficient for training.
- Different learning strategies are possible. If the sample is excessively large, or learning occurs dynamically, it is acceptable not to retain training objects. If the sample is small, then the same objects can be repeatedly presented for training.

## Stochastic gradient descent in R: example #2

quite easy to implement in R as a function which we call **gradientR** below:

```
gradientR<-function(y, X, epsilon,eta, iters){
epsilon = 0.0001
X = as.matrix(data.frame(rep(1,length(y)),X))
N= dim(X)[1]
print(""Initialize parameters..."")
theta.init = as.matrix(rnorm(n=dim(X)[2], mean=0,sd = 1)) # Initialize theta
theta.init = t(theta.init)
e = t(y) - theta.init%*%t(X)
grad.init = -(2/N)%*%(e)%*%X
theta = theta.init - eta*(1/N)*grad.init
l2loss = c()
for(i in 1:iters){
l2loss = c(l2loss,sqrt(sum((t(y) - theta%*%t(X))^2)))
e = t(y) - theta%*%t(X)
grad = -(2/N)%*%e%*%X
theta = theta - eta*(2/N)*grad
if(sqrt(sum(grad^2)) <= epsilon){
break
}
}
print(""Algorithm converged"")
print(paste(""Final gradient norm is"",sqrt(sum(grad^2))))
values<-list(""coef"" = t(theta), ""l2loss"" = l2loss)
return(values)
}
```

Let’s also make a function that estimates the parameters with the normal equations:

```
normalest <- function(y, X){
X = data.frame(rep(1,length(y)),X)
X = as.matrix(X)
theta = solve(t(X)%*%X)%*%t(X)%*%y
return(theta)
}
```

Now let’s make up some fake data and see gradient descent in action with η=100η=100 and 1000 epochs:

```
y = rnorm(n = 10000, mean = 0, sd = 1)
x1 = rnorm(n = 10000, mean = 0, sd = 1)
x2 = rnorm(n = 10000, mean = 0, sd = 1)
x3 = rnorm(n = 10000, mean = 0, sd = 1)
x4 = rnorm(n = 10000, mean = 0, sd = 1)
x5 = rnorm(n = 10000, mean = 0, sd = 1)
ptm <- proc.time()
gdec.eta1 = gradientR(y = y, X = data.frame(x1,x2,x3, x4,x5), eta = 100, iters = 1000)
#Output:
## [1] ""Initialize parameters...""
## [1] ""Algorithm converged""
## [1] ""Final gradient norm is 9.84293888300476e-05""
proc.time() - ptm
#Output:
## user system elapsed
## 0.232 0.004 0.237
```

Let’s check if we got the correct parameter values

```
normalest(y=y, X = data.frame(x1,x2,x3,x4,x5))
#Output:
## [,1]
## rep.1..length.y.. 0.006146655
## x1 0.011377991
## x2 -0.005618812
## x3 0.004558885
## x4 -0.003782078
## x5 -0.001489584
gdec.eta1$coef #Coefficients from gradient descent
#Output:
## [,1]
## rep.1..length.y.. 0.006162578
## x1 0.011349010
## x2 -0.005620819
## x3 0.004591632
## x4 -0.003783501
## x5 -0.001495606
```

Pretty close. Let’s take a look at the L2-loss for each epoch:

```
plot(1:length(gdec.eta1$l2loss),gdec.eta1$l2loss,xlab = ""Epoch"", ylab = ""L2-loss"")
lines(1:length(gdec.eta1$l2loss),gdec.eta1$l2loss)
```

What if we decreased the learning rate to 10?

```
#Output:
## [1] ""Initialize parameters...""
## [1] ""Algorithm converged""
## [1] ""Final gradient norm is 0.0493919754917037""
plot(1:length(gdec.eta1$l2loss),gdec.eta1$l2loss,xlab = ""Epoch"", ylab = ""L2-loss"")
lines(1:length(gdec.eta1$l2loss),gdec.eta1$l2loss)
```

## Problems with Stochastic Gradient Descent method

- The algorithm may fail to converge, or may converge too slowly (see algorithm convergence).
- As a rule, the Q-functional is multi-extreme and the gradient descent process can get "stuck" in one of its local minima. To fight it the technique of shaking of coefficients (jog of weights) is used. It consists in making random modifications of the vector w in a rather large neighborhood of the current value at each stabilization of the functional and starting the process of gradient descent from new points.
- With large feature space dimension n and/or small sample length l, overtraining is possible, i.e. classification becomes unstable and error probability increases. This strongly increases the norm of the weight vector. To deal with this drawback, the weights decay method is used. It consists in limiting the possible growth of the w norm by adding a penalty component to Q(w):
- If the activation function has horizontal asymptotes, the process can fall into a state of "paralysis". At high values of the scalar product <w, x_i> the value of \varphi^\prime becomes close to zero and the vector w ceases to change significantly. Therefore it is common practice to normalise the features beforehand