Optimization Functions

·         Constant Learning Rate Algorithms
·                     Choosing a proper learning rate can be difficult. A learning rate that is too small leads to painfully slow convergence, while a learning rate that is too large can hinder convergence and cause the loss function to fluctuate around the minimum or even to diverge.
·                     The same learning rate applies to all parameter updates. If our data is sparse and our features have very different frequencies, we might not want to update all of them to the same extent, but perform a larger update for rarely occurring features.

Stochastic Gradient Descent: Optimum results not obtained, computationally inexpensive. Performs a parameter update for each training.

Batch gradient descent: Computes the gradient of the cost function w.r.t. to the parameters θ for the entire training dataset.

Mini-batch gradient descent: Mini-batch gradient descent finally takes the best of both worlds and performs an update for every mini-batch of n training examples.

·         Adaptive Learning Rate Algorithms
Advantage: Parameters modified/adapted with iterations. Manual intervention is minimum.

Adagrad: Good results. Adagrad adapts updates to each individual parameter to perform larger or smaller updates depending on their importance.
Disadvantage: computationally expensive

Adam: Faster than Adagrad with good results, Adaptive Moment Estimation is most popular today.
Disadvantage: Computationally expensive

For a more in-depth and mathematical explanation, refer to the following link:


Brands Worked with or Featured On

Brands Worked with or Featured On

Popular Posts