Optimization Functions
·
Constant Learning Rate Algorithms
Challenges:
·
Choosing a proper learning rate can be difficult. A
learning rate that is too small leads to painfully slow convergence, while a
learning rate that is too large can hinder convergence and cause the loss
function to fluctuate around the minimum or even to diverge.
·
The same learning rate applies to all parameter updates.
If our data is sparse and our features have very different frequencies, we
might not want to update all of them to the same extent, but perform a larger
update for rarely occurring features.
Stochastic Gradient
Descent: Optimum results not obtained, computationally inexpensive. Performs a parameter update for each training.
Batch gradient descent: Computes the gradient of the cost function w.r.t. to
the parameters θ for the entire training dataset.
Mini-batch gradient descent: Mini-batch gradient descent finally takes the best of
both worlds and performs an update for every mini-batch of n training examples.
·
Adaptive
Learning Rate Algorithms
Advantage: Parameters modified/adapted with iterations. Manual
intervention is minimum.
Adagrad: Good
results. Adagrad adapts
updates to each individual parameter to perform larger or smaller updates
depending on their importance.
Disadvantage: computationally
expensive
Adam: Faster
than Adagrad with good results, Adaptive Moment Estimation is most popular today.
Disadvantage: Computationally
expensive
For a more in-depth and mathematical explanation, refer to the following link:
https://towardsdatascience.com/
Comments
Post a Comment