Loss Functions and their Use
This is an overview of the primary loss functions which come in common use for both beginners and experienced data explorers. Dive right in!
But before we begin, a few basic ML wizards will need introduction.
Absolute value takes care of the difference in original data and predicted data. If it is a negative value, he removes the negativity out of it with his Modulus charm.
A Coefficient is like an house elf. He sticks to his master, which is a variable/feature (like X in "Y=mX+C), and effectively contributes towards either increasing X's weight in the equation or decreasing it, as per required.
Best Fit is an artist. Mostly she uses Wingardium Leviosa to splash colors across the skies of graphs to plot lines or curves which represent the pattern of the data points in the graph (n-dimensional). The curves are such that they allow minimum deviation from the data points.
Confusion Matrix actually helps you muggles to ward off your confusion. It is a matrix which shows original data against predicted data. Very useful to understand how effectively our spells can do predictions.
Enough introductions for now, lets just begin. *Yawns*
Loss functions and their use
There are a multitude of loss functions yet to be explored but this was the scope of my current study. Fret not, will keep you posted every now and then! Happy wizarding!
But before we begin, a few basic ML wizards will need introduction.
Courtesy:Analytics Toolkit |
Absolute value takes care of the difference in original data and predicted data. If it is a negative value, he removes the negativity out of it with his Modulus charm.
A Coefficient is like an house elf. He sticks to his master, which is a variable/feature (like X in "Y=mX+C), and effectively contributes towards either increasing X's weight in the equation or decreasing it, as per required.
Best Fit is an artist. Mostly she uses Wingardium Leviosa to splash colors across the skies of graphs to plot lines or curves which represent the pattern of the data points in the graph (n-dimensional). The curves are such that they allow minimum deviation from the data points.
Confusion Matrix actually helps you muggles to ward off your confusion. It is a matrix which shows original data against predicted data. Very useful to understand how effectively our spells can do predictions.
Enough introductions for now, lets just begin. *Yawns*
Loss functions and their use
·
L1 regularization technique is called Lasso Regression. Adds “absolute value of magnitude” of
coefficient as penalty term to the loss function. Lasso shrinks the less
important feature’s coefficient to zero thus, removing some feature
altogether. It works well for feature
selection in case we have a huge number of features.
·
L2 is called Ridge Regression.
Ridge regression adds “squared magnitude” of coefficient as penalty term to the
loss function. This works very well to avoid over-fitting issue.
·
Regressive
loss functions
i)
Mean Square Error/Ridge:
measures
the average of the squares of the errors or deviations—that
is, the difference between the estimator and what is estimated.
ii)
Mean Absolute Error/Lasso:
It is the absolute
difference between the measured value and “true” value. It decreases the weight
for outlier errors.
iii)
Smooth Absolute Error: It is the absolute difference
between the measured value and “true” value for predictions lying close to real
value (close to best fit) and it is the square of difference between measured and “true” value for
outliers (or points far off from best fit). Basically, it is a
combination of MSE and MAE.
·
Classification
loss functions: Misclassification measures
i)
0-1 loss function: You
count the number of misclassified items. There is nothing more behind it, it is
a very basic loss function. For instance, it can be determined from confusion
matrix which shows number of misclassifications and correct classifications. 0-1 loss penalizes
misclassifications and assigns the smallest loss to the solution that has
greatest number of correct classifications.
ii)
Hinge loss function (L2
regularized): The hinge loss is used for "maximum-margin" classification,
most notably for support vector machines (SVMs).
Basically, it measures squared difference between the drafted margin between
the classes and the margins passing through the nearest points in the classes
on either side.
Courtesy: aitrends.com |
iii)
Logistic Loss: This function displays a similar convergence
rate to the hinge loss function, and since it is continuous (unlike Hinge Loss), gradient descent (he's a master wizard! Will talk about him in another long parchment. For now, refer to wiki) methods
can be utilized. However, the logistic loss function does not assign zero
penalty to any points. Instead, functions that correctly classify points with
high confidence are penalized less. This structure leads the logistic loss
function to be sensitive to outliers in the data.
Courtesy: Quora |
iv)
Cross entropy/log loss:
or log loss, measures the performance of a classification model whose output is
a probability value between 0 and 1. Cross-entropy loss increases as the
predicted probability diverges from the actual label.
Comments
Post a Comment