Loss Functions and their Use

This is an overview of the primary loss functions which come in common use for both beginners and experienced data explorers. Dive right in!

But before we begin, a few basic ML wizards will need introduction.

Courtesy:Analytics Toolkit


Absolute value takes care of the difference in original data and predicted data. If it is a negative value, he removes the negativity out of it with his Modulus charm.

A Coefficient is like an house elf. He sticks to his master, which is a variable/feature (like X in "Y=mX+C), and effectively contributes towards either increasing X's weight in the equation or decreasing it, as per required.

Best Fit is an artist. Mostly she uses Wingardium Leviosa to splash colors across the skies of graphs to plot lines or curves which represent the pattern of the data points in the graph (n-dimensional). The curves are such that they allow minimum deviation from the data points.

Confusion Matrix actually helps you muggles to ward off your confusion. It is a matrix which shows original data against predicted data. Very useful to understand how effectively our spells can do predictions.

Enough introductions for now, lets just begin. *Yawns*


 Loss functions and their use 


·         L1 regularization technique is called Lasso Regression. Adds “absolute value of magnitude” of coefficient as penalty term to the loss function. Lasso shrinks the less important feature’s coefficient to zero thus, removing some feature altogether.  It works well for feature selection in case we have a huge number of features.

·         L2 is called Ridge Regression. Ridge regression adds “squared magnitude” of coefficient as penalty term to the loss function. This works very well to avoid over-fitting issue.


·         Regressive loss functions
i)                    Mean Square Error/Ridge: measures the average of the squares of the errors or deviations—that is, the difference between the estimator and what is estimated. 


ii)                   Mean Absolute Error/Lasso: It is the absolute difference between the measured value and “true” value. It decreases the weight for outlier errors.

iii)                 Smooth Absolute Error: It is the absolute difference between the measured value and “true” value for predictions lying close to real value (close to best fit) and it is the square of difference between measured and “true” value for outliers (or points far off from best fit). Basically, it is a combination of MSE and MAE.

·         Classification loss functions: Misclassification measures

i)                    0-1 loss function: You count the number of misclassified items. There is nothing more behind it, it is a very basic loss function. For instance, it can be determined from confusion matrix which shows number of misclassifications and correct classifications. 0-1 loss penalizes misclassifications and assigns the smallest loss to the solution that has greatest number of correct classifications.

ii)                   Hinge loss function (L2 regularized): The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs). Basically, it measures squared difference between the drafted margin between the classes and the margins passing through the nearest points in the classes on either side.
Courtesy: aitrends.com

iii)                 Logistic Loss: This function displays a similar convergence rate to the hinge loss function, and since it is continuous (unlike Hinge Loss), gradient descent (he's a master wizard! Will talk about him in another long parchment. For now, refer to wiki) methods can be utilized. However, the logistic loss function does not assign zero penalty to any points. Instead, functions that correctly classify points with high confidence are penalized less. This structure leads the logistic loss function to be sensitive to outliers in the data.

Courtesy: Quora

iv)                 Cross entropy/log loss: or log loss, measures the performance of a classification model whose output is a probability value between 0 and 1. Cross-entropy loss increases as the predicted probability diverges from the actual label.


There are a multitude of loss functions yet to be explored but this was the scope of my current study. Fret not, will keep you posted every now and then! Happy wizarding!

Comments

Brands Worked with or Featured On

Brands Worked with or Featured On

Popular Posts