EVERTHING YOU NEED TO KNOW ABOUT COMMONLY USED LOSS/COST FUNCTIONS.
WHY ERRORS ARE CALCULATED?
Errors gives a measure which tells how our predicted value differs from actual value it needed to predict. It also tells how well our ML/DL model is performing for a train Dataset and how model gonna perform for future dataset. In machine learning models calculation of errors helps in tweaking hypermeters & in deep-learning models calculated errors are sent to optimizers for updating the weights properly in order to reach global minima.
HOW ERRORS ARE CALCULATED?
We have different loss/cost functions for calculating errors. Depending on the models we are dealing with, whether it is a regression , binary classification or multi-class classification models, different loss/cost functions can be used.
DIFFERENCE BETWEEN LOSS/COST FUNCTION?
There is only a thin layer difference between loss and cost function. Loss function calculates error for each data point in given dataset and cost function calculates the average loss in given dataset. for example in stochastic gradient descent(its is way of passing data through neural network and updating weights ) each data point is passed through neural network and error is calculated ,here loss function is used ,whereas in batch or mini batch gradient descent average of loss is calculated here we use cost function.
DIFFERENT TYPES OF LOSS/COST FUNCTIONS
Regression problems
-L1 LOSS FUNCTION
L1 loss function is also known as least absolute deviation(LAD) /least absolute error(LAE)/least absolute value(LAV).This loss function calculate the absolute magnitude of difference between actual value and predicted value. This only takes magnitude and not the direction.
Advantages-It performs well with the data containing outliers as it ignores the outliers
Disadvantages-When the difference between true output and predicted output is very small(probably after some number of iterations), the cost is diminished as it is just linear difference between them and hence the learning becomes very slow
-L2 LOSS FUNCTION
L2 loss function is also know as least squared error. This function calculates the squared difference between the actual value and predicted value. Since the value is squared direction becomes meaning less.
Advantages-the loss is magnified. It is like looking at a loss through lenses. So learning will be faster in case of L2 loss.
Disadvantages-This are sensitive to outliers. Since the error values are squared ,presence of outliers will give large errors.
-HUBER LOSS FUNCTION
Huber loss is the combination of both L1 loss & L2 loss. This are less sensitive to outliers as compared to mean squared errors.L1 loss has an issue when the difference between true and predicted output is small and L2 loss has an issue when the difference is large(i.e. in case of outliers). In Huber-loss both problems of L1 & L2 loss problems are addressed. In this function a delta parameter is introduced, which is a hyperparameter. If the absolute value of actual value and predicted is less than Delta i.e. for smaller values then quadratic function(i.e. y-actual and y-predicted are squared and divided by 2) is used other wise linear function is used.
disadvantages-The main challenge in Huber loss is finding the delta value which is done through iteration process.
BINARY CLASSIFICATION PROBLEMS
HINGE LOSS
This function is used in binary-class classification problem. It is useful for maximum margin classification, so works well with Support Vector Machine(SVM).Depending on what class we are trying to predict, t is +1 or -1. When t & y are of the same sign, it means the class predicted is correct and hence loss will be 0. Otherwise, for the opposite signs, of y & t, loss linearly increases with the value of y.
BINARY CROSS ENTROPY
This is also called sigmoid cross entropy. It is a Sigmoid activation plus a Cross-Entropy loss.
since predicted value is multiple of sigmoid function (sigma) with some features of x, so the value always lies between 0 & 1 and y-actual value is either 0 or 1.
Binary cross entropy formula -eq1
lets see some desirable properties of Binary cross entropy cost function
1st property for correct classification (i.e. y-actual equal to y-predicted), substitute the value of y -predicted and y-actual in -eq1 here J tends to approx. zero.
2nd property for misclassification, substitute the value of y-predicted and y actual in -eq1 here J tends to infinity.
3rd property since the actual value is either 0 or 1 and predicted value lies between 0 and 1. Now substitute the values in the equation ,the value for equation we get is negative and when we multiply with negative sign outside the brackets whole equation becomes positive
we can get value J greater than or equal to zero.
MULTI CLASS CLASSIFICATION PROBLEMS
MULTI CLASS CROSS ENTROPY
when we are dealing with multi class classification problems ,we use MULTI CLASS CLASSIFICATION ENTROPY. Mathematical presentation for this function is
This is basically an extension of binary cross entropy.
SOFTMAX CROSS ENTROPY
It is a combination of soft-max activation function plus cross-entropy loss, used for multiclass classification problem. Mathematical presentation of soft-max function is
combination of soft-max activation function plus cross-entropy loss
******************************************************************
Thanks for reading the article! Wanna connect with me?
Here is a link to my Linkedin Profile