article cover
BC

Badreddine Chaguer

Senior Data scientist/Co-founder

Focal Loss vs. Binary Cross Entropy Loss

September-11-2024

Binary classification tasks are usually trained using the binary cross entropy (BCE) loss function:

Article image

For simplicity, let’s define pt as follows:

Article image

…then we can also express the cross-entropy loss function as:

Article image

One often-overlooked limitation of BCE loss is that it treats the probability predictions for both classes equally, as seen in the symmetry of the loss function:

Article image

For better understanding, look at the table below. It shows two instances: one from the minority class and one from the majority class, both having the same loss value:

Article image

This becomes problematic with imbalanced datasets, where instances from the majority class are often "easily classifiable."

Therefore, a loss value of, say, log(0.3) from the majority class should ideally be given less weight than the same loss value from the minority class.

Article image

Focal Loss is a useful alternative to address this issue. It is defined as follows:

Article image

As shown above, Focal Loss introduces an additional multiplicative factor called down-weighting, with the parameter Gamma acting as a hyperparameter.

When plotting BCE (for class y = 1) and Focal Loss (for class y = 1 with Gamma = 3 , the following curve is obtained:

Article image

As shown in the figure above, Focal Loss reduces the impact of predictions where the model is already confident.

Additionally, a higher value of Gamma increases the amount of down-weighting, as illustrated in the plot below:

Article image

Moving forward, while Focal Loss reduces the contribution of confident predictions, it isn't the complete solution.

Even with Focal Loss, we still see that the function remains symmetric, similar to BCE:

Article image

To address this, we need to introduce an additional weighting parameter \( \alpha \), which is the inverse of the class frequency, as shown below:

Article image

Therefore, the final loss function is:

Article image

By combining down-weighting and inverse weighting, the model focuses more on learning patterns from difficult examples, rather than being overly confident with easy ones.

To evaluate the effectiveness of Focal Loss in handling class imbalance, I created a dummy classification dataset with a 90:10 imbalance ratio:

Article image

Next, I trained two neural network models (both with the same architecture of 2 hidden layers):

  • One using BCE loss
  • One using Focal loss

The decision region plot and test accuracy for these models are shown below:

Article image

It is evident that:

  • The model trained with BCE loss (left) predominantly predicts the majority class.
  • The model trained with Focal Loss (right) pays more attention to patterns in the minority class. Consequently, it performs better.