Focal Loss vs. Binary Cross Entropy Loss

Binary classification tasks are usually trained using the binary cross entropy (BCE) loss function:

For simplicity, let’s define pt as follows:

…then we can also express the cross-entropy loss function as:

One often-overlooked limitation of BCE loss is that it treats the probability predictions for both classes equally, as seen in the symmetry of the loss function:

For better understanding, look at the table below. It shows two instances: one from the minority class and one from the majority class, both having the same loss value:

This becomes problematic with imbalanced datasets, where instances from the majority class are often "easily classifiable."

Therefore, a loss value of, say, log(0.3) from the majority class should ideally be given less weight than the same loss value from the minority class.

Focal Loss is a useful alternative to address this issue. It is defined as follows:

As shown above, Focal Loss introduces an additional multiplicative factor called down-weighting, with the parameter Gamma acting as a hyperparameter.

When plotting BCE (for class y = 1) and Focal Loss (for class y = 1 with Gamma = 3 , the following curve is obtained:

As shown in the figure above, Focal Loss reduces the impact of predictions where the model is already confident.

Additionally, a higher value of Gamma increases the amount of down-weighting, as illustrated in the plot below:

Moving forward, while Focal Loss reduces the contribution of confident predictions, it isn't the complete solution.

Even with Focal Loss, we still see that the function remains symmetric, similar to BCE:

To address this, we need to introduce an additional weighting parameter \( \alpha \), which is the inverse of the class frequency, as shown below:

Therefore, the final loss function is:

By combining down-weighting and inverse weighting, the model focuses more on learning patterns from difficult examples, rather than being overly confident with easy ones.

To evaluate the effectiveness of Focal Loss in handling class imbalance, I created a dummy classification dataset with a 90:10 imbalance ratio:

Next, I trained two neural network models (both with the same architecture of 2 hidden layers):

One using BCE loss
One using Focal loss

The decision region plot and test accuracy for these models are shown below:

It is evident that:

The model trained with BCE loss (left) predominantly predicts the majority class.
The model trained with Focal Loss (right) pays more attention to patterns in the minority class. Consequently, it performs better.