Hi there
I’m new to this forum and don’t know where to best address the following topic.
The tutorial Classification on imbalanced data first uses a simple sequential net with sigmoid activation. Then it proceeds with class weights and resampling techniques. But the last two plots of the tutorial, ROC and recall-precision, clearly show that (almost) no matter which threshold is chosen the first model clearly outperforms the resampled/reweighted models for all metrices on the test set. So I have 3 questions:
- What is the justification for reweighting and resampling given that they do not result in better models? Also, the stats community does not seem to find a good reason, see https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he.
- Would it make sense to split the logic of the tutorial into modelling class probabilities (implied by the sigmoid activation) and making a decision, i.e. choosing a threshold and predicting classes (instead of class probabilities)?
- Would it make sense to emphasize the cross entropy / log loss a bit more? Reasoning: The tutorial states that accuracy is not a helpful metric for imbalanced data, but it is not said which metric to prefer. Cross entropy as a proper scoring rule is a good metric to compare models and find out which one gives the best predictions for the class probabilities.