1 min readAug 29, 2022

--

Good article! I find that many people resort to balancing the data merely because they insist on directly building only a hard-classifier (rather than one that predicts class probabilities) and implicitly use the default decision threshold of 0.5 on the predicted probability (for binary classification). (Unfortunately scikit-learn encourages this.) Re-balancing the training set skews the predicted probabilities and provides a very indirect, inconvenient, and misleading way of effectively changing the decision threshold. Changing the decision threshold can be done more directly and transparently, and is needed in some form when the cost of false positives does not equal the cost of false negatives, which is almost always the case (especially when the data is imbalanced) when the “classification” is actually a business decision that needs to manage the trade-off between false negatives and positives. See my:

How To Deal With Imbalanced Classification, Without Re-balancing the Data (Before considering over-sampling your data, try simply tuning your classification decision threshold)

--

--

David B Rosen (PhD)
David B Rosen (PhD)

Written by David B Rosen (PhD)

Lead Data Scientist for Automated Credit Approval (WW) at (not speaking for) IBM. linkedin.com/in/rosen1/

No responses yet