David B Rosen (PhD)
1 min readJun 3, 2022

--

The cost to the business of false positives vs false negatives is unlikely to be equal, so you need to consider multiple decision thresholds on predict_proba(), not blindly accept the default implicit decision threshold of 0.5 (implicitly shifted by balancing the training data, but not necessarily by the ideal amount). See my https://towardsdatascience.com/how-to-deal-with-imbalanced-classification-without-re-balancing-the-data-8a3c02353fe3

Also it is incorrect to apply ordinary cross-validation to an already-oversampled training set (even oversampled by adasyn or smote) because the CV splits will place instances in one fold and copies or derivatives of same instances in another fold so that the folds are no longer independent, leading to optimistic bias in the CV results and therefore potentially choosing the wrong hyperparameters with excessive overfitting masked by this bias. This issue and ways to overcome it (in addition to my favorite method which is to avoid rebalancing in the first place and adjust the threshold instead) are discussed in a paragraph in same article, which links to an article with full explanation.

--

--

David B Rosen (PhD)

Lead Data Scientist for Automated Credit Approval at (not speaking for) IBM Financing. linkedin.com/in/rosen1/