Good article -- I was not aware of how flawed the mean decrease in feature impurity method could be!
One pitfall of permutation feature importance could be if there are some important subsets of the features that are highly-correlated or redundant with one another within the subset, then permuting one of the features will have less effect on the performance, even if permuting the whole set of highly-correlated features at once might give a large loss of performance. One workaround could be to look for groups of highly-correlated features and permute each group together.
Usually linear autocorrelation would be used for this grouping, but it is possible that the close relationship among some features could be nonlinear, so one might try e.g. ROC area of each feature as the sole predictor of each other feature (all possible pairs), allowing for each relationship to be in positive or negative direction, although this could be tricky for many-valued categories.