Always check the evidence in Weight of Evidence

During one of ADC’s recent projects, we developed a behavioral score card, modelling the probability of a client becoming “At-Risk” in the next months. We used a technique called Weight of Evidence scaling on our data. Some blogs promote to use WoE scaling as a solution to automate the data preparation process. I’m not so sure you should, and here’s why. For those of you not familiar with the technique, some explanation here https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html and here http://ucanalytics.com/blogs/information-value-and-weight-of-evidencebanking-case/.

The Pros

WoE scaling is popular in credit risk modelling, so obviously it should have some attractive features to it. And it does. WoE mitigates the need to do extensive outlier analysis, as extreme values just end up in the first and last bins and thus won’t ruin the fit of your model. Since bins are used, we can create an extra bin for NA’s, so no treatment of missing values is necessary either. Then there is the modelling of non-linear effects. Figure 1 is a nice example of non-linearity that would be hard to model without WoE, because of the quirky nature of the relationship. Finally, another advantage is that fewer parameters are needed for categorical variables, as the factors are transformed to a numerical WoE-value and thus only a single parameter has to be estimated. So far, so good, right? It seems like a great technique that solves a lot of modelling problems for you. Then why am I not convinced about WoE scaling?

Figure 1: Red dots are WoE values, grey bars are number of observations per bin

The Cons

My concerns with WoE scaling are both practical and conceptual. Let’s start with the practical issue: Binning. WoE uses binned numerical variables and on those bins the log odds ratio is calculated.  But how should I choose my bins? Do I use empirical quantiles? Do I divide the range of the variable in equal parts? Or should I choose the bins by hand? Then the next question that naturally arises is: How many bins do I choose? Too few causes harsh aggregation that loses much of the available information from that variables. Too many bins may cause overfitting in the WoE transformation and may raise issues with low numbers of observations per bin. When is this number of observations even “low” and when is it “sufficient”?

The second, more conceptual issue I see with the WoE transformation, is the fixing of non-linear effects. When doing a multivariate regression with WoE transformed data, we essentially model a linear combination of the univariate, possibly non-linear relations. But what if the non-linearity that is present univariately can be explained by other variables? Since the relationships are fixed, a correction here is not possible anymore. Maybe the quadratic relationship in Figure 2 would have disappeared once we added more variables. The dips/peaks in the graph look like outliers, but they are based on around 50K and 100K observations respectively. What to do there? Without investigating why these relationships exist, a multivariate regression using WoE is just smacking a bunch of univariate relationships together and hoping for the best.

Figure 2: Again, red dots are WoE values, grey bars are number of observations per bin

The Bottom Line

I think WoE scaling is a nice technique that offers a lot of advantages in the modelling process. I disagree however with the notion that it can be used effectively to automate the processing of input variables for logistic regression. If explainability of the model is not a requirement and non-linear effects are suspected to be present, then you might just as well go for full-blown machine learning approaches like tree-based methods or neural networks.  If explainability of the model is required, all WoE transformations should be investigated and hand checked by the modeler, to ensure that you get a correct well supported model. In short: Always check the evidence.