I always felt that the concept of crossvalidation is a bit of a hack. It's a fundamentally random process, and not only does that mean that you will get different results from one run to the next, but it also means that you could "accidentally" end up with training or validation sets that aren't representative of the full dataset. This, in turn, can lead to various specific problems, which I will get into below. That being said, statistical tests of significance are a complicated world to navigate when you cannot assume normality. For the sake of simplicity then, I tend to fall back to crossvalidation. Here are some hacks that I apply to the hack that is crossvalidation, in order to make it a bit more reliable. a) The problem of "extreme" values The other day I used crossvalidation on a dataset that contained a handful of extreme values for the predictor (in the 50,000 to 500,000 range) while the rest of the dataset was concentrated in the 0 to 5000 range. To be clear, these are not outliers in the regression, i.e. their Cook's distance isn't significantly different than the rest of the data. The result was that, due to the random nature of the split between the training and validation set, these few extreme values all ended up in the validation set. The training data was modeled with some polynomial that worked well within the training data. But then, when it came to validating the model, the polynomial extrapolated to ridiculous values. The crossvalidated Mean Absolute Error was horrible. Here is what it looked like: Where the blue dots are the raw data (due to the scale being inflated from the bad model, you don't see many dots, but in fact there are several data points on top of each other) and the orange line is the fit of the polynomial model. Another similar situation gave a graph like this one: Ok, so I overfitted and crossvalidation told me about it... how was that broken? Crossvalidation did what it was intended to do, no? Yes, it did. The problem however, is that I failed to model the real underlying relationship, because the training data didn't contain crucial data points. And this failure is a direct result of splitting the data for crossvalidation purposes. In other words, I overfitted because of crossvalidation, and then it came back and slapped me on the wrist for it!
In fact, if we take a step back and look at what happened, we see that I modeled some dataset over a given domain (0 to 5,000ish), and then I validated it by estimating values beyond the domain for which I built the model. This is extrapolation. It is a basic fact of statistics that you cannot extrapolate beyond the domain of a given model (and expect it to give sensible results). Therefore, it follows that one should never produce a validation set containing values of the independent variables that are above or below the domain of the training data set. And, as far as I know, this isn't a caveat commonly mentioned in statistical learning theory textbooks (or maybe I missed the small print!). My solution to this problem was to sort my dataset along the xaxis values and take the top and bottom 5th percentiles of data (the choice of the percentile value is itself arbitrary) to include them immediately in my training set. The validation set would then be selected randomly from the remaining data. This way I guarantee that my extreme values belong to my training set, not my validation set, and that, therefore, the validation process will not extrapolate over the trained domain.
0 Comments

AuthorSimon Ouellette Categories
All
Archives
March 2018
