I always felt that the concept of cross-validation is a bit of a hack. It's a fundamentally random process, and not only does that mean that you will get different results from one run to the next, but it also means that you could "accidentally" end up with training or validation sets that aren't representative of the full dataset. This, in turn, can lead to various specific problems, which I will get into below.
That being said, statistical tests of significance are a complicated world to navigate when you cannot assume normality. For the sake of simplicity then, I tend to fall back to cross-validation. Here are some hacks that I apply to the hack that is cross-validation, in order to make it a bit more reliable.
a) The problem of "extreme" values
The other day I used cross-validation on a dataset that contained a handful of extreme values for the predictor (in the 50,000 to 500,000 range) while the rest of the dataset was concentrated in the 0 to 5000 range. To be clear, these are not outliers in the regression, i.e. their Cook's distance isn't significantly different than the rest of the data.
The result was that, due to the random nature of the split between the training and validation set, these few extreme values all ended up in the validation set. The training data was modeled with some polynomial that worked well within the training data. But then, when it came to validating the model, the polynomial extrapolated to ridiculous values. The cross-validated Mean Absolute Error was horrible. Here is what it looked like:
Where the blue dots are the raw data (due to the scale being inflated from the bad model, you don't see many dots, but in fact there are several data points on top of each other) and the orange line is the fit of the polynomial model.
Another similar situation gave a graph like this one:
Ok, so I over-fitted and cross-validation told me about it... how was that broken? Cross-validation did what it was intended to do, no? Yes, it did. The problem however, is that I failed to model the real underlying relationship, because the training data didn't contain crucial data points. And this failure is a direct result of splitting the data for cross-validation purposes. In other words, I over-fitted because of cross-validation, and then it came back and slapped me on the wrist for it!
In fact, if we take a step back and look at what happened, we see that I modeled some dataset over a given domain (0 to 5,000-ish), and then I validated it by estimating values beyond the domain for which I built the model. This is extrapolation. It is a basic fact of statistics that you cannot extrapolate beyond the domain of a given model (and expect it to give sensible results). Therefore, it follows that one should never produce a validation set containing values of the independent variables that are above or below the domain of the training data set. And, as far as I know, this isn't a caveat commonly mentioned in statistical learning theory textbooks (or maybe I missed the small print!).
My solution to this problem was to sort my dataset along the x-axis values and take the top and bottom 5th percentiles of data (the choice of the percentile value is itself arbitrary) to include them immediately in my training set. The validation set would then be selected randomly from the remaining data. This way I guarantee that my extreme values belong to my training set, not my validation set, and that, therefore, the validation process will not extrapolate over the trained domain.