I just finished reading Marcos Lopez de Prado’s chapter on Fractional Differencing in his new book, Advances in Financial Machine Learning. I have 2 questions/concerns about it:
1. If integrated time series are vulnerable spurious regression, why wouldn’t fractionally integrated time series be “fractionally spurious”?
2. Is it really true that there can be some predictive information in the levels (or “memory”) of the time series which can be captured by fractional differencing, but not integer differencing?
Because my background is machine learning, rather than pure math, I’m going to try to answer these questions using data and empirical evidence, rather than mathematical deduction.
To attempt to answer the first question, I rely on the fact that if you correlate various unrelated random walks with each other (without differencing them first), it will produce erroneously large correlation coefficients. This is how I will estimate the extent to which fractional differencing reduces (if at all) “spuriousness”. The traditional approach is to transform the fully integrated time series (i.e., I(1)) into a stationary I(0) time series by differencing it once (d = 1). The claim of fractional differencing is that it isn’t necessary to difference it of order 1: a fractional value, such as 0.5, is sufficient. Specifically, I will do the following:
a) I will generate 1000 random walks.
b) I will correlate them between each other, as is, in their original I(1) form. I expect to see a certain amount of spurious correlation under the form of correlation ratios that differ statistically from zero.
c) I will then start differencing these time series with d = 0.1. I will re-run the correlations. I will repeat this with d = 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9 and 1.0 (i.e. traditional differencing of order 1). For each analysis, I will keep the mean absolute correlation of the results.
d) I will then plot, for each value of d, from 0 to 1, the corresponding mean absolute rho that was obtained for each.
Here is the code. Here are the results:
The relationship between the value of d and the corresponding amount of spurious correlation discovered is not linear. We can see that it drops and quickly reaches diminishing returns at around d = 0.75, where it falls below the threshold of statistical significance (the orange line). This means that there is no significant decrease in spurious correlation associated with increasing d values, once we’ve reached approximately d = 0.75 (in the case of my artificial time series – the plateau might be reached at different values with differently constructed time series).
In other words, it appears that fractional differencing does indeed remove the risk of spurious correlation to the same extent as full differencing does, as long as we stay in that optimal range.
The next question then is: can we gain an improvement in predictive power from not using d = 1 (if the time series has long-term memory) ? In an attempt to answer the second question, I will try to build a data generating process with memory, such that using the same model fitted on an I(0) version of it performs more poorly than if it is fitted on a fractionally differentiated version of the time series. This will demonstrate that integer differentiation indeed eliminated some predictive component that the fractional differentiation preserved.
The goal of that model will be to predict changes out of sample in the underlying time series, using as predictor lags of the modified (i.e. fully or fractionally differentiated) version of itself. After all, out of sample predictive power is the ultimate test for anything. If it passes this test, I’m happy.
Here is the code. Here are the results (note that, in this chart, the R2 value reported is relative to the R2 obtained by using d = 1, on the same validation set with the same model):
My time series with long-term memory is an Ornstein-Uhlenbeck process with a deterministic trend. The trend is there to make it non-stationary (otherwise, it would already be stationary – no need to difference it!).
On 50,000 artificially generated data points from my Ornstein-Uhlenbeck process with trend, I keep 10% as validation set. I train a simple regression model, where I try to predict the change (d = 1) in the time series at t, given the fractionally differenced value at t - 1.
I fit this model 10 times, using the values of d 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9 and 1.0. In each case, I also run the Augmented Dickey-fuller test on the differenced time series, to see whether it is considered stationary or not. Notice in the chart that the relative R2 begins to increase after the ADF p-value hits 0. Then, it starts decreasing again until it we reach d = 1. In other words: the optimal d value is around 0.5.
We showed that fractional differencing eliminates spurious regression as thoroughly as full differencing, as long as we choose a reasonable value of d. This optimal value can be determined by running an ADF test (the time series should be sufficiently differenced to pass the test), while simultaneously making sure that the time series is well correlated to the original, I(1) version.
In addition to that, we confirmed the claim that the levels of some time series contain important predictive information which, if we do not eliminate by full differencing, can be used to produce models of superior out of sample (non-spurious) predictive ability.
Thanks to Mirza Trokic for the Python fractional differencing code!
The word entropy could essentially be replaced with “randomness”. Generally used in the context of classification problems (in machine learning, that is), entropy tells us how random is the distribution of a certain class in the dataset.
Let’s take a binary classification problem, to simplify things. We are training a model to predict whether the stock market will go up or down tomorrow. If, in our dataset, 50% of the time the market goes up, and 50% of the time the market goes down, then the next day return of the stock market has a very high entropy (it’s purely random). Mathematically, the binary information entropy is calculated as:
So if you do the math, the entropy of the stock market example is 1. If instead the stock market went up 100% of the time and went down 0% of the time (or vice versa), the entropy would fall to 0.
What then, is cross-entropy? In machine learning, it is a measure of the “distance” or “divergence” between the model’s distribution of class outcomes and the dataset’s distribution of class outcomes. More specifically, for each “actual vs predicted” pair in our dataset:
Here the actual value will be 1 or 0 (instead of some kind of probability), but the predicted value used in the formula is the predicted probability, not the most likely class. Otherwise, notice the similarity to the information entropy formula above. The other difference is that in cross-entropy we use the actual value relative to the predicted value.
Intuitively, it means that when the dataset says “1” and we predict a probability of “1”, we add 0 (the math is: -1 x log(1) – 0 x log(0)) to the total error sum. When we predict a probability of “0.1” but the dataset said “1”, we add 3.322 (the math is: -1 x log(0.1) – 0 x log(0.9)) to the total error sum. So the farther the predicted probability is from the actual label, the higher the error will be.
In part 3 of the SVM series, we saw the theoretical framework behind the solution to the optimal separating hyperplane.
Here we are going to quickly discuss, at a high-level, the traditional method of solving it, i.e. the Sequential Minimal Optimization algorithm, and we will also have a quick look at the more scalable/modern techniques based on the Gradient Descent "family" of algorithms.
The Sequential Minimal Optimization algorithm
In this algorithm, we start with our alphas at some initial values, and then we iteratively update them in pairs, in order to approach the optimal solution. The main reason we update them in pairs is that it allows us to calculate a closed form (analytical) solution to the best direction and magnitude by which to update the alphas in any given iteration.
Remember that the alphas act as weights for a weight average of points on each side of the separating hyperplane. These 2 averages constitute two points forming the vector w that was previously discussed.
So, if we have 3 support vectors on a given side of the separating hyperplane, for example, choosing the value of their alphas is the same as moving the "center point" (the weighted sum) towards an optimal position somewhere within the area covered by a triangle formed by these 3 support vectors. And when you pick two alphas, you are essentially moving the optimal center point somewhere along the line segment formed by the two corresponding support vectors.
As an example of a possible optimization path: in a first step, we pick the alphas for p1 and p2, and optimize the weights relative to each other. We choose 2/3 * p1 and 1/3 * p2. p3 is still at alpha = 0, so there is no p3 component in the resulting center point.
In a second step, we pick the alphas of p1 and p3. We discover that the optimal relative weight between those 2 if half and half. After these 2 steps then, the center point ends up in the middle of the area formed by the 3 support vectors p1, p2 and p3, because their respective alphas are 1/3 * p1 + 1/3 * p2 + 1/3 * p3.
The SMO algorithm's job is to identify the correct values of alphas such that:
In the SMO all we do (in a simplified way) is to keep repeating the following until convergence:
1. we pick two alphas
2. we optimize the value of those two alphas while keeping all of the other ones fixed, using a closed-form solution.
(see more detailed pseudo-code below)
Conceptually then, we keep moving iteratively the 2 center points on each side of the separating hyperplane, along the axis formed by the two selected alphas' corresponding support vectors at each iteration, until we reach the center points that best optimized the formula above.
Let's wrap up this brief overview with some pseudo-code for a simplified version (i.e. minus some performance improvements that complicate things a little) of the SMO algorithm:
This YouTube video is a pretty good introduction to the intuition (and even some of the math) behind the SMO, in its simple form:
Scalable Gradient Descent-based approches
In the era of Big Data we've mostly abandoned the closed form solution approach, in favor of a more stochastic approach. This is because SMO because quite slow in larger data sets.
The principle is the same as SMO, in that we iteratively modify values of alpha to approach the optimal solution. While there are a few variations being used and researched to optimize the SVM (and large-scale machine learning algorithms in general), the fairly "tried and tested" (and common) approach is Stochastic Gradient Descent.
Before even attempting any explanation, I highly recommend this beautiful blog article that displays a few optimization algorithms (including Gradient Descent: the start point to understanding SGD) in an interactive visual manner:
Nothing beats an interactive visual chart, when it comes to intuitive learning. As for the rest, here it is:
Stochastic Gradient Descent is really just Gradient Descent, but applied on randomly selected sub-sets of the data. We do this because statistically speaking, if the parameters are optimal for a randomly sampled sub-set of the data, they should be optimal for the entire data. This assumption allows us to work on much less data, and is thus a performance and scalability improvement over the traditional Gradient Descent algorithm. We can sometimes push this performance improvement to an extreme and calculate error over a single example per iteration of the algorithm, but a more conservative approach is to use a "mini-batch" of a pre-determined size (i.e. a meta-parameter of the algorithm that is selected by the user).
Let's quickly go over Gradient Descent, since it is at the heart of SGD. In Gradient Descent, we start from an initial point of our feature space, and we iteratively move towards some other point in feature space where the error is minimal. The error is iteratively computed by using the feature values represented by the current position in feature space, plugging these into the loss function that is being minimized, and comparing the output of the function (the "error", as it is sometimes called).
The "gradient" part in "Gradient Descent" comes from how we select the next direction in feature space to move to. At each iteration, the algorithm calculates what changes in what features will result in the steepest descent towards a low error in the next step. Here's a code example of a step in the Gradient Descent process, with 2 parameters b and m, and the learning rate meta-parameter which is user controlled (and represents how fast we want to change direction towards the direction of error minimization).
def stepGradient(b_current, m_current, points, learningRate):
b_gradient = 0
m_gradient = 0
N = float(len(points))
for i in range(0, len(points)):
b_gradient += -(2/N) * (points[i].y - ((m_current*points[i].x) + b_current))
m_gradient += -(2/N) * points[i].x * (points[i].y - ((m_current * points[i].x) + b_current))
new_b = b_current - (learningRate * b_gradient)
new_m = m_current - (learningRate * m_gradient)
return [new_b, new_m]
It then, of course, takes a step in the direction of steepest descent. For a more in-depth coverage of the GD familty of optimization algorithms, I recommend this blog post: http://sebastianruder.com/optimizing-gradient-descent/index.html
There are some particularities to implementing this in the context of an SVM problem, but I won't get into those details as the purpose of this series is to be more of a high-level intuitive overview.
In Part 2 we've seen a description of the optimization problem to solve in order to train our SVM
(in other words, in order to find the ideal separation hyperplane).
In this part, we will start looking into how exactly we are going to solve this problem. In textbooks you
will find a lot of mathematical derivations and proofs here, I will try to avoid that, since my focus is on
intuitive understanding rather than mathematical rigor.
So let's start. The first breakthrough that was made here in the development of support vector machines was the realization that if you derive the Lagrangian of the optimization problem to solve, most parameters disappear and you are left with a simpler form, which also happens to open up some interesting doors (namely, the Kernel trick, which we will discuss in a later article).
Lagrange mutlipliers and optimization:
I will start by providing a relatively simple explanation of the Lagrangian technique, how it works and why. Then we will skip to the result (the steps of the Lagrangian derivation can be found in many textbooks if you are interested).
A convex function is a function that graphically looks like this:
Finding the minimum for these functions can be done by taking the derivative of the function, and finding the point at which the derivative equals 0. This is because at the bottom of the convex function, the rate of change of the value suddenly ceases to change (thus a derivative of 0) for a brief moment before it reverses and starts going back up.
But remember that we have a slightly more complicated problem to solve here: we have a set of constraints. This is where Lagrange multipliers come in.
The idea of Lagrange multipliers is as follows:
Suppose you have a constraint:
for all i.
And suppose your objective function is:
The Lagrangian is simply the objective function + your constraint formulated in such as a way that f(x) = 0. For each such constraint term we also must add a coefficient, called the Lagrange multiplier.
Formulating the constraint in f(x) = 0 form:
So for this example, the Lagrangian is:
The point at the derivative of L = 0 corresponds to the lowest point of the convex function where the constraint also happens NOT to be violated.
In other words, when you take the derivatives of L, it's with respect to the objective function's variables AND the newly introduced Lagrange multiplier (the alphas). These derivatives must be set to 0.
It's a bit like a game where you are asked to pick the value of the objective function's variables to maximize your score: the smaller the value of the function, the more points you win (for a minimization problem). But at the same time, your opponent is allowed to choose the Lagrange multiplier arbitrarily to maximize the value of the function, thereby making you lose points. The optimal play is clearly to set w (in this example) such that it is insensitive to the value of alpha (in mathematical terms: such that the derivative with respect to alpha is 0).
To summarize then, the idea is to get as close as possible to the point where the derivative of the objective function equals 0, while making sure that the sensitivity to the Lagrange multiplier is also 0.
The Support Vector Expansion:
Ok, so now we apply the same technique to our SVM optimization problem, calculate the Lagrangian, take the partial derivatives of it with respect to w, b (the parameters to optimize), set those derivatives to 0 and substitute them into the Lagrangian for the reasons just explained (this is where we skip the math), and we get:
Lo and behold, the parameters w and b entirely vanished, and the whole problem is reduced to finding the right values of the Lagrange multipliers alpha! (a few additional mathematical tricks were used to get to this result, but the Lagrangian technique is the most fundamental part of it)
Furthermore, from this result, we can derive (see textbooks) the "Support Vector expansion" results:
Let's think a bit about what that means, and why it makes sense. The math behind this result is undeniable, but it doesn't really help you understand what is going on. First of all, what the Support Vector Expansion is saying is that you can express the separating hyperplane entirely in terms of a linear combination of some data points.
These data points that play a role in determining the hyperplane are the ones for which the corresponding alphas are non-zero. These happen, in fact, to be the Support Vectors! (we'll see later how the alphas are picked)
Ok, but how does it make sense that the w vector (or, equivalently, the decision boundary/hyperplane) can be entirely calculated from a few data points? To understand that, let's look at the simplest case:
We have 2 data points to classify. The best hyperplane in this scenario is perpendicular to the line segment (or vector, if you prefer) formed from point a to point b. It might not be visually obvious that this line is what best maximizes the separating margin, but if you think of Pythagore's theorem, it should be "mathematically evident" that in this case the distance from each point to the decision boundary is larger than, say, if the boundary were placed horizontally or vertically between the points:
(the width of the margin, represented by the line segment c, is, according to Pythagore's theorem, greater than either of the line segments that would be formed, a or b, if we placed the separating hyperplane horizontally or vertically between the two points)
Ok. I'm going to pretend this was convincing enough and keep going. The next logical step is to realize, in this scenario, that w is equal to b - a. It's a simple fact of linear algebra that the vector that connects two points is obtained by differencing these 2 points.
We've just shown, in the simplest case, why the hyperplane can be described as a subtraction of a negative-labeled support vector from a positive-labeled support vector. To generalize this example to cases where you have more than 1 support vector on each side of the decision boundary, realize that the w vector is then the average of the support vectors on one side minus the average of the vectors on the other side. So you are still ultimately producing the w vector as a subtraction of two vectors across the decision boundary, except that now the 2 points each represent the "gravity center" on each side (metaphorically speaking).
So there you have it, by summing and subtracting support vectors, we have expressed, in the general case, the w vector.
The next step:
Now we have expressed the optimization problem only in terms of alphas. We understand that the weight vector w isn't necessary, you can simply express the SVM's separating hyperplane as a function of a linear combination (or weighted vote) of carefully selected data points: the support vectors. Furthermore we understand that the selection mechanism is done via setting the alpha values correctly.
In summary, we have reached the conclusion that the entire business of SVMs is about finding the vectors that contribute the most to the determination of the decision hyperplane (i.e. the Support Vectors) and then "drawing" a vector w through them.
The next step, then, is to figure out how to identify the alpha values. Traditionally this is done through an algorithm called the Sequential Minimal Optimization (or SMO for short) algorithm, developed by John Platt. We will get into how this algorithm works in a later article. Note also that with the era of Big Data there is on-going research into more scalable algorithms (Stochastic Gradient Descent, L-BFGS, etc.) to numerically determine these optimal alphas.
Ok, you read Part 1 and you understand how to draw a hyperplane. You're now officially a machine learning expert. The rest is superficial.
But let's continue building this support vector machine.
The first step is to find the parameters of the hyperplane (i.e. the parameters w and b) that correspond the best hyperplane we could draw in order to properly separate the "+" and "-" observations.
How do we define the "best" separation? While writing this article I stumbled upon a blog post that explains the answer to this question so well that I couldn't do it better. So I will redirect you to this post, there is no point in me re-doing the work: http://www.svm-tutorial.com/2014/11/svm-understanding-math-part-1/
Quick summary: the best separating hyperplane is the one that will be right in the middle of the "gap" between the "+" and "-" support vectors.
By definition, the margin is the distance between that separating hyperplane and the closest data points on each side (i.e. the support vectors).
Therefore, we can define the SVM problem (i.e. finding the best separating hyperplane) as a constrained optimization problem:
1) maximize the separating margin. Since the margin is the distance between the closest data points and the separating hyperplane, and we're trying to get the hyperplane to be as far as possible from these closest data points on each side, then the mathematical equivalent of this is to maximize the margin.
2) But because we don't want to maximize the margin past the closest data points, we need to specify a constraint to make it stop at the support vectors. In other words, we need to tell the maximization algorithm when to stop.
Mathematically, this means that we minimize the function:
Subject to the constraint:
Let's first look at the objective function: we're essentially minimizing w (and for the moment let's ignore the less relevant stuff around it). If you remember the conclusions from part 1 of this series, minimizing w is the same as maximizing the separating margin.
Also, to be clear, because w is a vector, when I talk about minimizing w, what we're doing is trying to minimize the absolute value of the weights as a whole. The norm of w, written ||w||, is:
Where the w's are the components of the w vector for each dimension. In other words, in a two-dimensional chart, w1 is the x-axis value, and w2 is the y-axis value.
The other things around the ||w|| are there as mathematical devices to simplify derivatives later on -- they don't contribute much to the intuitive understanding of what is going on.
As for the constraint, it's saying that all vectors in the dataset should be correctly classified as being >= +1 if in the "+" class, and <= -1 if in the "-" class. Another way of putting it is that there should be no data points within the gap formed by the margin's edges.
Alright, we've successfully formulated the optimization problem that will allow us to determine the ideal separating hyperplane for a given dataset, and hopefully we've understood it as well. Stay tuned for part 3, where we will begin explaining how to solve this optimization problem.
If you want a deeper look into the math of the optimization problem we just described, I highly recommend this blog post, by the same author of the previous blog post I linked to.
What follows in this series of articles is an explanation of Support Vector Machines that focuses on intuitive understanding, rather than purely mathematical derivation. The target audience is those who have a basic knowledge of machine learning vocabulary, but aren't necessarily Ph.Ds in math.
SVMs are binary classifiers (although there exists Support Vector Regression, as well as ways to classify multinominal data -- but we won't go into that, at least not yet) whose entire purpose is to draw a line (when the feature set is 2-dimensional, or a hyperplane if the feature set is higher-dimensional). This line will be the line that separates the "+" class from the "-" class observations (we'll call them "+" and "-", but technically we use +1 and -1 as the numerical label).
The equation for a line or hyperplane, as typically seen in machine learning textbooks is:
This should be obvious, if you remember high school math, because a 2-d line is y = mx + b, so if we have more than 2 dimensions it's y = m1x1 + m2x2 + ... + b, which is more concisely expressed as y = <w, x> + b, the dot product between all w's and all x's.
But here lies a first source of confusion already. Often, next to that formula, in textbooks they'll show a 2D plot with a decision boundary line that separates the "+" observations from the "-" observations. Well, plot twist: that decision boundary that you see is NOT w, nor is it <w, x> + b.
w is actually a vector that is orthogonal to that decision boundary.
The decision boundary that you often see in those graphs is, rather, all values of x that satisfy the following equation:
As a reminder, also, keep in mind that the y axis in those graphs is NOT the y or f(x) of the f(x) = <w, x> + b formula. The y axis is in fact the second x variable, because the assumption behind those graphs is always that there are two features, x1 and x2.
So what is y = <w, x> + b? It's the decision plane: If y > 0, it classifies the observation as a "+" label, if y < 0 it classifies it as a "-" label. It's the function you will be using to perform classification. But to truly visualize this function, it is necessary to produce a 3D graph of the previous 2D one, where the Z axis is the f(x) of the formula.
That black line across the center is the decision boundary, where the values of <w,x> + b give 0. Before that, the values of y are below zero, hence classified as "-", and beyond that line, the values are above zero, hence classified as "+".
Before we delve further into the roles of w and b in that picture, we will introduce the notion of support vectors and separating margin.
The Support Vectors are those observations in the picture that are closest to the decision boundary, on each side of it.
The separating margin is the distance from the decision boundary to these support vectors. As we estimate the parameters of the SVM, we will be choosing them so that the y value for each of these support vectors is +1 or -1 (depending on if they are "+" examples or "-" examples, respectively).
The role of b in the function y = <w,x> + b is to decide the positioning of the decision boundary. In other words, it decides when the value of y crosses 0. The effect of increasing or decreasing b, then, has the effect of pulling this boundary back into the - region or pushing it further into the + region or .
The role of w is to control the "scaling", i.e. the width of the margin. In other words, it controls how big the step in the x variables needs to be in order to get to a y value +1 or -1. To understand what exactly I mean by that, it suffices to look at two examples: one where the w's are all equal to 1, and one where the w's are all equal to 2.
Let's have x1 = 1 and x2 = 1, for simplicity, and b = 0:
1) with w = (1, 1): f(x) = 1 * 1 + 1 * 1 = 2
2) with w = (2, 2): f(x) = 2 * 1 + 2 * 1 = 4
So in the first scenario, with w = 1, the same x values gave a classification of +2, while in the second scenario, the same x values gave y = +4. This means that the width of the margin (i.e. the distance from the decision boundary required to reach +1 and -1) in the first scenario was represented by the x values (0.5, 0.5), while the x values (0.25, 0.25) are enough in the second scenario to reach f(x) = +1. The thing to remember here is that the greater the values of w, the smaller the margin.
Ok, that should cover the introductory theory. You're ready to delve into the core of the Support Vector Machines. Stay tuned for part 2.
I always felt that the concept of cross-validation is a bit of a hack. It's a fundamentally random process, and not only does that mean that you will get different results from one run to the next, but it also means that you could "accidentally" end up with training or validation sets that aren't representative of the full dataset. This, in turn, can lead to various specific problems, which I will get into below.
That being said, statistical tests of significance are a complicated world to navigate when you cannot assume normality. For the sake of simplicity then, I tend to fall back to cross-validation. Here are some hacks that I apply to the hack that is cross-validation, in order to make it a bit more reliable.
a) The problem of "extreme" values
The other day I used cross-validation on a dataset that contained a handful of extreme values for the predictor (in the 50,000 to 500,000 range) while the rest of the dataset was concentrated in the 0 to 5000 range. To be clear, these are not outliers in the regression, i.e. their Cook's distance isn't significantly different than the rest of the data.
The result was that, due to the random nature of the split between the training and validation set, these few extreme values all ended up in the validation set. The training data was modeled with some polynomial that worked well within the training data. But then, when it came to validating the model, the polynomial extrapolated to ridiculous values. The cross-validated Mean Absolute Error was horrible. Here is what it looked like:
Where the blue dots are the raw data (due to the scale being inflated from the bad model, you don't see many dots, but in fact there are several data points on top of each other) and the orange line is the fit of the polynomial model.
Another similar situation gave a graph like this one:
Ok, so I over-fitted and cross-validation told me about it... how was that broken? Cross-validation did what it was intended to do, no? Yes, it did. The problem however, is that I failed to model the real underlying relationship, because the training data didn't contain crucial data points. And this failure is a direct result of splitting the data for cross-validation purposes. In other words, I over-fitted because of cross-validation, and then it came back and slapped me on the wrist for it!
In fact, if we take a step back and look at what happened, we see that I modeled some dataset over a given domain (0 to 5,000-ish), and then I validated it by estimating values beyond the domain for which I built the model. This is extrapolation. It is a basic fact of statistics that you cannot extrapolate beyond the domain of a given model (and expect it to give sensible results). Therefore, it follows that one should never produce a validation set containing values of the independent variables that are above or below the domain of the training data set. And, as far as I know, this isn't a caveat commonly mentioned in statistical learning theory textbooks (or maybe I missed the small print!).
My solution to this problem was to sort my dataset along the x-axis values and take the top and bottom 5th percentiles of data (the choice of the percentile value is itself arbitrary) to include them immediately in my training set. The validation set would then be selected randomly from the remaining data. This way I guarantee that my extreme values belong to my training set, not my validation set, and that, therefore, the validation process will not extrapolate over the trained domain.
I just finished implementing the Augmented Dickey-Fuller test using Scala and Apache Spark. The objective of the ADF test is, given some time series X(t), to determine whether this time series is stationary (i.e. revolves around a constant or slow-moving mean) or non-stationary (i.e. the expected value of the next innovation in the time series is equal to the last previous one, rather than some fixed mean). This is generally referred to as checking for a unit root.
The intuition and derivation follows. Let's start by describing our time series as a function of a lag of itself (+ a constant alpha to allow for an intercept in the regression).
If X(t) is non-stationary, then b will be equal to 1 (which is what we call a unit root). If X(t) is mean-reverting, b will be less than 1. If X(t) is perfectly stationary, b will be equal to 0.
So immediately we see that the Dickey-Fuller test will be about testing for the value of b.
Our null hypothesis will be that b = 1 (non-stationary), and our alternative hypothesis will be that b < 1 (stationary).
We can't test for b as is, because X(t) and X(t - 1) are integrated (a.k.a I(1), a.k.a. non-stationary) under the null hypothesis, and when a variable is integrated, the Central Limit Theorem does not apply, so the traditional coefficient estimation techniques can't be used.
Instead, we differentiate the equation, and obtain the following:
This formulation is more appropriate for regression analysis: the dependent variable is now stationary.
We must transform our hypothesis accordingly, because now it is about p = b - 1, not b itself:
Null hypothesis: p = 0 (because b - 1 = 1 - 1)
Alternative hypothesis: p < 0
X(t - 1) itself is still integrated under the null hypothesis (thus the estimator for p would not follow a t-distribution), but Dickey & Fuller have gone through the trouble of building the distribution table appropriate for this test.
So we are simply going to calculate the t-statistic for this regression, and compare against their table to see if we reject the null hypothesis or not.
So we run a linear regression to obtain the estimated p coefficient, and plug it in to get the corresponding t-statistic. In the SE(p) equation, y hat represents the fitted values of delta X in the main regression, and y represents the observed values thereof. X bar is the average of x values, being the right-hand side X variable.
The 95% confidence DF statistic is -2.87 for a dataset of 500 or more values. If the t-statistic < -2.87, you can reject the null hypothesis and conclude at the 5% significance level that the time series is indeed stationary.
But that's just the ordinary Dickey-Fuller test. What if your time series is indeed mean reverting, but the mechanics of it is more complicated (higher order lags are autocorrelated) than correlation with the previous lag?
This is where the Augmented version comes in:
In other words, you simply add up to n lags of the delta of X to your regression equation.
Note that it is possible to automatically find the optimal maximum lag order n using statistical significance tests, but this may be material for another post...