I just finished reading Marcos Lopez de Prado’s chapter on Fractional Differencing in his new book, Advances in Financial Machine Learning. I have 2 questions/concerns about it:
1. If integrated time series are vulnerable spurious regression, why wouldn’t fractionally integrated time series be “fractionally spurious”?
2. Is it really true that there can be some predictive information in the levels (or “memory”) of the time series which can be captured by fractional differencing, but not integer differencing?
Because my background is machine learning, rather than pure math, I’m going to try to answer these questions using data and empirical evidence, rather than mathematical deduction.
To attempt to answer the first question, I rely on the fact that if you correlate various unrelated random walks with each other (without differencing them first), it will produce erroneously large correlation coefficients. This is how I will estimate the extent to which fractional differencing reduces (if at all) “spuriousness”. The traditional approach is to transform the fully integrated time series (i.e., I(1)) into a stationary I(0) time series by differencing it once (d = 1). The claim of fractional differencing is that it isn’t necessary to difference it of order 1: a fractional value, such as 0.5, is sufficient. Specifically, I will do the following:
a) I will generate 1000 random walks.
b) I will correlate them between each other, as is, in their original I(1) form. I expect to see a certain amount of spurious correlation under the form of correlation ratios that differ statistically from zero.
c) I will then start differencing these time series with d = 0.1. I will re-run the correlations. I will repeat this with d = 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9 and 1.0 (i.e. traditional differencing of order 1). For each analysis, I will keep the mean absolute correlation of the results.
d) I will then plot, for each value of d, from 0 to 1, the corresponding mean absolute rho that was obtained for each.
Here is the code. Here are the results:
The relationship between the value of d and the corresponding amount of spurious correlation discovered is not linear. We can see that it drops and quickly reaches diminishing returns at around d = 0.75, where it falls below the threshold of statistical significance (the orange line). This means that there is no significant decrease in spurious correlation associated with increasing d values, once we’ve reached approximately d = 0.75 (in the case of my artificial time series – the plateau might be reached at different values with differently constructed time series).
In other words, it appears that fractional differencing does indeed remove the risk of spurious correlation to the same extent as full differencing does, as long as we stay in that optimal range.
The next question then is: can we gain an improvement in predictive power from not using d = 1 (if the time series has long-term memory) ? In an attempt to answer the second question, I will try to build a data generating process with memory, such that using the same model fitted on an I(0) version of it performs more poorly than if it is fitted on a fractionally differentiated version of the time series. This will demonstrate that integer differentiation indeed eliminated some predictive component that the fractional differentiation preserved.
The goal of that model will be to predict changes out of sample in the underlying time series, using as predictor lags of the modified (i.e. fully or fractionally differentiated) version of itself. After all, out of sample predictive power is the ultimate test for anything. If it passes this test, I’m happy.
Here is the code. Here are the results (note that, in this chart, the R2 value reported is relative to the R2 obtained by using d = 1, on the same validation set with the same model):
My time series with long-term memory is an Ornstein-Uhlenbeck process with a deterministic trend. The trend is there to make it non-stationary (otherwise, it would already be stationary – no need to difference it!).
On 50,000 artificially generated data points from my Ornstein-Uhlenbeck process with trend, I keep 10% as validation set. I train a simple regression model, where I try to predict the change (d = 1) in the time series at t, given the fractionally differenced value at t - 1.
I fit this model 10 times, using the values of d 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9 and 1.0. In each case, I also run the Augmented Dickey-fuller test on the differenced time series, to see whether it is considered stationary or not. Notice in the chart that the relative R2 begins to increase after the ADF p-value hits 0. Then, it starts decreasing again until it we reach d = 1. In other words: the optimal d value is around 0.5.
We showed that fractional differencing eliminates spurious regression as thoroughly as full differencing, as long as we choose a reasonable value of d. This optimal value can be determined by running an ADF test (the time series should be sufficiently differenced to pass the test), while simultaneously making sure that the time series is well correlated to the original, I(1) version.
In addition to that, we confirmed the claim that the levels of some time series contain important predictive information which, if we do not eliminate by full differencing, can be used to produce models of superior out of sample (non-spurious) predictive ability.
Thanks to Mirza Trokic for the Python fractional differencing code!
The word entropy could essentially be replaced with “randomness”. Generally used in the context of classification problems (in machine learning, that is), entropy tells us how random is the distribution of a certain class in the dataset.
Let’s take a binary classification problem, to simplify things. We are training a model to predict whether the stock market will go up or down tomorrow. If, in our dataset, 50% of the time the market goes up, and 50% of the time the market goes down, then the next day return of the stock market has a very high entropy (it’s purely random). Mathematically, the binary information entropy is calculated as:
So if you do the math, the entropy of the stock market example is 1. If instead the stock market went up 100% of the time and went down 0% of the time (or vice versa), the entropy would fall to 0.
What then, is cross-entropy? In machine learning, it is a measure of the “distance” or “divergence” between the model’s distribution of class outcomes and the dataset’s distribution of class outcomes. More specifically, for each “actual vs predicted” pair in our dataset:
Here the actual value will be 1 or 0 (instead of some kind of probability), but the predicted value used in the formula is the predicted probability, not the most likely class. Otherwise, notice the similarity to the information entropy formula above. The other difference is that in cross-entropy we use the actual value relative to the predicted value.
Intuitively, it means that when the dataset says “1” and we predict a probability of “1”, we add 0 (the math is: -1 x log(1) – 0 x log(0)) to the total error sum. When we predict a probability of “0.1” but the dataset said “1”, we add 3.322 (the math is: -1 x log(0.1) – 0 x log(0.9)) to the total error sum. So the farther the predicted probability is from the actual label, the higher the error will be.
VIX is the CBOE Volatility index, which attempts to track the volatility of the S&P 500, based on the implied volatility in the 30-day options of the S&P 500. Being an index, this is not a directly tradable instrument.
A quick word about implied volatility: an option’s theoretical price is determined by the underlying asset’s current price, the option strike price, the time until expiry, the risk-free interest rate, and the underlying asset’s volatility. For example, have a look at this Black-Scholes calculator. We can directly observe all of these input parameters, except for the volatility. By mathematically “reverse engineering” the volatility parameter from all of the other observed values, we obtain the implied volatility (i.e. what the market believes to be the volatility of the underlying asset).
If you're interested in the details of the VIX value calculation, you can have a look at the CBOE page that describes the steps of the calculation. It also explains the distinction, and relation, between the VIX volatility and the Black-Scholes implied volatility as discussed above.
In 2004, the VIX index became indirectly tradable through the introduction of Futures contracts. Since then, it is possible to speculate on the upcoming volatility for each month of the year (in fact, as of July 23rd 2015, there are also weekly contracts available). If you look at the CBOE VX Futures Daily Settlement Prices web page you can see the “current” prices for each of the upcoming futures contracts. For example:
This means that there is a VIX Futures contract set to expire at 01/17/2018 which is currently priced at 10.575 (and so forth). Notice that the prices, at the time of this writing, increase as we get farther and farther away from the expiry date. This is a consequence of investors' uncertainty about the future, and is called contango. It is conventional wisdom that we are in contango most of the time (in “normal” market periods). Once in a while, however, when investors start fearing near-immediate disaster, we can end up in what’s called backwardation: i.e. the futures set to expire soon are priced higher than the ones set to expire later.
A bit more terminology: the difference between the current price of a futures contract and the current spot price is called the basis. The spot price refers to the current price of the underlying asset (as opposed to the price of the futures contract on that asset). The VIX term structure refers to the shape made by the futures prices at different expiries (i.e. it is the term structure that can be in contango or backwardation).
Various exchange-traded funds and notes have since been created to replicate the VIX (or its inverse, with or without leverage):
If you're seeing this for the first time, you might be tempted to short and hold VXX. The fact is that most of the time you will be making quite a lot of money by doing that. However, it is an extremely risky strategy, as volatility spikes can be enormous, and can easily reach ~500%. Imagine the consequence of shorting VXX when volatility spikes by 500%... Update: if you held such a short position on VXX on February 6th, you'd be in trouble!
A trader might then wonder: do the VIX futures forecast the VIX spot price, or is it vice versa? Another way to put this is: does the futures basis have a predictive power on the spot price? Or, if the VIX futures tend to be mispriced and the spot price is a better predictor, does the basis have a (negative) predictive power on the upcoming changes in the price of the futures?
A study  was made to determine exactly that. Two distinct regression models were made:
(1) is a regression where we attempt to predict the upcoming change in the VIX spot price based on the current futures basis.
(2) is a regression where we attempt to predict the upcoming change in the VIX futures’ price based on the current futures basis.
The researchers' conclusion is that equation (2) is the statistically significant one, not (1). This suggests that VIX futures tend to be mispriced. It appears that there is a risk premium built into the price, possibly coming from the fact that these instruments are often used for hedging purposes. This also means that, from a trading perspective, when we are in contango, we should short the futures, because their price will decrease as we approach expiry. When we are in backwardation, we should buy the futures, because their price will increase as we approach expiry. Note that this strategy could theoretically be replicated by using one of the ETFs/ETNs, instead of the futures.
It should be noted, however, that this regression only explains about 10% of the variation in the futures’ price, which is why the researchers then goes on to isolate the risk premium by hedging against mini-S&P 500 futures. This should eliminate changes in portfolio value derived from overall volatility movements, since the S&P 500 is negatively correlated at about 80% with the VIX.
With a different methodology, another researcher  arrived at a similar conclusion: he rejected what he called the expectations hypothesis, i.e. the hypothesis that the VIX term structure reflects expected changes in the VIX.
These instructions refer to this workshop event.
In this track we will use the R statistical software to perform simple data modelling.
In this track we will use Tensorflow and Python to build advanced machine learning and deep learning models.
Installing Tensorflow & Python on your laptop is optional. I will give you remote SSH access, during the workshop, to a server that has Tensorflow installed on it.
If you want to install Tensorflow on your laptop:
Summary of requirements if you install TF:
If you don't want to install Tensorflow on your laptop:
I will make my home tensorflow-ready server available to you during the workshop sessions. You'll just need an SSH client (I recommend Putty for Windows users, and the default ssh client for ubuntu/linux users) to access the server and run the jobs.
You should also install an SCP client (WinSCP for windows?) and an FTP client (I recommend FileZilla FTP client for windows). I haven't yet fully determined the means by which we'll transfer code and results, but I'm pretty sure it will be either scp and/or FTP, so you should make sure you can use both of these.
The way it will work is you will write up a python script on your laptop (or, if you're comfortable with command-line text editor tools like vim or nano, you can write the exercises directly on the server in your own folder), and then you will upload them to your folder on the server via either SCP or FTP (to be determined).
Then, you will use a script I made, that will queue up your job for execution as soon as the GPU is available. The output of your python script will be available in a .log file once executed by the queuing system. More specific instructions will be provided during the workshop itself.
Summary of requirements if you don't install TF:
I gave a presentation at the Montreal Quantitative Trading meetup yesterday, on the topic of simulating the order book using Deep Learning (in order to improve backtesting accuracy). Here are the slides (in PDF format):
I've been setting up Tensorflow/CUDA machines 4 times now, so I thought I'd write down my process:
- a GTX 1080 (but will probably work with other NVIDIA graphics cards -- as long as it's NVIDIA)
- Python 2.7
- Bazel 0.5.2
- CUDA 8.0
- cuDNN 5.1
- Tensorflow r1.2
WARNING: if you choose any other version of the above frameworks/libraries, you're on your own: these instructions might fail. You might be surprised how easily a small change in a minor version number of one of the above software packages can make everything fail...
$ sudo apt-get update
$ sudo apt-get upgrade
$ sudo apt-get install build-essential cmake g++ gfortran git pkg-config wget curl
$ sudo apt-get install openjdk-8-jdk
2. Install Python & related packages
$ pip install numpy
$ pip install scipy
$ pip install h5py
3. (Optional) Install OpenBLAS for increased perf.
$ mkdir ~/OpenBLAS
$ cd ~/OpenBLAS
$ git clone https://github.com/xianyi/OpenBLAS.git
$ cd OpenBLAS
$ make FC=gfortran -j16
$ sudo make PREFIX=/usr/local install
4. Install latest nvidia drivers (361 or newer)
$ sudo apt-get install linux-image-generic linux-image-extra-virtual
$ sudo apt-get install linux-source linux-headers-generic
$ sudo nano /etc/modprobe.d/blacklist-nouveau.conf
(you can use vim or emacs, whatever you're more comfortable with as a text editor)
Add the following lines to the file:
options nouveau modeset=0
alias nouveau off
alias lbm-nouveau off
$ echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/nouveau-kms.conf
$ sudo update-initramfs -u
$ sudo reboot
Go to the NVIDIA website: http://www.nvidia.com/Download/index.aspx and download the drivers appropriate to your graphics card and OS.
Note: even though some drivers have a version specific to Ubuntu, I find the generic "Linux 64-Bit" easier to install, and I've had less problems with the latter than the former.
Press Ctrl-Alt-F1 to enter console mode. Login.
$ sudo service lightdm stop
$ sudo init 3
$ sudo sh <name of NVIDIA driver runfile you downloaded>
$ sudo reboot
$ cat /proc/driver/nvidia/version
This should display something along the lines of:
NVRM version: NVIDIA UNIX x86_64 Kernel Module 384.66 Tue Aug 1 16:02:12 PDT 2017
GCC version: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.5)
Should display something along the lines of:
| NVIDIA-SMI 384.66 Driver Version: 384.66 |
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| 0 Tesla K80 Off | 00000000:00:1E.0 Off | 0 |
| N/A 60C P0 73W / 149W | 0MiB / 11439MiB | 100% Default |
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
| No running processes found |
5. Install CUDA 8.0
Go to the NVIDIA webpage: https://developer.nvidia.com/cuda-80-ga2-download-archive
On the page, select:
Download the base installer and patch 2.
$ sudo sh cuda_8.0.61_375.26_linux.run
When asked if you want to install the nvidia drivers, say no.
Note: at the end of the installation you will probably see a warning message:
***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least 361.00 is required for CUDA 8.0 functionality to work.
You can ignore this message and continue to the next step.
$ gedit ~/.bashrc
(or any other editor you want)
At the end of the file, add:
# NVIDIA CUDA Toolkit
$ nvcc --version
You should see a message saying that you have Cuda compilation tools, release 8.0
6. Install cuDNN v5.1
Once again go to NVIDIA's website: https://developer.nvidia.com/cudnn
Go to Download.
Login (or create an account if you haven't already)
Choose "Download cuDNN v5.1 (Jan 20, 2017), for CUDA 8.0"
Choose "cuDNN v5.1 Library for Linux"
$ cd ~/Downloads/
$ tar xvf cudnn*.tgz
$ cd cuda
$ sudo cp */*.h /usr/local/cuda/include/
$ sudo cp */libcudnn* /usr/local/cuda/lib64/
$ sudo chmod a+r /usr/local/cuda/lib64/libcudnn*
7. Install Bazel 0.5.2 from GitHub Releases
I used the file: bazel_0.5.2-linux-x86_64.deb
$ sudo apt-get install ./bazel_0.5.2-linux-x86_64.deb
8. Install Tensorflow r1.2 from sources:
$ git clone https://github.com/tensorflow/tensorflow
$ cd tensorflow
$ git checkout r1.2
For most choices you can use the default, BUT you must say 'y' for the question about CUDA support.
(Note: this will be very resource intensive.)
$ bazel build -c opt --config=cuda //tensorflow/tools/pip_package:build_pip_package
$ bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
$ pip install /tmp/tensorflow_pkg/tensorflow-0.10.0-py2-none-any.whl
If the last line doesn't work, list the contents of your /tmp/tensorflow_pkg and use the filename that is in there instead. It's possible that the generated .whl filename will differ from the one in the instructions above.
9. (Optional) Test your Tensorflow installation
$ cd ~
in the python interpreter, type:
import tensorflow as tf
hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
This should display: "Hello, Tensorflow!"
You can test further with:
a = tf.constant(10)
b = tf.constant(32)
print(sess.run(a + b))
This should display: 42
10. (Optional) Install TFLearn
$ pip install tflearn
$ sudo apt-get purge bazel
Then re-follow the steps in Item 7 above.
press Ctrl-Alt-F1 to go to console mode
$ sudo ubuntu-drivers autoinstall
$ sudo reboot
I faced this infinite login loop problem twice, and both times the above steps solved it for me.
Sometimes we want to reduce tensors based on a given index's value. I had that specific problem today. Tensorflow's reduce operations are not terribly flexible, and will only reduce along an entire axis, indiscriminately. Specifically, let's say for example that I have a Q-table (in reinforcement learning, a Q-table lists, for each state of the environment and possible action in that state, the fitness value of that action/state combination):
q = tf.constant([[
Here the first axis represent possible states (in this case, only 2 states), and on the second axis we have a list of action-value "pairs". Let's say that for whatever reason, I now need to return, for each state, the action-value pair that has the maximum value.
As far as I know (and if I'm wrong about this, please save me from unnecessary code complexity and do let me know in the comment section below), Tensorflow won't let you do this in a single operation. Instead you'll have to use a "pattern" of operations that I will describe shortly, which can be followed for any similar index-specific reduce operation that you need to do (be it a min, max, any, all, sum, mean, etc.)
So, what happens when I use tf.reduce_max:
reduced_q = tf.reduce_max(q, 1)
output = sess.run(reduced_q)
[[ 5. 7.80000019]
[ 3.5 8.89999962]]
In other words, it took the max along dimension 0 of the inner axis, and the max along dimension 1 of the inner axis, for each state. This is not what we want. We only want the maximum value for index 1.
Instead, here is the solution I came up with. First, let's address the 2-dimensional case:
q_2d = [
max_indices = tf.argmax(tf.slice(q_2d, [0, 1], [-1, 1]), 0)
max_entries = tf.gather(q_2d, max_indices)
First, we slice along the inner dimension to get a version of the tensor containing only the values we want to max. We perform tf.argmax, which returns the index of the entries containing those max-valued pairs. Finally, we use tf.gather, to obtain a tensor from these specific indices.
This works correctly for 2d tensors. However, if you apply this code to the original 3d q tensor, you will find that it returns:
This still isn't what we want. First, we need to change the code to use tf.gather_nd (The "N-dimensional" version). Even then, however, you will find that the indices as returned by tf.argmax do not correctly index the entries we want from tf.gather_nd. tf.argmax returns:
But we need:
So we will have to do some extra manipulation:
max_indices = tf.argmax(tf.slice(q, [0, 0, 1], [-1, -1, 1]), 1)
inc_tensor = tf.constant([, ], dtype=tf.int64)
final_indices = tf.concat((inc_tensor, max_indices), 1)
max_entries = tf.gather_nd(q, final_indices)
The inc_tensor constant will have to be adjusted on a case-by-case basis depending on the actual dimensionality of your tensor. And we get:
[[ 4. 7.80000019]
[ 1.5 8.89999962]]
as desired. I hope that helps someone.
A long time ago, back when dinosaurs roamed the earth, there was Hadoop. It allowed people to hunt mammoths in a distributed fashion, and it was great.
But then, some guy called Matei Zaharia decided that writing intermediate output to disk was an unnecessarily inefficient way of doing things (after all, hard disks are several orders of magnitude slower than RAM -- and RAM is cheap nowadays). He came up with Spark, a solution that does exactly(-ish) what Hadoop did, but in memory (unless there isn’t enough of it available, in which case it will use the hard disk as swap space). This was a revolution, and he became a rock star.
Spark is still the industry standard at the moment: it is both mature enough to be embraced by more conservative companies (I said “conservative”, not prehistoric companies) -- and innovative enough not to be officially surpassed in its scalability and performance… yet.
That being said, there are contenders. Apache Flink is one. It’s proposed improvements over Spark are:
Their biggest selling point, from what I understand, is that their streaming system is more efficient, so Flink would be better suited for streaming applications.
The other one, which came out much more recently and is the development that got me excited enough to write this blog post, is called Ray. According to its GitHub commit history, the project was started on February 2016 (while Apache Flink, according to the same source, was started on December 2010). Its aim is to replace Spark for machine learning tasks. It’s not yet at its alpha stage, so it’s not considered seriously for any production quality work, and currently only appeals to hardcore tinkerers and enthusiasts such as myself.
So what’s supposed to be better about it? How is it different from Spark (and Flink)?
First, while Spark supports Java, Scala, Python and R as programming languages, Ray only supports Python (and I assume C++, since it’s written in that language -- but since I’m a civilized human being, I won’t use that).
Spark 1, Ray 0.
Ray replaces the “block synchronous” paradigm that is present in Spark. The latter forced you to design your tasks such that all partitions were going to take more or less the same processing time -- otherwise, your tasks that end more quickly would have to wait for that one looooong task before moving on. This resulted in having to implement sometimes complex strategies just to make sure that your partitions were “balanced”.
It can also handle GPU-based computation natively, which is a great improvement over Spark, especially with regards to deep learning tasks. In fact, it is “Tensorflow-ready” and its focus is ease-of-use for deep learning and machine learning applications.
Overall, while Spark valued throughput over low latency, Ray’s aim is to bring the latter to the world of distributed machine learning.
The video and slides for my presentation at the Spark Summit in Boston on February 8th are now available online:
The slides can be obtained here:
I was invited to present about the spark-timeseries library last week at the Montreal Apache Spark meetup.
The slides of the presentation are available here.
Coming up next: I will be giving a talk on the same topic at the upcoming Spark Summit in Boston, on February 8th!