Cointegration & Unit Roots
thanks to Ron M. for pointing out the topic

When studying stock prices we usually start with the prices themselves which constitute a series of numbers, each associated with a particular time.
For example: P0, P1, P2, ... where P0 is the starting price and P1 is the next price etc.

>Are we talking daily closing prices ... or what?
It doesn't matter, but we can (if we wish) think of them as daily closing prices. In any case, the series of numbers is called a time series.
The prices don't change smoothly, but may be considered to have some trend upon which is superimposed some random component, as in Figure 1 (where the "trend" is shown in red).

We'd like to associate with the random component some probability distribution characterized by a Mean, Variance, etc.
However, it's clear that any Mean we associate with the prices will increase with time if the trend is increasing and ...

>For me the trend is always down!
Pay attention. In order to analyze the evolution of prices (in particular the random component) one normally considers not the prices themselves but the returns P1/P0 -1, P2/P1-1, ... and their distribution. For example, the distribution of returns which give rise to the time series in Figure 1 might have a distribution as in Figure 2.

>You're assuming a normal or lognormal distribution?
To investigate the random component one normally makes some assumption regarding the distribution of returns, and normal or lognormal are the most popular assumptions. Indeed, the usual assumption is that the returns are random selections from a distribution where the parameters of the distribution (Mean, Variance etc.) don't change with time.

Note that the series of returns is also a time series. A significant difference between the return time series and the stock price time series is that the stock prices have a Mean which changes with time. Such a time series is called non-stationary.


Figure 1

Figure 2

Figure 3

>And the returns time series ... is it stationary?
It would be if the distribution parameters (Mean, Variance etc.) are constant over time.

>And are they?
Are they constant over time? Look at Figure 3 where we note the Mean and Variance (that's StandardDeviation2) of the monthly returns for GE stock, over the 1960s, 1970s etc..

>I'd say non-stationary.
Yes, me too.

Another thing we note (about how one often analyzes stock prices and/or returns) is the connection between two stocks in our portfolio.
It's customary to consider the correlation between the two returns series measured by R-squared / Pearson Correlation. Portfolio optimization and risk-reward analysis have usually been based upon correlation of returns, but a more recent consideration is the so-called cointegration.

>Huh?
When dealing with stock prices, it's been customary to first consider the difference between successive prices (and consider, instead, returns). This "differencing" eliminates any trends which may be involved and ...

>What's this cointegration stuff?
Correlation is used to measure the ... uh, correlation between returns. When the returns for one stock go up or down, does the other tend to go up or down as well? It's a short term measure of interdependence.

On the other hand, cointegration attempts to measure common trends in prices over the long haul. For example, suppose that the time series associated with two stock returns have a high correlation and the prices have high cointegration, as in Figure 4a. See the similarity in stock price trends?

>Yeah, so?
Okay, suppose we change the returns on one stock by a very small amount, replacing a return r by r - 0.001 (for example). The correlation between r and r - 0.001 is high. (Indeed, it's 100% !!)

>But the stock prices are no longer cointegrated, eh?
Yes, as in Figure 4b.

>Okay, I know how to calculate correlation, but how do I calculate ...?
How to calculate cointegration? That's what we'll talk about.


Figure 4a

Figure 4b


First, notice that if stock price (at time = n) increased according to Pn = P0 + r n + en where r is some growth rate and en is a random variable with Mean = 0 and Variance = v (and the correlation between pairs, such as en and en-m, is zero), then the mean of Pn would be P0 + r n (hence it's changing with the time, n) but the Variance would be constant at v (introduced by the random component en).

>Huh?
You can find the stat stuff here.
Anyway, if you could identify "r" then the NEW variable Pn - (P0 + r n) would have constant Mean = 0 as well as constant Variance v. This NEW variable would be stationary, eh? In fact, from the relation Pn = P0 + r n + en we can see that Pn - Pn-1 = r + en - en-1 so this differencing has produced a stationary time series since en - en-1 has constant statistical parameters like: Mean = 0, Variance = 2v.

>A stock price that increases in a straight line? I've never seen any ...?


Figure 5

That's just an example. However, we might also be talking about spending habits where, as time progresses, we spend more and the national population expenditures might be considered to grow linearly ... with a random component. Or, the time series defined by Pn = P0 + r n + en might be a relation between the logarithms of successive stock prices, like log[Un] = log[U0] + r n + en meaning that Un = U0er n+en implying a growth from some initial price of U0 with annual (or monthly or daily) returns having a Mean equal to r.


Now let's consider a Random Walk where Pn = Pn-1 + en.

Then we'd have P1 = P0+ e1 and P2 = P1+e2 = P0+e1+e2 and, eventually, Pn = P0+e1+e2+ ... +en. Then the Variance is:
[1]     VAR[Pn] = VAR[e1]+VAR[e2]+ ... +VAR[en] = n v.

>That makes the Standard Deviation grow like SQRT(n), right?
Yes, the volatility would increase as time progressed ... becoming infinite.


Now consider a time series defined by Pn = ρ Pn-1 + en where, as before, en is a random variable with Mean = 0 and Variance = v.
Then P1 = ρ P0 + e1, P2 = ρ P1 + e2 = ρ2 P1 + ρ e1 + e2 and, eventually, Pn = ρn P0 + ρn-1 e1 + ρn-2e2 + ... + en.

Or, to put it differently:
Pn = en + ρen-1 + ρ2en-2 + ρ3en-3 + ...

Note that this expression defines the time series Pn as a moving average and ...

>Huh?
It's a sum with yesterday's en-1 having a weight of ρ and the day before having a weight of ρ2 and the day before that having ...

>Yeah. So?
So the Mean of Pn is the sum of the means, which is zero, and the Variance is the sum of the variances (because the random components en, en-1, etc. have zero correlation) so Variance is:
[2]     VAR[Pn] = VAR[en]+VAR[ ρen-1]+VAR[ρ2en-2]+... = v+ρ2v + ρ4v + ρ6v + ... = v / (1 - ρ2)

>Where's P0 and how did you get ...?
Okay, the fact that VAR[ρx] = ρ2VAR[e] is here and, uh ... we'll assume the time series is infinite and goes back forever.

>To simplify the math, eh?
Well ... yes, and it gives us a nice formula for the Variance since 1+ρ246+ ... = 1/(1 - ρ2).

>Okay, I'll stick in ρ = 10 and get ...
Uh, no, we can only write 1+ρ246+ ... = 1/(1 - ρ2)   if   -1 < ρ < 1.

However, if -1 < ρ < 1 then we can see from [2] that the Mean and Variance (and other stochastic parameters associated with Pn, like covariance) are constants so this time series is stationary. Of course, if ρ = 1 then we're in trouble.

>ρ = 1? Is that the unit root stuff?
Yes. In fact, the assumption that we started our time series in the infinite past gives rise to a stationary time series.
If we had started at some t = 0, we'd get a Variance which ended with ρ2n-2en so depends upon n ... so it wouldn't be stationary.
Also, if had ρ > 1 it wouldn't be stationary.


Now consider two time series defined by:
[3A]     Pn = P0 er n en
[3B]     Pn = Pn-1 er en

If en = 1, then [2A] and [2B] are the same. In fact, [3B] gives Pn = P0 er n e1e2...en.

>That notation ... using "e" for the random component and for er ... you've done that before ... it's confusing!
Then concentrate!
Anyway, one may be tempted to say that there is little to choose between [3A] and [3B], however, we can write:
[3A.1]     log[Pn] = log[P0] + r n + log[en]
[3B.1]     log[Pn] = log[P0] + r n + log[e1]+log[e2]+...+log[en]

For the [3B] time series, there's a sum of random components (not just the current component, as in [3A]) and that makes the Variance grow with time, n.

>Is this stuff useful?
We'll see, but the point is that having a stationary time series is very handy. If, on the other hand, the Mean and/or Variance change from day to day, it's more awkward to do the statistical analysis. We'd like to see if some combination of the terms in the time series will result in a stationary time series.

>Like considering the returns instead of the prices?
Yes. Taking differences in successive prices or considering the percentage change ... that can eliminate trends as we saw in Figure 1.

>But calculating returns isn't the same as taking differences between successive prices ... is it?
Well, no. The returns involve a ratio:  Pn+1 / Pn. However, the difference between the logarithms gives log[Pn+1] - log[Pn] = log[Pn+1/ Pn].
The effect of differencing is to remove the trend (called, would-you-believe, "de-trending") and that may be bad as it removes the possibility of detecting common trends between two or more stocks. Nevertheless, it allows one to deal with a stationary time series. Imagine how one would determine the correlation between two time series if we dealt with prices or other non-stationary series. Now, using cointegration, one hopes to eliminate the need to calculate correlation coefficients yet identify common trends.

>Huh?
Suppose we wanted a portfolio of stocks chosen so as to "follow" the DOW. If we could arrange it so the tracking error (the difference between our portfolio and the DOW) was stationary, we'd be happy. It'd mean that our portfolio might deviate from the DOW but these deviations would have a Mean = 0 so portfolio values would oscillate about the DOW and ...

>Isn't that Mean Reversion?
Yes, if cointegration exists then there'd be mean reversion to the DOW index.

>Okay, so what IS cointegration? I mean ...
Two (or more) non-stationary time series are said to be cointegrated if a linear combination of the terms results in a stationary time series.
For example, if Un and Vn are non-stationary but Un - C Vn is stationary (for some constant C), then the two series are cointegrated (and there's an underlying, common trend). This would be the case if the "error", en = Un - C Vn is stationary and therefore has time-independent statistical parameters: Mean, Variance and Autocovariance.

>Autocovariance?
Yes, the correlation between en and en-m, namely the Mean of (en - M)(en-m - M) ... where M is the Mean of the en.
This Mean can depend upon m (the "lag") but NOT upon n (the time). That (along with the time independence of the Mean and Variance) would make the time series en stationary. Note that, if m = 0, then the mean of (en - M)(en-m - M) = (en - M)2 is just the Variance ... or (Standard Deviation)2.

See also Spearman Rank Correlation