Distributions

Normal, Log-normal and other assorted Distributions ... continued from Part II

We start with a jillion numbers: g₁, g₂, g₃, ... g_N where N is very large. For convenience, we'll refer to this set as simply {g}.

>Wait! What are we talking about here? Haven't we done this before?
Yes, but I want to talk about different distributions of, say, monthly stock gains and ...

>Okay, please proceed.
Gee, thanks. Anyway, we count how many of these numbers are less than x. It'll depend upon x, so we'll call it F(x).

>We'll call it? What's it?
We're counting the number of g's less than, say, "2". That number ... that's F(2). The number less than 1 we'll call F(1). The number less than ...
>Okay, I get it, but a picture is worth a thousand ...
Okay, here's a picture
Every time we get a count, we get a point on the chart.
For example, the number of g's less than 2 is 6616. The number less than 0 is 4011.
>It looks like you're working with 10,000 numbers.
You got it, and ...
>And it looks like they all lie between -10 and +10.
Yes, in this particular example, but ...
>So N = 10,000.

This is just an example, The numbers can be anything. The graph of F(x) could look like anything, except that it necessarily starts at the value "0" on the far left and increases to the number of members of the set {g}.

Notice that we can calculate the number of g's between x=0 and x=2:
There are 6616 less than 2 and 4011 less than 0 so there are 6616-4011=2605 between x=0 and x=2.

Usually, we divide F(x) by the number of members of the set {g} - in our case 10,000 - so F(x) gives the fraction which are less than some given x. The graph of this "new" F(x) would then go from 0 at the far left to 1 on the far right and, for our example, the fraction less than 2 is 0.6616 and less than 0 is 0.4011 so 0.2605 or 26.05% of the g's lie between 0 and 2.

>Are these numbers called g or are they called x? It's confusing, I mean ...
Uh ... I'm calling the original numbers g, like g₁, g₂, etc.. However, when I want to talk about a particular g-value, I use the symbol x. For example, I refer to the number of g's less than some particular value x. Clear?

>No. Anyway, you've got some guy called F(x). Does he have a name?
The Cumulative Distribution Function.
In general, if we wanted to know the fraction of g's between x and x+Δx,
it'd be F(x+Δx) - F(x) and, for small Δx, we can write F(x+Δx) - F(x) = F'(x)Δx
where F'(x) is the slope of F(x), at the place x. We'll call this slope f(x), so
F'(x) = f(x)
and the fraction of g's between x and x+Δx is then:
F(x+Δx) - F(x) = f(x) Δx
and if we sum of all these fractions we'd get all the g's, so:
Σf(x) Δx = 1

Note that the fraction lying in an interval of length Δx, at the place x, is f(x) Δx. That's important to remember. I'll say it again:

The fraction lying in an interval of length Δx, at the place x, is f(x) Δx and Σf(x) Δx = 1

>I take it that Σ means we add them all up and "1" means we've included 100% of the g's.
Right. For small Δx, we write this sum as an integral and get our first magic equation:
(1)

>An integral?
Don't worry about it. It's a wee bit of calculus.
>A very wee bit?
Yes. I promise.

Okay, now we want to know the average value of the g's. We do this like so:

Suppose we're determining the average grade on a test and we know that a fraction 0.1 of the students (that's 10%) got a grade of 45, 0.3 got a grade of 65 and 0.6 got a grade of 85.
The average grade is 0.1(45) + 0.3(65) + 0.6(85) = 75

>Don't tell me! To get the average of the g's we'd determine the fraction having the value x₁, say n₁, the fraction having the value x₂, say n₂, etc., then ... uh ... we'd calculate n₁(x₁) + n₂(x₂) + ...

Very good. You've been eating your smart pills. Okay, for the case we're considering, where
the fraction having the value x₁ is f(x₁)Δx,
the fraction having the value x₂ is f(x₂)Δx
etc.,
we'd calculate: x₁f(x₁)Δx + x₂f(x₂)Δx + ... = Σ x f(x) Δx

which gives us our second magic equation:
(2)
where m is the average, or Mean.

Now we want to measure how far the g's are from their Mean, m. We calculate the average of the squared deviations:

(1/N){(g₁-m)² + (g₂-m)² + ... }

but, as above, we count the fraction having the value x₁, namely f(x₁)Δx, and the fraction having the value x₂, namely f(x₂)Δx, etc. and use:

(x₁-m)²f(x₁)Δx + (x₂-m)²f(x₂)Δ + ...

which brings us to our third magic equation:

(3)

where S² is the mean squared deviation from the Mean.
>And S is called ... what?
S, the Root Mean Square (or RMS) deviation is called the Standard Deviation.

>So, what does f(x) look like?
Since it's the slope of the Cumulative Distribution (which grows from 0 to 1), then we expect f(x) to begin (at the far left) with the value 0 then increase (as the slope of F(x) increases), then decrease again (as the slope of F(x) decreases to 0).

>So, what does f(x) look like?

Here's a picture:

>And f(x), I presume, is the "Density".
Oh ... yeah. Did I forget to mention that?

There are a couple of popular distributions when considering the monthly (daily, weekly, yearly?) returns of stocks. The first is the infamous Bell Curve:

(4) Normal Distribution

The funny guys (like 2π) are there so that equation (1) is satisfied.

Also, the graphs we've used above are from a Normal Distribution with m = 1 and S = 3.

The next is the Log-normal Distribution. In this case, we assume that the logarithms of the g's have a Normal Distribution. Because we're considering log(g₁), log(g₁), etc., the numbers g₁, g₂, etc. had better be positive!
>Why?
Because log(g) isn't defined (as a real number) unless g > 0. Hence, when considering a Log-normal distribution of returns, we consider g₁, g₂, etc. to be the Gain Factors.
>Remind me.
If the monthly return is 2.3%, the Gain Factor for that month is 1.023, meaning that $1.00 will grow to $1.023 during that month. Since the Gain Factors are always positive - assuming you don't lose everything in one month (!) - then we can consider the distribution of their logarithms.
>I assume that the logarithm is the natural log.
Yes, to the base e = 2.71828, roughly. Anyway, here's the picture.

If we plot the distribution of logarithms, log(g₁), log(g₂), ... log(g_N),
we'd get a Normal curve as shown, where m and S are now the Mean and Standard Deviation of the logarithms!
>That sounds tough, I mean ...
Actually, the Mean is easy. For example, suppose {g} were a set of annual returns. The average logarithm, m, is

(1/N){ log(g₁)+log(g₂)+...+log(g_N) }	= (1/N)log(g₁g₂...g_N)
	= log{g₁g₂...g_N}^1/N
	= log(G)

where G is the annualized return.

Note that f(z) has a part (z-m) and since z=log(x) and m=log(G) we get (z-m)=log(x)-log(G)=log(x/G).
That allows us to write the Log-normal density distribution like so:

Put z-m=log(x/G)
Since we must satisfy equation (1), above, then changing from z to x requires changing Δz to (1/x)Δx
(that is, dz = dx/x) so we must change f Δz to f Δx/x
We get, finally f(x), our density distribution for x (as opposed to, f(z), the distribution for z):

(5) Log-normal Distribution

with x > 0 since x is now a Gain Factor ... and S is now the Standard Deviation of log(x) !

>We must change f Δz to f Δx/x ? What's that about?
It's because dz = d(log(x) = (1/x) dx ... but don't worry about it.

>You promised just a wee bit of calculus!
Yes ... uh, it's because of the logarithm, you see. If x changes by a tiny amount from x to x+Δx, then its logarithm will change from log(x) to log(x+Δx) = log(x[1+Δx/x]) = log(x) + log(1+Δx/x) and, for tiny values of Δx/x, log(1+Δx/x) = Δx/x so the logarithm will change by a tiny amount Δx/x and we can see that x in the denominator so that ...

>Please ... please, continue.

Okay. Notice that, if

F(x) is a Normal cumulative distribution with Mean = 0 and Standard Deviation = 1, then
F((x-A)/B) is a Normal distribution with Mean = A and Standard Deviation = B, and
F([log(x)-A]/B) describes a distribution where it's log(x) which has a Normal distribution and it's log(x) which has the Mean = A and Standard Deviation = B.

>Number 3 is our Log-normal distribution, right?
Right. Notice that, for the Log-normal distribution, the geometric Mean, G, plays the central role (unlike the Normal distribution where it's the arithmetic Mean).

But if we choose the Mean m and Standard Deviation S to match S&P 500 returns, neither distribution is a very good match.

>So?
So, let's try another tack
... to get this:
where the horizontal axis corresponds to the Gain Factors, not the gains themselves, so 0.8 means a gain of 0.8 - 1 = - 0.2 or a 20% loss and 1.1 means a 10% gain and ...
>Yeah, yeah. I got it.

>But I thought that individual stocks are supposed to be lognormal ... not the S&P 500.

Yes. Some say that individual stocks are more closely approximated by the lognormal distribution than a collection of stocks. For example, here's some samples, for comparison:

>The density distribution is rather erratic, eh?

Okay, here are the cumulative distributions:

However, I'd like to consider a distribution which approximates (for example) the S&P distribution:
for part IV