Learning outcomes

  • should be able to calculate a sample correlation
  • should be able to estimate the CI
  • should able to test if the correlation is zero
  • should able to estimate the confidence interval for a correlation
  • should be able to explain the difference between correlation and regression

Regression and Correlation

So far we have been looking at regression, where we try to explain \(Y\) with a covariate, \(X\). In this module we will look at a slightly different problem, when \(Y\) and \(X\) are both random, and we want to know if they are correlated. The difference here is that we are not thinking that \(X\) explains \(Y\), instead they are just related, e.g they may both be responding to a common cause.

We can summarise this relationship by estimating the covariance (Definition 3.9.1):

\[ \textrm{Cov}(X,Y) = \textrm{E}(XY) - \textrm{E}(X)\textrm{E}(Y) \]

The sample covariance is defined in a similar way:

\[ \textrm{Cov}(X,Y) = \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}) = \sum_{i=1}^n x_iy_i - n\bar{x}\bar{y} \]

Clearly if \(\textrm{Cov}(X,Y)>0\), \(y_i\) tends to be larger than \(\bar{y}\) when \(x_i>\bar{x}\), and vice versa. In contrast, if \(y_i\) tends to be larger than \(\bar{y}\) when \(x_i<\bar{x}\), the covariance will be negative.

One problem with using the covariance as a summary of association is that it depends on the variances of the variables: if we write \(x^{*}_i = C x_i\) then

\[ \textrm{Cov}(X^{*},Y) = \sum_{i=1}^n x_i^{*}y_i - n\bar{x}^{*}\bar{y} = \sum_{i=1}^n Cx_iy_i - nC\bar{x}\bar{y} = C \left( \sum_{i=1}^n x_iy_i - n\bar{x}\bar{y}\right) \]

The upside to this is that we can choose a convenient \(C\) to standardise the data. Specifically we use the standard deviation:

\[ \rho(X,Y) = \frac{\textrm{Cov}(X,Y)}{\sigma_X \sigma_Y} = \textrm{Cov}\left(\frac{X-\mu_X}{\sigma_X}, \frac{Y-\mu_Y}{\sigma_Y}\right) \] Like the covariance, this will be positive if larger values of \(X\) tend to be associated with larger values of \(Y\). And it has a value of 0 if the covariance is 0, and \(\pm1\) if there is an exact linear relationship.

Prove that \(\rho(X,Y) = \textrm{Cov}\left(\frac{X-\mu_X}{\sigma_X}, \frac{Y-\mu_Y}{\sigma_Y}\right)\)

Answer

Note that \(\textrm{E}(X) = \mu_X\)

\[ \begin{align} \textrm{Cov}\left(\frac{X-\mu_X}{\sigma_X}, \frac{Y-\mu_Y}{\sigma_Y}\right) &= \textrm{E}\left(\frac{X-\mu_X}{\sigma_X}\frac{Y-\mu_Y}{\sigma_Y}\right) -\textrm{E}\left(\frac{X-\mu_X}{\sigma_X}\right) E\left(\frac{Y-\mu_Y}{\sigma_Y}\right) \\ &= \frac{1}{\sigma_X \sigma_Y} \textrm{E}(X-\mu_X)(Y-\mu_Y) \\ &= \frac{\textrm{Cov}(X,Y)}{\sigma_X \sigma_Y} = \rho(X,Y) \end{align} \]

Prove that \(|\rho(X,Y)| \le 1\)

Hint: use \(\textrm{Var}(aX + bY) = a^2 \textrm{Var}(X) + b^2 \textrm{Var}(Y) + ab \textrm{Cov}(XY)\) (for some simple values of \(a\) and \(b\))

Hint use \(X^{*} = \frac{X - \mu_X}{\sigma_X}\), where \(\mu_X\) and \(\sigma_X\) are the mean and standard deviation of \(X\)
Answer Theorem 11.4.1

Prove that \(|\rho(X,Y)| = 1\) if \(Y = aX+b\) for some constants \(a\) and \(b\)

Answer The second half of Theorem 11.4.1

Here are some correlated variables, so you get a feeling for what they look like:

If you want to look at something more interactive, this is a nice app to explore correlations.

Estimating a Correlation Coefficient

We can estimate the correlation coefficient by plugging in the estimates of the covariance and variances. i.e. we go from

\[ \rho(X,Y) = \frac{\textrm{E}(X-\textrm{E}(X))(Y-\textrm{E}(Y))}{\sqrt{\textrm{Var}(X)\textrm{Var}(Y)}}= \frac{\textrm{E}(XY) - \textrm{E}(X)\textrm{E}(Y)}{\sqrt{\textrm{Var}(X)\textrm{Var}(Y)}} \]

to the sample estimator:

\[ r = \frac{\frac{1}{n}\sum_{i=1}^{n}(X_i - \bar{X})(Y_i -\bar{Y})}{\sqrt{\frac{1}{n}\sum_{i=1}^n(X_i-\bar{X})^2 \frac{1}{n}\sum_{i=1}^n(Y_i-\bar{Y})^2}} \]

Show that the correlation coefficient can also be estimated as \(r = \frac{\sum_{i=1}^n{X_iY_i - \bar{X}\bar{Y}}}{\sqrt{\left(\sum_{i=1}^n X_i^2-n\bar{X}^2\right) \left(\sum_{i=1}^n Y_i^2-n\bar{Y}^2\right)}}\)

(you should have done much of this already: I am mainly about this asking to remind you of the tricks involved)

Answer

We can do the numerator and denominator separately. First the numerator:

\[ \begin{align} \sum_{i=1}^{n}(X_i - \bar{X})(Y_i -\bar{Y}) &=\sum_{i=1}^{n}X_iY_i - \sum_{i=1}^{n}X_i\bar{Y} - \sum_{i=1}^{n}Y_i\bar{X}+ \sum_{i=1}^{n}\bar{X}\bar{Y} \\ &= \sum_{i=1}^{n}X_iY_i - \bar{Y}\sum_{i=1}^{n}X_i - \bar{X}\sum_{i=1}^{n}Y_i+ \sum_{i=1}^{n}\bar{X}\bar{Y} \\ &= \sum_{i=1}^{n}X_iY_i - n\bar{Y}\bar{X} - n\bar{X}\bar{Y}+ n\bar{X}\bar{Y} \\ &= \sum_{i=1}^{n}X_iY_i - n\bar{Y}\bar{X} \\ \end{align} \] And in the denominator, we only need to look at one of \(X\) or \(Y\):

$$ \[\begin{align} \sum_{i=1}^{n}(X_i - \bar{X})^2 &=\sum_{i=1}^{n}X_i^2 - 2\sum_{i=1}^{n}X_i\bar{X} + \sum_{i=1}^{n}\bar{X}^2 \\ &=\sum_{i=1}^{n}X_i^2 - 2n\bar{X}^2 + n\bar{X}^2 \\ &=\sum_{i=1}^{n}X_i^2 - n\bar{X}^2 \\ \end{align}\] $$

So, putting these together (and cancelling some \(\frac{1}{n})\)’s we get

\[ \begin{align} r &= \frac{\frac{1}{n}\sum_{i=1}^{n}(X_i - \bar{X})(Y_i -\bar{Y})}{\sqrt{\frac{1}{n}\sum_{i=1}^n(X_i-\bar{X})^2 \frac{1}{n}\sum_{i=1}^n(Y_i-\bar{Y})^2}} \\ &= \frac{\frac{1}{n}\left(\sum_{i=1}^{n}X_iY_i - n\bar{Y}\bar{X}\right)}{\frac{1}{n}\sqrt{\left(\sum_{i=1}^{n}X_i^2 - n\bar{X}^2\right)\left(\sum_{i=1}^{n}Y_i^2 - n\bar{Y}^2\right)}} \\ &= \frac{\sum_{i=1}^{n}X_iY_i - n\bar{Y}\bar{X}}{\sqrt{\left(\sum_{i=1}^{n}X_i^2 - n\bar{X}^2\right)\left(\sum_{i=1}^{n}Y_i^2 - n\bar{Y}^2\right)}} \\ \end{align} \]

The Bivariate Normal Distribution

The sample correlation is a summary of the data, but if we want to do more with bivariate problems we need a model for them. A convenient choice is to generalise the normal distribution, to a bivariate normal distribution. This is a natural choice, partly because a lot of statistical problems involve normal distributions, and the bivariate normal can be extended to more than two dimensions, to give the multivariate normal distribution.

Remember that the pdf of a normal distribution is

\[ f_{Y}(y) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2\sigma^2}(y-\mu)^2} \]

We can develop the bivariate normal distribution following §11.5. The path is a bit tortuous, so we will do it in bits. We will start off by assuming that \(X\) and \(Y\) are standard independent normal distributions, and build on that.

Show that if \(X\) and \(Y\) are independent, the joint pdf is \(f_{X,Y}(x,y) = \frac{1}{2\pi}e^{-\frac{1}{2}(x^2+y^2)}\)

Answer \[ \begin{align} f_{X,Y}(x,y) &= f_{X}(x)f_{Y}(y) \\ &= \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}x^2}\frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}y^2} \\ &=\frac{1}{2\pi}e^{-\frac{1}{2}(x^2+y^2)} \end{align} \] The first line is because they are independent, and the second line is because they are standard (i.e. plug in \(\mu_X=\mu_Y=0\), \(\sigma_X=\sigma_Y=1\)).

An intuitively obvious extension is to write the exponent as a quadratic equation (eqn. 11.5.2):

\[ f_{X,Y}(x,y) = Ke^{-\frac{1}{2}c(x^2-2vxy + y^2)} \]

We want to find \(K\), \(c\), and \(v\) (or at least some of their properties) to make sure the marginal pdfs of \(X\) and \(Y\) are normal distributions. We will start by finding \(K\), the normalising constant. This means we have to evaluate this double integral:

\[ 1/K = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} e^{-\frac{1}{2}c(x^2+-2vxy + y^2)} dx dy \]

(we can alsohappily reverse the order of integration). We do that by completing the square, i.e. writing \(ay^2 + by + c = A (y - B)^2 + C\). Doing this lets us integrate by \(y\) first.

Complete the square to show that \(x^2+-2vxy + y^2 = (y - vx)^2 + (1-v^2)x^2\)

Answer Top of p571.

Now we can write the pdf as \(f_{X,Y}(x,y) = Ke^{-\frac{1}{2}c(1-v)x^2} e^{-\frac{1}{2}c(y-vx)^2}\). This means we can integrate with respect to \(y\) first. Even better, we can do the integration by inspection.

What is \(\int_{-\infty}^\infty{e^{-\frac{1}{2}c(x-b)^2}}dx\)?

Answer

This looks like part of a normal distribution pdf with mean \(b\) and variance \(1/c\), so

\[ \int_{-\infty}^\infty{e^{-\frac{1}{2}c(x-b)^2}}dx = \sqrt{\frac{c}{2 \pi}} \]

(the actual proof of this involves complex analysis)

Show that \(K = \frac{c \sqrt{1-v^2}}{2\pi}\) by evaluating the integral \(1/K = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} e^{-\frac{1}{2}c(x^2+-2vxy + y^2)} dx dy\)

(you may want to reverse the order of integration)

Answer Middle of p571.

Note that this implies \(|v|<1\) if we want to avoid \(K\) being a complex number.

We can chose \(c\) to be anything positive, and it turns out that \(c=(1-v^2)^{-1}\) is a good choice. So, substituting these in we get

\[ \begin{align} f_{X,Y}(x,y) &= \frac{(1-v^2)^{-1} \sqrt{1-v^2}}{2\pi}e^{-\frac{1}{2}(1-v^2)^{-1}(x^2+-2vxy + y^2)} \\ &= \frac{1}{2\pi\sqrt{1-v^2}}e^{-\frac{1}{2}\frac{(x^2+-2vxy + y^2)}{(1-v^2)}} \\ &= \frac{1}{2\pi\sqrt{1-v^2}} e^{-x^2}e^{-\frac{1}{2}\frac{(y-vx)^2}{(1-v^2)}} \end{align} \] (this is eqn. 11.5.3). The last equation comes from the completing the square. Note that this separates out \(e^{-x^2}\) from \(e^{-\frac{1}{2}\frac{(y-vx)^2}{(1-v^2)}}\). This makes integrating with respect to \(y\) easier, but can also be useful when working with conditional distributions.

This is great. We have an equation. But before we celebrate and go home, it would be nice to understand what it is. First, we want the marginal distributions to be normal distributions, i.e. if we integrate out \(y\), \(x\) should follow a normal distribution.

Show that the marginal distribution of \(x\) is a standard normal distribution, i.e. \(f_X(x)=\frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}x^2}\)

Hint This is similar to part of the calculation of \(K\).
Answer Bottom of p571.

The second property we want to look at is the correlation between \(X\) and \(Y\). Because they both have variance 1, this is th same as the covariance. And because they both have mean 0, this is the same as \(\textrm{E}(X,Y)\).

Prove that \(\rho(X,Y)=\textrm{E}(X,Y)=v\).

Hint: \(\textrm{E}(X,Y)=\int\int xyf_{X,Y}(x,y)dxdy\).
Hints
  • Use the completed of the square version of the pdf
  • \(\textrm{E}(X)=\int xf_{X}(x)dx\) (but you need to spot where this is useful).
  • \(\textrm{Var}(X|\textrm{E}(X)=0)=\int x^2f_{X}(x)dx\)
Answer Top of p572. Where the authors as”why?” is where you answer “because \(\textrm{E}(X)=\int xf_{X}(x)dx\)”.

Now we have the pdf for a bivariate normal distribution where the marginal distributions are standard normal distributions. We can generalise to any normal distributions by replacing the standard normal with \(\frac{x-\mu_x}{\sigma_x}\). This gives us

\[ f_{X,Y}(x,y) = \frac{1}{2\pi \sigma_x\sigma_y}exp \left\{ -\frac{1}{2(1-\rho^2)}\left[\frac{(x-\mu_x)^2}{\sigma_x^2} - 2\rho \frac{(x-\mu_x)(y-\mu_y)}{\sigma_x \sigma_y} + \frac{(y-\mu_y)^2}{\sigma_y^2} \right] \right\} \]

the extra \(\sigma_x \sigma_y\) in the denominator appear from the integration: in the calculations for \(K\) write \(c^{*}=c/\sigma_y^2\) and \(c^{*}=c/\sigma_x^2\).

Some Properties of the bivariate normal

If we know one of \(X\) or \(Y\), what does this say about the other? In essence, what is \(Pr(Y|X=x)\)?

Show that \(Y|X=x \sim N(\mu_Y + \frac{\rho\sigma_Y}{\sigma_X}, (1-\rho^2) \sigma^2_Y)\)

Hint: start by assuming \(X\) and \(Y\) are standard normal distributions.

Hint Use the completed-the-square formula, and then recognise the distribution
Answer Theorem 11.5.1.

Parameter Estimation

The maximum likelihood estimates of the parameters are the sample means and variances, and the sample correlation. Showing this is a bit messy (the correlation appears to involve a cubic equation), so go ahead if you want, but I will make sure this is not on the exam. Especially as I have not worked these out yet either.

Show that the m.l.e. for \(\mu_X\) is \(\bar{x}=\frac{1}{n}\sum_{i=1}^nx_i\) and \(\mu_Y\) is \(\bar{y}=\frac{1}{n}\sum_{i=1}^ny_i\)

Hint Hint: the way I did it involves a bit of algebra. Get rid of the terms that are constant with respect to \(\mu_X\), and differentiate w.r.t. \(\mu_X\).
Answer

This is the likelihood: we want to differentiate with respect to \(\mu_y\), so we clean it up a bit: \[ \begin{align} L &= \log(f_{X,Y}(x,y)) = \sum_{i=1}^n \left( -\log(2\pi \sigma_x\sigma_y) -\frac{1}{2(1-\rho^2)}\left[\frac{(x_i-\mu_x)^2}{\sigma_x^2} - 2\rho \frac{(x_i-\mu_x)(y_i-\mu_y)}{\sigma_x \sigma_y} + \frac{(y_i-\mu_y)^2}{\sigma_y^2} \right]\right) \\ &= C_1 -\left[ - \sum_{i=1}^{n}(y_i-\mu_y) \frac{2\rho(x_i-\mu_x)}{2(1-\rho^2)\sigma_x \sigma_y} + \sum_{i=1}^{n}\frac{(y_i-\mu_y)^2}{2(1-\rho^2)\sigma_y^2} \right] \\ &= C_1 + \sum_{i=1}^{n}(y_i-\mu_y) \frac{2\rho(x_i-\mu_x)}{2(1-\rho^2)\sigma_x \sigma_y} - \sum_{i=1}^{n}\frac{y_i^2-2y_i\mu_y + \mu_y^2}{2(1-\rho^2)\sigma_y^2} \\ &= C_2 -\mu_y \frac{2\rho\sum_{i=1}^{n}(x-\mu_x)}{2(1-\rho^2)\sigma_x \sigma_y} +\frac{2\mu_y\sum_{i=1}^{n}y_i - n\mu_y^2}{2(1-\rho^2)\sigma_y^2} \\ &= C_2 -\mu_y \frac{2\rho n(\bar{x}-\mu_x)}{2(1-\rho^2)\sigma_x \sigma_y} +\frac{2\mu_y n\bar{y} - n\mu_y^2}{2(1-\rho^2)\sigma_y^2} \\ &= C_2 -\frac{n}{(1-\rho^2)\sigma_y}\left( \mu_y \frac{\rho (\bar{x}-\mu_x)}{ \sigma_x} +\frac{\mu_y \bar{y} - \mu_y^2/2}{\sigma_y}\right) \\ \end{align} \]

Where \(C_1=-n\log(2\pi \sigma_x\sigma_y) -\frac{\sum_{i=1}^{n}(x-\mu_x)^2}{2(1-\rho^2)\sigma_x^2}\), and \(C_2=-n\log(2\pi \sigma_x\sigma_y) -\frac{\sum_{i=1}^{n}(x-\mu_x)^2}{2(1-\rho^2)\sigma_x^2} +\frac{2\sum_{i=1}^{n}y\rho(x-\mu_x)}{2(1-\rho^2)\sigma_x \sigma_y}-\frac{\sum_{i=1}^{n}y^2}{2(1-\rho^2)\sigma_y^2}\).

Differentiating with respect to \(\mu_y\) and setting to 0:

\[ \begin{align} \frac{\partial L}{\partial \mu_y} &=\frac{n}{(1-\rho^2)\sigma_y}\left(\frac{\rho (\bar{x}-\mu_x)}{ \sigma_x} +\frac{\bar{y} - \mu_y}{\sigma_y}\right) =0 \\ &=\sigma_y\left(\frac{\rho (\bar{x}-\mu_x)}{ \sigma_x} +\frac{\bar{y} - \mu_y}{\sigma_y}\right) \\ &=\frac{\sigma_y\rho (\bar{x}-\mu_x)}{ \sigma_x} +\ \bar{y} - \mu_y \\ \mu_y &= \frac{\sigma_y\rho (\bar{x}-\mu_x)}{\sigma_x} +{\bar{y}} \end{align} \] By symmetry,

\[ \mu_x = \frac{\sigma_x\rho (\bar{y}-\mu_y)}{\sigma_y} +{\bar{x}} \]

By inspection, we can solve these by setting \(\mu_x=\bar{x}\) and \(\mu_y=\bar{y}\)

Show that the m.l.e. for \(\sigma^2_X\) is \(s^2_x=\frac{1}{n}\sum_{i=1}^n(x_i-\bar{x})^2\) and \(\sigma^2_y\) is \(s^2_y=\frac{1}{n}\sum_{i=1}^n(y_i-\bar{y})^2\).

Answer not available Sorry.

Show that the m.l.e. for \(\rho\) is \(r_{xy}=\frac{1}{n}\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})\)

Answer not available

Inference

Now we can calculate the estimates of the correlation, we hit another problem: the sampling distribution is not simple. Thus we will list the results without going into the details.

Testing \(H_0: \rho=0\)

The nicest result here is that if \(\rho=0\) and the data come from a bivariate normal distribution,

\[ T_{n-2} =\frac{\sqrt{n-2}R}{\sqrt{2-R^2}} \]

has a \(t\) distribution with \(n-2\) degrees of freedom. The sampling distribution of \(r\) is known, but it is

\[ p(r) = \frac{1}{r}(n-2)(1-r^2)^{(n-4)/2}\sqrt{\frac{\pi}{2}}\frac{\Gamma(n-1)}{\Gamma(n-\frac{1}{2})}(1-\rho r)^{n-3/2} {}_{2}F_{1} \left(\frac{2}{r},\frac{1}{2},\frac{2n-1}{2},\frac{\rho r+1}{2} \right) \]

where \(\Gamma(z)\) is the usual Gamma function, and \({}_{2}F_{1}(a,b,c,d)\) is a hypergeometric function1. The reason for showing this is (a) to show that not everything is a \(t\) distribution, and (b) sometimes things are nasty. The traditional approach in these situations has been to find an approximation to the sampling statistic, or to a function of teh sampling statistic. This has usually been a normal distribution, because as \(N \to \infty\) most things become normal (in essence, you can do a Taylor series expansion about the true value of a parameter, and the third and higher order terms disappear fairly quickly, as long as the sampling distribution is well behaved).

For the correlation, the approximation that is often used is due to Fisher. He suggested using an arctan transformation:

\[ z = \frac{1}{2}\ln\frac{1+r}{1-r} \]

which has an approximate variancee of

\[ s_z = \frac{1}{n-3} \]

For Example 11.5.1, calculate the estimate of \(\rho\), \(r\), and a 95% confidence interval for \(\rho\)

Hint Remember the confidence interval is calculated for \(z\), not \(r\).
Answer

The book tells us that \(r=-0.453\). So

\[ z = \frac{1}{2}\ln\frac{1-0.453}{1+0.453} = \frac{1}{2}\ln0.376 = -0.489 \]

The standard error is \(\sqrt{1/(20-3)}=\sqrt{0.0588}=0.243\), so the approximate 95% confidence interval for \(z\) is \(-0.489 \pm 1.96\times0.243 = -0.489 \pm 0.475 = (-0.964, -0.014)\). But this is for \(z\), not \(r\), so we need to back-transform:

\[ \begin{align} z &= \frac{1}{2}\ln\frac{1+r}{1-r} \\ e^{2z} &= \frac{1+r}{1-r} \\ (1-r)e^{2z} &= 1+r \\ (1-r)e^{2z} &= 1+r \\ e^{2z} - 1 &= r+re^{2z} \\ e^{2z} - 1 &= r(1+e^{2z}) \\ r = \frac{e^{2z} - 1}{e^{2z} + 1} \end{align} \]

Now we plug in the limits of the confidence interval:

\[ \left(\frac{e^{-0.964\times2} - 1}{e^{-0.964\times2} + 1}, \frac{e^{-0.014\times2} - 1}{e^{-0.014\times2} + 1}\right) = (-0.75, -0.014) \]

So although the estimate is statistically significant, the confidence interval includes values of the correlation which range from “almost nothing” to “strong”.


  1. No, me neither.↩︎