Q–Q plots

Statistics at NTNU

Øyvind Bakke, Department of Mathecmatical Sciences, NTNU

To be used in TMA4315 GLM, H2017 (Version 29.08.2017)

Q–Q plots

A Q–Q (quantile–quantile) plot can be used to have a graphical check of whether data come from a particiular distribution.

Assume that we have independent observations \(Y_i\),  \(i=1\), \(\ldots\), \(n\), from a continuous distribution having cdf \(F\).

We need some facts that you may already know:

  1. The \(F(Y_i)\) are uniformly distributed on the unit interval:

    It may seem a bit abstract to use \(Y_i\) as an argument of its own cdf, but let us nevertheless attempt to find the cdf of \(F(Y_i)\). Let \(p\) be a number in the unit interval. Then \(F(Y_i)\leq p\) is equivalent to \(Y_i\leq F^{-1}(p)\), and \(P(F(Y_i)\leq p)=P(Y_i\leq F^{-1}(p))=F(F^{-1}(p))=p\).

    (Note that \(F^{-1}(p)\) is the number for which the probability is \(p\) that \(Y_i\) is less than the number, that is, the \(p\)-quantile – the quantile function for the distribution evaluated at \(p\).)

    We have shown that the cdf of \(F(Y_i)\) has value \(p\) at each \(p\) in the unit interval. This is exactly the cdf of the uniform distribution on the unit interval.

  2. The \(k\)th order statistic of a random sample of \(n\) uniformly distributed variables on the unit interval has expected value \(\frac{k}{n+1}\)

    Recall that the \(k\)th order statistic is the \(k\)th-smallest value of the sample. In general, its pdf if given by \(n\binom{n-1}{k-1}(F_X(x))^{k-1}(1-F_X(x))^{n-k}f_X(x)\), where \(f_X\) and \(F_X\) are the pdf and the cdf of the variables of the sample, respectively – see e.g. the notes on order statistics (in Norwegian) from the course TMA4245 Statistics or the Wikipedia entry on order statistics.

    In the case that the variables are uniformly distributed on the unit interval, \(f_X(x)=1\) and \(F_X(x)=x\) for \(x\) in the unit interval, and the pdf of the \(k\)th order statistic reduces to \(n\binom{n-1}{k-1}x^{k-1}(1-x)^{n-k}\), which you can check is the pdf of a beta distribution with parameters \(\alpha=k\) and \(\beta=n-k+1\). You may not know this distribution, but you can find it in the statistical tables used in TMA4245 or refer to Wikipedia. Its expected value is \(\frac\alpha{\alpha+\beta}=\frac{k}{n+1}\). This make sense intuitively: Consider a random sample of \(n\) uniformly distributed variables on the unit interval. Then the expected values of the smallest, the 2nd smallest, the 3rd smallest and so on, are \(\frac1{n+1}\), \(\frac2{n+1}\), \(\frac3{n+1}\), all the way up to \(\frac n{n+1}\).

Now, denote by \(Y_{(k)}\) the \(k\)th order statistic of the \(Y_i\). Then the \(k\)th order statistic of the \(F(Y_i)\) is \(F(Y_{(k)})\), since \(F\) is increasing. By (1) and (2) above, the expected value of \(F(Y_{(k)})\) is \(EF(Y_{(k)})=\frac k{n+1}\).

Thus, we would expect \(F(Y_{(k)})\approx\frac k{n+1}\), and, consequently \(Y_{(k)}\approx F^{-1}\bigl(\frac k{n+1}\bigr)\). The right-hand side is often called the \(k\)th \(n+1\)-quantile of the distribution of the \(Y_i\). The left-hand side may be called an empirical or a sample quantile, and the relation we have seen shows that they serve as estimates of the quantiles of the distribution.

As a small example, consider a sample of \(5\) standard normally distributed variables. Let’s compare the order statistics with the \(6\)-quantiles of the standard normal distribution:

y<-rnorm(5) # the Y_i
y_ordered<-sort(y) # empirical quantiles
zq<-qnorm(1:5/6) # standard normal 6-quantiles
rbind(y,zq)
##          [,1]       [,2]       [,3]      [,4]      [,5]
## y  -0.6264538  0.1836433 -0.8356286 1.5952808 0.3295078
## zq -0.9674216 -0.4307273  0.0000000 0.4307273 0.9674216

The Q–Q plot is a plot of the \(k\)th order statistic against the \(k\)th \(n+1\)-quantile of \(F\),  \(k=1\), \(\ldots\), \(n\). We expect the points to approximately follow a line having slope one and going through the origin. Let’s have a look at a larger example:

y<-rnorm(100)
zq<-qnorm(1:100/101)
plot(zq,sort(y))
abline(0,1,col="red") # adds line of slope 1 through origin

As expected the relationship is approximately linear.

The idea of the Q–Q plot is to get an indication of whether the random sample really comes from a distribution having cdf \(F\). Consider instead a random sample where \(Y_i+1\) has the exponential distribution with expectation \(1\) (then still \(EY_i=0\) and \(\operatorname{Var}Y_i=1\)), but with \(F\) still being the standard normal cdf:

y<-rexp(100)-1
plot(zq,sort(y))
abline(0,1,col="red") # adds line of slope 1 through origin

We see no linear relationship.

What about a random sample from a non-standard normal distribution if \(F\) is the standard normal cdf, \(\Phi\)? Assume that the \(Y_i\) are normally distributed with mean \(\mu\) and standard deviation \(\sigma\). Then the \(\frac{Y_i-\mu}\sigma\) are standard normally distributed, and repeating the above argument with the standardized variables, we get \(\frac{Y_{(k)}-\mu}\sigma\approx\Phi^{-1}\bigl(\frac k{n+1}\bigr)\), or \(Y_{(k)}\approx\mu+\sigma\,\Phi^{-1}\bigl(\frac k{n+1}\bigr)\). The relation is still approximately linear, but with intercept \(\mu\) and slope \(\sigma\)! This means that we can use a Q–Q plot with \(F=\Phi\), called a normal Q–Q plot, to check for normality in general. The points should approximately follow a straight line.

Let’s make normal Q–Q plots for a random sample of non-standard normally distributed variables and for a random sample of exponentially distributed variables:

y<-rnorm(100,5,4) # not standard normal
y2<-rexp(100) # exponential
plot(zq,sort(y))
abline(0,1,col="red") # adds line of slope 1 through origin
plot(zq,sort(y2))
abline(0,1,col="red") # adds line of slope 1 through origin

Indeed, the plot for normal data shows an approximate linear relationship, in contrast to the one for exponential data. But, as you can see, the red lines having slope one going through the origin are not of any help to guide our eyes now.

Instead, a line going through the first and third quartiles are often drawn. We illustrate using the R functions qqnorm and qqline instead of coding it on our own:

qqnorm(y)
qqline(y,col="red")
qqnorm(y2)
qqline(y2,col="red")

In this course, the main use of Q–Q plots will be using the functions qqnorm and qqline to check data for normality.

Finally, some technical details about qqnorm and qqline: The \(x\) coordinates of qqnorm are of the form \(\Phi^{-1}\bigl(\frac{k-1/2}n\bigr)\), or \(\Phi^{-1}\bigl(\frac{k-3/8}{n+1/4}\bigr)\) if \(n\leq10\), rather than \(\Phi^{-1}\bigl(\frac k{n+1}\bigr)\). We used the approximation \(Y_{(k)}\approx F^{-1}\bigl(\frac k{n+1}\bigr)\) for the Q–Q plots, but it can be shown that a slightly more accurate and more precisely formulated approximation is \(EY_{(k)}\approx\Phi^{-1}\bigl(\frac {k-1/2}n\bigr)\) in the normal case (note the expectation on the left hand side).

For qqline, the line goes through two points having \(x\) coordinates \(\Phi^{-1}\bigl(\frac14\bigr)\) and \(\Phi^{-1}\bigl(\frac34\bigr)\). The \(y\) coordinates are empirical first and third quartiles (empirical \(\frac14\)- and \(\frac34\)-quantiles) given by the data. Loosely, the first empirical quartile should be a number such that a fourth of the data are less than the number, and similarly for the third quartile. But \(n\) need not be divisible by four, and if it is, for example if \(n=100\), any number between the 25th- and the 26th-smallest observation could qualify. There are several ways to define an empirical \(p\)-quantile. The one used by default by qqline is to determine \(k\) such that \(\frac{k-1}{n-1}<p\leq\frac k{n-1}\), and use linear interpolation at \(p\) between \(\bigl(\frac{k-1}{n-1},Y_{(k)}\bigr)\) and \(\bigl(\frac k{n-1},Y_{(k+1)}\bigr)\) as the \(p\)-quantile.