Exercise Module 5: Regression

Instructions:

This module accompanies the lectures from week 5.

This document contains information, questions, R code, and plots.

Hints and reminders are in bold

Questions appear in blue.

# R code looks like this.
# Comments are in grey and explain the code

2+2 # add two and two

## [1] 4

# outputs from R appear like the above ^^^^

Part A = Why and when to use linear regression
Part B = Finding a best line (trial and error)
Part C = Finding a best line (maximum likelihood estimation)
Part D = Assumptions
Part E = Olympics data (running a regression)
Part F = Quantifying uncertainty
Part G = Interpretting
Part H = Reflection

R skills

Importing data
Running functions
Plotting
Using abline() to plot lines
Using lm() to estimate parameters of a linear regression
Using coef() to show the maximum likelihood estimate of parameters
Using confint() to get confidence intervals
Using the predict() function

Part A: Why and when to use linear regression

If you want a recap on variables and data types then go here: https://qmbio.math.ntnu.no/theory/all-about-data/

What do you already know about regression? Discuss in your groups.

Why do we use linear regression?

At this point we will introduce a new motivation for statistical modelling. In previous weeks we have focused on something called inference. This is when you want to say something about a population from a sample. But there can also be another motivation, this is prediction. We can use models to predict the futher state of our variables.

Example inference: do zombies run faster than humans?

Example prediction: how much faster will I be if I turn into a zombie?

For linear regression the aim is to predict values of a response variable from values of an explanatory variable. It can also quantify an association/relationship between two variables. In simple terms, we fit a straight line to: 1) estimate a relationship between $X$ and $Y$, 2) predict change in $Y$ from change in $X$. It does this using a straight line as a model.

Linear regression assumes a causal relationship between $X$ and $Y$, i.e. it assumes that $X$ does something to $Y$. However, we can never actually test this with this method, we can only quantify the patterns. To test if $X$ really does have a causal affect on $Y$, you would need experiments. Otherwise, you never know if the relationship is a coincidence e.g. https://www.tylervigen.com/spurious-correlations

Example questions you can answer with linear regression:

How does temperature change with time?
How does the amount of crime change with temperature?
How does relative plant biomass change with light intensity?
What will the temperature be in 2100?
How tall will someone be if they have hair 30cm long?

So, we know that regressions look at relationships and also prediction…

Great, so when do we use linear regression?

Linear regression is used when you have a continuous numeric response variable ($Y$) AND continuous numeric explanatory variables ($X$). This means that for any value of $X$, we can predict a value of $Y$. Because a linear regression model has a component that is a continuous straight line, the variables going in also need to be continuous.

If our data are in categories (instead of continuous) e.g. blue and green, then we cannot predict a $Y$ value for any value of $X$ because we do not know what values lie between our categories (is it teal, turquoise, or yellow?). We also do not know if the same relationship would be present.

The straight line is also important for deciding when to use a linear regression. As with any model, we want our linear regression model to represent how our data were generated. Therefore, we would like it to take a similar shape to our data, i.e. we want our data to follow a straight line (roughly).

There are some other criteria we need to consider, these are introduced in Part D.

How is the description above different to the t-test last week? (think about what question you asked with the t-test)

Based on the description above, which of these datasets do you think is suitable for a linear regression?

Regression

Part B: Finding a best line 1 (trial and error)

Today we have given you some data. This data is made up. x is the explanatory variable and y is the reponse variable. Both have a mean of 0, this is important. The data are created using the code below.

Create the data x and y as objects in R.

# First save your data as objects x and y
x <- -10:10
y <- c(-41.64, -11.04, -20.71, -3.89, -23.71, -18.42, -16.21, 
       -23.22, -16.47, 15.72, -7.43, 14.1, -6, -12.04, 7.53, 23.26, 
       28.44, 36.9, 5.45, 46.78, 22.58)

We now want to create a model that represents how these data were generated. We will use a regression model because we have numeric data and we think there is a relationship between our two variables. When we model using a regression, we want to mathematically define a straight line that best represents the data we have. So, ideally we want one that follows the same pattern as our data and explains as much variation in the data as possible.

The line is defined by two parameters: $\alpha$ = the intercept, where the line crosses the Y axis and $\beta$ the slope of the line (steepness/gradient). We can alter the position of the line using these two parameters. For our model we also need to estimate the variance in the residuals, but we will get to this later.

We have tried out one line below. We plot it using the abline() function. abline() plots straight lines and takes the arguments a = intercept and b = slope, so it is very convenient to use here.

# Then plot the data points
plot(x,y)

# Add a line with intercept = 0, slope = 1, colour = red
abline(a=0, b=1, col=2)

plot of chunk unnamed-chunk-3

Our line is too shallow, it does not follow the pattern of the data. But we want to check this mathematically too. In regression, we want to minimise the distances between the data points and the model line (residuals). This might be familiar from your last statistics course. We can summarise the residuals by adding all of the distances up to give a single measure of distance between the model and the data. When we do this we must square the distances first otherwise the sum will = 0 (some points are above the line and some below so they cancel). This is called the sum of squares.

To calculate the sum of squares you need to:

1) predict a value of $Y$ for each $X$ using the linear equation and fill in your chosen values of $\alpha$ and $\beta$:

predicted_Y = $\alpha$ + $\beta$*X

2) calculate the differences between the predicted $Y$ values and the real ones

3) square the differences

4) add them all up

You can do this in R by using it like a calculator. Just remember to save each step as an object. You should be able to do it for all $X$ values at once.

If you want the code

CLICK HERE

## To find the sum of squares when alpha = 0 and beta = 1

E.y <- 0 + 1*x # Calculate your expected y value for each x

# Calculate the difference between the expected y and the actual y (residuals)
residual <- y - E.y 

# Square all the residuals
SS <- residual^2

# Sum the squared residuals
sum(SS)

Otherwise have a go yourself!

Have a go at altering the slope of the line ($\beta$) to try and find a better line. You do not need to alter a ($\alpha$) because the line must go through the mean of x and the mean of y (more on this in Part D) which is an intercept of 0, so this is fixed. Try a few times until you are satisfied. (doesn't have to be the best, just one you are happy with.)
Calculate the sum of squares each time and keep note of it.
What is the best sum of squares you got? Write it on the board nearest to you.
What would it look like if you plotted the sum of squares against each slope value? i.e. sum of squares on the y-axis and the slope/$\beta$ values on the x-axis

Part C: Finding a best line 2 (maximum likelihood estimation)

In the same way as we did in previous weeks, we have chosen a model (linear regression) and this can be represented as an equation:

\[ y_i = \alpha + \beta x_i + \varepsilon_i \]

To fit the model to our data, we need to find estimates of the parameters. In Part B, you tried to do this by trying different things, but this takes a lot of time and will be different for each group. So, instead, we use maximum likelihood estimation to find the values of the parameters that make our data most likely. This should be the same for every group.

The parameters are:

$\alpha$ = the intercept of the line

$\beta$ = the slope of the line

$\sigma^2$ = the variance of the residuals

The first step for this would be to write out the likelihood equation for this model.

I want to see the maths

The log-likelihood is

\[ l(\mathbf{y}|\mathbf{x}, \alpha, \beta, \sigma^2) = - \frac{n}{2} \log{\sigma^2} - \sum_{i=1}^n {\frac{{(y_i - \alpha - \beta x_i)^2}}{2\sigma^2} } \]

This is quadratic in $y_i$, so this is the same minimising the sums of squares, i.e. the least squares estimate (that you did in Part B)

Once you have this you can write out equations to get the maximum likelihood estimator for each parameter. In reality, while $\sigma^2$ needs to be estimated as part of the model, we do not really use it in interpretation.

I want to see the equations

\[ \begin{aligned} \hat{\alpha} &= \bar{y} - \hat{\beta} \bar{x} \ \end{aligned} \]

\[ \begin{aligned} \hat{\beta} &= \frac{\sum_{i=1}^{N} (x_i - \bar{x})(y_i - \bar{y}) }{\sum_{i=1}^{N} \left( x_i - \bar{x} \right)^2} \ \end{aligned} \]

\[ \begin{aligned} \hat{\sigma^2} &= \frac{1}{N} \sum_{i=1}^{N} \left(y_i - (\hat{\alpha} + \hat{\beta}x_i \right))^2 \ \end{aligned} \]

In words:

$\hat{\alpha}$ = the mean of $y$ minus the mean of $x$ multiplied by $\hat{\beta}$

$\hat{\beta}$ = the covariance of $x$ multiplied by the covariance of $y$ divided by the variance of $x$

$\hat{\sigma^2}$ = the variance of the residuals from your model

You can test these out by calculating using these formulas then compare to the output from R below.

Otherwise keep reading to see how to do this in R.

Luckily for us, R can calculate the maximum likelihood estimates for us. To do this we use the function lm(), it stands for linear model (should seem familiar). We began to use this last week. It takes the argument of a formula in form: y~x. The function will fit the regression model using maximum likelihood estimation and give us the maximum likelihood estimates of $\alpha$ and $\beta$ as an output.

Use lm() to fit the actual regression line to the data from Part B.

Remember to save the output as an object.

Code hint

ModelObject <- lm(y~x)

Now you have a model object you can use a new function called coef() to look at the estimates of $\alpha$ and $\beta$. The arugment you need to put in is the object you saved above.

Code hint

coef(YourModelObject)

How do these compare to your estimates for the slope in Part B?

Think about what these estimates mean in relation to the aim of regression i.e. to estimate a relationship between two variables. Which bit tells you about a relationship?

Part D: Assumptions

The linear regression model has several criteria that need to be met in order to use it appropriately and for the statistics to behave as we expect. We call these assumptions. Assumptions are part of most statistical models and they are called this because the maths behind these models assumes these things are true. If they are not, the model can behave strangely and the results are invalid.

Assumptions for linear regression:

Normally distributed error (residuals)
Error has equal variance along line
Mean of the residuals = 0
The relationship is linear

The regression line must also go through the point that = the mean of $X$ and the mean of $Y$.

We will cover these more next week, including how to see if they are met, and what to do if they are not.

Part E: Olympics data

In this part of the module we will give you some real data to model using a linear regression. These data are the official fastest times for men and women for the 100m sprint (running) from 1948 to 2004.

The data can be found at: https://www.math.ntnu.no/emner/ST2304/2019v/Week5/Times.csv

It is a .csv file with headers.

Import the data and save as an object.

Code hint

Times <- read.csv('https://www.math.ntnu.no/emner/ST2304/2019v/Week5/Times.csv', header=T)

The question we want to ask here is:

How do the times of men/women change with year? Times are the response and Year the explanatory variable.

In Part C, you used the lm() function and the formula arugment. But lm() can actually take more arguments. We will need to use one more arugment in this part (this is similar to what we tried to do in plot() in week 1, but it did not work then. It should now!!).

Arguments of lm(): lm(formula, data)

formula = Y ~ X
data = your data

Y is the response variable and X is the explanatory variable.

Example: lm(columnNAME1 ~ columnNAME2, data = YourDataFile)

When using the data argument, you need to use the column names for the variables you want to include.

Fit the regression for looking at the relationship between year and time using lm(), make sure to assign the output as an object.

Code hint

# You will need to look at the column names to find the X and Y variables ??? 
# you can use:
head(YourData)
str(YourData)
colnames(YourData)

#Remember to use ?FunctionName for help on arguments

Look at the results using coef() like above.

Have a go at plotting your data and add the regression line using abline()

Make sure to label your axes (arguments = xlab and ylab) and choose the colour of your line.

Remember the data arugment does not work in plot. You need to use YourDataFile$columnNAME1

Code hint

plot(x=Times$Year, y=Times$WomenTimes, 
     ylab="Times (s)", xlab="Year")

abline(a = YourIntercept, b = YourSlope, 
       col = "hotpink")

Part F: Quantifying uncertainty

As you know from previous weeks, any estimates we make for parameters are not certain. There are several possible values for any parameter that could be true but we cannot know which is true. We can try and quantify how uncertain we are in our estimate and which values might be more plausible.

For regression, it is the same process as we have come across before. Using the likelihood, we can get a distribution for each parameter. This distribution has our maximum likelihood estimate in the centre but some variation around it. However, now that we have several parameters to estimate, it gets a bit more complex.

The graph below shows the likelihood curves for $\alpha$ and $\beta$ plotted against each other. As we have two curves, this makes a surface. You can see higher numbers in blue. You should notice there is a correlation between the two parameters. This is because they are not independent.

As a regression line must go through the mean of X and the mean of Y ($(\bar{x}, \bar{y})$), this fixes the line at that point. If you increase the intercept ($\alpha$) you would therefore need to also decrease the slope ($\beta$).

## NULL

plot of chunk unnamed-chunk-10

All of this makes it difficult to estimate the uncertainty by hand. But, it is very easy to calculate the confidence intervals from an lm().

We use the function confint() e.g. confint(YourModelObject)

Calculate the confidence intervals for your model in Part E.

Based on what you know already about the different parameters, work out what the confidence intervals are telling you.

Now that you have the confidence intervals and have thought about them, it can be helpful to visualise them. To plot them takes a few steps. First is plotting the data and the regression line as you have done before.

For the next step, we need to introduce another function here (sorry!) predict().

predict() predicts new values of $Y$ based on values of $X$. It takes several arguments:

object = your model object from lm()
newdata = a dataframe of new $X$ values to predict from
interval = can take values “confidence” or “prediction”

There are others but we will not use them here. Today, we will only use arguments object and interval. We want to get predictions based on our model and we want confidence intervals (we will look at prediction later).

You know you can plot your regression line using abline()

Read the comments to see what the code does, DO NOT just copy and paste

To plot your confidence intervals you need these two bits of code:

# generate some predictions based on the confidence interval
predictions <- predict(YourModelObject, interval = 'confidence')

# look at the predictions
predictions
# There are three columns of results
# the first is the mean prediction (the actual regression line)
# the second is the lower end of the confidence interval
# the third is the upper end of the confidence interval

# This bit add the lines to the plot
# [,2] = second column, [2,] would be second row

# plot the lower bound, column 2 in predictions
lines(YourXValues, predictions[,2], lty=2)

# plot the upper bound, column 3 in predictions
lines(YourXValues, predictions[,3], lty=2)

# lty=2 gives a dashed line

Part G: Interpretting

You now have estimates of $\alpha$ and $\beta$ and the associated confidence intervals for your regression model.

Although we did plot a line, remember the drawing is not your model, the equations and numbers are!

But what do these mean in terms of the relationship between 100m times and year?

Have a go at picking out the key results from your model.

Prepare a summary of one key result to tell another group.

PAUSE HERE.

Prediction

Wait a minute…. this has all been about understanding, I thought we could also predict with regression?

Regression can also be used for prediction. We can predict within our dataset and outside.

Why might we want to predict values of Y? (discuss together)

In what ways could prediction be good or bad? (discuss together)

When we use a regression for prediction it has uncertainty. Uncertainty in relationship is captured by confidence intervals. Prediction uncertainty captured by prediction intervals. A prediction interval gives a range of values that can be considered likely to contain a future observation.

For prediction we need to convey uncertainty in estimated relationship and scatter around the line (error). So, prediction error includes the variance of the residuals too ($\sigma^2$ is finally useful!)

Definition: While confidence intervals are used to represent uncertainty in parameters and to relate them to an underlying population, prediction intervals are used to represent uncertainty in predictions. It gives a plausible range for the next observation of the response (i.e. for the new value of Y). It is a type of confidence interval, but in regression it will be wider than the confidence interval of the line (which includes uncertainty in our parameters $\alpha$ and $\beta$). A 95% prediction interval tells you, if you were to collect a sample and run the analysis, then go out an collect a new observation of the response variable ($Y$) with particular value of the explanatory variable ($X$) many many times AND each time draw a prediction interval, 95% of the time, the new observation would fall in within the prediction interval.

Part H: Reflection

This is getting ready for next week, when we look at how good our models are.

Some points to think about:

How well do you think your line captures the data?
How could you work out how good the fit is? What would you want to look at?
What did you find out about the times?
Is there anything more might you want to know? (E.g. does temperature impact running speed?)