Regression Module

Instructions:

This document contains information, questions, R code, and plots.

Hints and reminders are in bold

Questions appear in blue.

# R code looks like this.
# Comments are in grey and explain the code

2+2 # add two and two

## [1] 4

# outputs from R appear like the above ^^^^

Part A = Why and when to use linear regression
Part B = Finding a best line (trial and error)
Part C = Finding a best line (maximum likelihood estimation)
Part D = Assumptions
Part E = Olympics data (running a regression)
Part F = Quantifying uncertainty
Part G = Interpreting
Part H = Reflection

An extra resource this week

As part of the development work we are doing to improve this course, we are creating some webpages that will support the information given in this course. They are not all complete - but - the one for linear regression is nearly finished (links within the document are not finished).

You can find the page here.

Not all of the material will be useful and it goes a bit beyond what we will cover today. But some sections are useful so I will add links below when they are.

Useful sections:

Introduction
Which questions?
Type of data
Model details
Parameters
Quantify uncertainty
Draw conclusions

We are also interested in feedback on the webpage.

It would be really helpful for us if you could fill in this feedback form https://forms.gle/8yLgA1MViRMZKV1a7. Hopefully we will get a reference group sorted soon to provide more feedback to us.

Different R coding

If you look at the ‘Worked example’ part of the webpage, you will notice a different style of R coding used for plots and data manipulation. To learn more about this style (if you like the plots) try this tutorial click here.

R skills

Importing data
Running functions
Plotting
Using abline() to plot lines
Using lm() to estimate parameters of a linear regression
Using coef() to show the maximum likelihood estimate of parameters
Using confint() to get confidence intervals
Using the predict() function

Begin with the intro lecture

https://ntnu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=e0a8364f-0ee6-4fca-80da-acc900f859b0

and video 2

https://ntnu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=de5c4738-3455-4577-a325-acc900f8662a

Part A: Why and when to use linear regression

What do you already know about regression? Discuss in your groups.

(No model answer for this, just what you can remember.)

Why do we use linear regression?

At this point we will introduce a new motivation for statistical modelling. In previous weeks we have focused on something called inference. This is when you want to say something about a population from a sample. But there can also be another motivation, this is prediction. We can use models to predict the further state of our variables.

Check out the introduction, which questions? and type of data sections here.

Great, I read the page, so when do we use linear regression?

Linear regression is used when you have a continuous numeric response variable ($Y$) AND continuous numeric explanatory variables ($X$). This means that for any value of $X$, we can predict a value of $Y$. Because a linear regression model has a component that is a continuous straight line, t he variables going in also need to be continuous.

If our data are in categories (instead of continuous) e.g. blue and green, then we cannot predict a $Y$ value for any value of $X$ because we do not know what values lie between our categories (is it teal, turquoise, or yellow?). We also do not know if the same relationship would be p resent.

The straight line is also important for deciding when to use a linear regression. As with any model, we want our linear regression model to represent how our data were generated. Therefore, we would like it to take a similar shape to our data, i.e. we want our data to follow a straight line (roughly).

There are some other criteria we need to consider, these are introduced in Part D.

How is the description above different to the t-test last week? (think about what question you asked with the t-test)

Answer

The description above describes something continuous, where as in the t-test one of our variables had groups (zombie or human). In the t-test, because we had groups, we were testing the difference in mean between those groups. But for regression we look at a relationship between two variables and the change in Y for a change of 1 unit in X.

Based on the description above, which of these datasets do you think is suitable for a linear regression?

Is regression ok?

Answer

Answer explanation in the video. But panels 3 and 4 should be ok. Could maybe make panel 2 work but panel 1 is not suitable.

Once you have had a go, what a summary in video 3

https://ntnu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=1f07494d-ede2-4ed2-b246-acc900f86cb8

Part B: Finding a best line 1 (trial and error)

Today we have given you some data. This data is made up. x is the explanatory variable and y is the response variable. Both have a mean of 0, this is important. The data are created using the code below.

Create the data x and y as objects in R.

# First save your data as objects x and y
x <- -10:10
y <- c(-41.64, -11.04, -20.71, -3.89, -23.71, -18.42, -16.21, 
       -23.22, -16.47, 15.72, -7.43, 14.1, -6, -12.04, 7.53, 23.26, 
       28.44, 36.9, 5.45, 46.78, 22.58)

We now want to create a model that represents how these data were generated. We will use a regression model because we have numeric data and we think there is a relationship between our two variables. When we model using a regression, we want to mathematically define a straight line that best represents the data we have. So, ideally we want one that follows the same pattern as our data and explains as much variation in the data as possible.

The line is defined by two parameters: $\alpha$ = the intercept, where the line crosses the Y axis and $\beta$ the slope of the line (steepness/gradient). We can alter the position of the line using these two parameters. For our model we also need to estimate the variance in the residuals, but we will get to this later.

For more detail on the model details and parameters look at those sections on the webpage here.

You can also watch video 4 (note the correct to the residuals part)

https://ntnu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=81460c19-df44-4c8b-94c0-acc900f8c96f

We have tried out one candidate model line below. We plot it using the abline() function. abline() plots straight lines and takes the arguments a = intercept and b = slope, so it is very convenient to use here.

# Then plot the data points
plot(x,y)

# Add a line with intercept = 0, slope = 1, colour = red
abline(a=0, b=1, col="red")

plot of candidate model line

Our line is too shallow, it does not follow the pattern of the data. But we want to check this mathematically too. In regression, we want to minimise the distances between the data points and the model line (residuals). This might be familiar from your last statistics course. We can summarise the residuals by adding all of the distances up to give a single measure of distance between the model and the data. When we do this we must square the distances first otherwise the sum will = 0 (some points are above the line and some below so they cancel). This is called the sum of squares.

To calculate the sum of squares you need to:

predict a value of $Y$ for each $X$ using the linear equation and fill in your chosen values of $\alpha$ and $\beta$:

predicted_Y = $\alpha$ + $\beta$*X

calculate the differences between the predicted $Y$ values and the real ones
square the differences
add them all up

You can do this in R by using it like a calculator. Just remember to save each step as an object. You should be able to do it for all $X$ values at once.

If you want the code

CLICK HERE

## To find the sum of squares when alpha = 0 and beta = 1

E.y <- 0 + 1 * x # Calculate your expected y value for each x

# Calculate the difference between the expected y and the actual y (residuals)
residual <- y - E.y 

# Square all the residuals
SS <- residual^2

# Sum the squared residuals
sum(SS)

Otherwise have a go yourself! Or use this app: click here.

Have a go at altering the slope of the line ($\beta$) to try and find a better line. You do not need to alter a ($\alpha$) because the line must go through the mean of x and the mean of y (more on this in Part D) which is an intercept of 0, so this is fixed. Try a few times until you are satisfied. (doesn’t have to be the best, just one you are happy with.)
Calculate the sum of squares each time and keep note of it.
What is the best sum of squares you got? Write it on the board nearest to you.
What would it look like if you plotted the sum of squares against each slope value? i.e. sum of squares on the y-axis and the slope/$\beta$ values on the x-axis

Answer

Answer explanation in the video.

The best answer for the sum of squares is given by a slope ($\beta$) value of 3 which gives a sum of squares of 3423.242.

If you plot sum of squares again each slope value, you should get a U shaped plot. The lowest point will be at a slope of 3.

Once you have finished part B watch the summary in video 5

https://ntnu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=3045c9ec-89bf-4f3f-b908-acc900f8854c

Part C: Finding a best line 2 (maximum likelihood estimation)

In the same way as we did in previous weeks, we have chosen a model (linear regression) and this can be represented as an equation:

\[ y_i = \alpha + \beta x_i + \varepsilon_i \]

To fit the model to our data, we need to find estimates of the parameters. In Part B, you tried to do this by trying different things, but this takes a lot of time and will be different for each group. So, instead, we use maximum likelihood estimation to find the values of the parameters that make our data most likely. This should be the same for every group.

The parameters are:

$\alpha$ = the intercept of the line

$\beta$ = the slope of the line

$\sigma^2$ = the variance of the residuals

The first step for this would be to write out the likelihood equation for this model.

I want to see the maths

The log-likelihood is

\[ l(\mathbf{y}|\mathbf{x}, \alpha, \beta, \sigma^2) = - \frac{n}{2} \log{\sigma^2} - \sum_{i=1}^n {\frac{{(y_i - \alpha - \beta x_i)^2}}{2\sigma^2} } \]

This is quadratic in $y_i$, so this is the same minimising the sums of squares, i.e. the least squares estimate (that you did in Part B)

Once you have this you can write out equations to get the maximum likelihood estimator for each parameter. In reality, while $\sigma^2$ needs to be estimated as part of the model, we do not really use it in interpretation.

I want to see the equations

\[ \begin{aligned} \hat{\alpha} &= \bar{y} - \hat{\beta} \bar{x} \\ \end{aligned} \]

\[ \begin{aligned} \hat{\beta} &= \frac{\sum_{i=1}^{N} (x_i - \bar{x})(y_i - \bar{y}) }{\sum_{i=1}^{N} \left( x_i - \bar{x} \right)^2} \\ \end{aligned} \]

\[ \begin{aligned} \hat{\sigma^2} &= \frac{1}{N} \sum_{i=1}^{N} \left(y_i - (\hat{\alpha} + \hat{\beta}x_i \right))^2 \\ \end{aligned} \]

In words:

$\hat{\alpha}$ = the mean of $y$ minus the mean of $x$ multiplied by $\hat{\beta}$

$\hat{\beta}$ = the covariance of $x$ multiplied by the covariance of $y$ divided by the variance of $x$

$\hat{\sigma^2}$ = the variance of the residuals from your model

You can test these out by calculating using these formulas then compare to the output from R below.

Otherwise keep reading to see how to do this in R.

There is also a colour coded summary in video 6

https://ntnu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=d7aa92a9-7f57-4659-aaa2-acc900f88bc2

Luckily for us, R can calculate the equivalent of maximum likelihood estimates for us. To do this we use the function lm(), it stands for linear model (should seem familiar). We began to use this last week. It takes the argument of a formula in form: y~x. The function will fit the regression model and give us the maximum likelihood estimates of $\alpha$ and $\beta$ as an output.

Use lm() to fit the actual regression line to the data from Part B.

Remember to save the output as an object.

Code hint

ModelObject <- lm(y~x)

Answer

ModelObject <- lm(y~x)

Now you have a model object you can use a new function called coef() to look at the estimates of $\alpha$ and $\beta$. The argument you need to put in is the object you saved above.

Code hint

coef(YourModelObject)

Answer

coef(ModelObject)

##  (Intercept)            x 
## -0.000952381  3.021441558

How do these compare to your estimates for the slope in Part B?

Answer

Will depend on what you got as your best answer in Part B, but should be close.

Think about what these estimates mean in relation to the aim of regression i.e. to estimate a relationship between two variables. Which bit tells you about a relationship?

Answer

The beta value tells you about the relationship.

Part D: Assumptions

The linear regression model has several criteria that need to be met in order to use it appropriately and for the statistics to behave as we expect. We call these assumptions. Assumptions are part of most statistical models and they are called this because the maths behind these models assumes these things are true. If they are not, the model can behave strangely and the results are invalid.

Assumptions for linear regression:

The relationship between X and Y is linear
Residuals (this is another word for error) are normally distributed
The residuals have a mean of 0
The variance of the residuals is equal for all fitted values (homoscedasticity)
There are no outliers
Each value of Y is independent

The regression line must also go through the point that = the mean of $X$ and the mean of $Y$.

We will cover these more next week, including how to see if they are met, and what to do if they are not.

Part E: Olympics data

In this part of the module we will give you some real data to model using a linear regression. These data are the official fastest times for men and women for the 100m sprint (running) from 1948 to 2004.

The data can be found at: https://www.math.ntnu.no/emner/ST2304/2019v/Week5/Times.csv

It is a .csv file with headers.

Import the data and save as an object.

Code hint

Times <- read.csv('https://www.math.ntnu.no/emner/ST2304/2019v/Week5/Times.csv', header=T)

The question we want to ask here is:

How do the times of men/women change with year? Times are the response and Year the explanatory variable.

In Part C, you used the lm() function and the formula argument. But lm() can actually take more arguments. We will need to use one more argument in this part.

Arguments of lm(): lm(formula, data)

formula = Y ~ X
data = your data

Y is the response variable and X is the explanatory variable.

Example: lm(columnNAME1 ~ columnNAME2, data = YourDataFile)

When using the data argument, you need to use the column names for the variables you want to include.

Fit the regression for looking at the relationship between year and time using lm(), make sure to assign the output as an object. Choose either Men’s or Women’s times.

We recommend some of your group choose Men’s times and some Women’s so you can share the results later.

Code hint

# You will need to look at the column names to find the X and Y variables ??? 
# you can use:
head(YourData)
str(YourData)
colnames(YourData)

#Remember to use ?FunctionName for help on arguments

Answer

OlympicModelW <- lm(WomenTimes ~ Year, data  = Times)

OlympicModelM <- lm(MenTimes ~ Year, data  = Times)

#Remember to use ?FunctionName for help on arguments

Look at the results using coef() like above.

Answer

coef(OlympicModelW)

coef(OlympicModelM)

Have a go at plotting your data and add the regression line using abline()

Make sure to label your axes (arguments = xlab and ylab) and choose the colour of your line.

Remember the data arugment does not work in plot. You need to use YourDataFile$columnNAME1

Code hint

plot(x=Times$Year, y=Times$WomenTimes, 
     ylab="Times (s)", xlab="Year")

abline(a = YourIntercept, b = YourSlope, 
       col = "hotpink")

Answer

par(mfrow=c(1,2)) # make two plots in the plot window

plot(x=Times$Year, y=Times$WomenTimes, 
     ylab="Times (s)", xlab="Year")

abline(a = coef(OlympicModelW)[1], 
       b = coef(OlympicModelW)[2], 
       col = "purple") 

plot(x=Times$Year, y=Times$MenTimes, 
     ylab="Times (s)", xlab="Year")

abline(a = coef(OlympicModelM)[1], 
       b = coef(OlympicModelM)[2], 
       col = "green")

plot of chunk unnamed-chunk-14

Part F: Quantifying uncertainty

As you know from previous weeks, any estimates we make for parameters are not certain. There are several possible values for any parameter that could be true but we cannot know which is true. We can try and quantify how uncertain we are in our estimate and which values might be more plausible.

For regression, it is the same process as we have come across before. Using the likelihood, we can get a distribution for the estimator of each parameter. This distribution has our maximum likelihood estimate in the centre but some variation around it. However, now that we have several parameters to estimate, it gets a bit more complex.

The graph below shows the likelihood curves for $\alpha$ and $\beta$ plotted against each other. As we have two curves, this makes a surface. You can see higher numbers in blue. You should notice there is a correlation between the two parameters. This is because they are not independent.

As a regression line must go through the mean of X and the mean of Y ($(\bar{x}, \bar{y})$), this fixes the line at that point. If you increase the intercept ($\alpha$) you would therefore need to also decrease the slope ($\beta$).

## NULL

plot of chunk unnamed-chunk-15

All of this makes it difficult to estimate the uncertainty by hand. But, it is very easy to calculate the confidence intervals from an lm().

For more theory on uncertainty in linear regression look at the Quantify Uncertainty section of the webpage here.

We use the function confint() e.g. confint(YourModelObject)

Calculate the confidence intervals for your model in Part E.

Answer

confint(OlympicModelW)

##                   2.5 %       97.5 %
## (Intercept) 29.18663838 55.192123521
## Year        -0.02231223 -0.009152059

confint(OlympicModelM)

##                   2.5 %       97.5 %
## (Intercept) 21.02003411 36.687965892
## Year        -0.01346441 -0.005535594

Based on what you know already about the different parameters, work out what the confidence intervals are telling you.

Answer

From previous sections, you should know that the slope parameter is the one we are most interested in to tell us about the relationship between $X$ and $Y$.

In this case, for both examples we can see the confidence interval for the slope (second row of output and labeled ‘Year’) does not span 0. There is a clear direction of the effect of Year on Times, which is negative. So, times are decreasing over time. The strength of this relationship is not certain and varies quite a lot. It could plausibly be as high as -0.1 or -0.2 seconds per year or as small as -0.009 or -0.005 seconds per year.

For the intercept, the uncertainty is wider. The intercept tells us how fast Olympic runners would have been in Year 0, which is a strange thing to want to know.

Now that you have the confidence intervals and have thought about them, it can be helpful to visualise them. To plot them takes a few steps. First is plotting the data and the regression line as you have done before.

For the next step, we need to introduce another function here (sorry!) predict().

predict() predicts new values of $Y$ based on values of $X$. It takes several arguments:

object = your model object from lm()
newdata = a dataframe of new $X$ values to predict from
interval = can take values “confidence” or “prediction”

There are others but we will not use them here. Today, we will only use arguments object and interval. We want to get predictions based on our model and we want confidence intervals (we will look at prediction later).

You know you can plot your regression line using abline()

Read the comments to see what the code does, DO NOT just copy and paste

To plot your confidence intervals you need these two bits of code:

# generate some predictions based on the confidence interval
predictions <- predict(YourModelObject, interval = 'confidence')

# look at the predictions
predictions
# There are three columns of results
# the first is the mean prediction (the actual regression line)
# the second is the lower end of the confidence interval
# the third is the upper end of the confidence interval

# This bit add the lines to the plot
# [,2] = second column, [2,] would be second row

# REMEMBER to re-run your plot of the regression line

# plot the lower bound, column 2 in predictions
lines(YourXValues, predictions[,2], lty=2)

# plot the upper bound, column 3 in predictions
lines(YourXValues, predictions[,3], lty=2)

# lty=2 gives a dashed line

Answer

# generate some predictions based on the confidence interval
predictionsW <- predict(OlympicModelW, interval = 'confidence')

predictionsM <- predict(OlympicModelM, interval = 'confidence')

# look at the predictions
predictionsW

##         fit      lwr      upr
## 1  11.54317 11.32666 11.75968
## 2  11.48024 11.28563 11.67484
## 3  11.41731 11.24338 11.59124
## 4  11.35438 11.19941 11.50935
## 5  11.29145 11.15301 11.42990
## 6  11.22852 11.10321 11.35383
## 7  11.16560 11.04887 11.28232
## 8  11.10267 10.98895 11.21638
## 9  11.03974 10.92302 11.15646
## 10 10.97681 10.85150 11.10212
## 11 10.91388 10.77544 11.05232
## 12 10.85095 10.69598 11.00592
## 13 10.78802 10.61410 10.96195
## 14 10.72510 10.53049 10.91970
## 15 10.66217 10.44566 10.87868

predictionsM

##       fit       lwr       upr
## 1  10.348 10.217556 10.478444
## 2  10.310 10.192754 10.427246
## 3  10.272 10.167212 10.376788
## 4  10.234 10.140633 10.327367
## 5  10.196 10.112590 10.279410
## 6  10.158 10.082503 10.233497
## 7  10.120 10.049676 10.190324
## 8  10.082 10.013487 10.150513
## 9  10.044  9.973676 10.114324
## 10 10.006  9.930503 10.081497
## 11  9.968  9.884590 10.051410
## 12  9.930  9.836633 10.023367
## 13  9.892  9.787212  9.996788
## 14  9.854  9.736754  9.971246
## 15  9.816  9.685556  9.946444

# There are three columns of results
# the first is the mean prediction (the actual regression line)
# the second is the lower end of the confidence interval
# the third is the upper end of the confidence interval

# This bit add the lines to the plot
# [,2] = second column, [2,] would be second row

plot(x=Times$Year, y=Times$WomenTimes, 
     ylab="Times (s)", xlab="Year")

abline(a = coef(OlympicModelW)[1], 
       b = coef(OlympicModelW)[2], 
       col = "purple") 

# plot the lower bound, column 2 in predictions
lines(Times$Year, predictionsW[,2], lty=2)

# plot the upper bound, column 3 in predictions
lines(Times$Year, predictionsW[,3], lty=2)

plot of chunk unnamed-chunk-18

# lty=2 gives a dashed line

Part G: Interpreting

You now have estimates of $\alpha$ and $\beta$ and the associated confidence intervals for your regression model.

Although we did plot a line, remember the drawing is not your model, the equations and numbers are!

But what do these mean in terms of the relationship between 100m times and year?

Have a go at picking out the key results from your model.

Answer

The key result is the slope value and the associated confidence intervals.

It tells you the strength of the relationship between x and y, h ow much y changes for each change in x. Also the confidence intervals around this relationship tells you how confident you are in direction and strength.

Remember to interpret it in the original units e.g. seconds per year.

Prepare a summary of one key result to tell another member of your group.

Then have a go at presenting your results.

Once you had a go look at a summary in video 7

https://ntnu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=effad08c-ee1d-4e59-87f5-acc900f8c9b1

Prediction

Wait a minute…. this has all been about understanding, I thought we could also predict with regression?

Regression can also be used for prediction. We can predict within our dataset and outside.

Why might we want to predict values of Y? (discuss together)

Answer

To predict future the, or new values within the dataset.

This might be needed if you cannot do experiments e.g. under climate change.

Or if you didn’t measure a particular value e.g. if you want to know the time of a runner in a non-Olympic year. See if you can think of other examples.

In what ways could prediction be good or bad? (discuss together)

Answer

BAD: the same relationship may not continue into the future and new x values you have never measured. It is a big assumption to make. A linear regression can also predict from - infinity to infinity of the both variables, which might not be realistic e.g. runners cannot run in less than 0 seconds.

GOOD: can give insight into likely outcomes in scenarios we have not experienced.

When we use a regression for prediction it has uncertainty. Uncertainty in relationship is captured by confidence intervals. Prediction uncertainty captured by prediction intervals. A prediction interval gives a range of values that can be considered likely to contain a future observation.

For prediction we need to convey uncertainty in estimated relationship and scatter around the line (error). So, prediction error includes the variance of the residuals too ($\sigma^2$ is finally useful!)

Definition: While confidence intervals are used to represent uncertainty in parameters estimates and to relate them to an underlying population, prediction intervals are used to represent uncertainty in predictions. It gives a plausible range for the next observation of the response (i.e. for the new value of Y). It is a type of confidence interval, but in regression it will be wider than the confidence interval of the line (which includes uncertainty in our parameters $\alpha$ and $\beta$). A 95% prediction interval tells you, if you were to collect a sample and run the analysis, then go out an collect a new observation of the response variable ($Y$) with particular value of the explanatory variable ($X$) many many times AND each time draw a prediction interval, 95% of the time, the new observation would fall in within the prediction interval.

Prediction uncertainty explained a bit more in video 8

https://ntnu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=8ce31c0a-b243-4f49-bab3-acc900f896a7

You will practice prediction during the Exercise this week.

Part H: Reflection

This is getting ready for next week, when we look at how good our models are.

Some points to think about:

How well do you think your line captures the data?
How could you work out how good the fit is? What would you want to look at?
What did you find out about the times?
Is there anything more might you want to know? (E.g. does temperature impact running speed?)

Answers will come for this part next week.

REMINDER It would be really helpful for us if you could fill in this feedback form https://forms.gle/8yLgA1MViRMZKV1a7. Hopefully we will get a reference group sorted soon to provide more feedback to us.