Excerise 8: Model selection

Instructions:

Hints and reminders are italic

Questions appear in blue.

Answers are in red or as R code.


Resources:

This week we want you to make some decisions on when to do each analysis. Therefore, we have included more detailed instructions in separate HTMLs with code not here in the exercise.


The challenge: How can we create the best dinosaur exhibit at a new zoo?

You are the board of directors of a new zoo opening in Norway. You want your zoo to be both exciting and educational, to teach visitors about all different kinds of plants and animals from throughout time. You have been very excited by new advances in cloning technology that you saw in Jurassic Park. You have set up a team of biologists to try and clone some dinosaurs to complete your “Ancient History” exhibit. They have been trialling different cloning techniques to try and work out the best protocol.

You have also set up another team to investigate public opinion of dinosaurs. It is very expensive to clone and to keep dinosaurs so you want to make sure that you are investing in the right ones.

Both teams have sent their results back you. Now you must analyse the data and decide on how to set up your dinosaur exhibit.

Your job is to find out how to use resources most efficiently to create an exciting dinosaur exhibit.

Dino


General questions

1. Why do we have model selection in statistics?

We use model selection in statistics to try and find appropriate explanatory variables to include in our model. Once we have chosen the model that best explains how our data were generated e.g. a Poisson GLM or a linear regression with normal error, we then need to decide which explanatory variables to include to explain our response. Often in the real world, there are many different explanatory variables we could include, either with or without interactions. There is a balance between including everything and explaining a lot of variance and keeping a model simple. The more variables you add, the more parameters you estimate from the data which could lead to a model becoming over-fitted (where we have equal to or more variables than data). Therefore, we need to conduct model selection. All variables we add will increase the variance explained but not always because they actually influence the response (even random data will increase R squared - see lecture slide 7). So we need ways to determine whether a variable is worth adding or not. We do this in two ways, either by confirming a hypothesis or exploring which variables best explain the data.

2. What are two of the different aims of model selection?

The two different aims are 1) hypothesis testing/confirmatory model selection i.e. does this extra variable or interaction effect have a statistically significant effect on our response (our Y variable)? 2) which variables of many best explain variance in our response (Y)/ exploratory model selection i.e. of these 3 variables we measure, which explain more variation than would be expected by chance and are worth including despite the extra parameters we need to estimate?

3. How do you perform model selection for each of these? i.e. name the technique don't give all the details

1) Conduct confirmatory model selection using anova (LMs) or analysis of deviance (GLMs). 2) Conduct exploratory model selection using the AIC or BIC.


Which variables influence cloning success?

This data has been collected by your team of scientists in the cloning facility. They have been trying to clone several different species of dinosaur using fossils of different ages and different lab procedures. They have recorded the 'success' of the cloning as an index created from the number of viable embryos created, longevity of embryos, and the cost of the cloning method. The index has positive and negative values, positive indicating greater success from the investment. This is called SuccessIndex in the data and is the response variable.

The explanatory variables they collected are: Age this is age of the fossil being cloned in million years, Size this is the average adult body weight of the dinosaur species being cloned in metric tons, Herbivore this is an indicator of whether the species is a herbivore (TRUE) or a carnivore (FALSE).

Think about what kind of data (continuous or categorical) each of these are. It will help you with interpreting.

It is thought that some of these variables might explain the variation in cloning success index. But it is not yet know which.

The dataset for this questions can be found at https://www.math.ntnu.no/emner/ST2304/2019v/Week10/CloneData.csv

As always, the first step is to import the data and assign it to an object then plot it. You can use the whole web link above to import the data. It is a csv file with column names (header) included. pairs() should work here

CloneData <- read.csv("https://www.math.ntnu.no/emner/ST2304/2019v/Week10/CloneData.csv", header=T)
pairs(CloneData)

plot of chunk unnamed-chunk-1

4. Look at the question at the start of this section. Is this question confirmatory or exploratory? Why?

Exploratory because it is asking which variables of many not a specific hypothesis about any of them

Based on your answer to question 4, open the appropriate help HTML for this section.

5. Conduct model selection for answering “which variables influence cloning success?” To answer this question include a bullet point list of the steps you take to do this. You can include a line or two of R code with each bullet point but you should not need a lot.

Correct steps will in the R code for this exercise.

Should conduct exploratory model selection to find the variables that best explain variation in SuccessIndex. Should use the AIC or BIC, either is fine but should be justified. BIC has a higher penalty for extra parameters, so students should choose why this would be beneficial or not. Do you want to aim for a simple model? Or to explain most variance without extra penalty?

# Steps of the model selection

# create a null model with no variables
# then one for each explanatory variable alone
null <- lm(SuccessIndex ~ 1, data = CloneData)
model1 <- lm(SuccessIndex ~ Age, data = CloneData)
model2 <- lm(SuccessIndex ~ Size, data = CloneData)
model3 <- lm(SuccessIndex ~ Herbivore, data = CloneData)

AIC(null, model1, model2, model3) # find AIC
##        df      AIC
## null    2 310.6009
## model1  3 286.2750
## model2  3 233.2745
## model3  3 309.3978
BIC(null, model1, model2, model3) # find BIC
##        df      BIC
## null    2 315.8112
## model1  3 294.0905
## model2  3 241.0900
## model3  3 317.2133
library(bestglm)
## Loading required package: leaps
# create Xy to use in bestglm
NewData <- CloneData[ ,c(2:4, 1)] # find the best of all combinations

# now we can use bestglm()
bestglm(NewData, IC="AIC")
## AIC
## BICq equivalent for q in (2.66453525910038e-15, 0.907315757683734)
## Best Model:
##                Estimate Std. Error    t value     Pr(>|t|)
## (Intercept) -22.4050184 6.99244363  -3.204176 1.833486e-03
## Age           0.5422586 0.05379267  10.080529 8.984456e-17
## Size         -0.8334241 0.05374822 -15.506077 5.251352e-28
bestglm(NewData, IC="BIC")
## BIC
## BICq equivalent for q in (2.66453525910038e-15, 0.907315757683734)
## Best Model:
##                Estimate Std. Error    t value     Pr(>|t|)
## (Intercept) -22.4050184 6.99244363  -3.204176 1.833486e-03
## Age           0.5422586 0.05379267  10.080529 8.984456e-17
## Size         -0.8334241 0.05374822 -15.506077 5.251352e-28
# Should get a model with Age and Size

6. Interpret the results from the model selection. Include reference to model selection and the final model you end up with. I.e. you should also mention what the effect any variables have

Age has a positive relationship with cloning success, success increases 0.54 for every 1 million year increase in age. Size has a negative relationship, success decreases 0.84 for every 1 ton increase in size. Neither sets of confidence intervals cross 0. So you would be unlikely to see these results if nothing was going on.


Does the size of a dinosaur affect their popularity?

This data was collected from a large survey of the general public. The participants were asked to rate, on a continuous scale (0-100), how much they liked different dinosaur species. The species all differed in size. The board members (your team) think that visitors to your zoo will be more excited to see bigger dinosaurs because bigger dinosaurs are more popular.

The dataset has columns: PopularityScore the popularity score of the dinosaur species, Weight weight of the species in metric tons (a measure of size).

The data for this question can be found at https://www.math.ntnu.no/emner/ST2304/2019v/Week10/DinoData.csv

As always, the first step is to import the data and assign it to an object then plot it. You can use the whole web link above to import the data. It is a csv file with column names (header) included. pairs() should work here

DinoData <- read.csv("https://www.math.ntnu.no/emner/ST2304/2019v/Week10/DinoData.csv", header=T)
pairs(DinoData)

plot of chunk unnamed-chunk-3

7. Look at the question at the start of this section. Is this question confirmatory or exploratory? Why?

Confirmatory, it is a specific hypothesis about a relationship between size and popularity

8. Conduct model selection for answering “does the size (weight) of a dinosaur affect its popularity?” To answer this question include a bullet point list of the steps you take to do this. You can include a line or two of R code with each bullet point but you should not need a lot.

Correct steps will be in the R code below. Should conduct confirmatory model selection to test whether there is a relationship between popularity and weight.

# Begin by creating H0 model lm(PopularityScore ~ 1, data = DinoData)
# Then create H1 model lm(PopularityScore ~ Weight, data = DinoData)

# Compare with anova(ModelH0, ModelH1) - make sure they are in correct order
# The DF in the anova output should be 1 NOT -1

9. Interpret the results from the model selection. Include reference to model selection and the final model you end up with. I.e. you should also mention what the effect of any variables are

The F value from the anova ia 5.1742 with probability of getting this value or higher if H0 is true of 0.0251. This lower than the usual cut off of 0.05 so we can reject H0. We can conclude that Weight of species does have a relationship with popularity. Bigger dinosaurs are more popular by 0.5 points per metric ton. Confidence intervals do not span 0, so we are unlikely to see this effect if H0 is true. R squared for this model is very low (0.05). Although statistically

significant, Weight does not explain much variation in PopularityScore.

Recommendation

10. Based on all of your results, what would you recommend as a way to create an efficient and exciting exhibit?

Recommend bigger dinosaurs and older fossils. But bigger are also harder to clone - so needs a balance. Got conflicting results so could be hard to decide. There is no single correct answer but all answers should discuss the conflicting results. That big dinosaurs are more population but harder to clone. Some mention of the low R squared for popularity model should also be included. This indicates that there is more influencing popularity than just size. Also all dinosaurs are very popular, so maybe it is ok to have smaller ones. (These are just examples of points and justification).