Excerise 8: Dinosaur park

SOLUTION

Instructions:

This document contains information, questions, R code, and plots.

Hints and reminders are bold

Questions appear in blue.

Answers are in red.

Needs to be completed and handed in by 24th March 23:59


Rationale:

This week we want you to make some decisions on when to do each analysis. Therefore, we have included more detailed instructions in separate HTMLs with code not here in the exercise.

Resources:

R this week:

All R parts explained in extra code htmls.

Might need to install bestglm package.


The challenge: How can we create the best dinosaur exhibit at a new zoo?

You are the board of directors of a new zoo opening in Norway. You want your zoo to be both exciting and educational, to teach visitors about all different kinds of plants and animals from throughout time. You have been very excited by new advances in cloning technology that you saw in Jurassic Park. You have set up a team of biologists to try and clone some dinosaurs to complete your “Ancient History” exhibit. They have been trialling different cloning techniques to try and work out the best protocol.

You have also set up another team to investigate public opinion of dinosaurs. It is very expensive to clone and to keep dinosaurs so you want to make sure that you are investing in the right ones.

Both teams have sent their results back you. Now you must analyse the data and decide on how to set up your dinosaur exhibit.

Your job is to find out how to use resources most efficiently to create an exciting dinosaur exhibit.

Dino


Part A: Setting aims

This part is a bit different to previous exercises. We realise that we have been giving you feedback and solutions without asking you what is most useful! This week we will take the chance to ask for your engagement with feedback and improvement. We will begin and end with questions related to your group's development and to ask what feedback would help you reach the level you want to get to.

A1. Go back through the exercises you have had corrected by the teaching assistants and look at the solutions (they are on the webpage). Based on looking at these, choose (and write down) two things you want to work on improving for this week's exercise.

Here you should write some aims e.g. explain confidence intervals better, do the R code without hints etc. They will be personal to you.

We've chosen our goals, now let us do the exercise!


Part B: Concepts

B1. Why do we have model selection in statistics?

We use model selection in statistics to try and find appropriate explanatory variables to include in our model. Once we have chosen the model that best explains how our data were generated e.g. a Poisson GLM or a linear regression with normal error, we then need to decide which explanatory variables to include to explain our response. Often in the real world, there are many different explanatory variables we could include, either with or without interactions. There is a balance between including everything and explaining a lot of variance and keeping a model simple. The more variables you add, the more parameters you estimate from the data which could lead to a model becoming over-fitted (where we have equal to or more variables than data). Therefore, we need to conduct model selection. All variables we add will increase the variance explained but not always because they actually influence the response (even random data will increase R squared - see lecture module). So, we need ways to determine whether a variable is worth adding or not. We do this in two ways, either by confirming a hypothesis or exploring which variables best explain the data.

B2. What are two of the different aims of model selection?

The two different aims are 1) hypothesis testing/confirmatory model selection i.e. does this extra variable or interaction effect have a statistically significant effect on our response (our Y variable)? 2) which variables, of many, best explain variance in our response (Y)/ exploratory model selection i.e. of these 3 variables we measure, which explain more variation than would be expected by chance and are worth including despite the extra parameters we need to estimate?

B3. How do you perform model selection for each of these? i.e. name the technique don't give all the details

1) Conduct confirmatory model selection using anova (for linear models) or analysis of deviance (generalised linear models - learn this in a week or so). 2) Conduct exploratory model selection using the AIC or BIC.


Part C: Which variables influence cloning success?

These data have been collected by your team of scientists in the cloning facility. They have been trying to clone several different species of dinosaur using fossils of different ages and different lab procedures. They have recorded the 'success' of the cloning as an index created from the number of viable embryos created, longevity of embryos, and the cost of the cloning method. The index has positive and negative values, positive indicating greater success from the investment. This is called SuccessIndex in the data and is the response variable.

The explanatory variables they collected are: Age this is age of the fossil being cloned in million years, Size this is the average adult body weight of the dinosaur species being cloned in metric tons, Herbivore this is an indicator of whether the species is a herbivore (TRUE) or a carnivore (FALSE).

Think about what kind of data (continuous or categorical) each of these are. It will help you with interpreting.

It is thought that some of these variables might explain the variation in cloning success index. But it is not yet known which.

The dataset for this questions can be found at https://www.math.ntnu.no/emner/ST2304/2020v/Week10/CloneData_2020v.csv

As always, the first step is to import the data and assign it to an object. You can use the whole web link above to import the data. It is a csv file with column names (header) included.

CloneData <- read.csv("https://www.math.ntnu.no/emner/ST2304/2020v/Week10/CloneData_2020v.csv", header=T)
pairs(CloneData)

plot of chunk unnamed-chunk-1

C1. Look at the question at the start of this section (in the title of Part C). Is this question confirmatory or exploratory? Why?

Exploratory because it is asking which variables, of many, and not a specific hypothesis about any of them.

Based on your answer to question C1, open the appropriate help HTML for this section.

C2. Conduct model selection for answering “which variables influence cloning success?” To answer this question include a bullet point list of the steps you take to do this. You can include a line or two of R code with each bullet point but you should not need a lot.

Correct steps will be in the R code for this exercise.

Should conduct exploratory model selection to find the variables that best explain variation in SuccessIndex. Should use the AIC or BIC, either is fine but should be justified. BIC has a higher penalty for extra parameters, so you should choose why this would be beneficial or not. Do you want to aim for a simple model? Or to explain most variance without extra penalty?

# Steps of the model selection

# create a null model with no variables
# then one for each explanatory variable alone
null <- lm(SuccessIndex ~ 1, data = CloneData)
model1 <- lm(SuccessIndex ~ Age, data = CloneData)
model2 <- lm(SuccessIndex ~ Size, data = CloneData)
model3 <- lm(SuccessIndex ~ Herbivore, data = CloneData)

AIC(null, model1, model2, model3) # find AIC
##        df      AIC
## null    2 294.6107
## model1  3 253.0163
## model2  3 230.8343
## model3  3 292.0531
BIC(null, model1, model2, model3) # find BIC
##        df      BIC
## null    2 299.8210
## model1  3 260.8318
## model2  3 238.6498
## model3  3 299.8686
library(bestglm)
## Loading required package: leaps
# create Xy to use in bestglm
NewData <- CloneData[ ,c(2:4, 1)] # find the best of all combinations

# now we can use bestglm()
bestglm(NewData, IC="AIC")
## AIC
## BICq equivalent for q in (0, 0.869174304607287)
## Best Model:
##                Estimate Std. Error    t value     Pr(>|t|)
## (Intercept) -37.1383401 5.58392208  -6.650942 1.724241e-09
## Age           0.6179022 0.04294342  14.388754 8.421557e-26
## Size         -0.7215383 0.04292571 -16.809001 1.715255e-30
bestglm(NewData, IC="BIC")
## BIC
## BICq equivalent for q in (0, 0.869174304607287)
## Best Model:
##                Estimate Std. Error    t value     Pr(>|t|)
## (Intercept) -37.1383401 5.58392208  -6.650942 1.724241e-09
## Age           0.6179022 0.04294342  14.388754 8.421557e-26
## Size         -0.7215383 0.04292571 -16.809001 1.715255e-30
# Should get a model with Age and Size

C3. Interpret the results from the model selection. Include reference to model selection and the final model you end up with. I.e. you should also mention what the effect any variables have

Model selection showed that the model with age and size, was the best supported. Both the BIC and AIC agreed on this model being the best balance of fit and complexity. This was shown by it having the lowest AIC and BIC values. Age has a positive relationship with cloning success, success increases 0.618 (standard error of 0.043) for every 1 million year increase in age. Size has a negative relationship, success decreases 0.722 (standard error of 0.043) for every 1 ton increase in size. Neither sets of confidence intervals cross 0. So, we have reasonable support for the direction of the effects. Even when we consider uncertainty, the direction is still clear.


Part D: Does the size of a dinosaur affect their popularity?

These data were collected from a large survey of the general public. The participants were asked to rate, on a continuous scale (0-100), how much they liked different dinosaur species. The species all differed in size. The board members (your team) think that visitors to your zoo will be more excited to see bigger dinosaurs because bigger dinosaurs are more popular.

The dataset has columns: PopularityScore the popularity score of the dinosaur species, Weight weight of the species in metric tons (a measure of size).

The data for this question can be found at https://www.math.ntnu.no/emner/ST2304/2020v/Week10/DinoData_2020v.csv

As always, the first step is to import the data and assign it to an object. You can use the whole web link above to import the data. It is a csv file with column names (header) included.

DinoData <- read.csv("https://www.math.ntnu.no/emner/ST2304/2020v/Week10/DinoData_2020v.csv", header=T)
pairs(DinoData)

plot of chunk unnamed-chunk-3

D1. Look at the question at the start of this section (in the title for Part D). Is this question confirmatory or exploratory? Why?

Confirmatory, it is a specific hypothesis about a relationship between size and popularity.

D2. Conduct model selection for answering “does the size (weight) of a dinosaur affect its popularity?” To answer this question include a bullet point list of the steps you take to do this. You can include a line or two of R code with each bullet point but you should not need a lot.

Correct steps will be in the R code below. Should conduct confirmatory model selection to test whether there is a relationship between popularity and weight.

# Begin by creating H0 model 
ModelH0 <- lm(PopularityScore ~ 1, data = DinoData)
# Then create H1 model 
ModelH1 <- lm(PopularityScore ~ Weight, data = DinoData)

# Compare with 
anova(ModelH0, ModelH1) 
## Analysis of Variance Table
## 
## Model 1: PopularityScore ~ 1
## Model 2: PopularityScore ~ Weight
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     99 510.43                              
## 2     98 484.88  1    25.545 5.1628 0.02526 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# - make sure they are in correct order
# The DF in the anova output should be 1 NOT -1

D3. Interpret the results from the model selection. Include reference to model selection and the final model you end up with. I.e. you should also mention what the effect of any variables are

The F-value from the anova ia 5.1628 with probability of getting this value or higher if H0 is true of 0.02526. This lower than the usual cut off of 0.05 so we can reject H0. We can conclude that Weight of species does have a relationship with popularity. Bigger dinosaurs are more popular by 0.5 points per metric ton. Confidence intervals do not span 0 (0.06 to 0.95), so we are unlikely to see this effect if H0 is true. R squared for this model is very low (0.05). Although statistically significant, weight does not explain much variation in PopularityScore.


Part E: Recommendation

E1. Based on all of your results, what would you recommend as a way to create an efficient and exciting exhibit?

Recommend bigger dinosaurs and older fossils. But bigger are also harder to clone - so needs a balance. Got conflicting results so could be hard to decide. There is no single correct answer but all answers should discuss the conflicting results. That big dinosaurs are more population but harder to clone. Some mention of the low R squared for popularity model should also be included. This indicates that there is more influencing popularity than just size. Also all dinosaurs are very popular, so maybe it is ok to have smaller ones. (These are just examples of points and justification).


Part F: Reflection and feedback

F1. Look back at your answer to A1. How well do you think you did at improving the areas you chose to work on? Give a sentence or two for each.

F2. What would you like feedback on this week?

This question (F2) might be hard to answer, but it will be really helpful if you do. It might be that you want feedback on how well you improved the areas you mentioned in A1. On the other hand, you might be very confident that your work on those parts was good and you might prefer feedback on what other areas you can improve in future. You may want feedback on interpretation, or concepts, or R code, or on everything. It is up to your group what will be most helpful! You will still get a solution to this exercise so you can see a model answer. Therefore, choose feedback that will help beyond the solution.

Feedback should be to help you improve, so let us know what will be most helpful.