Excerise 8: Model selection

Instructions:

Hints and reminders are italic

Questions appear in blue.

Needs to be completed and handed in by 21st March


Resources:

This week we want you to make some decisions on when to do each analysis. Therefore, we have included more detailed instructions in separate HTMLs with code not here in the exercise.


The challenge: How can we create the best dinosaur exhibit at a new zoo?

You are the board of directors of a new zoo opening in Norway. You want your zoo to be both exciting and educational, to teach visitors about all different kinds of plants and animals from throughout time. You have been very excited by new advances in cloning technology that you saw in Jurassic Park. You have set up a team of biologists to try and clone some dinosaurs to complete your “Ancient History” exhibit. They have been trialling different cloning techniques to try and work out the best protocol.

You have also set up another team to investigate public opinion of dinosaurs. It is very expensive to clone and to keep dinosaurs so you want to make sure that you are investing in the right ones.

Both teams have sent their results back you. Now you must analyse the data and decide on how to set up your dinosaur exhibit.

Your job is to find out how to use resources most efficiently to create an exciting dinosaur exhibit.

Dino


General questions

1. Why do we have model selection in statistics?

2. What are two of the different aims of model selection?

3. How do you perform model selection for each of these? i.e. name the technique don't give all the details


Which variables influence cloning success?

This data has been collected by your team of scientists in the cloning facility. They have been trying to clone several different species of dinosaur using fossils of different ages and different lab procedures. They have recorded the 'success' of the cloning as an index created from the number of viable embryos created, longevity of embryos, and the cost of the cloning method. The index has positive and negative values, positive indicating greater success from the investment. This is called SuccessIndex in the data and is the response variable.

The explanatory variables they collected are: Age this is age of the fossil being cloned in million years, Size this is the average adult body weight of the dinosaur species being cloned in metric tons, Herbivore this is an indicator of whether the species is a herbivore (TRUE) or a carnivore (FALSE).

Think about what kind of data (continuous or categorical) each of these are. It will help you with interpreting.

It is thought that some of these variables might explain the variation in cloning success index. But it is not yet know which.

The dataset for this questions can be found at https://www.math.ntnu.no/emner/ST2304/2019v/Week10/CloneData.csv

As always, the first step is to import the data and assign it to an object then plot it. You can use the whole web link above to import the data. It is a csv file with column names (header) included. pairs() should work here

4. Look at the question at the start of this section. Is this question confirmatory or exploratory? Why?

Based on your answer to question 4, open the appropriate help HTML for this section.

5. Conduct model selection for answering “which variables influence cloning success?” To answer this question include a bullet point list of the steps you take to do this. You can include a line or two of R code with each bullet point but you should not need a lot.

6. Interpret the results from the model selection. Include reference to model selection and the final model you end up with. I.e. you should also mention what the effect any variables have


Does the size of a dinosaur affect their popularity?

This data was collected from a large survey of the general public. The participants were asked to rate, on a continuous scale (0-100), how much they liked different dinosaur species. The species all differed in size. The board members (your team) think that visitors to your zoo will be more excited to see bigger dinosaurs because bigger dinosaurs are more popular.

The dataset has columns: PopularityScore the popularity score of the dinosaur species, Weight weight of the species in metric tons (a measure of size).

The data for this question can be found at https://www.math.ntnu.no/emner/ST2304/2019v/Week10/DinoData.csv

As always, the first step is to import the data and assign it to an object then plot it. You can use the whole web link above to import the data. It is a csv file with column names (header) included. pairs() should work here

7. Look at the question at the start of this section. Is this question confirmatory or exploratory? Why?

Based on question 7, open the appropriated HTML help file. 8. Conduct model selection for answering “does the size (weight) of a dinosaur affect its popularity?” To answer this question include a bullet point list of the steps you take to do this. You can include a line or two of R code with each bullet point but you should not need a lot.

9. Interpret the results from the model selection. Include reference to model selection and the final model you end up with. I.e. you should also mention what the effect of any variables are


Recommendation

10. Based on all of your results, what would you recommend as a way to create an efficient and exciting exhibit?