This document contains information, questions, R code, and plots.
Hints and reminders are bold
Questions appear in blue.
# R code looks like this. # Comments are in grey and explain the code 2+2 # add two and two
##  4
# outputs from R appear like the above ^^^^
You should work through the exercise in groups and write answers in a separate document to be submitted on Blackboard by Thursday 30th January.
Answer all questions. The assignment can be written in any format (e.g.
should easily be able to include pictures, R code, and text.
To complete this exercise you will need to use:
The exercise is designed to use statistical theory in an applied context to solve biology related problems similar to those you will face in real life. This week the exercise looks at how we can use maximum likelihood estimation to make biological and political conclusions. These exercises will get you to use all steps in statistical modelling (choosing models, estimating parameters, estimating uncertainty, and interpretation of results.)
We will give some help with R and some hints, these will reduce each week. We also expect you to use some problem solving yourselves and to ask the assistants for help.
We would expect this to take you at least 2 hours.
In the exam you will be given the R help pages for any function that is used.
Here you can practice using them. Type
? followed by the function name to bring up
the help pages for that function e.g.
For the function
optimise() that we use this week.
This function provides an easy way for us to optimise over a parameter (find the 'best' value). What we consider the 'best' value of a parameter depends on what criteria we want to optimise. For most of the work we are doing we want to find the parameter that maximises the likelihood. We can use optimise to calculate the likelihood for all possible values of a parameter (within bounds we set) and find the maximum of these.
Arugments we have to provide to
optimise() (the arguments optimise takes):
f, the function to be optimised, for us that is the likelihood.
lower, the lower bound of parameter values to try.
upper, the upper bound of parameter values to try.
maximum, TRUE or FALSE, whether to maximise or minimise the function. We want to maximise.
You are a team of scientists studying the reintroduction of dragons into Trondheim.
Dragons were reintroduced last year as part of rewilding efforts. The exact species of dragon introduced is a seed eater and so no threat to people or livestock. They also help maintain lower vegetation across the river, which helps kayakers and other river users. Since reintroduction you have been collecting data on these dragons as part of your research. Now the government wants you to present your findings to help them choose a management strategy:
From previous work you know that an average population abundance of 35 individuals is needed to maintain the population and keep vegetation under control. An average population abundance of more than 50 individuals will start to put pressure on resources and increase risks of fire.
It is your job to analyse your data and advise the government how to manage the population.
One of your team has been collecting data on the sex of dragon eggs. Conveniently, male dragons are purple and females are red (even as eggs).
They have collected data from 100 eggs.
Their data can be found at (https://www.math.ntnu.no/emner/ST2304/2020v/Week03/DragonEggData.csv).
This file has a header and is a
You can either download and then import it using read.csv() or import directly using the whole web address above.
DO NOT open the file in excel and re-save after downloading.
The data has two columns
sex. In the sex column 0 = male and 1 = female.
Remember: to access a particular column of a data frame you use $ e.g. sex_ratio_data$Egg_number
First thing we should always do is look at our data.
A1. Plot a histogram of the data and look at the raw data in R.
This is not a pretty graph but lets you see what sort of data you have.
From this data you can calculate the observed proportion of female eggs:
A2. What is the proportion of female eggs in the observed data? (include one line on how you calculated it)
Now you have the proportion of female eggs that were observed. But this does not tell us much. You need to complete some statistical analyses to make generalisable statements about the results.
But which distribution can we use as a model?
You want to choose a model to represent the data. Just like last week, this is binary data (male or female) and each egg is an independent trial. It should be familiar that we can use the binomial distribution as our model. As we have chosen the binomial distribution, you know that there is one unknown parameter to estimate \(p\) = the probability of success (being female).
We have a model, how do we estimate the parameter?
A3i. Begin by calculating the likelihood for different probabilities of being female (here being female = getting a success).
dbinom()function to find the density for different parameter values. In this case the density is the same as the likelihood (use ?dbinom if you can't remember what arguments it takes)
A3ii. Then plot the likelihood curve.
A4. At what point along the curve is the maximum likelihood estimate? and How can we find it, mathematically? (hint: you can describe the mathematical process in words or use the equation)
Below is the code to find the maximum likelihood estimate of the probability of being female in R and plot it on the graph. You could also use the equations from last week and find the maximum likelihood estimate analytically.
But we present an example using
optimise(), this is easier for more complex likelihoods.
# f = the dbinom() function # You now add an upper and lower bound for the probability (0 and 1) # R will only calculate the likelihood for parameter values between 0 and 1 # You still need to include the arguments of the number of successes # and number of trials. dbinom() needs these to run # You have set maximum to TRUE MLE.binom <- optimise(f=dbinom, lower=0, upper=1, x=NSuccess, size=NTrials, maximum=TRUE) # $maximum takes only the maximum value from the output # Look at the result: MLE.binom # $maximum = the parameter value that gave the highest likelihood # $objective = the likelihood vlaue for that parameter
# Now to plot the results # Repeat the plot from A3ii plot(x=Probabilities, y=likelihoods, type="l", xlab="Probability of female", ylab="Likelihood") # add a point at the maximum likelihood points(MLE.binom$maximum, # the parameter value MLE.binom$objective, # the maximum likelihood value col="red", pch=16) # pch = 16 and col = "red", makes the point red and filled in
You should now have a plot of the likelihoods and the probability (our parameter) with the maximum likelihood. Although you have managed to calculate this, it is not the whole picture.
A5. Why is the maximum likelihood estimate alone not enough to draw statistical conclusions? and What other information should we include?
Below is the code to calculate the confidence intervals for the maximum likelihood estimate and plot them, using a normal approximation.
A6. Write a definition for the standard error. You can use equations but it must also include words.
r <- 31 # number of successes N <- 100 # total number of trials p <- 0.31 # proportion of success (rounded) phat <- r/N # get maximum likelihood estimate stderr <- sqrt(phat*(1-phat)/N) # creating the standard error # create normal approximation of likelihood likelihoods.approx <- dnorm(Probabilities, phat, stderr) # scale this by the maximum likelihood (becomes relative likelihood) # i.e. relative to the maximum calculated likelihoods.approx <- likelihoods.approx/max(likelihoods.approx) # plot the previous likelihood (binomial) plot(x=Probabilities, y=likelihoods, type="l", ylab="Likelihood", xlab="Probability") # add a line for the new normal approximation likelihood # need to multiply the relative likelhoods by the maximum from the binomial lines(Probabilities, likelihoods.approx*max(likelihoods), col="red") # calculate the lower and upper confidence intervals using the standard error CI_lower <- MLE.binom$maximum - 1.96*stderr CI_upper <- MLE.binom$maximum + 1.96*stderr # display the rounded result round(c(CI_lower, CI_upper), 2) # mark the CIs on the plot - can do with lines, polygons or points # I have chosen lines because it is simple code abline(v=0.31) # MLE line abline(v=c(CI_lower, CI_upper), col="red") # CI lines # add text labels x and y tell R the coordinates to put text at text(x=0.1, y=0.04, labels="Lower CI", col="red") text(x=0.57, y=0.04, labels="Upper CI", col="red")
Now, that you have estimated parameters and uncertainty, it is time to interpret the results.
A7. Do the dragons have a skewed sex ratio as eggs (is it different to 50-50)? Use the confidence intervals in your answer. Explain your answer
Another member of your team has been collecting data on the number of adult dragons in Trondheim. They have set up a feeding station along the river. They have counted the number of dragons that visited the feeding station in a single day. The number of dragons they counted was 32.
This is just a single count, if you repeated the count on more days each one would differ. Just like the amount of land and sea we sampled differed each time we moved the globe.
But which distribution can we use as a model?
The data you have are counts, someone went out and counted dragons. Count data, can only be whole numbers (integers) that are always positive (non-negative). You cannot have half a dragon or negative dragons. The Poisson distribution also has these properties, therefore it is the appropriate choice. So, here we assume that the count data has come from a Poisson distribution.
In the same way as for binomial data, we want to find what parameter values (but this time for the Poisson distribution), which produce the highest likelihood given our observed data, those that are most likely to give rise to our data.
Here you only have information on when you did see a dragon, not how many you did not see one (there is no record of failures).
You need to save your number of observations as an object in R.
ObservedAbundance <- 32
The Poisson distribution is characterised by a single parameter. Lambda \( \lambda \) which is the mean (and also the variance) of the distribution. As with the binomial distribution (and \(p\), the probability), we can also find the maximum likelihood estimate for the Poisson, but here we are finding the likelihood of different \( \lambda \) (means).
Below you can see the code to find the likelihoods of different \( \lambda \) and then plot them.
Details are as follows:
Remember: you can only have whole dragons.
dpois()function to find the likelihoods (dpois(x=ObservedData, lambda=TestLambdas) or use ?dpois)
# Begin by creating a sequence of lambdas to find likelihoods for lambdas <- seq(1,70,1) # 70 lambdas (means) # Calculate the likelihoods likelihoods.Pois <- dpois(x=ObservedAbundance, lambdas) # plot the outcome plot(lambdas, likelihoods.Pois, type="l", ylab="Likelihood", xlab="Lambda (mean)")
To find the maximum likelihood we again use
# replot the outcome plot(lambdas, likelihoods.Pois, type="l", ylab="Likelihood", xlab="Lambda (mean)") # This time we use optimise on the dpois function used above # Set lower and upper to the bounds of your lambdas above MLE.Pois <- optimise(dpois, lower=0, upper=70, x = ObservedAbundance, maximum = TRUE) MLE.Pois # plot at point at the maximum likelihood points(MLE.Pois$maximum, # parameter value that gives the highest likelihood MLE.Pois$objective, # the actual likelihood value for that parameter col="blue", pch=16)
You now have a maximum likelihood estimate for the mean abundance (\( \lambda \)) of adult dragons in Trondheim. As with the sex ratio though, this is not all of the information that we need for statistical inference. We would also like a measure of our confidence (uncertainty) in our estimate.
The following code calculates the confidence intervals for the maximum likelihood estimate of \( \lambda \)
Remember: the variance is the square of the standard deviation.
dnorm() takes the standard deviation = sqrt(lambda).
# calculate approximate likelihoods from the normal distribution # mean is MLE from Poisson and standard deviation is square root of MLE Poisson likelihoods.approx.Pois <- dnorm(lambdas, MLE.Pois$maximum, sqrt(MLE.Pois$maximum)) likelihoods.approx.Pois <- likelihoods.approx.Pois/max(likelihoods.approx.Pois) plot(lambdas, likelihoods.Pois, type="l", ylab="Likelihood", xlab="Probability") lines(lambdas, likelihoods.approx.Pois*max(likelihoods.Pois), col="blue") stderr_Pois <- sqrt(MLE.Pois$maximum) # standard deviation as normal approximation CI_lower <- MLE.Pois$maximum - 1.96*stderr_Pois CI_upper <- MLE.Pois$maximum + 1.96*stderr_Pois ApproxCI <- round(c(CI_lower, CI_upper),2) ApproxCI # Exact CIs by knowing the maths ## You do not need to know this!!! It's just here so we can plot them. ExactCI <- qchisq(p=c(0.025, 0.975), df = 2*MLE.Pois$maximum)/2 # Mark on the plot - can do with lines, polygons or points # This plots a polygon for the exact confidence intervals InExactCI <- lambdas > ExactCI & lambdas < ExactCI # This code shades the plot under the curve and between the exact confidence intervals polygon(x=c(lambdas[InExactCI], max(lambdas[InExactCI]), min(lambdas[InExactCI])), y=c(likelihoods.Pois[InExactCI], 0,0), border=NA, col="grey70") # YOU ONLY NEED TO KNOW HOW TO PLOT POLYGONS IF YOU ARE INTERESTED # Then mark on the plot approximate intervals # can do with lines, polygons or points # This one is a line abline(v=32) # MLE line abline(v=c(CI_lower, CI_upper), col="blue") # CI lines text(x=10, y=0.04, labels="Lower\nCI", col="blue") text(x=50, y=0.04, labels="Upper\nCI", col="blue") # Adds a legend (key) to the plot in the top right corner # Labelled Exact CI and Approx CI, with a grey and blue filled square and no line around the edge (bty = 'n') legend("topright", c("Exact CI", "Approx CI"), fill=c("grey70", "blue"), bty='n')
The approximate 95% confidence intervals for the Poisson distribution are 21 to 43 (to nearest whole dragon) with a maximum likelihood estimate of 32 individuals.
B1. Interpret the confidence intervals here. What do they represent?
B2. Are you happy with the normal approximation and the confidence intervals here? Why/Why not?
Your team have analysed two sets of data on the dragon population in Trondheim.
Now you have to create a management recommendation for the local government.
C1. Produce a recommendation for the government. It should include: