Exercise 1: Confidence/uncertainty

SOLUTION

Instructions:

This document contains information, questions, R code, and plots.

Hints and reminders are italic

Questions appear in blue.

Answers in red or as R code!.

# R code looks like this.
# Comments are in grey and explain the code

2+2 # add two and two
## [1] 4
# outputs from R appear like the above ^^^^

You should work through the exercise in groups and write answers in a separate document to be submitted on Blackboard by Thursday 31st January.

A new R function we will use (optimise)

This function provides an easy way for us to optimise over a parameter (find the 'best' value). What we consider the 'best' value of a parameter depends on what criteria we want to optimise. For most of the work we are doing we want to find the parameter that maximises the likelihood. We can use optimise to calculate the likelihood for all possible values of a parameter (within bounds we set) and find the maximum of these.

Things we have to provide to optimise (the arguments optimise takes):


The challenge: management of dragons in Trondheim

You are a team of scientists studying the reintroduction of dragons into Trondheim.

Dragons were reintroduced last year as part of rewilding efforts. The exact species of dragon introduced is a seed eater and so no threat to people or livestock. They also help maintain lower vegetation across the river, which helps kayakers and other river users. Since reintroduction you have been collecting data on these dragons as part of your research. Now the government wants you to present your findings to help them choose a management strategy:

From previous work you know that an average population abundance of 35 individuals is needed to maintain the population and keep vegetation under control. An average population abundance of more than 50 individuals will start to put pressure on resources and increase risks of fire.

It is your job to analyse your data and advise the government how to manage the population.


Analysis one: sex ratio of dragon eggs

Dragon picture

One of your team has been collecting data on the sex of dragon eggs. Conveniently male dragons are purple and females are red (even as eggs).

They have collected data from 100 eggs. Their data can be found at (https://www.math.ntnu.no/emner/ST2304/2019v/Week3/sex_ratio_data.csv). This file has a header and is a csv. You can either download and then import it using read.csv() or import directly using the whole we address above.

In the sex column 0 = male and 1 = female.

Remember: to access a particular column of a data frame you use $ e.g. sex_ratio_data$Egg_number

First thing we should always do is look at our data. Plot a histogram of the sex column. This is not a pretty graph but lets you see what sort of data you have.

# import the data and name it
SexRatioData <- read.csv('https://www.math.ntnu.no/emner/ST2304/2019v/Week3/sex_ratio_data.csv', header=T) 

# plot the histogram - I have changed the title (main) and 
# the x axis label (xlab) manually
hist(SexRatioData$sex, xlab="Sex (0 = male, 1 = female)", 
     main = "Histogram of sex ratio data")

plot of chunk unnamed-chunk-2

From this data you can calculate the observed proportion of female eggs:

1. What is the proportion of female eggs in the observed data? (include one line on how you calculated it)

0.39 are female, code below.

# There are several ways to do this. The exact way does not matter as long as it is correct.

# create an object called ProportionFemale 

# Option 1:
# Sum the numbers in sex column (basically counts all the 1s) 
# then divide by the total number of trials (100)
# round() rounds the numbers, here to 2 decimal places.
ProportionFemale <- round(sum(SexRatioData$sex)/length(SexRatioData$sex),2) 

# Option 2: 
# We sum up the number of 1s and divide by the total number. 
# mean() is a sneaky way of doing this.
ProportionFemale <- mean(SexRatioData$sex)

# display answer
ProportionFemale # 0.39
## [1] 0.39

Now you have the proportion of female eggs that were observed.

But this does not tell us much. You need to complete some statistical analyses to make generalisable statements about the results.

2a. Begin by calculating the likelihood for different probabilities of being female (here being female = getting a success).

Code for 2a and 2b below.

2b. Then plot the likelihood curve.

# save the number of successes as an object
NSuccess <- sum(SexRatioData$sex) # 39
# save the number of trials as an object (total number of eggs)
NTrials <- length(SexRatioData$sex) # 100

# Help! what probabilites (probability of being female) do I use?
# You can use any but chosing a large number of possibilities covering 
# from almost 0 to almost 1.
# seq() is a good function to use. 
# It creates a vector of sequence of numbers from the start and end you choose.
Probabilities <- seq(from=0.001, to=0.999, by=0.001) # 999 numbers

# generate likelihoods
likelihoods <- dbinom(x=NSuccess, size=NTrials, prob=Probabilities)

# plot the curve as a line (type="l")
# manually set x and y labels (xlab and ylab)
plot(x=Probabilities, y=likelihoods, type="l", xlab="Probability of female", 
     ylab="Likelihood") 

plot of chunk unnamed-chunk-4

3a. At what point along the curve is the maximum likelihood estimate? 3b. How can we find it, mathematically? (hint: you can describe the mathematical process in words it does not have to be the equation)

We can differentiate the likelihood, and find the sex ratio where the slope is zero. or we can optimise

Below is the code to find the maximum likelihood estimate of the probability of being female in R and plot it on the graph.

This is where we use optimise

# f = the dbinom function we already used above
# We now add an upper and lower bound for the probability (0 and 1)
# We still need to include the arguments of the number of successes
# and number of trials. dbinom needs these to run
# We have set maximum to TRUE
MLE.binom <- optimise(f=dbinom, lower=0, upper=1, x=NSuccess, 
             size=NTrials, maximum=TRUE)$maximum
# $maximum takes only the maximum value from the output
MLE.binom
## [1] 0.3899995
# Now we plot it

# repeat the plot from above
plot(x=Probabilities, y=likelihoods, type="l", xlab="Probability of female", 
     ylab="Likelihood") 

# add a point at the maximum likelihood
# pch=16 and col="red", makes the point red and filled in
points(MLE.binom, max(likelihoods), col="red", pch=16) 

plot of chunk unnamed-chunk-5

You should now have a plot of the likelihoods and the probability with the maximum likelihood. Although you have managed to calculate this, it is not the whole picture.

4a. Why is the maximum likelihood estimate alone not enough to draw statistical conclusions? 4b. What other information should we include?

It is not enough because it does not tell us anything about uncertainty/confidence. If we took another sample we would get a different answer, this should be represented. What other information is the confidence interval. or standard error

Below is the code to calculate the confidence intervals for the maximum likelihood estimate and plot them, using a normal approximation.

5.Write a definition for the standard error.

The standard error is the standard deviation of the sampling distribution of a statistic. If we were to estimate a statistic (e.g. probability) many times using different samples from the population, the expected standard deviation of the different estimated probabilities would be the standard error. (p22 of New statistics with R, can also google) Formula (sqrt(variance/n)) is also acceptable. But should include some reference to the sampling distribution.

r <- 39 # number of successes
N <- 100 # total number of trials
p <- 0.39 # proportion of success
phat <- r/N 
stderr <- sqrt(phat*(1-phat)/N) # creating the standard error

# create normal approximation of likelihood
likelihoods.approx <- dnorm(Probabilities, phat, stderr) 
# scale this by the maximum likelihood (becomes relative likelihood)
likelihoods.approx <- likelihoods.approx/max(likelihoods.approx)

# plot the previous likelihood (binomial)
plot(x=Probabilities, y=likelihoods, type="l", ylab="Likelihood", xlab="Probability")
# add a line for the new normal approximation likelihood
# need to multiply the relative likelhoods by the maximum from the binomial
lines(Probabilities, likelihoods.approx*max(likelihoods), col="red")

# calculate the lower and upper confidence intervals using the standard error
CI_lower <- MLE.binom - 1.96*stderr
CI_upper <- MLE.binom + 1.96*stderr
# display the rounded result
round(c(CI_lower, CI_upper), 2)
## [1] 0.29 0.49
# mark the CIs on the plot - can do with lines, polygons or points
# I have chosen lines because it is simple code
abline(v=0.39) # MLE line
abline(v=c(CI_lower, CI_upper), col="red") # CI lines
# add text labels (\n creates a new line)
text(x=0.2, y=0.04, labels="Lower\nCI", col="red") 
text(x=0.57, y=0.04, labels="Upper\nCI", col="red")

plot of chunk unnamed-chunk-6

6. Do the dragons have a skewed sex ratio as eggs (is it different to 50-50)? Use the confidence intervals in your answer. Explain your answer

The CI for probability of being female is 0.29 to 0.49, which excludes 0.5. So there seem to be fewer females than would be expected from a 50:50 ratio You do not need the MLE, which is 0.39, to reach a conclusion. But should still be included.


Analysis two: adult dragon numbers

Another member of your team has been collecting data on the number of adult dragons in Trondheim. They have set up a feeding station along the river. They have counted the number of dragons that visited the feeding station in a single day. The number of dragons they counted was 30.

This is just a single count, if you repeated the count on more days each one would differ. Just like the amount of land and sea we sampled differed each time we threw the beach ball.

Here we assume that our count data has come from a Poisson distribution. Count data, can only be whole numbers that are always positive, the Poisson distribution also has these properties. Therefore it is the appropriate choice.

In the same way as for binomial data, we want to find what parameter values (but this time for the Poisson distribution), which produce the highest likelihood given our observed data, those that are most likely to give rise to our data.

Here you only have information on when you did see a dragon, not how many you did not see one (there is no record of failures).

You need to save your number of observations as an object in R.

ObservedAbundance <- 30

The Poisson distribution is characterised by a single parameter. Lambda \( \lambda \) which is the mean (and also the variance) of the distribution. As with the binomial distribution (and p, the probability), we can also find the maximum likelihood estimate for the Poisson, but here we are finding the likelihood of different \( \lambda \) (means).

Below you can see the code to find the likelihoods of different \( \lambda \) and then plot them. Details are as follows:

Remember: we can only have whole dragons

# Begin by creating a sequence of lambdas to find likelihoods for
lambdas <- seq(1,70,1) # 70 lambdas (means)

# Calculate the likelihoods
likelihoods.Pois <- dpois(x=ObservedAbundance, lambdas)

# plot the outcome
plot(lambdas, likelihoods.Pois, type="l", ylab="Likelihood", xlab="Lambda (mean)")

plot of chunk unnamed-chunk-8

To find the maximum likelihood we again use optimise.

# replot the outcome
plot(lambdas, likelihoods.Pois, type="l", ylab="Likelihood", xlab="Lambda (mean)")

# This time we use optimise on the dpois function used above
# Set lower and upper to the bounds of your lambdas above
MLE.Pois <- optimise(dpois, lower=0, upper=70, x = ObservedAbundance, 
                     maximum = TRUE)$maximum
MLE.Pois
## [1] 30
# plot at point at the maximum likelihood 
points(MLE.Pois, max(likelihoods.Pois), col="blue", pch=16) 

plot of chunk unnamed-chunk-9

You now have a maximum likelihood estimate for the mean abundance (\( \lambda \)) of adult dragons in Trondheim. As with the sex ratio though, this is not all of the information that we need for statistical inference. We would also like a measure of our confidence (uncertainty) in our estimate.

The following code calculates the confidence intervals for the maximum likelihood estimate of \( \lambda \)

Remember: the variance is the square of the standard deviation. dnorm() takes the standard deviation = sqrt(lambda)

# calculate approximate likelihoods from normal distribution
# mean is MLE from Poisson and standard deviation is square root of MLE Poisson
likelihoods.approx.Pois <- dnorm(lambdas, MLE.Pois, sqrt(MLE.Pois)) 
likelihoods.approx.Pois <- likelihoods.approx.Pois/max(likelihoods.approx.Pois)

plot(lambdas, likelihoods.Pois, type="l", ylab="Likelihood", xlab="Probability")
lines(lambdas, likelihoods.approx.Pois*max(likelihoods.Pois), col="blue")

stderr_Pois <- sqrt(MLE.Pois) # standard deviation as normal approximation
CI_lower <- MLE.Pois - 1.96*stderr_Pois
CI_upper <- MLE.Pois + 1.96*stderr_Pois
round(c(CI_lower, CI_upper),2)
## [1] 19.26 40.74
# Exact CIs by knowing the maths
## You do not need to know this!!! It's just here so we can plot them.
ExactCI <- qchisq(p=c(0.025, 0.975), df = 2*MLE.Pois)/2

# Mark on the plot - can do with lines, polygons or points
# This plots a polygon for the exact confidence intervals
InExactCI <- lambdas > ExactCI[1] & lambdas < ExactCI[2]

# This code shades the plot under the curve and between the exact confidence intervals
polygon(x=c(lambdas[InExactCI], max(lambdas[InExactCI]), min(lambdas[InExactCI])),
        y=c(likelihoods.Pois[InExactCI], 0,0), border=NA, col="grey70")

# Then mark on the plot approximate intervals
# can do with lines, polygons or points
# This one is a line
abline(v=30) # MLE line
abline(v=c(CI_lower, CI_upper), col="blue") # CI lines
text(x=10, y=0.04, labels="Lower\nCI", col="blue")
text(x=50, y=0.04, labels="Upper\nCI", col="blue")

# Adds a legend (key) to the plot in the top right corner
# Labelled Exact CI and Approx CI, with a grey and blue filled square and no line around the edge (bty = 'n')
legend("topright", c("Exact CI", "Approx CI"), 
       fill=c("grey70", "blue"), bty='n')

plot of chunk unnamed-chunk-10

The approximate 95% confidence intervals for the Poisson distribution are 19 to 41 (to nearest whole dragon) with a maximum likelihood estimate of 30 individuals.

7. Interpret the confidence intervals here. What do they represent?

The confidence intervals are 95% confidence intervals so they represent the boundaries within which we would likely expect the true population value of the parameter (here the mean of the Poisson) to be found. If we were to repeat our sampling many times and each time create a confidence interval, on average the true value would be within the confidence interval 95% of the time. Here we our uncertainty spans from 19 to 41 dragons. Numbers of dragons outside of this range are less likely, according to our model.

8. Are you happy with the normal approximation and the confidence intervals here? Why/Why not?

It does not matter if the students chose happy or unhappy. The key thing is to mention that the approximate and exact CIs are not the same. They do not cover the same proportion of the distribution. But they are quite similar. So the approximate are probably ok to use.


Recommendation

Your team have analysed two sets of data on the dragon population in Trondheim. Now you have to create a management recommendation for the local government.

9. Produce a recommendation for the government. It should include:

The exact answer here will depend on choices, i.e. do you want to be cautious in terms of dragon numbers or in terms of potential damage? The MLE is a bit lower than would be ideal. However the confidence interval does contain the ideal population mean abundance so if you are cautious of having too many dragons you may want to follow management strategy 1. But most of the interval is below ideal, so strategy 3 is also well justified. Particulary strategy 3 could be chosen in dragon conservation is the main priority, after all, these are generally nice dragons. Whatever management strategy choosen there should be a discussion of the upper and lower bounds of the confidence intervals when justifying the answer.

You have information on sex ratio. Which is too high (too few females). There should be at least some reference to this. But we have no data that could let us look at the causes of either pattern (sex ratio or abundance). Speculating on mechanisms is ok as long as it is biologically plausible but they must be aware that they cannot say much about anything we have no data on. This is how new studies are designed, informed by previous work.

Anything biologically sensible is accepted here. We would expecting answers such as; higher sample size, repeated counts along the river, environmental data, food availability. Any of these things would allow you to get better estimates of abundance and start to explore some mechanisms and relationships that could influence abundance or sex ratio.