Dataset 2: Iris petals

Iris image

As written in the module. These data are from three species of the plant, iris. They include measures of petal width and length. Here you will look at how the length of petals influences their width.

You can find the data here: it is a .csv with a header.

Important! When you import the data it is important to make sure it is in the right format. Here you need Species to be a factor and PetalLength to be numeric. See code below.

irisdata <- read.csv("https://www.math.ntnu.no/emner/ST2304/2020v/Week09/irisdata.csv", header=T)

# str() checks the data structure
str(irisdata)

# we can see that the variables are ok but best to be sure
irisdata$Species <- as.factor(irisdata$Species)
irisdata$PetalLength <- as.numeric(irisdata$PetalLength)

# now check again
str(irisdata)

The columns in the dataframe are:

You want to find out how petal length and species effect petal width.

1. What model will you use to answer this? (1 mark)

1 point for any of: linear model, regression, ancova. No points for anova or t-test. There is one categorical and one continuous explanatory here.

2. What type of variables do you have and which are response or explanatory? (3 marks)

1 point for each of: PetalWidth = response, continuous. Species = explanatory, categorical. PetalLength = explanatory, continuous.

We have given code to run a model for the data below. Think about what type of model is being run? It is good practice to consider if you would have chosen the same one.

# Model 1
model1 <- lm(PetalWidth ~ Species+PetalLength, data = irisdata)

coef(model1)

confint(model1)

3. Interpret the output of the model. What does it tell you about the effect of petal length on petal width? (5 marks)

1 mark for: Intercept is intercept of line for Species 1 = color. 1 mark for: effect of Petal length is positive (longer petals are also wider, 0.23cm (0.16 5o 0.29) wider for every 1cm length). 1 mark for: this is the same for all species. 1 mark for: Species Setosa seems to have a lower petal width than Species color (0.44 cm, -0.64 to -0.23), but Species Virginica is higher (0.4 cm, 0.29 to 0.52). 1 mark for: the confidence intervals do not cross 0 for the species, so the direction seems clear. To get each mark you should mention confidence intervals and the coefficient estimate.

Below is the code to make some graphs to check the model fit.

# Graph 1

residuals <- residuals(model1)
fitted <- fitted(model1)
plot(fitted, residuals)

qqnorm(residuals)
qqline(residuals)

4. What are the assumptions of the model? (5 marks)

1 mark each for: linearity, equal variance, independence, no outliers, normality of residuals.

5. Are the assumptions met? Reference which plot you use to decide and why you make the choice. (6 marks)

3 marks for: No quite - equal variance, tested using residuals vs fitted plot and there is different variance for different species. Also 3 marks for: Not quite - normality of residuals, tested using normal qq, quite a lot of deviation at edges.

6. What other plot might you also want for checking assumptions? (1 mark)

1 mark: cook's distance.

Here is code for another model on the same data.

# Model 2
model2 <- lm(PetalWidth ~ Species*PetalLength, data = irisdata)

coef(model2)

confint(model2)

7. How is this model different to the first one? (1 mark)

1 mark: now has interaction.

8. Given the new model, does this change your interpretation of the effect of petal length on petal width? Why? (4 marks)

1 mark for: yes. 1 mark for: Species Setosa no longer has a clear effect as the confidence intervals now span 0. 1 mark for: the effect of petal length and Species Virginica still seem very similar and in same direction. 1 mark for: Species Virginica also seems to have an interaction, the effect of petal length on petal width is weaker for this species. But still positive. Important to know here that the interaction term tells you a difference in the slope of the effect of petal length on petal width and include confidence intervals and estimates in the answer.

9. Which model do you prefer, why? (3 marks)

1 mark for: I prefer model 2. 1 mark for: because it had clearer estimates of some effects. 1 mark for: model two showed that there is an interaction for Species Virginica.