---
title: "Lecture 8: Categorical Variables"
author: "Bob O'Hara"
date: "bob.ohara@ntnu.no"
output: beamer_presentation
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
```

## Before we start...


Reference Group

# Exercise 1 and R

# Problem 1: Bird Brains

We want to look at the relationship between the log of body size and log of brain size. Adapt the code above to answer the following questions:

- First, read in the data

```{r GetBirdBrainsData, fig.height=4, echo=TRUE}
BirdBrains <- read.csv("../Data/BirdBrains.csv")
plot(BirdBrains$logBodyMass, BirdBrains$logBrainMass)
```


# Problem 1: Bird Brains

### What is the relationship between log of body size and log of brain size? What is the effect of increasing body size? Does brain size increase proportionally (e.g. if you double body size, does brain size double?)?

- fit a model

```{r FitBirdBrainModel, echo=TRUE}
brains.mod <- lm(logBrainMass~logBodyMass, data=BirdBrains)
# just show the coefficients
round(summary(brains.mod)$coefficients, 3)

```

# Problem 1: Bird Brains

### What is the relationship between log of body size and log of brain size? What is the effect of increasing body size? Does brain size increase proportionally (e.g. if you double body size, does brain size double?)?

- the relationship is positive: a larger body size means a larger brain size.
- if the brain size doubles, logBrainSize increases by $\log(2) = `r round(log(2), 2)`$. 
- So body mass increases by $`r round(coef(brains.mod)[2], 2)` \times \log(2) = `r round(log(2)*coef(brains.mod)[2], 2)`$.

This is less than $\log(2)$, so it is less than proportional

# Problem 1: Bird Brains

1. What is the relationship between log of body size and log of brain size? What is the effect of increasing body size? Does brain size increase proportionally (e.g. if you double body size, does brain size double?)?

- We can see that it is less than proportional by plotting the original data

```{r PlotBirdBrainsData, fig.height=4, echo=TRUE}
plot(BirdBrains$Body.mass, BirdBrains$Brain.mass)
abline(lm(Brain.mass~Body.mass, data=BirdBrains))

```


# Model Fit

### How good is the model fit? Does it look like there any problems with the fit, e.g. outliers, curavture in the data?

First, from the summary, we can see that the $R^2$ is `r round(100*summary(brains.mod)$r.square)`%, which means that the model is really good: only `r round(100*(1-summary(brains.mod)$r.square))`% of the variance is unexplained

```{r SummaryBrainsModel, comment=NA}
summ2 <- paste(capture.output(print(summary(brains.mod), digits=2)), "\n")
cat(summ2[16:18])

```

# Model Fit

### How good is the model fit? Does it look like there any problems with the fit, e.g. outliers, curavture in the data?

Now some residual plots

```{r ResidPlots, echo=TRUE, fig.height=4}
plot(fitted(brains.mod), resid(brains.mod))
```

This looks OK: it's a blob. There doesn't seem to be any curvature.

# Model Fit

### How good is the model fit? Does it look like there any problems with the fit, e.g. outliers, curavture in the data?

Normal Probability plots

```{r QQPlots, echo=TRUE, fig.height=4}
qqnorm(resid(brains.mod))
qqline(resid(brains.mod))

```

Looks largely OK, but tails might be a bit thick

# Model Fit

### For the parrots (Order Psitacciformes), predict what their brain size would be if they were normal birds. Compare the predicted and real values - does the distribution of their brain sizes look the same as other birds?

First, extract the parrots and make the predictions

```{r PredictParrots, echo=TRUE, fig.height=4}
BirdBrains$IsParrot <- BirdBrains$Order=="Psitacciformes"
ParrotPred <-BirdBrains[BirdBrains$IsParrot,]
p.pred <- predict(brains.mod, newdata = ParrotPred, 
                  interval = "prediction")# Why the intercept?


 # You do not need to worry about this next line!
rownames(p.pred) <- sapply(ParrotPred$Species.name., 
    function(str)gsub('^.* ', paste0(substr(str, 1, 1), 

                                     ". "), str)) 
```


# Compare Parrots

### For the parrots (Order Psitacciformes), predict what their brain size would be if they were normal birds. Compare the predicted and real values - does the distribution of their brain sizes look the same as other birds?

We should compare the real & predicted values

```{r ShowPredictParrots, echo=TRUE, comment=NA}
print(cbind(ParrotPred$logBrainMass, p.pred), digits=2) 

```


# Plot Parrots

Better to plot. A few ways to do this

```{r PlotPredictParrots, echo=TRUE, fig.height=2, fig.width=3}
par(mar=c(4.1,4.1,0.1,1), cex=0.8)
plot(p.pred[,"fit"], ParrotPred$logBrainMass, xlim=range(p.pred),
     xlab="Fitted", ylab="Actual") 
abline(0,1) # add 1:1 line
```

# Parrot Differences

### For the parrots (Order Psitacciformes), predict what their brain size would be if they were normal birds. Compare the predicted and real values - does the distribution of their brain sizes look the same as other birds?

Or calculate differences:

```{r SubtractPredictParrots, echo=TRUE, comment=NA}
round(ParrotPred$logBrainMass[1:3] - p.pred[1:3,"fit"], 2)
round(ParrotPred$logBrainMass[4:6] - p.pred[4:6,"fit"], 2)
```

All positive, so the real values are greater than the predictions: parrots have bigger brains than expected, compared to other birds.

# Categorical Variables, aka Factors

```{r GetBSData}
Data.all <- read.csv("../Data/EggBodySize.csv")
Data <- Data.all[!apply(Data.all, 1, function(x) any(is.na(x))),c("species", "development.category", "Egg.mass..g.", "female.mass..g.", "mm..n..mean.femur")]
Data$logEggMass <- log(Data$Egg.mass..g.)
Data$logFemaleMass <- log(Data$female.mass..g.)

```

So far we have dealt with continuous variables.

But not everything is continuous.


# Design of Experiments

A lot of the theory was developed for designed experiments

- field trials in Rothamsted
- lab studies
- clinical trials

# Examples

For historical reasons, this is also called "Analysis of Variance"

- Testing different crop varieties
- Test effects of drugs/poisons

# Real Data

Effects of sugar and chcolate temperature on breaking chocolate cake

- Bake a chocolate cake
- hold 1 end
- lift other end
- measure angle at which it breaks
- repeat until you get your Masters degree


# One categorical Variable, 2 levels

```{r GetCake, results='asis', echo=FALSE, message=FALSE}
library(lme4)
data("cake")
Sugar <- cake[cake$recipe!="B" & cake$temp=="225",] # remove recipe B & use temperature of 175C
Sugar$treatment <- c("Control", "Treatment")[1+grepl("A", Sugar$recipe)]
```

Use data for 2 recipes: "normal" (control), and more sugar. For now only use 1 temperature

```{r SimData2lvls, fig.height=3}
par(mar=c(4.1,6,1,1), cex=1.5)
boxplot(angle~treatment, horizontal=TRUE, las=1, data=Sugar)

```

"Easy" example: control and treatment

- same as a t-test

Question: does the treatment (i.e. adding sugar) have an effect?

- does the cake break more easily?
- is it different from the control?

# Experimental Design

We assume that individual units are assigned at random to Control and Treatment

(the full experiment has cakes baked on different days, which we will ignore, and temperature, which we will include later)

# The Model

We assume that the response is control and treatment are normally distributed, and the only difference is their mean:
$$
\begin{aligned}
y_{i,control} &= \mu_{control} + \varepsilon_i \\
y_{i,treatment} &= \mu_{treatment} + \varepsilon_i \\
\epsilon_i &\sim N(0, \sigma^2)
\end{aligned}
$$
We can write this like this:
$$
\begin{aligned}
y_{i} &= \alpha + \beta_j + \varepsilon_i \\
\epsilon_i &\sim N(0, \sigma^2)
\end{aligned}
$$
where $\beta_j$ is the difference between control & treatment effects (we can set $\beta_1=0$), and $\alpha$ is an intercept.

From this, $E(y_i) = \alpha + \beta_j$

# Regression

We could also use regression

- set $X = 0$ for the control
- set $X = 1$ for the treatment

($X$ is called an 'indicator variable')

The coefficient is a change is the response when $X$ changes by 1, i.e. from control to treatment 

The Intercept is the mean for the control

# Regression

```{r RegressANOVA, echo=TRUE}
Sugar$X <- as.numeric(Sugar$treatment=="Treatment")
mod.reg <- lm(angle~X, data=Sugar)
print(summary(mod.reg)$coefficients, digits=3)

```

So the difference is `r round(coef(mod.reg)["X"], 2)`, with a standard error of `r round(summary(mod.reg)$coefficients["X","Std. Error"], 2)`.


# Regression

But why put the intercept on the control? Why not put it mid-way between the Control & Treatment, for example?

```{r RegressANOVA2, echo=TRUE}
Sugar$X.c <- as.numeric(Sugar$treatment=="Treatment")-0.5
mod.reg2 <- lm(angle~X.c, data=Sugar)
print(summary(mod.reg2)$coefficients, digits=3)

```

The difference is the same, but the intercept is now at `r round(coef(mod.reg2)['(Intercept)'], 2)`

# Regression

```{r PlotRegressANOVA}

par(mfrow=c(1,2))
plot(Sugar$X, Sugar$angle, main = "Intercept at Control")
abline(mod.reg, col=2)
abline(v=0, col=1, lty=3)

plot(Sugar$X.c, Sugar$angle, main = "Centred Intercept")
abline(mod.reg2, col=2)
abline(v=0, col=1, lty=3)

```

# More than 2 levels

With Control & Treatment we have 2 levels, but what if we have more?

```{r SimData3levels, fig.height=4, echo=TRUE}
Recipes <- cake[cake$temp=="225",] # use temp. of 225C
par(mar=c(4.1,6,1,1), cex=1.5)
boxplot(Recipes$angle~Recipes$recipe, 
        horizontal=TRUE, las=1)
```

# More than 2 levels

We can use the same regression trick

```{r ShowData3levels, fig.height=5, echo=TRUE, comment=NA}

Recipes$HotChoc <- as.numeric(Recipes$recipe=="B") # hotter chocolate
Recipes$Sugar <- as.numeric(Recipes$recipe=="C") # More sugar

head(Recipes[,c("recipe", "HotChoc", "Sugar", "angle")])
```

# Drawing 3 levels

```{r DrawData3levels, fig.height=5}
  library(scatterplot3d)
scatterplot3d(x=Recipes$HotChoc, y=Recipes$Sugar, z=Recipes$angle, bg=2, cex.axis = 1.4, las=1, color=4, 
              grid = FALSE, lab=c(1,1,2), asp=0.01, 
              angle=55, pch=1, cex.symbols=2)
```
 
 Nothing can be Treatment 1 _and_ Treatment 2

# More than 2 levels

Fit the model...

```{r FitData3levels, fig.height=5, echo=TRUE, comment=NA}
mod.3lvl <- lm(angle ~ HotChoc + Sugar,data=Recipes)
print(summary(mod.3lvl)$coefficients, digits=3)

```

We now have effects for the difference between hotter chocolate and the control & more sugar and the control


# Changing the Intercept

Make Recipe B the intercept

```{r FitData3levelsX1, fig.height=5, echo=TRUE}

Recipes$ColdChoc <- as.numeric(Recipes$recipe=="A") # colder chocolate
Recipes$Sugar <- as.numeric(Recipes$recipe=="C") # More sugar

mod.3lvl.1 <- lm(angle ~ ColdChoc + Sugar, data=Recipes)
print(summary(mod.3lvl.1)$coefficients, digits=3)

```

The estimate for chocolate temperature was `r round(coef(mod.3lvl)["HotChoc"], 2)`, now it is `r round(coef(mod.3lvl.1)["ColdChoc"], 2)`

The estimate for Sugar before was `r round(coef(mod.3lvl)["Sugar"], 2)`, now we have `r round(coef(mod.3lvl.1)["Sugar"], 2)`

- now it is a contrast between "more sugar and cold chocolate" and "hot chocolate"

# The Design Matrix Reloaded

Last time I introducted multiple regression as

$$
\mathbf{Y} = X \mathbf{\beta} + \mathbf{\varepsilon}
$$

where $X$ is the design matrix

Today I have been building the design matrix with 0's and 1's

```{r ShowDesignMatrix, echo=FALSE, comment=NA}
head(Recipes[,c("recipe", "HotChoc", "Sugar")])
```

# The Design Matrix Reloaded

Building the design matrix by hand is a pain, so we get the computer to do it. In the data, recipe is coded as a factor (i.e. categorical):

```{r FactorRecipe, echo=TRUE, comment=NA}
str(Recipes$recipe)
```
with different levels

```{r FactorRecipeLevels, echo=TRUE, comment=NA}
levels(Recipes$recipe)
```

# The Design Matrix Reloaded

All this means the computer knows what to do

```{r DesigmMatrixOneWay, echo=TRUE, comment=NA}
DesignMatrix <- model.matrix(~recipe, data=Recipes)
head(DesignMatrix)
```


# The Design Matrix Reloaded

We can thus just fit the model without worrying

```{r ANOVA1, echo=TRUE, comment=NA}
mod.an <- lm(angle~recipe, data=Recipes)
print(summary(mod.an)$coefficients, digits=3)
```

...except about how to interpret the parameter estimates

# The Design Matrix Contrasted

The coefficients are the same as before:

```{r ANOVACoef, echo=TRUE, comment=NA}
coef(mod.an)
```

```{r ANOVAReg, echo=TRUE, comment=NA}
coef(mod.3lvl)
```

# Contrasts

By default, R sets the first level of a factor to the intercept, and the other levels contrasted to the first level

- makes sense here, but not if (for example) we are comparing 6 varieties of barley

We can see the contrasts R uses:

```{r TrtContrasts, echo=TRUE, comment=NA}
contrasts(Recipes$recipe)
```
Each contrast is a column

- for contrast B, B gets a value of 1
- for contrast C, C gets a value of 1

# Sum to Zero Contrasts

We can also use different contrasts:

```{r ContrastsSum, echo=TRUE, comment=NA}
Recipes$recipeS <- C(Recipes$recipe, contr = contr.sum)
contrasts(Recipes$recipeS)
```

- harder to interpret
- but can be more flexible if you have specific hypotheses

# Sum to Zero Contrasts

```{r ANOVASum, echo=TRUE, comment=NA}
mod.anS <- lm(angle~recipeS, data=Recipes)
print(summary(mod.anS)$coefficients, digits=3)
```


# No intercept

We can also "cheat" by fitting a model with no intercept, by putting "-1" on the right hand side of the model:

```{r ANOVANoI, echo=TRUE, comment=NA}
mod.NoI <- lm(angle~recipe - 1, data=Recipes)
print(summary(mod.NoI)$coefficients, digits=3)
```

# Two categorical Variables

We do not have to only consider one variable: just like multiple regression, we can consider several.

e.g. with the cakes, temperature was also a factor that was controlled

```{r CakesTwoWayData, echo=TRUE, comment=NA}
 # use temperatures of 175C & 225C
Cakes2 <- cake[cake$temp%in%c("175", "225") ,]
```


# Two categorical Variables

```{r PlotCakesTwoWayData, echo=TRUE, comment=NA, fig.height=5}
par(mar=c(4.1,6,1,1), cex=1.5)
boxplot(Cakes2$angle~Cakes2$recipe + Cakes2$temp, 
        horizontal=TRUE, las=1)
```

# Two categorical Variables

The contrasts for the temperature can be written in the same way

```{r TwoWayDataTempContrast, echo=TRUE, comment=NA}
Cakes2$tempF <- factor(Cakes2$temp)
contrasts(Cakes2$tempF)
```

# Two categorical Variables

The design matrix is now like this:

```{r DesignMatrixTwoWayData, echo=TRUE, comment=NA}
head(model.matrix(~recipe + tempF, data=Cakes2))

```

We now just have an extra column for temperature

# Two categorical Variables: Fitting the model

We can fit the model, just like last time!

```{r FitTwoWayModel, echo=TRUE, comment=NA}
mod.2way <- lm(angle~recipe + tempF, data=Cakes2)
print(summary(mod.2way)$coefficients, digits=2)

```

Little effect of recipe, big effect of temperature

- higher temperature means cake bends more before breaking

# Why the intercept?

What happens if we remove the intercept?

```{r FitTwoWayModelNoInt, echo=TRUE, comment=NA}
mod.2way <- lm(angle~recipe + tempF-1, data=Cakes2)
print(summary(mod.2way)$coefficients, digits=2)

```

We get 3 levels of recipe, but still only one of tempF

# Why the intercept?

The problem is that we can only remove the intercept for one factor, otherwise there are too many moving parts.

```{r DesignMatrix2wayNoInt, echo=TRUE, comment=NA}
head(model.matrix(~recipe + tempF-1, data=Cakes2))
```
If we had a 175&deg;C effect, we could subtract that from all of the recipe effects and add it to the 225&deg;C effect & get the same model


# Next Week

More complicated model: interactions

- when the recipe effect depends on temperature


# Problem 2

We can simulate some bad data and look at what happens. To simulate some "good" data, we can do this:

```{R GoodData, echo=TRUE}
# define the parameters
alpha <- 5; beta <- 5; sigma <- 1 # standard deviation

# 100 samples from a uniform distribution between 0 and 1
x <- runif(100, 0, 1) # random uniform distribution
# calculate E(y) using the defined values of alpha & beta, 
#   and the simulated values of x
mu <- alpha + beta*x
# simulate y, with mean mu and standard deviation sigma
y <- rnorm(length(mu), mu, sigma)
```

# Problem 2

Simulate the 'bad' data

```{R SimBadData, echo=TRUE}
mu.bad <- alpha + beta*x^2
y.bad <- rnorm(length(mu.bad), mu.bad, sigma)
```


# Problem 2

Fit the model to the bad data

```{R FitBadData, echo=TRUE, fig.height=4}
mod.bad <- lm(y.bad~x)

plot(x, y.bad)
abline(mod.bad) # add the fitted line
```

This doesn't look too bad.

# Problem 2: Residuals

```{R GoodResiduals, fig.height=3}
par(mfrow=c(1,3))
plot(fitted(mod.bad), resid(mod.bad), main="Residuals vs Fitted")
qqnorm(resid(mod.bad)); qqline(resid(mod.bad), main="Normal Probability Plot")
plot(fitted(mod.bad), cooks.distance(mod.bad), main="Cook' D")

```

The residuals look curved. THe QQ-plot looks fine: Cook's D is OK.