--- title: "Lecture 8: Categorical Variables" author: "Bob O'Hara" date: "bob.ohara@ntnu.no" output: beamer_presentation --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = FALSE) ``` ## Before we start... Reference Group # Exercise 1 and R # Problem 1: Bird Brains We want to look at the relationship between the log of body size and log of brain size. Adapt the code above to answer the following questions: - First, read in the data ```{r GetBirdBrainsData, fig.height=4, echo=TRUE} BirdBrains <- read.csv("../Data/BirdBrains.csv") plot(BirdBrains$logBodyMass, BirdBrains$logBrainMass) ``` # Problem 1: Bird Brains ### What is the relationship between log of body size and log of brain size? What is the effect of increasing body size? Does brain size increase proportionally (e.g. if you double body size, does brain size double?)? - fit a model ```{r FitBirdBrainModel, echo=TRUE} brains.mod <- lm(logBrainMass~logBodyMass, data=BirdBrains) # just show the coefficients round(summary(brains.mod)$coefficients, 3) ``` # Problem 1: Bird Brains ### What is the relationship between log of body size and log of brain size? What is the effect of increasing body size? Does brain size increase proportionally (e.g. if you double body size, does brain size double?)? - the relationship is positive: a larger body size means a larger brain size. - if the brain size doubles, logBrainSize increases by $\log(2) = `r round(log(2), 2)`$. - So body mass increases by $`r round(coef(brains.mod)[2], 2)` \times \log(2) = `r round(log(2)*coef(brains.mod)[2], 2)`$. This is less than $\log(2)$, so it is less than proportional # Problem 1: Bird Brains 1. What is the relationship between log of body size and log of brain size? What is the effect of increasing body size? Does brain size increase proportionally (e.g. if you double body size, does brain size double?)? - We can see that it is less than proportional by plotting the original data ```{r PlotBirdBrainsData, fig.height=4, echo=TRUE} plot(BirdBrains$Body.mass, BirdBrains$Brain.mass) abline(lm(Brain.mass~Body.mass, data=BirdBrains)) ``` # Model Fit ### How good is the model fit? Does it look like there any problems with the fit, e.g. outliers, curavture in the data? First, from the summary, we can see that the $R^2$ is `r round(100*summary(brains.mod)$r.square)`%, which means that the model is really good: only `r round(100*(1-summary(brains.mod)$r.square))`% of the variance is unexplained ```{r SummaryBrainsModel, comment=NA} summ2 <- paste(capture.output(print(summary(brains.mod), digits=2)), "\n") cat(summ2[16:18]) ``` # Model Fit ### How good is the model fit? Does it look like there any problems with the fit, e.g. outliers, curavture in the data? Now some residual plots ```{r ResidPlots, echo=TRUE, fig.height=4} plot(fitted(brains.mod), resid(brains.mod)) ``` This looks OK: it's a blob. There doesn't seem to be any curvature. # Model Fit ### How good is the model fit? Does it look like there any problems with the fit, e.g. outliers, curavture in the data? Normal Probability plots ```{r QQPlots, echo=TRUE, fig.height=4} qqnorm(resid(brains.mod)) qqline(resid(brains.mod)) ``` Looks largely OK, but tails might be a bit thick # Model Fit ### For the parrots (Order Psitacciformes), predict what their brain size would be if they were normal birds. Compare the predicted and real values - does the distribution of their brain sizes look the same as other birds? First, extract the parrots and make the predictions ```{r PredictParrots, echo=TRUE, fig.height=4} BirdBrains$IsParrot <- BirdBrains$Order=="Psitacciformes" ParrotPred <-BirdBrains[BirdBrains$IsParrot,] p.pred <- predict(brains.mod, newdata = ParrotPred, interval = "prediction")# Why the intercept? # You do not need to worry about this next line! rownames(p.pred) <- sapply(ParrotPred$Species.name., function(str)gsub('^.* ', paste0(substr(str, 1, 1), ". "), str)) ``` # Compare Parrots ### For the parrots (Order Psitacciformes), predict what their brain size would be if they were normal birds. Compare the predicted and real values - does the distribution of their brain sizes look the same as other birds? We should compare the real & predicted values ```{r ShowPredictParrots, echo=TRUE, comment=NA} print(cbind(ParrotPred$logBrainMass, p.pred), digits=2) ``` # Plot Parrots Better to plot. A few ways to do this ```{r PlotPredictParrots, echo=TRUE, fig.height=2, fig.width=3} par(mar=c(4.1,4.1,0.1,1), cex=0.8) plot(p.pred[,"fit"], ParrotPred$logBrainMass, xlim=range(p.pred), xlab="Fitted", ylab="Actual") abline(0,1) # add 1:1 line ``` # Parrot Differences ### For the parrots (Order Psitacciformes), predict what their brain size would be if they were normal birds. Compare the predicted and real values - does the distribution of their brain sizes look the same as other birds? Or calculate differences: ```{r SubtractPredictParrots, echo=TRUE, comment=NA} round(ParrotPred$logBrainMass[1:3] - p.pred[1:3,"fit"], 2) round(ParrotPred$logBrainMass[4:6] - p.pred[4:6,"fit"], 2) ``` All positive, so the real values are greater than the predictions: parrots have bigger brains than expected, compared to other birds. # Categorical Variables, aka Factors ```{r GetBSData} Data.all <- read.csv("../Data/EggBodySize.csv") Data <- Data.all[!apply(Data.all, 1, function(x) any(is.na(x))),c("species", "development.category", "Egg.mass..g.", "female.mass..g.", "mm..n..mean.femur")] Data$logEggMass <- log(Data$Egg.mass..g.) Data$logFemaleMass <- log(Data$female.mass..g.) ``` So far we have dealt with continuous variables. But not everything is continuous. # Design of Experiments A lot of the theory was developed for designed experiments - field trials in Rothamsted - lab studies - clinical trials # Examples For historical reasons, this is also called "Analysis of Variance" - Testing different crop varieties - Test effects of drugs/poisons # Real Data Effects of sugar and chcolate temperature on breaking chocolate cake - Bake a chocolate cake - hold 1 end - lift other end - measure angle at which it breaks - repeat until you get your Masters degree # One categorical Variable, 2 levels ```{r GetCake, results='asis', echo=FALSE, message=FALSE} library(lme4) data("cake") Sugar <- cake[cake$recipe!="B" & cake$temp=="225",] # remove recipe B & use temperature of 175C Sugar$treatment <- c("Control", "Treatment")[1+grepl("A", Sugar$recipe)] ``` Use data for 2 recipes: "normal" (control), and more sugar. For now only use 1 temperature ```{r SimData2lvls, fig.height=3} par(mar=c(4.1,6,1,1), cex=1.5) boxplot(angle~treatment, horizontal=TRUE, las=1, data=Sugar) ``` "Easy" example: control and treatment - same as a t-test Question: does the treatment (i.e. adding sugar) have an effect? - does the cake break more easily? - is it different from the control? # Experimental Design We assume that individual units are assigned at random to Control and Treatment (the full experiment has cakes baked on different days, which we will ignore, and temperature, which we will include later) # The Model We assume that the response is control and treatment are normally distributed, and the only difference is their mean: $$ \begin{aligned} y_{i,control} &= \mu_{control} + \varepsilon_i \\ y_{i,treatment} &= \mu_{treatment} + \varepsilon_i \\ \epsilon_i &\sim N(0, \sigma^2) \end{aligned} $$ We can write this like this: $$ \begin{aligned} y_{i} &= \alpha + \beta_j + \varepsilon_i \\ \epsilon_i &\sim N(0, \sigma^2) \end{aligned} $$ where $\beta_j$ is the difference between control & treatment effects (we can set $\beta_1=0$), and $\alpha$ is an intercept. From this, $E(y_i) = \alpha + \beta_j$ # Regression We could also use regression - set $X = 0$ for the control - set $X = 1$ for the treatment ($X$ is called an 'indicator variable') The coefficient is a change is the response when $X$ changes by 1, i.e. from control to treatment The Intercept is the mean for the control # Regression ```{r RegressANOVA, echo=TRUE} Sugar$X <- as.numeric(Sugar$treatment=="Treatment") mod.reg <- lm(angle~X, data=Sugar) print(summary(mod.reg)$coefficients, digits=3) ``` So the difference is `r round(coef(mod.reg)["X"], 2)`, with a standard error of `r round(summary(mod.reg)$coefficients["X","Std. Error"], 2)`. # Regression But why put the intercept on the control? Why not put it mid-way between the Control & Treatment, for example? ```{r RegressANOVA2, echo=TRUE} Sugar$X.c <- as.numeric(Sugar$treatment=="Treatment")-0.5 mod.reg2 <- lm(angle~X.c, data=Sugar) print(summary(mod.reg2)$coefficients, digits=3) ``` The difference is the same, but the intercept is now at `r round(coef(mod.reg2)['(Intercept)'], 2)` # Regression ```{r PlotRegressANOVA} par(mfrow=c(1,2)) plot(Sugar$X, Sugar$angle, main = "Intercept at Control") abline(mod.reg, col=2) abline(v=0, col=1, lty=3) plot(Sugar$X.c, Sugar$angle, main = "Centred Intercept") abline(mod.reg2, col=2) abline(v=0, col=1, lty=3) ``` # More than 2 levels With Control & Treatment we have 2 levels, but what if we have more? ```{r SimData3levels, fig.height=4, echo=TRUE} Recipes <- cake[cake$temp=="225",] # use temp. of 225C par(mar=c(4.1,6,1,1), cex=1.5) boxplot(Recipes$angle~Recipes$recipe, horizontal=TRUE, las=1) ``` # More than 2 levels We can use the same regression trick ```{r ShowData3levels, fig.height=5, echo=TRUE, comment=NA} Recipes$HotChoc <- as.numeric(Recipes$recipe=="B") # hotter chocolate Recipes$Sugar <- as.numeric(Recipes$recipe=="C") # More sugar head(Recipes[,c("recipe", "HotChoc", "Sugar", "angle")]) ``` # Drawing 3 levels ```{r DrawData3levels, fig.height=5} library(scatterplot3d) scatterplot3d(x=Recipes$HotChoc, y=Recipes$Sugar, z=Recipes$angle, bg=2, cex.axis = 1.4, las=1, color=4, grid = FALSE, lab=c(1,1,2), asp=0.01, angle=55, pch=1, cex.symbols=2) ``` Nothing can be Treatment 1 _and_ Treatment 2 # More than 2 levels Fit the model... ```{r FitData3levels, fig.height=5, echo=TRUE, comment=NA} mod.3lvl <- lm(angle ~ HotChoc + Sugar,data=Recipes) print(summary(mod.3lvl)$coefficients, digits=3) ``` We now have effects for the difference between hotter chocolate and the control & more sugar and the control # Changing the Intercept Make Recipe B the intercept ```{r FitData3levelsX1, fig.height=5, echo=TRUE} Recipes$ColdChoc <- as.numeric(Recipes$recipe=="A") # colder chocolate Recipes$Sugar <- as.numeric(Recipes$recipe=="C") # More sugar mod.3lvl.1 <- lm(angle ~ ColdChoc + Sugar, data=Recipes) print(summary(mod.3lvl.1)$coefficients, digits=3) ``` The estimate for chocolate temperature was `r round(coef(mod.3lvl)["HotChoc"], 2)`, now it is `r round(coef(mod.3lvl.1)["ColdChoc"], 2)` The estimate for Sugar before was `r round(coef(mod.3lvl)["Sugar"], 2)`, now we have `r round(coef(mod.3lvl.1)["Sugar"], 2)` - now it is a contrast between "more sugar and cold chocolate" and "hot chocolate" # The Design Matrix Reloaded Last time I introducted multiple regression as $$ \mathbf{Y} = X \mathbf{\beta} + \mathbf{\varepsilon} $$ where $X$ is the design matrix Today I have been building the design matrix with 0's and 1's ```{r ShowDesignMatrix, echo=FALSE, comment=NA} head(Recipes[,c("recipe", "HotChoc", "Sugar")]) ``` # The Design Matrix Reloaded Building the design matrix by hand is a pain, so we get the computer to do it. In the data, recipe is coded as a factor (i.e. categorical): ```{r FactorRecipe, echo=TRUE, comment=NA} str(Recipes$recipe) ``` with different levels ```{r FactorRecipeLevels, echo=TRUE, comment=NA} levels(Recipes$recipe) ``` # The Design Matrix Reloaded All this means the computer knows what to do ```{r DesigmMatrixOneWay, echo=TRUE, comment=NA} DesignMatrix <- model.matrix(~recipe, data=Recipes) head(DesignMatrix) ``` # The Design Matrix Reloaded We can thus just fit the model without worrying ```{r ANOVA1, echo=TRUE, comment=NA} mod.an <- lm(angle~recipe, data=Recipes) print(summary(mod.an)$coefficients, digits=3) ``` ...except about how to interpret the parameter estimates # The Design Matrix Contrasted The coefficients are the same as before: ```{r ANOVACoef, echo=TRUE, comment=NA} coef(mod.an) ``` ```{r ANOVAReg, echo=TRUE, comment=NA} coef(mod.3lvl) ``` # Contrasts By default, R sets the first level of a factor to the intercept, and the other levels contrasted to the first level - makes sense here, but not if (for example) we are comparing 6 varieties of barley We can see the contrasts R uses: ```{r TrtContrasts, echo=TRUE, comment=NA} contrasts(Recipes$recipe) ``` Each contrast is a column - for contrast B, B gets a value of 1 - for contrast C, C gets a value of 1 # Sum to Zero Contrasts We can also use different contrasts: ```{r ContrastsSum, echo=TRUE, comment=NA} Recipes$recipeS <- C(Recipes$recipe, contr = contr.sum) contrasts(Recipes$recipeS) ``` - harder to interpret - but can be more flexible if you have specific hypotheses # Sum to Zero Contrasts ```{r ANOVASum, echo=TRUE, comment=NA} mod.anS <- lm(angle~recipeS, data=Recipes) print(summary(mod.anS)$coefficients, digits=3) ``` # No intercept We can also "cheat" by fitting a model with no intercept, by putting "-1" on the right hand side of the model: ```{r ANOVANoI, echo=TRUE, comment=NA} mod.NoI <- lm(angle~recipe - 1, data=Recipes) print(summary(mod.NoI)$coefficients, digits=3) ``` # Two categorical Variables We do not have to only consider one variable: just like multiple regression, we can consider several. e.g. with the cakes, temperature was also a factor that was controlled ```{r CakesTwoWayData, echo=TRUE, comment=NA} # use temperatures of 175C & 225C Cakes2 <- cake[cake$temp%in%c("175", "225") ,] ``` # Two categorical Variables ```{r PlotCakesTwoWayData, echo=TRUE, comment=NA, fig.height=5} par(mar=c(4.1,6,1,1), cex=1.5) boxplot(Cakes2$angle~Cakes2$recipe + Cakes2$temp, horizontal=TRUE, las=1) ``` # Two categorical Variables The contrasts for the temperature can be written in the same way ```{r TwoWayDataTempContrast, echo=TRUE, comment=NA} Cakes2$tempF <- factor(Cakes2$temp) contrasts(Cakes2$tempF) ``` # Two categorical Variables The design matrix is now like this: ```{r DesignMatrixTwoWayData, echo=TRUE, comment=NA} head(model.matrix(~recipe + tempF, data=Cakes2)) ``` We now just have an extra column for temperature # Two categorical Variables: Fitting the model We can fit the model, just like last time! ```{r FitTwoWayModel, echo=TRUE, comment=NA} mod.2way <- lm(angle~recipe + tempF, data=Cakes2) print(summary(mod.2way)$coefficients, digits=2) ``` Little effect of recipe, big effect of temperature - higher temperature means cake bends more before breaking # Why the intercept? What happens if we remove the intercept? ```{r FitTwoWayModelNoInt, echo=TRUE, comment=NA} mod.2way <- lm(angle~recipe + tempF-1, data=Cakes2) print(summary(mod.2way)$coefficients, digits=2) ``` We get 3 levels of recipe, but still only one of tempF # Why the intercept? The problem is that we can only remove the intercept for one factor, otherwise there are too many moving parts. ```{r DesignMatrix2wayNoInt, echo=TRUE, comment=NA} head(model.matrix(~recipe + tempF-1, data=Cakes2)) ``` If we had a 175°C effect, we could subtract that from all of the recipe effects and add it to the 225°C effect & get the same model # Next Week More complicated model: interactions - when the recipe effect depends on temperature # Problem 2 We can simulate some bad data and look at what happens. To simulate some "good" data, we can do this: ```{R GoodData, echo=TRUE} # define the parameters alpha <- 5; beta <- 5; sigma <- 1 # standard deviation # 100 samples from a uniform distribution between 0 and 1 x <- runif(100, 0, 1) # random uniform distribution # calculate E(y) using the defined values of alpha & beta, # and the simulated values of x mu <- alpha + beta*x # simulate y, with mean mu and standard deviation sigma y <- rnorm(length(mu), mu, sigma) ``` # Problem 2 Simulate the 'bad' data ```{R SimBadData, echo=TRUE} mu.bad <- alpha + beta*x^2 y.bad <- rnorm(length(mu.bad), mu.bad, sigma) ``` # Problem 2 Fit the model to the bad data ```{R FitBadData, echo=TRUE, fig.height=4} mod.bad <- lm(y.bad~x) plot(x, y.bad) abline(mod.bad) # add the fitted line ``` This doesn't look too bad. # Problem 2: Residuals ```{R GoodResiduals, fig.height=3} par(mfrow=c(1,3)) plot(fitted(mod.bad), resid(mod.bad), main="Residuals vs Fitted") qqnorm(resid(mod.bad)); qqline(resid(mod.bad), main="Normal Probability Plot") plot(fitted(mod.bad), cooks.distance(mod.bad), main="Cook' D") ``` The residuals look curved. THe QQ-plot looks fine: Cook's D is OK.