(Lastest changes: 26.08.2018. Theory for w1 is up to date. Missing: QC of ILw1.)
Jump to IL for first week
Jump to IL for second week
Construct a model to help understand the relationship between a response and one or several explanatory variables. [Correlation, or cause and effect?]
Construct a model to predict the reponse from a set of (one or several) explanatory variables. [More or less “black box”]
Munich, 1999: 3082 observations on 9 variables.
rent: the net rent per month (in Euro).rentsqm: the net rent per month per square meter (in Euro).area: Living area in square meters.yearc: year of construction.location: quality of location: a factor indicating whether the location is average location, 1, good location, 2, and top location, 3.bath: quality of bathroom: a a factor indicating whether the bath facilities are standard, 0, or premium, 1.kitchen: Quality of kitchen: 0 standard 1 premium.cheating: central heating: a factor 0 without central heating, 1 with central heating.district: District in Munich.More information in Fahrmeir et. al., (2013) page 5.
library("gamlss.data")
library(GGally)
ggpairs(rent99,lower = list(combo = wrap(ggally_facethist, binwidth = 0.5)))
Interesting questions
rent and area?rent?area the same on rent for appartments at average, good and top location? (interaction)\({\bf Y}: (n \times 1)\) vector of responses (random variable) [e.g. one of the following: rent, rent pr sqm, weight of baby, ph of lake, volume of tree]
\({\bf X}: (n \times p)\) design matrix [e.g. location of flat, gestation age of baby, chemical measurement of the lake, height of tree]
\({\bf \beta}: (p \times 1)\) vector of regression parameters (intercept included, so \(p=k+1\))
\({\bf \varepsilon}: (n\times 1)\) vector of random errors. Used in “traditional way”.
We assume that pairs \(({\bf x}_i^T,y_i)\) (\(i=1,...,n\)) are measured from sampling units. That is, the observation pair \(({\bf x}_1^T,y_1)\) is independent from \(({\bf x}_2^T,y_2)\), and so on.
Study the print-out and discuss the following questions:
model.matrix are. (Hint: coding of location)## Answers
## a) rent or rentsqm for response and all other as covariates.
## b) rent, rentsqm (continuous).
## c) area (continuous), yearc (ordinal or continuous), location (categorical - ordinal), bath, kichen, cheating (categorical, nominal?), district (categorical - nominal).
## d) our design matrix, contains numerical values for the continuous covariates, and dummy variable coding for categorical covariates (more about this later).
library("gamlss.data")
ds=rent99
colnames(ds)
summary(ds)
dim(ds)
head(ds)
str(ds$location)
contrasts(ds$location)
X=model.matrix(rentsqm~area+yearc+location+bath+kitchen+cheating+district,data=ds)
head(X)
## [1] "rent" "rentsqm" "area" "yearc" "location" "bath"
## [7] "kitchen" "cheating" "district"
## rent rentsqm area yearc
## Min. : 40.51 Min. : 0.4158 Min. : 20.00 Min. :1918
## 1st Qu.: 322.03 1st Qu.: 5.2610 1st Qu.: 51.00 1st Qu.:1939
## Median : 426.97 Median : 6.9802 Median : 65.00 Median :1959
## Mean : 459.44 Mean : 7.1113 Mean : 67.37 Mean :1956
## 3rd Qu.: 559.36 3rd Qu.: 8.8408 3rd Qu.: 81.00 3rd Qu.:1972
## Max. :1843.38 Max. :17.7216 Max. :160.00 Max. :1997
## location bath kitchen cheating district
## 1:1794 0:2891 0:2951 0: 321 Min. : 113
## 2:1210 1: 191 1: 131 1:2761 1st Qu.: 561
## 3: 78 Median :1025
## Mean :1170
## 3rd Qu.:1714
## Max. :2529
## [1] 3082 9
## rent rentsqm area yearc location bath kitchen cheating district
## 1 109.9487 4.228797 26 1918 2 0 0 0 916
## 2 243.2820 8.688646 28 1918 2 0 0 1 813
## 3 261.6410 8.721369 30 1918 1 0 0 1 611
## 4 106.4103 3.547009 30 1918 2 0 0 0 2025
## 5 133.3846 4.446154 30 1918 2 0 0 1 561
## 6 339.0256 11.300851 30 1918 2 0 0 1 541
## Factor w/ 3 levels "1","2","3": 2 2 1 2 2 2 1 1 1 2 ...
## 2 3
## 1 0 0
## 2 1 0
## 3 0 1
## (Intercept) area yearc location2 location3 bath1 kitchen1 cheating1
## 1 1 26 1918 1 0 0 0 0
## 2 1 28 1918 1 0 0 0 1
## 3 1 30 1918 0 0 0 0 1
## 4 1 30 1918 1 0 0 0 0
## 5 1 30 1918 1 0 0 0 1
## 6 1 30 1918 1 0 0 0 1
## district
## 1 916
## 2 813
## 3 611
## 4 2025
## 5 561
## 6 541
\[{\bf Y=X \beta}+{\bf\varepsilon}\] is called a classical linear model if the following is true:
\(\text{E}(\bf{\varepsilon})=\bf{0}\).
\(\text{Cov}(\bf{\varepsilon})=\text{E}(\bf{\varepsilon}\bf{\varepsilon}^T)=\sigma^2\bf{I}\).
The design matrix has full rank, \(\text{rank}({\bf X})=k+1=p\).
The classical normal linear regression model is obtained if additionally
For random covariates these assumptions are to be understood conditionally on \(\bf{X}\).
Independent pairs \((Y_i, {\bf x}_i)\) for \(i=1,\ldots,n\).
In multiple linear regression there are two popular methods for estimating the regression parameters in \(\beta\): maximum likelihood and least squares. These two methods give the same estimator when we assume the normal linear regression model. We will in this module focus on maximum likelihood estimation, since that can be used also when we have non-normal responses (modules 3-6: binomial, Poisson, gamma, multinomial).
We assume that pairs of covariates and response are measured independently of each other: \(({\bf x}_i,Y_i)\), and \(Y_i\) follows the distribution specified above, and \({\bf x}_i\) is fixed.
\[L(\beta)=\prod_{i=1}^n L_i(\beta)=\prod_{i=1}^n f(y_i; \beta)\] Q: fill in with the normal density for \(f\) and the multiple linear regression model.
The log-likelihood is just the natural log of the likelihood, and we work with the log-likelihood because this makes the mathematics simpler - since we work with exponential families. The main aim with the likelihood is to maximize it to find the maximum likelihood estimate, and since the log is a monotone function the maximum of the log-likelihood will be in the same place as the maximum of the likelihood.
\[ l(\beta)=\ln L(\beta)=\sum_{i=1}^n \ln L_i(\beta)=\sum_{i=1}^n l_i(\beta)\\ \] Observe that the log-likelihood is a sum of invidual contributions for each observation pair \(i\).
Q: fill in with the normal density for \(f\) and the multiple linear regression model.
H¨{a}rdle and Simes (2015), page 65.
Rule 1: \[\frac{\partial}{\partial \boldsymbol{\beta}}({\bf d}^T\boldsymbol{\beta})=\frac{\partial}{\partial \boldsymbol{\beta}}(\sum_{j=1}^p d_j \beta_j)={\bf d}\] Rule 2: \[\frac{\partial}{\partial \boldsymbol{\beta}}(\boldsymbol{\beta}^T{\bf D}\boldsymbol{\beta})=\frac{\partial}{\partial \boldsymbol{\beta}}(\sum_{j=1}^p \sum_{k=1}^p \beta_j d_{jk} \beta_k)=2{\bf D}\boldsymbol{\beta}\]
Rule 3: The Hessian of the quadratic form \(\boldsymbol{\beta}^T{\bf D}\boldsymbol{\beta}\) is
\[\frac{\partial^2 \boldsymbol{\beta}^T{\bf D}\boldsymbol{\beta}}{\partial \boldsymbol{\beta}\partial \boldsymbol{\beta}^T}= 2{\bf D}\]
The score function is a \(p\times 1\) vector, \(s(\beta)\), with the partial derivatives of the log-likelihood with respect to the \(p\) elements of the \(\beta\) vector.
\[s(\beta)=\frac{\partial l(\beta)}{\partial \beta}= \sum_{i=1}^n \frac{\partial l_i(\beta)}{\partial \beta}= \sum_{i=1}^n s_i(\beta)\]
Again, observe that the score function is a sum of individual contributions for each observation pair \(i\).
Q: fill in for the multiple linear regression model.
To find the maximum likelihood estimate \(\hat{\beta}\) we solve the set of \(p\) equations: \[s(\hat{\beta})=0\]
Q: fill in for the multiple linear regression model. Specify what the normal equations are.
For the normal linear regression model, these equations \(s(\hat{\beta})=0\) have a solution to be written on closed form.
\[ \hat\beta=({\bf X}^T{\bf X})^{-1} {\bf X}^T {\bf Y}\]
Q: How can you see that least squares and ML gives the same estimator?
But, for other distribution than the normal we get a set of non-linear equations when we look at \(s(\hat{\beta})=0\), and then we will use the Newton-Raphson or Fisher Scoring iterative methods.
Observed Fisher information matrix \(H(\beta)\)
\[H(\beta) = -\frac{\partial^2l(\beta)}{\partial\beta\partial\beta^T} = -\frac{\partial s(\beta)}{\partial\beta^T}\]
so this is minus the Hessian of the loglikelihood. \(H(\beta)\) may be considered as a local measure of information that the likelihood contains. The higher the curvature of the log-likelihood near its maximum the more information is providd by the likelihood about the unknown parameter. Since we look at minus the Hessian, we have a positive \(H(\beta)\) near the maximum.
Q: Calculate this for the multiple linear regression model. What is the dimension of \(H(\beta)\)?
In addition we also use the expected Fisher information matrix \(F(\beta)\) which we may find in two ways, one is by taking the mean of the observed Fisher information matrix:
\[F(\beta) = E\left( -\frac{\partial^2l(\beta)}{\partial\beta\partial\beta^T} \right).\]
Q: Calculate this for the multiple linear regression model. What is the dimension of \(F(\beta)\)?
In Module 3 we need the Fisher information matrix in the Newton-Raphson method, and also to find the (asympotic) covariance matrix of our estimated coefficents \(\hat{\beta}\) - so much more about this then.
Explain what the values under Estimate mean in practice.
fit=lm(rentsqm~area+yearc+location+bath+kitchen+cheating,data=ds)
summary(fit)
##
## Call:
## lm(formula = rentsqm ~ area + yearc + location + bath + kitchen +
## cheating, data = ds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.4303 -1.4131 -0.1073 1.3244 8.6452
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -45.475484 3.603775 -12.619 < 2e-16 ***
## area -0.032330 0.001648 -19.618 < 2e-16 ***
## yearc 0.026959 0.001846 14.606 < 2e-16 ***
## location2 0.777133 0.076870 10.110 < 2e-16 ***
## location3 1.725068 0.236062 7.308 3.45e-13 ***
## bath1 0.762808 0.157559 4.841 1.35e-06 ***
## kitchen1 1.136908 0.183088 6.210 6.02e-10 ***
## cheating1 1.765261 0.129068 13.677 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.031 on 3074 degrees of freedom
## Multiple R-squared: 0.3065, Adjusted R-squared: 0.3049
## F-statistic: 194.1 on 7 and 3074 DF, p-value: < 2.2e-16
Reproduce the values under Estimate by calculating without the use of lm.
X=model.matrix(rentsqm~area+yearc+location+bath+kitchen+cheating,data=ds)
Y=ds$rentsqm
betahat=solve(t(X)%*%X)%*%t(X)%*%Y
#betahat-fit$coefficients
print(betahat)
## [,1]
## (Intercept) -45.47548356
## area -0.03233033
## yearc 0.02695857
## location2 0.77713297
## location3 1.72506792
## bath1 0.76280784
## kitchen1 1.13690814
## cheating1 1.76526110
(Optional - known from TMA4267)
First, we define predictions as \(\hat{\bf Y}={\bf X}\hat{\beta}\), and inserted the ML (and LS) estimate we get \(\hat{\bf Y}={\bf X}({\bf X}^T{\bf X})^{-1}{\bf X}^T{\bf Y}\).
We define the projection matrix \[ {\bf H}={\bf X}({\bf X}^T{\bf X})^{-1} {\bf X}^T\] called the hat matrix. This simplifies the notation for the predictions, \[\hat{\bf Y}={\bf H}{\bf Y}\] so the hat matrix is putting the hat on the response \({\bf Y}\).
In addition we define residuals as \[\begin{align*} \hat{\varepsilon}&={\bf Y}-\hat{\bf Y} \\ \hat{\varepsilon}&={\bf Y}-{\bf HY}=({\bf I-H}){\bf Y}\\ \end{align*}\]so we have a second projection matrix \[ {\bf I-H}={\bf I}-{\bf X}({\bf X}^T{\bf X})^{-1} {\bf X}^T \]
(Optional - known from TMA4267)
Nice graphical presentations: Putanen, Styan and Isotalo: Matrix Tricks for Linear Statistical Models: Our Personal Top Twenty, Figure 8.3.
\[ \hat{\sigma}^2=\frac{1}{n-p}({\bf Y}-{\bf X}{\hat{\beta}})^T({\bf Y}-{\bf X}{\bf \hat{\beta}})=\frac{\text{SSE}}{n-p}\]
In the generalized linear models setting (remember exponential family from Module 1) we will look at the parameter \(\sigma^2\) as a nuisance parameter = parameter that is not of interest to us. Our our focus will be on the parameters of interest - which will be related to the mean of the response, which is modelled using our covariate - so the regression parameters \(\beta\) are therefore our prime focus.
However, to perform inference we need an estimator for \(\sigma^2\).
The maximum likelihood estimator for \(\sigma^2\) is \(\frac{\text{SSE}}{n}\), which is found from maximizing the likelihood inserted our estimate of \(\hat{\beta}\) \[L(\hat{\beta},{\sigma^2})=(\frac{1}{2\pi})^{n/2}(\frac{1}{\sigma^2})^{n/2}\exp(-\frac{1}{2\sigma^2} ({\bf y}-{\bf X}\hat{\beta})^T({\bf y}-{\bf X}\hat{\beta}))\] \[\begin{align*} l(\hat{\beta},{\sigma^2})&=\text{ln}(L(\hat{\beta},{\sigma^2}))\\ &=-\frac{n}{2}\text{ln}(2\pi)-\frac{n}{2}\text{ln}\sigma^2-\frac{1}{2\sigma^2} ({\bf y}-{\bf X}\hat{\beta})^T({\bf y}-{\bf X}\hat{\beta})\\ \end{align*}\]The score vector with respect to \(\sigma^2\) is \[\frac{\partial l}{\partial \sigma^2}=0-\frac{n}{2\sigma^2}+\frac{1}{2\sigma^4}({\bf y}-{\bf X}\hat{\beta})^T({\bf y}-{\bf X}\hat{\beta}) \] Solving \(\frac{\partial l}{\partial \sigma^2}=0\) gives us the estimator \[ \frac{1}{n}({\bf y}-{\bf X}\hat{\beta})^T({\bf y}-{\bf X}\hat{\beta})=\frac{\text{SSE}}{n}\] But, this estimator is biased.
To prove this you may use the trace-formula, that is \(\text{E}({\bf Y}^T {\bf A}{\bf Y})=\text{tr}({\bf A}\text{Cov}({\bf Y}))+\text{E}({\bf Y})^T{\bf A}\text{E}({\bf Y})\), and we use that \(\text{SSE}={\bf Y}^T ({\bf I}-{\bf H}){\bf Y}\). This was done in class notes from TMA4267 - lecture 10
But, the estimator is asympotically unbiased (unbiased when the sample size \(n\) increases to infinity).
When an unbiased version is preferred, it is found using restricted maximum likelihood (REML). We will look into REML-estimation in Module 7. In our case the (unbiased) REML estimate is \[ \hat{\sigma}^2=\frac{1}{n-p}({\bf y}-{\bf X}\hat{\beta})^T({\bf y}-{\bf X}\hat{\beta})=\frac{\text{SSE}}{n-p}\]
The restricted maximum likelihood estimate is used in lm.
Q: What does it mean that the REML estimate is unbiased? Where is the estimate \(\hat{\sigma}\) in the regression output? (See output from lm for the rent index example.)
## Answer:
## Residual standard error: 2.031 is sigmahat
To be able to do inference (=make confidence intervals, prediction intervals, test hypotheses) we need to know about the properties of our parameter estimators in the (normal) linear model.
Least squares and maximum likelihood estimator for \({\bf \beta}\): \[ \hat{\beta}=({\bf X}^T{\bf X})^{-1} {\bf X}^T {\bf Y}\] with \(\hat{\beta}\sim N_{p}(\beta,\sigma^2({\bf X}^T{\bf X})^{-1})\).
Restricted maximum likelihood estimator for \({\bf \sigma}^2\): \[ \hat{\sigma}^2=\frac{1}{n-p}({\bf Y}-{\bf X}\hat{\beta})^T({\bf Y}-{\bf X}\hat{\beta})=\frac{\text{SSE}}{n-p}\] with \(\frac{(n-p)\hat{\sigma}^2}{\sigma^2} \sim \chi^2_{n-p}\).
Statistic for inference about \(\beta_j\), \(c_{jj}\) is diagonal element \(j\) of \(({\bf X}^T{\bf X})^{-1}\). \[ T_j=\frac{\hat{\beta}_j-\beta_j}{\sqrt{c_{jj}}\hat{\sigma}}\sim t_{n-p}\] This requires that \(\hat{\beta}_j\) and \(\hat{\sigma}\) are independent (see below).
However, when we work with large samples then \(n-p\) becomes large and the \(t\) distribution goes to a normal distribution, so we may use the standard normal in place of the \(t_{n-p}\).
Asymptotically we have: \[\hat{\beta}\sim N_{p}(\beta,\tilde{\sigma}^2({\bf X}^T{\bf X})^{-1})\]. and \[ T_j=\frac{\hat{\beta}_j-\beta_j}{\sqrt{c_{jj}}\tilde{\sigma}}\sim N(0,1)\] where \(\tilde{\sigma}^2=\frac{\text{SSE}}{n}\) (the ML estimator).
Q: Pointing forwards: do you see any connection between the covariance matrix of \(\hat{\boldsymbol{\beta}}\) and the Fisher information?
Can be proven using knowledge from TMA4267 on properties of the multivariate normal distribution.
Independence: Let \({\bf X}_{(p \times 1)}\) be a random vector from \(N_p({\bf \mu},{\bf \Sigma})\). Then \({\bf A}{\bf X}\) and \({\bf B}{\bf X}\) are independent iff \({\bf A}{\bf \Sigma}{\bf B}^T={\bf 0}\).
We have:
\({\bf Y}\sim N_n({\bf X}{\bf \beta},\sigma^2{\bf I})\)
\({\bf AY}=\hat{{\bf \beta}}=({\bf X}^T{\bf X})^{-1}{\bf X}^T {\bf Y}\), and
\({\bf BY}=({\bf I}-{\bf H}){\bf Y}\).
Now \({\bf A}\sigma^2{\bf I}{\bf B}^T=\sigma^2 {\bf A}{\bf B}^T=\sigma^2 ({\bf X}^T{\bf X})^{-1}{\bf X}^T ({\bf I}-{\bf H})={\bf 0}\)
since \({\bf X}({\bf I}-{\bf H})={\bf X}-{\bf HX}={\bf X}-{\bf X}={\bf 0}\).
We conclude that \(\hat{{\bf \beta}}\) is independent of \(({\bf I}-{\bf H}){\bf Y}\),
and, since SSE=function of \(({\bf I}-{\bf H}){\bf Y}\): SSE=\({\bf Y}^T({\bf I}-{\bf H}){\bf Y}\),
then \(\hat{{\bf \beta}}\) and SSE are independent, and the result with \(T_j\) being t-distributed with \(n-p\) degrees of freedom is correct.
Remark: a similar result will exist for GLMs, using the concept of orthogonal parameters.
In the normal linear model we have made the following assumptions.
Linearity of covariates: \({\bf Y}={\bf X}\beta+\varepsilon\). Problem: non-linear relationship?
Homoscedastic error variance: \(\text{Cov}(\varepsilon)=\sigma^2 {\bf I}\). Problem: Non-constant variance of error terms
Uncorrelated errors: \(\text{Cov}(\varepsilon_i,\varepsilon_j)=0\).
Additivity of errors: \({\bf Y}={\bf X}\beta+\varepsilon\)
Assumption of normality: \(\varepsilon \sim N_n({\bf 0},\sigma^2{\bf I})\)
The same assumptions are made when we do things the GLM way for the normal linear model.
In addtion the following might cause problems:
Read this for yourself. You do not need to understand this in detail, but is useful to have a basic idea why we look for a straight line in a QQ-plot. There is one question about this in the ILw1.
Go to separate R Markdown or html document: QQ–plot as Rmd or QQ–plot as html
If we assume the normal linear model then we know that the residuals (\(n\times 1\) vector) \[\hat{\varepsilon}={\bf Y}-{\bf \hat{Y}}=({\bf I}-{\bf H}){\bf Y}\] has a normal (singular) distribution with mean \(\text{E}(\hat{\varepsilon})={\bf 0}\) and covariance matrix \(\text{Cov}(\hat{ \varepsilon})=\sigma^2({\bf I}-{\bf H})\) where \({\bf H}={\bf X}({\bf X}^T{\bf X})^{-1}{\bf X}^T\).
This means that the residuals (possibly) have different variance, and may also be correlated.
Our best guess for the error \(\varepsilon\) is the residual vector \(\hat{\varepsilon}\), and we may think of the residuals as predictions of the errors. Be aware: don’t mix errors (the unobserved) with the residuals (“observed”).
But, we see that the residuals are not independent and may have different variance, therefore we will soon define variants of the residuals that we may use to assess model assumptions after a data set is fitted.
Q: How can we say that the residuals can have different variance and may be correlated? Why is that a problem?
We would like to check the model assumptions - we see that they are all connected to the error terms. But, but we have not observed the error terms \(\varepsilon\) so they can not be used for this. However, we have made “predictions” of the errors - our residuals. And, we want to use our residuals to check the model assumptions.
That is, we want to check that our errors are independent, homoscedastic (same variance for each observation), and not dependent on our covariates - and we want to use the residuals (observed) in place of the errors (unobserved). Then it would have been great if the residuals have these properties when the underlying errors have. To amend our problem we need to try to fix the residual so that they at least have equal variances. We do that by working with standardized or studentized residuals.
\[r_i=\frac{\hat{\varepsilon}_i}{\hat{\sigma}\sqrt{1-h_{ii}}}\] where \(h_ii\) is the \(i\)th diagonal element of the hat matrix \({\bf H}\).
In R you can get the standardized residuals from an lm-object (named fit) by rstandard(fit).
\[r^*_i=\frac{\hat{\varepsilon}_i}{\hat{\sigma}_{(i)}\sqrt{1-h_{ii}}}\] where \(\hat{\sigma}_{(i)}\) is the estimated error variance in a model with observation number \(i\) omitted. This seems like a lot of work, but it can be shown that it is possible to calculated the studentized residuals directly from the standardized residuals: \[r^*_i=r_i (\frac{n-p-1}{n-p-r_i^2})^{1/2}\]
In R you can get the studentized residuals from an lm-object (named fit) by rstudent(fit).
Some important plots
Dependence of the residuals on the predicted value: wrong regression model?
Nonconstant variance: transformation or weighted least squares is needed?
Plot the residuals, \(r^*_i\), against predictor variable or functions of predictor variables. Trend suggest that transformation of the predictors or more terms are needed in the regression.
Assessing normality of errors: QQ-plots and histograms of residuals. As an additional aid a test for normality can be used, but must be interpreted with caution since for small sample sizes the test is not very powerful and for large sample sizes even very small deviances from normality will be labelled as significant.
Plot the residuals, \(r^*_i\), versus time or collection order (if possible). Look for dependence or autocorrelation.
Residuals can be used to check model assumptions, and also to discover outliers.
More information on the plots here: http://data.library.virginia.edu/diagnostic-plots/ and http://ggplot2.tidyverse.org/reference/fortify.lm.html
You can use the function fortify.lm in ggplot2 to create a dataframe from an lm-object, which ggplot uses automatically when given a lm-object. This can be used to plot diagnostic plots.
For simplicity we use the Munch rent index with rent as response and only area as the only covariate. (You may change the model to a more complex one, and rerun the code chunks.)
fit <- lm(rent ~ area, data=rent99) # Run a regression analysis
format(head(fortify(fit)), digits = 4L)
# to show what fortify implicitly does in ggplot
# for more information ggplot2::fortify.lm
## rent area .hat .sigma .cooksd .fitted .resid .stdresid
## 1 109.9 26 0.001312 158.8 5.870e-04 260.0 -150.00 -0.9454
## 2 243.3 28 0.001219 158.8 1.678e-05 269.6 -26.31 -0.1658
## 3 261.6 30 0.001130 158.8 6.956e-06 279.2 -17.60 -0.1109
## 4 106.4 30 0.001130 158.8 6.711e-04 279.2 -172.83 -1.0891
## 5 133.4 30 0.001130 158.8 4.779e-04 279.2 -145.85 -0.9191
## 6 339.0 30 0.001130 158.8 8.032e-05 279.2 59.79 0.3768
A plot with the fitted values of the model on the x-axis and the residuals on the y-axis shows if the residuals have non-linear patterns. The plot can be used to test the assumption of a linear relationship between the response and the covariates. If the residuals are spread around a horizontal line with no distinct patterns, it is a good indication on no non-linear relationships, and a good model.
Does this look like a good plot for this data set?
ggplot(fit, aes(.fitted, .stdresid)) +
geom_point(pch = 21) +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_smooth(se = FALSE, col = "red", size = 0.5, method = "loess") +
labs(x = "Fitted values", y = "Standardized residuals", title = "Fitted values vs standardized residuals", subtitle = deparse(fit$call))
This plot shows if the residuals are Gaussian (normally) distributed. If they follow a straigt line it is an indication that they are, and else they are probably not.
ggplot(fit, aes(sample = .stdresid)) +
stat_qq(pch = 19) +
geom_abline(intercept = 0, slope = 1, linetype = "dotted") +
labs(x = "Theoretical quantiles", y = "Standardized residuals", title = "Normal Q-Q", subtitle = deparse(fit$call))
library(nortest)
ad.test(rstudent(fit))
##
## Anderson-Darling normality test
##
## data: rstudent(fit)
## A = 6.4123, p-value = 9.809e-16
This is also called spread-location plot. It shows if the residuals are spread equally along the ranges of predictors. Can be used to check the assumption of equal variance (homoscedasticity). A good plot is one with a horizontal line with randomly spread points.
Is this plot good for your data?
ggplot(fit, aes(.fitted, sqrt(abs(.stdresid)))) +
geom_point() +
geom_smooth(se = FALSE, col = "red", size = 0.5, method = "loess") +
labs(x = "Fitted values", y = expression(sqrt("Standardized residuals")), title = "Scale-location", subtitle = deparse(fit$call))
This plot can reveal influential outliers. Not all outliers are influential in linear regression; even though data have extreme values, they might not be influential to determine the regression line (the results don’t differ much if they are removed from the data set). These influential outliers can be seen as observations that does not get along with the trend in the majority of the observations. In plot.lm, dashed lines are used to indicate the Cook’s distance, instead of using the size of the dots as is done here.
Cook’s distance is the Euclidean distance between the \(\mathbf{\hat{y}}\) (the fitted values) and \(\mathbf{\hat{y}}_{(i)}\) (the fitted values calculated when the \(i\)-th observation is omitted from the regression). This is then a measure on how much the model is influences by observation \(i\). The distance is scaled, and a rule of thumb is to examine observations with Cook’s distance larger than 1, and give some attention to those with Cook’s distance above 0.5.
Leverage is defined as the diagonal elements of the hat matrix, i.e., the leverage of the \(i\)-th data point is \(h_{ii}\) on the diagonal of \(\mathbf{H = X(X^TX)^{-1}X^T}\). A large leverage indicated that the observation (\(i\)) has a large influence on the estimation results, and that the covariate values (\(\mathbf{x}_i\)) are unusual.
ggplot(fit, aes(.hat, .stdresid)) +
geom_smooth(se = FALSE, col = "red", size = 0.5, method = "loess") +
geom_point(aes(size = .cooksd)) +
scale_size_continuous("Cook's dist.") +
labs(x = "Leverage", y = "Standardized residuals", title = "Residuals vs Leverage", subtitle = deparse(fit$call))
(Some observations does not fit our model, but if we fit a more complex model this may change.)
(read for yourself - topic of ILw1)
Example: consider our rent dataset with rent as reponse, and continuous covariate area and categorical covariate location. Let the location be a factor with levels average, good, excellent.
library(gamlss.data)
library(tidyverse)
library(GGally)
ds=rent99 %>%
select(location, area,rent)
levels(ds$location)
# change to meaningful names
levels(ds$location)=c("average","good","excellent")
ggpairs(ds)
## [1] "1" "2" "3"
Q: comment on what you see in the ggpairs plot.
Categorical covariates may either be ordered or unordered. We will only consider unordered categories here. In general, we could like to estimate regression coefficients for all levels for the categorical covariates. However, if we want to include an intercept in our model we can only include codings for one less variable than the number of levels we have - or else our design matrix will not have full rank.
Q: Assume you have a categorical variable with three levels. Check for yourself that making a design matrix with one intercept and three columns with dummy (0-1) variable coding will result in a matrix that is singular.
# make "wrong" dummy variable coding with 3 columns
n=length(ds$location)
X=cbind(rep(1,n),ds$area,rep(0,n),rep(0,n),rep(0,n))
X[ds$location=="average",3]=1
X[ds$location=="good",4]=1
X[ds$location=="excellent",5]=1
X[c(1,3,69),]
library(Matrix)
dim(X)
rankMatrix(X)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 26 0 1 0
## [2,] 1 30 1 0 0
## [3,] 1 55 0 0 1
## [1] 3082 5
## [1] 4
## attr(,"method")
## [1] "tolNorm2"
## attr(,"useGrad")
## [1] FALSE
## attr(,"tol")
## [1] 6.843415e-13
This is why we need to instead work with different ways of coding categorical variables. One solution is to not include an intercept in the model, but that is often not what we want. We will look at two other solutions - one where we decide on a reference category (that we not include in the coding, and therefore is kind of included in the intercept - this is called “treatment coding”) and one where we require that the the sum of the coeffisients are zero (called “effect coding). This mainly effects how we interpret parameter estimates and communicate our findings to the world.
If we fit a regression model with lm to the data with rent as response and area and location as covariates, a model matrix is made - and how to handle the categorical variable is either specified the call to lm in contrasts=list(location="contr.treatment") (or to model.matrix) or globally for all categorical variables with options(contrasts=c("contr.treatment","contr.poly"))- where first element give choice for unordered factor (then treatment contrast is default) and second for ordered (and then this polynomial contrast is default). We will only work with unordered factors now.
This is the default coding. The reference level is automatically chosen as the “lowest” level (sorted alphabetically). For our example this means that the reference category for location is “average”. If we instead wanted “good” to be reference category we could relevel the factor.
X1=model.matrix(~area+location,data=ds)
X1[c(1,3,69),]
ds$locationRELEVEL=relevel(ds$location,ref="good")
X2=model.matrix(~area+locationRELEVEL,data=ds)
X2[c(1,3,69),]
## (Intercept) area locationgood locationexcellent
## 1 1 26 1 0
## 3 1 30 0 0
## 69 1 55 0 1
## (Intercept) area locationRELEVELaverage locationRELEVELexcellent
## 1 1 26 0 0
## 3 1 30 1 0
## 69 1 55 0 1
So, what does this mean in practice? Model 1 has average as reference category and model 2 good.
fit1=lm(rent~area+location,data=ds,contrasts = list(location="contr.treatment"))
summary(fit1)
fit2=lm(rent~area+locationRELEVEL,data=ds,contrasts = list(locationRELEVEL="contr.treatment"))
summary(fit2)
##
## Call:
## lm(formula = rent ~ area + location, data = ds, contrasts = list(location = "contr.treatment"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -790.98 -100.89 -4.87 94.47 1004.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 128.0867 8.6947 14.732 < 2e-16 ***
## area 4.7056 0.1202 39.142 < 2e-16 ***
## locationgood 28.0040 5.8662 4.774 1.89e-06 ***
## locationexcellent 131.1075 18.2614 7.180 8.73e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 157.1 on 3078 degrees of freedom
## Multiple R-squared: 0.3555, Adjusted R-squared: 0.3549
## F-statistic: 566 on 3 and 3078 DF, p-value: < 2.2e-16
##
##
## Call:
## lm(formula = rent ~ area + locationRELEVEL, data = ds, contrasts = list(locationRELEVEL = "contr.treatment"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -790.98 -100.89 -4.87 94.47 1004.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 156.0907 9.4950 16.439 < 2e-16 ***
## area 4.7056 0.1202 39.142 < 2e-16 ***
## locationRELEVELaverage -28.0040 5.8662 -4.774 1.89e-06 ***
## locationRELEVELexcellent 103.1034 18.4021 5.603 2.30e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 157.1 on 3078 degrees of freedom
## Multiple R-squared: 0.3555, Adjusted R-squared: 0.3549
## F-statistic: 566 on 3 and 3078 DF, p-value: < 2.2e-16
Q: Comment on the print-out. How do we interpret the intercept estimate?
This is an equally useful and popular coding - and this is the coding that is preferred when working with analysis of variance in general. The effect coding assumes that the sum of the effects for the levels of the factor sums to zero, and this is done with the following coding scheme (Model 3 with the original location and 4 with the releveled version.)
X3=model.matrix(~area+location,data=ds,contrasts=list(location="contr.sum"))
X3[c(1,3,69),]
X4=model.matrix(~area+locationRELEVEL,data=ds,contrasts=list(locationRELEVEL="contr.sum"))
X4[c(1,3,69),]
## (Intercept) area location1 location2
## 1 1 26 0 1
## 3 1 30 1 0
## 69 1 55 -1 -1
## (Intercept) area locationRELEVEL1 locationRELEVEL2
## 1 1 26 1 0
## 3 1 30 0 1
## 69 1 55 -1 -1
Observe the coding scheme. This means that when we find “the missing location level estimate” as the negative of the sum of the parameter estimates for the other estimated levels.
So, what does this mean in practice?
fit3=lm(rent~area+location,data=ds,contrasts = list(location="contr.sum"))
summary(fit3)
fit4=lm(rent~area+locationRELEVEL,data=ds,contrasts = list(locationRELEVEL="contr.sum"))
summary(fit4)
##
## Call:
## lm(formula = rent ~ area + location, data = ds, contrasts = list(location = "contr.sum"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -790.98 -100.89 -4.87 94.47 1004.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 181.1238 10.6383 17.026 < 2e-16 ***
## area 4.7056 0.1202 39.142 < 2e-16 ***
## location1 -53.0372 6.6428 -7.984 1.98e-15 ***
## location2 -25.0331 6.7710 -3.697 0.000222 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 157.1 on 3078 degrees of freedom
## Multiple R-squared: 0.3555, Adjusted R-squared: 0.3549
## F-statistic: 566 on 3 and 3078 DF, p-value: < 2.2e-16
##
##
## Call:
## lm(formula = rent ~ area + locationRELEVEL, data = ds, contrasts = list(locationRELEVEL = "contr.sum"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -790.98 -100.89 -4.87 94.47 1004.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 181.1238 10.6383 17.026 < 2e-16 ***
## area 4.7056 0.1202 39.142 < 2e-16 ***
## locationRELEVEL1 -25.0331 6.7710 -3.697 0.000222 ***
## locationRELEVEL2 -53.0372 6.6428 -7.984 1.98e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 157.1 on 3078 degrees of freedom
## Multiple R-squared: 0.3555, Adjusted R-squared: 0.3549
## F-statistic: 566 on 3 and 3078 DF, p-value: < 2.2e-16
Q: Comment on the print-out. How do we now interpret the intercept estimate?
(read for yourself)
To illustrate how interactions between covariates can be included we use the ozone data set from the ElemStatLearn library. This data set is measurements from 1973 in New York and contains 111 observations of the following variables:
ozone : ozone concentration (ppm)radiation : solar radiation (langleys)temperature : daily maximum temperature (F)wind : wind speed (mph)We start by fitting a multiple linear regression model to the data, with ozone as our response variable and temperature and wind as covariates.
| ozone | radiation | temperature | wind |
|---|---|---|---|
| 41 | 190 | 67 | 7.4 |
| 36 | 118 | 72 | 8.0 |
| 12 | 149 | 74 | 12.6 |
| 18 | 313 | 62 | 11.5 |
| 23 | 299 | 65 | 8.6 |
| 19 | 99 | 59 | 13.8 |
##
## Call:
## lm(formula = ozone ~ temperature + wind, data = ozone)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.160 -13.209 -3.089 10.588 98.470
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -67.2008 23.6083 -2.846 0.00529 **
## temperature 1.8265 0.2504 7.293 5.32e-11 ***
## wind -3.2993 0.6706 -4.920 3.12e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.72 on 108 degrees of freedom
## Multiple R-squared: 0.5817, Adjusted R-squared: 0.574
## F-statistic: 75.1 on 2 and 108 DF, p-value: < 2.2e-16
The model can be written as: \[Y = \beta_0 + \beta_1 x_t + \beta_2 x_w + \varepsilon\] In this model we have assumed that increasing the value of one covariate is independent of the other covariates. For example: by increasing the temperature by one-unit always increases the response value by \(\beta_2 \approx 1.651\), regardless of the value of wind.
However, one might think that the covariate wind (wind speed) might act differently upon ozone for different values of temperature and vice verse. \[\begin{aligned} Y &= \beta_0 + \beta_1 x_t + \beta_2 x_w + \beta_3\cdot(x_t \cdot x_w) +\varepsilon \\ &= \beta_0 + (\beta_1 + \beta_3 x_w) \cdot x_t + \beta_2 x_w + \varepsilon \\ &= \beta_0 + \beta_1 x_t + (\beta_2 + \beta_3 x_t) \cdot x_w + \varepsilon \end{aligned}.\] We fit this model in R. An interaction term can be included in the model using the * symbol.
Q: Look at the summary below. Is this a better model than without the interaction term? It the term significant?
ozone.int = lm(ozone~temperature+wind+ temperature*wind, data=ozone)
summary(ozone.int)
##
## Call:
## lm(formula = ozone ~ temperature + wind + temperature * wind,
## data = ozone)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.929 -11.190 -3.037 8.209 97.440
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -239.94146 48.59004 -4.938 2.92e-06 ***
## temperature 4.00151 0.59311 6.747 8.02e-10 ***
## wind 13.60882 4.28070 3.179 0.00193 **
## temperature:wind -0.21747 0.05446 -3.993 0.00012 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.36 on 107 degrees of freedom
## Multiple R-squared: 0.636, Adjusted R-squared: 0.6258
## F-statistic: 62.31 on 3 and 107 DF, p-value: < 2.2e-16
Below we see that the interaction term is highly significant. The \(p\)-value is very small, so that there is strong evidence that \(\beta_3 \neq 0\). Furthermore, \(R^2_{\text{adj}}\) has increased, indicating that more of the variability in the data has been explained by the model (than without the interaction).
Interpretation of the interaction term:
If we now increase the temperature by \(10^{\circ}\) F, the increase in wind speed will be \[(\hat \beta_1+\hat \beta_3 \cdot x_w) \cdot 10 = (4.0 -0.22 \cdot x_w) \cdot 10 = 40-2.2 x_w \text{ units}.\]
If we increase the wind speed by 10 mph, the increase in temperature will be \[(\hat \beta_2 + \hat \beta_3 \cdot x_t) \cdot 10 = (14 -0.22 \cdot x_t) \cdot 10 = 140-2.2 x_t \text{ units}.\]
The hierarchical principle
It is possible that the interaction term is higly significant, but the main effects are not.
In our ozone.int model above: the main effects are temperature and wind. The hierarchical principle states that if we include an interaction term in our model, the main effects are also to be included, even if they are not significant. This means that if the coefficients \(\hat \beta_1\) or \(\hat \beta_2\) would be insignificant, while the coefficient \(\hat \beta_3\) is significant, \(\hat \beta_1\) and \(\hat \beta_2\) should still be included in the model.
There reasons for this is that a model with interaction terms, but without the main effects is hard to interpret.
We create a new variable temp.cat which is a temperature as a qualitative covariate with two levels and fit the model: \[\begin{aligned}y&=\beta_0 + \beta_1 x_w + \begin{cases} \beta_2 + \beta_3 x_w &\text{ if temperature="low"}\\ 0 &\text{ if temperature = "high"}\end{cases} \\\\ &= \begin{cases} (\beta_0 + \beta_2) + (\beta_1 + \beta_3) \cdot x_w &\text{ if temperature="low"}\\ \beta_0 + \beta_1 x_w &\text{ if temperature="high""} \end{cases} \end{aligned}\]
temp.cat = ifelse(ozone$temperature < mean(ozone$temperature), "low", "high")
ozone2 = cbind(ozone, temp.cat)
kable(head(ozone2))
ozone.int2 = lm(ozone~wind + temp.cat+ temp.cat*wind, data=ozone2)
summary(ozone.int2)
| ozone | radiation | temperature | wind | temp.cat |
|---|---|---|---|---|
| 41 | 190 | 67 | 7.4 | low |
| 36 | 118 | 72 | 8.0 | low |
| 12 | 149 | 74 | 12.6 | low |
| 18 | 313 | 62 | 11.5 | low |
| 23 | 299 | 65 | 8.6 | low |
| 19 | 99 | 59 | 13.8 | low |
##
## Call:
## lm(formula = ozone ~ wind + temp.cat + temp.cat * wind, data = ozone2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -53.291 -9.091 -1.307 11.227 71.815
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 119.0450 7.5004 15.872 < 2e-16 ***
## wind -6.7235 0.8195 -8.204 5.61e-13 ***
## temp.catlow -92.6316 12.9466 -7.155 1.09e-10 ***
## wind:temp.catlow 6.0544 1.1999 5.046 1.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.26 on 107 degrees of freedom
## Multiple R-squared: 0.6393, Adjusted R-squared: 0.6291
## F-statistic: 63.2 on 3 and 107 DF, p-value: < 2.2e-16
interceptlow = coef(ozone.int2)[1]+coef(ozone.int2)[3]
slopelow = coef(ozone.int2)[2]+coef(ozone.int2)[4]
intercepthigh = coef(ozone.int2)[1]
slopehigh = coef(ozone.int2)[2]
ggplot(ozone)+geom_line(aes(y=interceptlow+slopelow*wind, x = wind), col="blue")+geom_line(aes(y=intercepthigh+slopehigh*wind, x=wind), col="red")+ylab("ozone")
Write down the GLM way for the multiple linear regression model. Explain.
Write down the likelihood and loglikelihood. Then define the score vector.
What is the set of equations we solve to find parameter estimates? What if we could not find a closed form solution to our set of equations - what could we do then?
Define the observed and the expected Fisher information matrix. What dimension does these matrices have? What can these matrices tell us?
A core finding is \(\hat \beta\). \[ \hat{\beta}=({\bf X}^T{\bf X})^{-1} {\bf X}^T {\bf Y}\] with \(\hat{\beta}\sim N_{p}(\beta,\sigma^2({\bf X}^T{\bf X})^{-1})\).
Show that \(\hat{\beta}\) has this distribution with the given mean and covariance matrix. What does this imply for the distribution of the \(j\)th element of \(\hat{\beta}\)? In particular, how can we calculate the variance of \(\hat{\beta}_j\)?
Explain the difference between error and residual. What are the properties of the raw residuals? Why don’t we want to use the raw residuals for model check? What is our solution to this?
That is the theoretical intercept and slope of a QQ–plot based on a normal sample?
Fit the regression model with first rent and then rentsqm as reponse and following covariates: area, location (dummy variable coding using location2 and location3), bath, kitchen and cheating (central heating).
#CODE
Consentrate on the response-model you choose for the rest of the tasks.
Next week: more on inference on this data set.
In a design matrix orthogonal columns gives diagonal \({\bf X}^T {\bf X}\). What does that mean? How can we get orthogonal columns?
If we have orthogonal columns, will then simple (only one covariate) and multiple estimated regression coefficients be different? Explain.
What is multicollinearity? Is that a problem? Why (not)?
Background material for this task: [Categorical covariates - dummy and effect coding)(#categorical)
We will study a dataset where we want to model income as response and two unordered categorical covariates genderand place (location).
income <- c(300, 350, 370, 360, 400, 370, 420, 390,
400,430,420, 410, 300, 320, 310, 305,
350, 370, 340, 355,370, 380, 360, 365)
gender <- c(rep("Male", 12),rep("Female",12))
place <- rep(c(rep("A",4),rep("B",4),rep("C",4)),2)
data <- data.frame(income,gender=factor(gender,levels=c("Female","Male")),
place=factor(place,levels=c("A","B","C")))
#CODE ggpairs
Check out the interaction.plot(data$gender,data$place,data$income). What does it show? Do we need an interaction term if we want to model a MLR with income as response? Is there a ggplot version of the interaction plot?
Check our plot.design(income~place+gender, data = data). What does it show?
First, use treatment contrast (dummy variable coding) and fit a MLR with income as response and gender and place as covariates. Explain what your model estimates mean. In particular, what is the interpretation of the intercept estimate?
Now, turn to sum-zero contrast (effect coding). Explain what your model estimates mean. Now, what is the intercept estimate? Calculate the estimate for place=C.
Next week we connect this to linear hypotheses and ANOVA.
This part of the module was marked “self-study”. Go through this together in the group, and make sure that you understand.
(this problem was also given in TMA4268 Statistical learning)
ggplot. You use ggplot. Why this? To see how things looks when the model is correct and wrong.#Homoscedastic errore
n=1000
x=seq(-3,3,length=n)
beta0=-1
beta1=2
xbeta=beta0+beta1*x
sigma=1
e1=rnorm(n,mean=0,sd=sigma)
y1=xbeta+e1
ehat1=residuals(lm(y1~x))
plot(x,y1,pch=20)
abline(beta0,beta1,col=1)
plot(x,e1,pch=20)
abline(h=0,col=2)
#Heteroscedastic errors
sigma=(0.1+0.3*(x+3))^2
e2=rnorm(n,0,sd=sigma)
y2=xbeta+e2
ehat2=residuals(lm(y2~x))
plot(x,y2,pch=20)
abline(beta0,beta1,col=2)
plot(x,e2,pch=20)
abline(h=0,col=2)
n=1000
beta=matrix(c(0,1,1/2,1/3),ncol=1)
set.seed(123)
x1=rnorm(n,0,1); x2=rnorm(n,0,2); x3=rnorm(n,0,3)
X=cbind(rep(1,n),x1,x2,x3)
y=X%*%beta+rnorm(n,0,2)
fit=lm(y~x1+x2+x3)
yhat=predict(fit)
summary(fit)
ehat=residuals(fit); estand=rstandard(fit); estud=rstudent(fit)
plot(yhat,ehat,pch=20)
points(yhat,estand,pch=20,col=2)
UNDER CONSTRUCTION :-)
install.packages(c("gamlss.data","tidyverse", "GGally", "Matrix","nortest"))