To be used:
R
, Rstudio
, R Markdown
, and get familiar with related topicsMunich rent
data set and the ggplot graphics (from tidyverse libraries
from data science)You know simple linear regression from TMA4240/TMA4245 (or similar introductory courses), and multiple linear regression (from TMA4267 Linear statistical models or TMA4255 Applied Statistics). We will stay with regression (for the whole course) - but make expansions in several directions.
What will not change:
our target is a random response (from some statistical distribution): continuous, nominal or ordinal, we have
fixed covariates (or explanatory variables) (in a design matrix): quantitative or qualitative, and
unknown regression parameters.
We will consider relationships between the conditional mean of Y, \(\text{E}(Y\mid {\bf x})=\mu\), and linear combinations of the covariates in a linear predictor \(\eta_i=\beta_0+\beta_1 x_{i1}+\cdots +\beta_k x_{ik}={\bf x}_i^T {\bf \beta}\).
For most of the course we will assume observation pairs (\(Y_i,{\bf x}_i\)) are independent \(i=1,\ldots,n\), but we will also consider clustered pairs (in Module 7: Linear mixed effects models LMM and Generalized linear mixed effects models GLMM).
H2016: Principles of statistical modelling and inference. Likelihood theory. General theory for generalised linear models, with applications to regression models for normally distributed data, logistic regression for binary and multinomial data, Poisson regression models and log-linear models for contingency tables. Extensions of GLM-theory to, for example, models for over-dispersion and quasi-likelihood estimation.
H2017: Added linar mixed and general linear mixed models.
Textbook: Fahrmeir, Kneib, Lang, Marx (2013): “Regression. Models, Methods and Applications” https://link.springer.com/book/10.1007%2F978-3-642-34333-9 (free ebook for NTNU students). Tentative reading list: main parts of Chapters 2, 3 (repetition), 5, 6, 7, Appendix B.4.
The modules of this course are:
Introduction (the module page you are reading now) [week 34]
Multiple linear regression (emphasis on likelihood) [week 35-36]
Binary regression (binary individual and grouped response) [week 37-38]
Poisson and gamma regression (count, non-normal continuous) [week 39-40]
GLM in general and quasi likelihood (exponential family, link function) [week 41-42]
6. Multinomial GLM and contingency tables [week 43]
Linear mixed models (clustered data, repeated measurements) [week 44-45]
Generalized mixed effects models [week 46]
Discussion and conclusion [week 47]
The Framingham Heart Study is a study of the etiology (i.e. underlying causes) of cardiovascular disease (CVD), with participants from the community of Framingham in Massachusetts, USA https://www.framinghamheartstudy.org/. This dataset is subset of a teaching version of the Framingham data, used with permission from the Framingham Heart Study.
We will focus on modelling systolic blood pressure using data from \(n=2600\) persons. For each person in the data set we have measurements of the following seven variables
SYSBP
systolic blood pressure (mmHg),
SEX
1=male, 2=female,
AGE
age (years) at examination,
CURSMOKE
current cigarette smoking at examination: 0=not current smoker, 1= current smoker,
BMI
body mass index (kg/m\(^2\)),
TOTCHOL
serum total cholesterol (mg/dl), and
BPMEDS
use of anti-hypertensive medication at examination: 0=not currently using, 1=currently using.
A multiple normal linear regression model was fitted to the data set with \(-\frac{1}{\sqrt{SYSBP}}\) as response and all the other variables as covariates.
The data set is here called thisds
.
modelB=lm(-1/sqrt(SYSBP)~SEX+AGE+CURSMOKE+BMI+TOTCHOL+BPMEDS,data=thisds)
summary(modelB)
##
## Call:
## lm(formula = -1/sqrt(SYSBP) ~ SEX + AGE + CURSMOKE + BMI + TOTCHOL +
## BPMEDS, data = thisds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0207366 -0.0039157 -0.0000304 0.0038293 0.0189747
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.103e-01 1.383e-03 -79.745 < 2e-16 ***
## SEX -2.989e-04 2.390e-04 -1.251 0.211176
## AGE 2.378e-04 1.434e-05 16.586 < 2e-16 ***
## CURSMOKE -2.504e-04 2.527e-04 -0.991 0.321723
## BMI 3.087e-04 2.955e-05 10.447 < 2e-16 ***
## TOTCHOL 9.288e-06 2.602e-06 3.569 0.000365 ***
## BPMEDS 5.469e-03 3.265e-04 16.748 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.005819 on 2593 degrees of freedom
## Multiple R-squared: 0.2494, Adjusted R-squared: 0.2476
## F-statistic: 143.6 on 6 and 2593 DF, p-value: < 2.2e-16
PLAN: We will recapitulate what we have learned in TMA4267 Linear statistical models, and then focus likelihood theory, and how this can be implemented in Compulsory exercise 1.
Textbook: Chapter 3 (from TMA4267) and parts of Appendix B4.
How can be model a respons that is not a continuous variable? Here we look at present/absent, true/false, healthy/diseased.
About 60 beetles were exposed to each of 8 different concentrations of CS\(_2\) (data on log10-dose), and the number killed at each of the concentrations were recorded.
#install.packages("investr")
library(investr)
head(beetle)
## ldose n y
## 1 1.6907 59 6
## 2 1.7242 60 13
## 3 1.7552 62 18
## 4 1.7842 56 28
## 5 1.8113 63 52
## 6 1.8369 59 53
frac=beetle$y/beetle$n
plot(beetle$ldose,frac,pch=20)
What might be the distribution of the number of dead beetles, \(Y_i\) at a given dose \(x_i\)? Dose \(x_i\) was given to \(n_i\) beetles.
\[ Y_i=bin(n_i,\pi_i)\] where \(\pi_i\) =probability for a beetle to die at dose \(x_i\) and \(n_i\)= number of beetles treated with dose \(x_i\). A linear model for \(\pi_i\) estimated by ordinary least squares is problematic because
\(0 \le \pi_i \le 1\) can not be guaranteed by a linear expression \(\beta_0+\beta_1 x_i\), and
\(\text{Var}(Y_i) =n_i \pi_i (1-\pi_i)\) is non-constant (heteroscedastic) variance.
The “usual” solution to this is logistic regression where the relationship between the mean of the response and the predictor is not linear, but instead \[ \ln(\frac{\pi_i}{1-\pi_i})= \beta_0+\beta_1 x_i \] or equivalently \[\pi_i=\frac{\exp(\beta_0+\beta_1 x_i)}{1+\exp(\beta_0+\beta_1 x_i)}\]
Then \(0\le \pi_i\le 1\). We estimate the model by Maximum Likelihood (ML), while taking into account that the responses are binomially distributed.
fit=glm(cbind(beetle$y,beetle$n-beetle$y)~ldose,data=beetle,family=binomial)
summary(fit)
##
## Call:
## glm(formula = cbind(beetle$y, beetle$n - beetle$y) ~ ldose, family = binomial,
## data = beetle)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5941 -0.3944 0.8329 1.2592 1.5940
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -60.717 5.181 -11.72 <2e-16 ***
## ldose 34.270 2.912 11.77 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 284.202 on 7 degrees of freedom
## Residual deviance: 11.232 on 6 degrees of freedom
## AIC: 41.43
##
## Number of Fisher Scoring iterations: 4
thisrange=range(beetle$ldose)
xs=seq(thisrange[1],thisrange[2],length=100)
predicted=predict(fit,newdata=data.frame(ldose=xs),type="response")
plot(beetle$ldose,frac)
lines(xs,predicted)
PLAN: In this module we will study the binary regression, work on parameter estimation and interpretation of parameter estimates using odds, work with both individual and grouped data, test linear hypotheses, look at criteria for model fit and model choice, and discuss overdispersion.
Textbook: 2.3 and 5.1.
Count data - the number of times and event occurs - is common. In one famous study British doctore were in 1951 sent a questionnaire about whether they smoked tobacco - and later information about their death were collected. Questions that were asked were: Is the death rate higher for smokers than for non-smokers? If so, by how much? And, how is this related to age?
library(boot)
#n=person-year, ns=smoker-years,
#age=midpoint 10 year age group,
#y=number of deaths due to cad, smoke=smoking status
head(breslow)
## age smoke n y ns
## 1 40 0 18790 2 0
## 2 50 0 10673 12 0
## 3 60 0 5710 28 0
## 4 70 0 2585 28 0
## 5 80 0 1462 31 0
## 6 40 1 52407 32 52407
To investigate this we will look at different ways of relating the expected number of deaths and the number of doctors at risk in the observation period for each smoke and age group. We will do this by assuming a Poission distribution for the number of deaths, and linking this to a linear predictor.
When we work with continuous data - like life times, costs and claim sized - these may not be negative, and the their distribution often follow a right skewed distribution. We will look at effect a one of more covariates that may work multiplicative on the response and see how we may fit that using gamma regression on the log scale of the response.
Textbook: 5.2 and 5.3
We will see that normal, binary, Poisson and gamma regression all have the same underlying features:
The mean of the response, \(\mu=\text{E}(Y)\), is connected to the linear predictor \(\eta={\bf x}^T \beta\) by a link function: \(\eta=g(\mu)\) or, alternatively, by a response function \(\mu=h(\eta)\) - where \(g=h^{-1}\) (inverse functions).
The distribution of the response can be written as a univariate exponential family (we work with that in this first module).
This leads to a unified framework, and maximum likelihood estimation can be written on a generalized form for all the GLMs. In addition we can present statistical inference and asymptotic properties of estimators on a common form. Finally, we may expand this to quasi-likelihood models by just specifying mean and variance (not distribution) and solve using generalized estimation equations.
This part is rather mathematical - but is built on the findings of modules 1-4.
Textbook: 5.4 and 5.5
Compulsory exercise 2 will cover modules 3-5.
In a study on the effect of sleep deprivation the average reaction time per day were measured. On day 0 the subjects had their normal amount of sleep. Starting that night they were restricted to 3 hours of sleep per night. The observations represent the average reaction time on a series of tests given each day to each subject. This was measured for 18 subjects for 10 days (days 0-9).
library(lme4)
## Loading required package: Matrix
library(ggplot2) # see more on ggplot later in this module
gg <- ggplot(sleepstudy, aes(x = Days, y = Reaction))
gg <- gg + geom_point(color = "blue", alpha = 0.7)
gg <- gg + geom_smooth(method = "lm", color = "black")
gg <- gg + theme_bw()
gg <- gg + facet_wrap(~Subject)
gg
We observe that each subject’s reaction time increases approximately linearly with the number of sleepdeprived days. But, it appears that subjects have different slopes and intercepts.
As a first model we may assume that there is a common intercept and slope for the population - called fixed effects, but allow for random deviations for the intercept and slope for each individual. In linear mixed effects models we assume that the random intercepts and slopes are drawn from normal distributions and estimate the variance in these distribution. Such a model will make observations correlated within subjects.
fm1 <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy)
summary(fm1)
## Linear mixed model fit by REML ['lmerMod']
## Formula: Reaction ~ Days + (Days | Subject)
## Data: sleepstudy
##
## REML criterion at convergence: 1743.6
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.9536 -0.4634 0.0231 0.4634 5.1793
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## Subject (Intercept) 612.09 24.740
## Days 35.07 5.922 0.07
## Residual 654.94 25.592
## Number of obs: 180, groups: Subject, 18
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 251.405 6.825 36.838
## Days 10.467 1.546 6.771
##
## Correlation of Fixed Effects:
## (Intr)
## Days -0.138
Here the population fixed effects estimates are an intercept of 251.4 ms and a slope of 10.47 ms/day. The random effects for the intercept and the slope have estimated standard deviations 24.74 ms and 5.92 ms/day.
In this module we will look at different models for clustered (schools, families) and repeated measurement (e.g. over time) using regression with fixed and random effects.
Textbook: 2.4, 7.1, 7.3
We generalize our model in Module 7 - on normal responses - to binary responses.
Textbook: 7.2, 7.5, 7.7
Compulsory exercise 3 will cover modules 6-8.
Model specification: an equation linking the response and the explanatory variables, and a probability distribution for the response.
Estimation of the parameters in the model
Checking the adequacy of the model, how well it fits the data.
Inference: confidence intervals, hypothesis tests, interpretation of results.
Both theoretic derivations and practical analysis in R will be emphasized.
(new in H2017 in italic)
Knowledge.
The student can assess whether a generalised linear model can be used in a given situation and can further carry out and evaluate such a statistical analysis. The student has substantial knowledge of generalised linear models and associated inference and evaluation methods. This includes regression models for Gaussian distributed data, logistic regression for binary and multinomial data, Poisson regression and log-linear models for contingency tables.
The student has theoretical knowledge about linear mixed models and generelized linear mixed effects models, and associated inference and evaluation of the models. Main emphasis is on Gaussian and binomial data.
Skills.
The student can assess whether a generalised linear model or a generalized linear mixed model can be used in a given situation, and can further carry out and evaluate such a statistical analysis.
The plenary lectures is rather passive in nature - for the students - and held in classical auditorium (F3). They provide the first step into the new module
has focus on student activity and understanding though discussing with fellow students and with the lecturer/TA - in groups.
Lecturer gives a short introduction to current state, and present a problem.
Students work together in groups on problem.
The problem is presented on the digital screen, and the students discuss by interacting around the screen and often by producing R code and interpreting analysis output - all presented on the digital screen. One student may act as the “teacher”.
If the problem is of a theoretical flavour, or drawing is needed - the students work on the whiteboards next to the digital screen.
Lecturer summarizes solutions to the problem with input from the student groups.
has mainly focus on programming and interpretation - with some theory - and can be worked on in small groups. Will be a test of accuired understanding, and will constitute 30% of the final evaluation.
In this course we will look at models where the distribution of the response variable, \(y_i\), can be written in the form of a univariate exponential family \[ f(y_i\mid \theta_i)=\exp \left( \frac{y_i \theta_i-b(\theta_i)}{\phi}\cdot w_i + c(y_i, \phi, w_i) \right) \] where
\(\theta_i\) is called the canonical parameter and is a parameter of interest
\(\phi\) is called a nuisance parameter (and is not of interest to us=therefore a nuisance (plage))
\(w_i\) is a weight function, in most cases \(w_i=1\)
b and c are known functions.
It can be shown that \(\text{E}(Y_i)=b'(\theta_i)\) and \(\text{Var}(Y_i)=b''(\theta_i)\cdot \frac{\phi}{w}\).
Remark: slightly different versions of writing the exponential family exists, but we will use this version in our course (a different version might be used in TMA4295, but the basic findings are the same).
To make the notation easier for these tasks, we omit the \(i\) subscript.
NB: you may first use \(n=1\) in the binomial (which then is called Bernoulli) - since that is much easier than a general \(n\).
Hint: https://wiki.math.ntnu.no/tma4245/tema/begreper/discrete and nearly the same parameterization for showing the binomial is member of exponential https://www.youtube.com/watch?v=7mNrsFr7P_A.
Hint: https://wiki.math.ntnu.no/tma4245/tema/begreper/discrete and first part of Sannsynlighetsmaksimering
Then, what about the (univariate) normal? (requirements, probability density function f(x) and member?)
Finally, the gamma distribution? (requirements, pdf f(x) and member?)
Hint: https://wiki.math.ntnu.no/tma4245/tema/begreper/continuous
See also table on page 303 in the textbook. Mettes solution: https://www.math.ntnu.no/emner/TMA4315/2017h/Module1ExponentialFamily.pdf
http://r-pkgs.had.co.nz (We will make an R package in the exercise part of this course.)
Do we need GitHub or Bitbucket in our course? https://www.youtube.com/watch?v=w3jLJU7DT5E and https://techcrunch.com/2012/07/14/what-exactly-is-github-anyway/
The module pages (you are reading the Module 1 page now), are written using R Markdown. To work with the module pages you either copy-paste snippets of R code from the module page over in your editor window in Rstudio, or copy the Rmd-version of the module page (1Intro.Rmd) into your Rstudio editor window (then you can edit directly in Rmarkdown document - to make it into your personal copy).
If you choose the latter: To compile the R code we use knitr
(termed “knit”) to produce a html-page you press “knit” in menu of the editor window, but first you need to install packages: rmarkdown
and devtools
(from CRAN). For the module pages the needed R packages will always be listed in the end of the module pages.
#install.packages("rmarkdown")
library(rmarkdown)
#install.packages("devtools")
library(devtools)
#install_github("yixuan/prettydoc")
library(prettydoc)
If you want to learn more about the R Markdown (that you may use for the compulsory exercises) this is a good read:
http://r4ds.had.co.nz/r-markdown.html (Chapter 27: R Markdown from the “R for Data Science” book), and
the Rstudio cheat sheet on R Markdown is here: https://www.rstudio.com/wp-content/uploads/2016/03/rmarkdown-cheatsheet-2.0.pdf.
Then you see that you can make a pdf-file in addition to a html-file (for your reports you may choose either).
If you only want to extract the R code from a R Markdown file you may do that using the function purl
from library knitr
. To produce a file “1Intro.R” from this “1Intro.Rmd” file:
require(knitr)
purl("https://www.math.ntnu.no/emner/TMA4315/2017h/1Intro.Rmd")
The file will then be saved in your working directory, that you see with getwd()
.
And to work with either the 1Intro.R or 1Intro.Rmd file you will have to first install the following libraries:
install.packages("rmarkdown")
install.packages("prettydoc")
install.packages("gamlss.data")
install.packages("tidyverse")
install.packages("ggpubr")
install.packages("investr")
install.packages("lme4")
We will use this data set when working with multiple linear regression (next module), so this is a good way to start to know the data set and the ggplot functions, which can be installed together with a suite of useful libraries from tidyverse
.
A version of the Munic Rent Index data is available as rent
in library catdata
from CRAN.
#install.packages("gamlss.data")
#install.packages("tidyverse")
library(gamlss.data)
#library(tidyverse) #automatically loads ggplot2
Get to know the rent
data.
ds=rent99
colnames(ds)
## [1] "rent" "rentsqm" "area" "yearc" "location" "bath"
## [7] "kitchen" "cheating" "district"
dim(ds)
## [1] 3082 9
summary(ds)
## rent rentsqm area yearc
## Min. : 40.51 Min. : 0.4158 Min. : 20.00 Min. :1918
## 1st Qu.: 322.03 1st Qu.: 5.2610 1st Qu.: 51.00 1st Qu.:1939
## Median : 426.97 Median : 6.9802 Median : 65.00 Median :1959
## Mean : 459.44 Mean : 7.1113 Mean : 67.37 Mean :1956
## 3rd Qu.: 559.36 3rd Qu.: 8.8408 3rd Qu.: 81.00 3rd Qu.:1972
## Max. :1843.38 Max. :17.7216 Max. :160.00 Max. :1997
## location bath kitchen cheating district
## 1:1794 0:2891 0:2951 0: 321 Min. : 113
## 2:1210 1: 191 1: 131 1:2761 1st Qu.: 561
## 3: 78 Median :1025
## Mean :1170
## 3rd Qu.:1714
## Max. :2529
Then, head for plotting with ggplot
but first take a quick look at the ggplot2
library:
Before you continue you should have read the start of the Visualisation chapter that explains the ggplot grammar. Yes, you start with creating the coordinate system with ggplot
and then add layers. What does the following words mean: mapping, aesthetic, geom function mean in the ggplot
setting?
First, look at plotting rentsqm
for different values of location
- with panels of scatter plots and with boxplots
ggplot(data=ds)+
geom_point(mapping=aes(area,rentsqm))+
facet_wrap(~location,nrow=1)
ggplot(data = ds, mapping = aes(x = location, y = rentsqm)) +
geom_boxplot()
So, location matters.
But, should we use rent
or rentsqm
as response?
#install.packages("ggpubr")
require(ggpubr)
## Loading required package: ggpubr
## Loading required package: magrittr
plot1 <- ggplot(data=ds) +
geom_density(mapping=aes(rent),kernel="gaussian")
plot2 <- ggplot(data=ds) +
geom_density(mapping=aes(rentsqm),kernel="gaussian")
ggarrange(plot1, plot2, ncol=2)
So, which response will we use? And, what if we would include area
as covariate? I have plotted two plots together below, more on mixing graphs on the same page (we need ggprbr, gridExtra and cowplot packages) https://www.r-bloggers.com/ggplot2-easy-way-to-mix-multiple-graphs-on-the-same-page/
Relationship between rent
or rentsqm
and area
plot1 <- ggplot(data=ds,aes(area,rent))+
geom_point(mapping=aes(area,rent),size=0.5)
plot2 <- ggplot(data=ds)+
geom_point(mapping=aes(area,rentsqm),size=0.5)
ggarrange(plot1, plot2, ncol=2)
So, if we include area as a covariate, we may look at residuals when using rent
or rentsqm
. More about diagnostic plots in Module 2 - but - which plot below looks more random?
lm.rent=lm(rent~area,data=ds)
summary(lm.rent)
##
## Call:
## lm(formula = rent ~ area, data = ds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -786.63 -104.88 -5.69 95.93 1009.68
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 134.5922 8.6135 15.63 <2e-16 ***
## area 4.8215 0.1206 39.98 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 158.8 on 3080 degrees of freedom
## Multiple R-squared: 0.3417, Adjusted R-squared: 0.3415
## F-statistic: 1599 on 1 and 3080 DF, p-value: < 2.2e-16
lm.rentsqm=lm(rentsqm~area,data=ds)
summary(lm.rentsqm)
##
## Call:
## lm(formula = rentsqm ~ area, data = ds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9622 -1.5737 -0.1102 1.5861 9.9674
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.46883 0.12426 76.20 <2e-16 ***
## area -0.03499 0.00174 -20.11 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.291 on 3080 degrees of freedom
## Multiple R-squared: 0.1161, Adjusted R-squared: 0.1158
## F-statistic: 404.5 on 1 and 3080 DF, p-value: < 2.2e-16
p1<-ggplot(lm.rent, aes(.fitted, .resid))+geom_point()
p1<-p1+stat_smooth(method="loess")+geom_hline(yintercept=0, col="red", linetype="dashed")
p1<-p1+xlab("Fitted values")+ylab("Residuals")
p1<-p1+ggtitle("Rent: Residual vs Fitted Plot")+theme_bw()
p2<-ggplot(lm.rentsqm, aes(.fitted, .resid))+geom_point()
p2<-p2+stat_smooth(method="loess")+geom_hline(yintercept=0, col="red", linetype="dashed")
p2<-p2+xlab("Fitted values")+ylab("Residuals")
p2<-p2+ggtitle("rentsqm: Residual vs Fitted Plot")+theme_bw()
ggarrange(p1, p2, ncol=2)
Take home message: for the mean of the response may differ with out covariates - that is why we use regression. For the normal linear regression it is not the response that is supposed to have mean zero, but the error term - more about this in Module 2. And, is the variance of the residuals independent of the fitted values? Yes, more in Module 2.
Grolemund and Hadwick (2017): “R for Data Science”, http://r4ds.had.co.nz
Hadwick (2009): “ggplot2: Elegant graphics for data analysis” textbook: https://link.springer.com/book/10.1007%2F978-0-387-98141-3
If you want to see more of the powers of ggplot, combined with a nice story: https://www.andrewheiss.com/blog/2017/08/10/exploring-minards-1812-plot-with-ggplot2/
R-bloggers: https://https://www.r-bloggers.com/ is a good place to look for tutorials.