Our textbook James et al (2013): An Introduction to Statistical Learning - with Applications in R (ISL). Chapter 1 and 2.3. Rbeginner and Rintermediate
What is statistical learning?
refers to a vast set of tools to understanding data (according to our text book, page 1).
We focus on the whole chain: model-method-algorithm-interpretation, and we both focus on prediction and understanding (inference). Statistical learning is a statistical discipline.
So, what is the difference between machine learning and statistical learning?
Well, machine learning is more focused on the algorithmic part of learning, and is a discipline in computer science.
But, many methods/algorithms are common to the fields of statistical learning and machine learning.
What about data science?
In data science the aim is to
extract knowledge and understanding from data, and requires a combination of statistics, mathematics, numerics, computer science and informatics.
This encompasses the whole process of data acquisition/scraping, going from unstructured to structured data, setting up a data model, performing data analysis, implementing tools and interpreting results.
The Framingham Heart Study is a study of the etiology (i.e. underlying causes) of cardiovascular disease (CVD), with participants from the community of Framingham in Massachusetts, USA https://www.framinghamheartstudy.org/. (In Norway we have the Health survey of Nord-Trøndelag, HUNT - but not with data available for teaching.)
We will focus on modelling systolic blood pressure using data from \(n=2600\) persons. For each person in the data set we have measurements of the following seven variables
SYSBP systolic blood pressure (mmHg),
SEX 1=male, 2=female,
AGE age (years) at examination,
CURSMOKE current cigarette smoking at examination: 0=not current smoker, 1= current smoker,
BMI body mass index ( \(kg/m^2\) ),
TOTCHOL serum total cholesterol (mg/dl), and
BPMEDS use of anti-hypertensive medication at examination: 0=not currently using, 1=currently using.
Etiology of CVD - model
A multiple normal linear regression model was fitted to the data set with \(-\frac{1}{\sqrt{SYSBP}}\) as response (output) and all the other variables as covariates (inputs).
The results are used to formulate hypotheses about the etiology of CVD - to be studied in new trials.
The iris flower data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in 1936. The data set contains three plant species {setosa, virginica, versicolor} and four features measured for each corresponding sample: Sepal.Length, Sepal.Width, Petal.Length and Petal.Width.
One method: In this plot the small black dots represent correctly classified iris plants, while the red dots represent misclassifications. The big black dots represent the class means.
Competing method: Sometimes a more suitable boundary is not linear. —
Gene expression in rats
In a collaboration with the Faculty of Medicine the relationship between inborn maximal oxygen uptake and skeletal muscle gene expression was studied.
Rats were artificially selected for high- and low running capacity (HCR and LCR, respectively),
and either kept seditary or trained.
Transcripts significantly related to running capacity and training were identified (moderated t-tests from two-way anova models, false discovery rate controlled).
To further present the findings heat map of the most significant transcripts were presented (high expression are shown in red and transcripts with a low expression are shown in yellow).
This is hierarchical cluster analysis with pearson correlation distance measure.
To sum up - there are three main types of problems discussed in this course:
Regression
Classification
Unsupervised methods: e.g. clustering
using data from science, technology, industry, economy/finance, …
Who is this course for?
Primary requirements
Bachelor level: 3rd year student from Science or Technology programs, and master/PhD level students with interest in performing statistical analyses.
Statistics background: TMA4240/45 Statistics or equivalent.
No background in statistical software needed: but we will use the R statistical software extensively in the course.
Not a prerequisist but a good thing with knowledge of computing - preferably an introductory course in informatics, like TDT4105 or TDT4110.
Overlap
TDT4173 Machine learning and case based reasoning: courses differ in philosophy (computer science vs. statistics). See Bb under FAQ for more details.
TMA4267 Linear Statistical Models: useful to know about multivariate random vectors, covariance matrices and the multivariate normal distribution. Overlap only for Multiple linear regression (M3).
Focus: both statistical theory and running analyses
The course has focus on statistical theory, but all models and methods on the reading list will also be investigated using (mostly) available function in R and real data sets.
It it important that the student in the end of the course can analyses all types of data (covered in the course) - not just understand the theory.
And vice versa - this is not a “we learn how to perform data analysis”-course - the student must also understand the model, methods and algorithms used.
There is a final written exam (70% on final grade) in addition to compulsory exercises (30% on final grade).
About the course
Course content
Statistical learning, multiple linear regression, classification, resampling methods, model selection/regularization, non-linearity, support vector machines, tree-based methods, unsupervised methods, neural nets.
Learning outcome
Knowledge. The student has knowledge about the most popular statistical learning models and methods that are used for prediction and inference in science and technology. Emphasis is on regression- og classification-type statistical models.
Skills. The student knows, based on an existing data set, how to choose a suitable statistical model, apply sound statistical methods, and perform the analyses using statistical software. The student knows how to present the results from the statistical analyses, and which conclusions can be drawn from the analyses.
Learning methods and activities
Lectures, exercises and works (projects).
Portfolio assessment is the basis for the grade awarded in the course. This portfolio comprises a written final examination (70%) and works (projects) (30%). The results for the constituent parts are to be given in %-points, while the grade for the whole portfolio (course grade) is given by the letter grading system. Retake of examination may be given as an oral examination. The lectures may be given in English.
Estimating \(f\) (regression, classification), prediction accuracy vs model interpretability.
Supervised vs. unsupervised learning
Bias-variance trade-off
The Bayes classifier and the KNN - a flexible method for regression and classification
For the IL on 17.01 we target students that does not plan to take TMA4267 Linear statistical models, and we work with random vectors, covariance matrices and the multivariate normal distribution (very useful before Modules 3 and 4).
Two of the most commonly used resampling methods are cross-validation and the bootstrap. Cross-validation is often used to choose appropriate values for tuning parameters. Bootstrap is often used to provide a measure of accuracy of a parameter estimate.
Mette will interrupt and talk in plenum only if there are common issues that all need to know/pay attention to.
Friday January 12
(Martina and Thea present, Mette present 12-12.50)
12.15: New students (not present on Wednesday) can work in groups with the questions under “R, Rstudio, CRAN and GitHub - and R Markdown”, and then go on to Rbeginner.html and Rintermediate.html.
12.15: Students attending Wednesday continue from where the came to on Wednesday.
P. Dalgaard: Introductory statistics with R, 2nd edition, Springer, which is also available freely to NTNU students as an ebook: Introductory Statistics with R.