Welcome to the practicals for the course aDAV. In the practicals we will get hands-on experience with the materials in the lectures by doing theoretical exercises, programming in
R, and completing assignments. This is the first practical. In this practical, we will start with a exercise on the distinction between supervised and unsupervised learning, and then proceed with
R. We will briefly introduce how we are going to work with
RStudio for the remainder of the course, and get to know (some of) the datasets that we are going to work with.
From these practicals, nothing needs to be handed in. You are doing these practicals to get experience with the material from the lectures and to practice for the assignments (which do need to be handed in).
Supervised and unsupervised learning
In the lecture and reading material for this week we have seen that machine learning can be classified on the characteristics of the data and the tasks it aims to solve. A main distinction to be made is that between supervised and unsupervised learning:
In this exercise, we will practice this distinction using 10 examples.
A biologist friend of yours asks you to assess the phylogenetic relation between different species of birds using genetic information (traits). The outcome of your analysis is a phylogenetic tree in which species that diverged more recently are closely linked together. Phylogenetic tree of life built using ribosomal RNA sequences, after Karl Woese. Image credit: Modified from Eric Gaba, Wikimedia Commons.
A pharmaceutical company is developing a drug to treat certain type of cancer. The first step is looking for targets: they ask you to analyse data consisting on the differential expression of 10,000 genes in micro-RNA chips from different tissues. You want to uncover groups of genes that display similar expression patterns in each of the tissues.
# A: Supervised learning - classification # B: Supervised learning - classification # C: Unsupervised learning # D: Supervised learning - regression # E: Unsupervised learning # F: Supervised learning - classification # G: Supervised learning - regression. # H: Unsupervised learning # I: Supervised learning. Classification (if focused on predictions, “pass or not pass”) or regression (if focused on the inference, probability estimation). # J: Unsupervised learning
R projects and Markdown files
We assume that you are already familiar with R and Rstudio, as outlined in the entry requirements of the course. In addition, we assume you are familiar with using Rstudio projects and R Markdown files, as outlined in the course Preparation. If you haven’t completed the course preparation tab on the website yet, please do so before next class. If you feel you still lack some R skills, there are some sources mentioned under Preparation.
In this course we will always work in a project. During this practical, we will work in the project
01_R_intro_students.Rprojin RStudio and run the following code in the console. Do you know where the file “sometext.txt” is located on your disk?
In addition, we will make extensive use of
.Rmd files, R Markdown files. With R Markdown files, we can easily create documents which seamlessly combine text, code, and plots. The document you are reading right now was generated from an R Markdown file.
r_introduction_stu.Rmdfrom the Files pane in RStudio.
Knit > Knit to HTMLon top of the source pane.
The assignments in week 3 and week 8 need to be handed in as an
R project folder with data, a
.Rmd file, and the
.html file generated from it.
From practical 2 onwards, we will always start by opening the
R project and do the assignments directly in the R Markdown file. Under each assignment, you can insert an
R chunk and input your code there. If you prefer, you may also work directly in a
.R file for the practicals. Note that if you do this, you will still have to work with an
.Rmd file for the assignments.
Datasets from the ISLR package
The first book we are using in this course is Introduction to Statistical Learning, abbreviated as ISLR. The authors use several datasets throughout the book which are packaged in the
ISLR. The datasets are: Auto, Caravan, Carseats, College, Credit, Default, Hitters, Khan, NCI60, OJ, Portfolio, Smarket, Wage, Weekly.
Rby running the following in the console
You only need to install packages once. When they are installed on your system, you can always load them in your environment using
library(). Let’s have a closer look at some of the datasets we will be working with.
head()to look at the first few rows of the
Hittersdataset. What data does this dataset contain? What are the variable types? Hint: to get more information on what each column represents, this dataset comes with a neat help file that can be accessed through
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun ## -Andy Allanson 293 66 1 30 29 14 1 293 66 1 ## -Alan Ashby 315 81 7 24 38 39 14 3449 835 69 ## -Alvin Davis 479 130 18 66 72 76 3 1624 457 63 ## -Andre Dawson 496 141 20 65 78 37 11 5628 1575 225 ## -Andres Galarraga 321 87 10 39 42 30 2 396 101 12 ## -Alfredo Griffin 594 169 4 74 51 35 11 4408 1133 19 ## CRuns CRBI CWalks League Division PutOuts Assists Errors ## -Andy Allanson 30 29 14 A E 446 33 20 ## -Alan Ashby 321 414 375 N W 632 43 10 ## -Alvin Davis 224 266 263 A W 880 82 14 ## -Andre Dawson 828 838 354 N E 200 11 3 ## -Andres Galarraga 48 46 33 N E 805 40 4 ## -Alfredo Griffin 501 336 194 A W 282 421 25 ## Salary NewLeague ## -Andy Allanson NA A ## -Alan Ashby 475.0 N ## -Alvin Davis 480.0 A ## -Andre Dawson 500.0 N ## -Andres Galarraga 91.5 N ## -Alfredo Griffin 750.0 A
# The dataset Hitters contains records and salaries for baseball players. Examples of records contained are number of times at bat, hits, home runs, runs, runas batted walks and number of years in the major leages. Most of the variables are numeric (i.e., interval or ratio variables), but some variables like 'League' and 'Division' are of the factor type (i.e., categorical variables).
Bostondataset, contained in the
MASSlibrary. What data does this dataset contain? What are the variable types? Hint: also this dataset comes with a neat help file that can be accessed through
## crim zn indus chas nox rm age dis rad tax ptratio black lstat ## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 ## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 ## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 ## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 ## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 ## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 ## medv ## 1 24.0 ## 2 21.6 ## 3 34.7 ## 4 33.4 ## 5 36.2 ## 6 28.7
# The dataset Boston contains housing values (the variable 'medv') and other information about Boston suburbs, such as per capita crime rate. All variables are numeric (or integer, meaning numeric values without decimals).
summary()to create a summary of the
Bostondataset. What is the range and median per capita crime rate by town? And what is the range of the average number of rooms per dwelling?
## crim zn indus chas ## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000 ## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000 ## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000 ## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917 ## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000 ## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000 ## nox rm age dis ## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130 ## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100 ## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207 ## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795 ## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188 ## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127 ## rad tax ptratio black ## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32 ## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38 ## Median : 5.000 Median :330.0 Median :19.05 Median :391.44 ## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67 ## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23 ## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90 ## lstat medv ## Min. : 1.73 Min. : 5.00 ## 1st Qu.: 6.95 1st Qu.:17.02 ## Median :11.36 Median :21.20 ## Mean :12.65 Mean :22.53 ## 3rd Qu.:16.95 3rd Qu.:25.00 ## Max. :37.97 Max. :50.00
# The range of per captia crime rate is 0.01 - 88.98, with a median of 0.26 (hence, we have a lot of low values and only very little high values within the given range!). The average number of rooms per dwelling ranges from 3.56 to 8.78.
Throughout this course, try to maintain a consistent and legible style for your code. This is very important as it will make your collaborators, as well as future you happy. Being able to read and understand your own code after a year of not looking at it is possible if you use consistent style and informative comments where necessary.
Try to adhere to this style for your assignments, too. Tip: in RStudio, you can display a vertical line at 80 characters to know when your code exceeds this. You can do this at Tools > Global Options > Code > Display > Show margin.
If you have followed this practical, you are all set for the remainder of this course! Next week, we will have a closer look at the Visualization using ggplot.