Introduction to aDAV in R


Welcome to the practicals for the course aDAV. In the practicals we will get hands-on experience with the materials in the lectures by doing theoretical exercises, programming in R, and completing assignments. This is the first practical. In this practical, we will start with a exercise on the distinction between supervised and unsupervised learning, and then proceed with R. We will briefly introduce how we are going to work with R and RStudio for the remainder of the course, and get to know (some of) the datasets that we are going to work with.

  • Make sure you have read and completed the preparations.
  • Read through the course schedule. Note that assignments are due in week 5 and 9.

From these practicals, nothing needs to be handed in. You are doing these practicals to get experience with the material from the lectures and to practice for the assignments (which do need to be handed in).



Supervised and unsupervised learning

In the lecture and reading material for this week we have seen that machine learning can be classified on the characteristics of the data and the tasks it aims to solve. A main distinction to be made is that between supervised and unsupervised learning:

In this exercise, we will practice this distinction using 10 examples.


  1. Classify the following examples into Supervised learning tasks and Unsupervised learning. If they are Supervised, indicate whether a classification task or a regression task applies. After 10 minutes, we will discuss the correct answers in class. Hint: always start asking yourself if would expect a response variable in the data!

  1. You work at a consultancy company, and a bank hired your services. Your customer asks you to develop a model that helps them predicting which loan applicants will default (not be able to pay the loan). Final goal: predict good or bad applicants.
  2. Is there a cat in the picture? Recognize cats in pictures of animals and other objects. Final goal: cat or not cat.
  3. A biologist friend of yours asks you to assess the phylogenetic relation between different species of birds using genetic information (traits). The outcome of your analysis is a phylogenetic tree in which species that diverged more recently are closely linked together. Phylogenetic tree of life built using ribosomal RNA sequences, after Karl Woese. Image credit: Modified from Eric Gaba, Wikimedia Commons.

  4. Using pre-existing data and satellite images containing information such as albedo, infrared refraction, colour absorption, etc., and known superficies with solar panels, train a model able of estimating the total area of solar panels per image.
  5. Market segmentation: group customers into segments based on their purchasing behaviours, age, maximal educational degree attained, etc.
  6. Predict whether nodules in a tomography are benign or malignant. You will use a dataset in which the images have been manually annotated by a committee of medic doctors.
  7. Use news and social media analytics to predict changes in the stock market. You will use a small number of stocks as target indicators, and web-scrapped text from social media and the news. You have access to previous instances of this data, and you want to predict the values for your indicators in the near future.
  8. Build a recommendation system for songs. It suggests new artists based on their multidimensional “proximity” to the ones liked by the user: “Because you liked ______ you may also like ______.”
  9. An applied researcher wants to evaluate whether the stress felt by students acts affects the relation between amount of study time and the probability of success on an exam. You will use a data collected in a classroom experiment.
  10. A pharmaceutical company is developing a drug to treat certain type of cancer. The first step is looking for targets: they ask you to analyse data consisting on the differential expression of 10,000 genes in micro-RNA chips from different tissues. You want to uncover groups of genes that display similar expression patterns in each of the tissues.

# A: Supervised learning - classification
# B: Supervised learning - classification
# C: Unsupervised learning
# D: Supervised learning - regression
# E: Unsupervised learning
# F: Supervised learning - classification
# G: Supervised learning - regression.
# H: Unsupervised learning
# I: Supervised learning. Classification (if focused on predictions, “pass or not pass”) or regression (if focused on the inference, probability estimation).
# J: Unsupervised learning


R projects and Markdown files

We assume that you are already familiar with R and Rstudio, as outlined in the entry requirements of the course. In addition, we assume you are familiar with using Rstudio projects and R Markdown files, as outlined in the course Preparation. If you haven’t completed the course preparation tab on the website yet, please do so before next class. If you feel you still lack some R skills, there are some sources mentioned under Preparation.

In this course we will always work in a project. During this practical, we will work in the project 01_R_intro_students.


  1. Open the file 01_R_intro_students.Rproj in RStudio and run the following code in the console. Do you know where the file “sometext.txt” is located on your disk?

print(readLines("data/sometext.txt"))

In addition, we will make extensive use of .Rmd files, R Markdown files. With R Markdown files, we can easily create documents which seamlessly combine text, code, and plots. The document you are reading right now was generated from an R Markdown file.


  1. Open the file r_introduction_stu.Rmd from the Files pane in RStudio.

RStudio may ask you to install several packages. You should allow it to!
If these do not install, you should install and load rmarkdown; knitr and the tidyverse.


  1. Make sure you can output the R Markdown file you created to a html using Knit > Knit to HTML on top of the source pane.

The assignments in week 3 and week 8 need to be handed in as an R project folder with data, a .Rmd file, and the .html file generated from it.

From practical 2 onwards, we will always start by opening the R project and do the assignments directly in the R Markdown file. Under each assignment, you can insert an R chunk and input your code there. If you prefer, you may also work directly in a .R file for the practicals. Note that if you do this, you will still have to work with an .Rmd file for the assignments.



Datasets from the ISLR package

The first book we are using in this course is Introduction to Statistical Learning, abbreviated as ISLR. The authors use several datasets throughout the book which are packaged in the R package ISLR. The datasets are: Auto, Caravan, Carseats, College, Credit, Default, Hitters, Khan, NCI60, OJ, Portfolio, Smarket, Wage, Weekly.


  1. Install and load the package in R by running the following in the console

install.packages("ISLR")
library(ISLR)

You only need to install packages once. When they are installed on your system, you can always load them in your environment using library(). Let’s have a closer look at some of the datasets we will be working with.


  1. Look at the Default dataset by running the following in the console. What does this dataset contain?

View(Default)

  1. Use the function head() to look at the first few rows of the Hitters dataset. What data does this dataset contain? What are the variable types? Hint: to get more information on what each column represents, this dataset comes with a neat help file that can be accessed through ?Hitters

head(Hitters)
##                   AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
## -Andy Allanson      293   66     1   30  29    14     1    293    66      1
## -Alan Ashby         315   81     7   24  38    39    14   3449   835     69
## -Alvin Davis        479  130    18   66  72    76     3   1624   457     63
## -Andre Dawson       496  141    20   65  78    37    11   5628  1575    225
## -Andres Galarraga   321   87    10   39  42    30     2    396   101     12
## -Alfredo Griffin    594  169     4   74  51    35    11   4408  1133     19
##                   CRuns CRBI CWalks League Division PutOuts Assists Errors
## -Andy Allanson       30   29     14      A        E     446      33     20
## -Alan Ashby         321  414    375      N        W     632      43     10
## -Alvin Davis        224  266    263      A        W     880      82     14
## -Andre Dawson       828  838    354      N        E     200      11      3
## -Andres Galarraga    48   46     33      N        E     805      40      4
## -Alfredo Griffin    501  336    194      A        W     282     421     25
##                   Salary NewLeague
## -Andy Allanson        NA         A
## -Alan Ashby        475.0         N
## -Alvin Davis       480.0         A
## -Andre Dawson      500.0         N
## -Andres Galarraga   91.5         N
## -Alfredo Griffin   750.0         A
# The dataset Hitters contains records and salaries for baseball players. Examples of records contained are number of times at bat, hits, home runs, runs, runas batted walks and number of years in the major leages. Most of the variables are numeric (i.e., interval or ratio variables), but some variables like 'League' and 'Division' are of the factor type (i.e., categorical variables). 

  1. Do the same for the Boston dataset, contained in the MASS library. What data does this dataset contain? What are the variable types? Hint: also this dataset comes with a neat help file that can be accessed through ?Boston

library(MASS)
head(Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
# The dataset Boston contains housing values (the variable 'medv') and other information about Boston suburbs, such as per capita crime rate. All variables are numeric (or integer, meaning numeric values without decimals). 

  1. Use the function summary() to create a summary of the Boston dataset. What is the range and median per capita crime rate by town? And what is the range of the average number of rooms per dwelling?

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
# The range of per captia crime rate is 0.01 - 88.98, with a median of 0.26 (hence, we have a lot of low values and only very little high values within the given range!). The average number of rooms per dwelling ranges from 3.56 to 8.78. 


Code style

Throughout this course, try to maintain a consistent and legible style for your code. This is very important as it will make your collaborators, as well as future you happy. Being able to read and understand your own code after a year of not looking at it is possible if you use consistent style and informative comments where necessary.


  1. Read through the style guide on Hadley Wickham’s website.

Try to adhere to this style for your assignments, too. Tip: in RStudio, you can display a vertical line at 80 characters to know when your code exceeds this. You can do this at Tools > Global Options > Code > Display > Show margin.



Conclusion

If you have followed this practical, you are all set for the remainder of this course! Next week, we will have a closer look at the Visualization using ggplot.