Preparation and Materials


Introduction

During this course, we will exclusively use R for data visualization and analysis, via RStudio.

The open-source programming language R is focussed on statistics and data analysis, with several built-in options and example datasets for performing most common types of analysis. For example, we can perform a regression and plot it to show the relationship between weight and miles per gallon of the cars in the built-in mtcars dataset:

result <- lm(mpg ~ wt, data = mtcars)
plot(mpg ~ wt, data = mtcars)
abline(reg = result, lty = 2)

RStudio is the de-facto standard integrated development environment (IDE) for R: a computer program that enables us to easily write programs, scripts, documents, and even entire blogs, websites, and journal articles using R in the background.

We assume that you are already familiar with R and Rstudio, as outlined in the entry requirements of the course. However, if you haven’t installed R yet on your (current) computer or need some refreshing, please see below on how to install R, Rstudio and some sources to familiarize yourself with the basics.

Rstudio projects

A feature we are going to use a lot is RStudio’s projects. A project is a file folder with code, data, and other files related to a single project. An R project folder contains an .Rproj file which you can open from RStudio. This automatically sets that folder as the working directory, meaning any files in it can be loaded relative to this directory.

After opening an R project, RStudio shows the name of the project in the top-right corner of the program, above the environment panel. By clicking on it, you can close it, open another project, create new projects, and quickly access your latest projects. You can also open projects in new RStudio sessions.

R Markdown

In this course, we will make extensive use of .Rmd files, R Markdown files. With R Markdown files, we can easily create documents which seamlessly combine text, code, and plots. Even the website you are reading right now was generated from an R Markdown file.

If you scroll through the file, you may see that there is a specific syntax associated with R Markdown files. At the start, there is some information about the document and how it should be output, and in the document itself is the text with a lot of pound signs (#), underscores (_) and backticks (\). If you are still unfamiliar with using R Markdown files, please read through the following tutorials on rmarkdown.rstudio.com before next class:

RStudio may ask you to install several packages. You should allow it to!
If these do not install, you should install and load rmarkdown; knitr and the tidyverse.


Make sure you can output the R Markdown file you created to a html using Knit > Knit to HTML on top of the source pane.


The assignments in week 3 and week 8 need to be handed in as an R project folder with data, a .Rmd file, and the .html file generated from it.

Readings

Introduction to Statistical Learning

The first book we are using in this course is Introduction to Statistical Learning, abbreviated as ISLR. The link will direct you to the website of the book, where a pdf of the (first edition of the) book is available online for free and can be downloaded. Under ‘resources’, you can also find a link to a free online course on the book which includes a very nice series of (short) lectures!

Data visualization

The second book we are using in this course is Data Visualization - A practical introduction by Kieran Healy, which we will abbreviate as DatVis. The link will direct you to a preprint version of the book which is available online for free. Next week we will start with chapter 1, learning all about the basic principles of data visualization and perception.

R shiny apps

In week 7, we will explore making interactive visualizations uisng R shiny apps. For this, we will use the book Mastering Shiny by Hadley Wickham. This book is currently still under development (and intended for a late 2020 release by O’Reilly Media), but can already be red online.

Text mining

At the end of the course, we will have a look at some basic text mining, for which we will use the book Text Mining with R by Julia Silge and David Robinson. Nice to know: this entire book and its website were made using R Markdown! We will start text mining in week 9, and cover chapter 1 to 3.

The tidyverse

Another useful resource for this course is the book Hadley Wickham’s R for Data Science, or R4DS. Again, this entire book and its website were made using R Markdown! Hadley Wickham and R4DS use a specific dialect of R, a set of packages called the tidyverse. Because these packages are very useful and great for data science, we will be using the tidyverse throughout this course.

Code style

Throughout this course, try to maintain a consistent and legible style for your code. This is very important as it will make your collaborators, as well as future you happy. Being able to read and understand your own code after a year of not looking at it is possible if you use consistent style and informative comments where necessary.


Read through the style guide on Hadley Wickham’s website.


Try to adhere to this style for your assignments, too. Tip: in RStudio, you can display a vertical line at 80 characters to know when your code exceeds this. You can do this at Tools > Global Options > Code > Display > Show margin.

R and Rstudio

1. Install R

R can be obtained here. We won’t use R directly in the course, but rather call R through RStudio. Therefore it needs to be installed.

2. Install RStudio Desktop

Rstudio is an Integrated Development Environment (IDE). It can be obtained as stand-alone software here. The free and open source RStudio Desktop version is sufficient. Also ensure that you have installed a TeX distribution, to do this run the following in Rstudio:

  install.packages("tinytex")
  library(tinytex)
  install_tinytex()

3. Familiarize yourself with the basics.

Take a look at this video if you aren’t familiar with RStudio. Additionally, since this is not a course on programming with R rather a course on data analysis and visualization, you should ensure you are familiar with some basics. For such basics you should check out:

  • The first two chapters of introduction to R on datacamp
  • Simply installing R, playing around and reading Workflow basics in Hadley Wickham’s R for Data Science Book.
  • Interactive R Course: Install R, and in the console type the following lines one by one:
  install.packages("swirl")
  library(swirl)
  swirl()
  • And follow the guide to run the R Programming: The basics of Programming in R interactive course.

What if the steps above do not work for me? If all fails and you have insufficient rights to your machine, the following web-based service will offer a solution.

1. Open a free account on rstudio.cloud. You can run your own cloud-based RStudio environment there.

2. Use Utrecht University’s MyWorkPlace. You would have access to R and RStudio there. You may need to install packages for new sessions during the course.

Naturally, you will need internet access for these services to be accessed.