**Introduction**

During this practical, two different classification methods will be covered: K-nearest neighbours and logistic regression.

One of the packages we are going to use is class. For this, you will probably need to `install.packages("class")`

before running the `library()`

functions.

```
library(MASS)
library(class)
library(ISLR)
library(tidyverse)
```

This practical will be mainly based around the `default`

dataset which contains credit card loan data for 10 000 people. With the goal being to classify credit card cases as `yes`

or `no`

based on whether they will default on their loan.

**Create a scatterplot of the**`Default`

dataset, where`balance`

is mapped to the x position,`income`

is mapped to the y position, and`default`

is mapped to the colour. Can you see any interesting patterns already?

**Add**`facet_grid(cols = vars(student))`

to the plot. What do you see?

**Transform “student” into a dummy variable using**`ifelse()`

(0 = not a student, 1 = student). Then, randomly split the Default dataset into a training set`default_train`

(80%) and a validation set`default_valid`

(20%)

If you haven’t used the function `ifelse()`

before, please feel free to review it in Chapter 5 Control Flow (*particular section 5.2.2*) in Hadley Wickham’s Book Advanced R, this provides a concise overview of choice functions (`if()`

) and vectorised if (`ifelse()`

).

**K-Nearest Neighbours**

Now that we have explored the dataset, we can start on the task of classification. We can imagine a credit card company wanting to predict whether a customer will default on the loan so they can take steps to prevent this from happening.

The first method we will be using is k-nearest neighbours (KNN). It classifies datapoints based on a majority vote of the k points closest to it. In `R`

, the `class`

package contains a `knn()`

function to perform knn.

**Create class predictions for the test set using the**`knn()`

function. Use`student`

,`balance`

, and`income`

(but no basis functions of those variables) in the`default_train`

dataset. Set k to 5. Store the predictions in a variable called`knn_5_pred`

.

*Remember*: make sure to review the `knn()`

function through the *help* panel on the GUI or through typing “?knn” into the console. For further guidance on the `knn()`

function, please see *Section 4.6.5* in An introduction to Statistical Learning

**Create two scatter plots with income and balance as in the first plot you made. One with the true class (**`default`

) mapped to the colour aesthetic, and one with the predicted class (`knn_5_pred`

) mapped to the colour aesthetic. Hint: Add the predicted class`knn_5_pred`

to the`default_valid`

dataset before starting your`ggplot()`

call of the second plot. What do you see?

**Repeat the same steps, but now with a**`knn_2_pred`

vector generated from a 2-nearest neighbours algorithm. Are there any differences?

During this we have manually tested two different values for K, this although useful in exploring your data. To know the optimal value for K, you should use cross validation.

**Assessing classification**

The confusion matrix is an insightful summary of the plots we have made and the correct and incorrect classifications therein. A confusion matrix can be made in `R`

with the `table()`

function by entering two `factor`

s:

```
conf_2NN <- table(predicted = knn_2_pred, true = default_valid$default)
conf_2NN
```

```
## true
## predicted No Yes
## No 1899 55
## Yes 31 15
```

To learn more these, please see *Section 4.4.3* in An Introduction to Statistical Learning, where it discusses Confusion Matrices in the context of another classification method Linear Discriminant Analysis (LDA).

**What would this confusion matrix look like if the classification were perfect?**

**Make a confusion matrix for the 5-nn model and compare it to that of the 2-nn model. What do you conclude?**

**Comparing performance becomes easier when obtaining more specific measures. Calculate the specificity, sensitivity, accuracy and the precision.**

**Logistic regression**

KNN directly predicts the class of a new observation using a majority vote of the existing observations closest to it. In contrast to this, logistic regression predicts the `log-odds`

of belonging to category 1. These log-odds can then be transformed to probabilities by performing an inverse logit transform:

p = 1⁄(1 + ℇ ^{-α})

where α indicates log-odds for being in class 1 and *p* is the probability.

Therefore, logistic regression is a `probabilistic`

classifier as opposed to a `direct`

classifier such as KNN: indirectly, it outputs a probability which can then be used in conjunction with a cutoff (usually 0.5) to classify new observations.

Logistic regression in `R`

happens with the `glm()`

function, which stands for generalized linear model. Here we have to indicate that the residuals are modeled not as a Gaussian (normal distribution), but as a `binomial`

distribution.

**Use**`glm()`

with argument`family = binomial`

to fit a logistic regression model`lr_mod`

to the`default_train`

data.

Now we have generated a model, we can use the `predict()`

method to output the estimated probabilities for each point in the training dataset. By default `predict`

outputs the log-odds, but we can transform it back using the inverse logit function of before or setting the argument `type = "response"`

within the predict function.

**Visualise the predicted probabilities versus observed class for the training dataset in**`lr_mod`

. You can choose for yourself which type of visualisation you would like to make. Write down your interpretations along with your plot.

Another advantage of logistic regression is that we get coefficients we can interpret.

**Look at the coefficients of the**`lr_mod`

model and interpret the coefficient for`balance`

. What would the probability of default be for a person who is not a student, has an income of 40000, and a balance of 3000 dollars at the end of each month? Is this what you expect based on the plots we’ve made before?

Let’s visualise the effect `balance`

has on the predicted default probability.

**Create a data frame called**`balance_df`

with 3 columns and 500 rows:`student`

always 0,`balance`

ranging from 0 to 3000, and`income`

always the mean income in the`default_train`

dataset.

**Use this dataset as the**`newdata`

in a`predict()`

call using`lr_mod`

to output the predicted probabilities for different values of`balance`

. Then create a plot with the`balance_df$balance`

variable mapped to x and the predicted probabilities mapped to y. Is this in line with what you expect?

**Create a confusion matrix just as the one for the KNN models by using a cutoff predicted probability of 0.5. Does logistic regression perform better?**

**Calculate the specificity, sensitivity, accuracy and the precision for the logistic regression using the above confusion matrix. Again, compare the logistic regression to KNN.**

**Final Exercise**

Now let’s do another - slightly less guided - round of KNN and/or logistic regression on a new dataset in order to predict the outcome for a specific case. We will use the Titanic dataset also discussed in the lecture. The data can be found in the `/data`

folder of your project. Before creating a model, explore the data, for example by using `summary()`

.

**Create a model (using knn or logistic regression) to predict whether a 14 year old boy from the 3rd class would have survived the Titanic disaster.**

**Would the passenger have survived if they were a 14 year old girl in 2nd class?**