During this practical, two different classification methods will be covered: K-nearest neighbours and logistic regression.
One of the packages we are going to use is class. For this, you will probably need to
install.packages("class") before running the
library(MASS) library(class) library(ISLR) library(tidyverse)
This practical will be mainly based around the
default dataset which contains credit card loan data for 10 000 people. With the goal being to classify credit card cases as
no based on whether they will default on their loan.
balanceis mapped to the x position,
incomeis mapped to the y position, and
defaultis mapped to the colour. Can you see any interesting patterns already?
Default %>% arrange(default) %>% # so the yellow dots are plotted after the blue ones ggplot(aes(x = balance, y = income, colour = default)) + geom_point(size = 1.3) + theme_minimal() + scale_colour_viridis_d() # optional custom colour scale
# People with high remaining balance are more likely to default. # There seems to be a low-income group and a high-income group
facet_grid(cols = vars(student))to the plot. What do you see?
Default %>% arrange(default) %>% # so the yellow dots are plotted after the blue ones ggplot(aes(x = balance, y = income, colour = default)) + geom_point(size = 1.3) + theme_minimal() + scale_colour_viridis_d() + facet_grid(cols = vars(student))
# The low-income group is students!
ifelse()(0 = not a student, 1 = student). Then, randomly split the Default dataset into a training set
default_train(80%) and a validation set
If you haven’t used the function
ifelse() before, please feel free to review it in Chapter 5 Control Flow (particular section 5.2.2) in Hadley Wickham’s Book Advanced R, this provides a concise overview of choice functions (
if()) and vectorised if (
default_df <- Default %>% mutate(student = ifelse(student == "Yes", 1, 0)) %>% mutate(split = sample(rep(c("train", "valid"), times = c(8000, 2000)))) default_train <- default_df %>% filter(split == "train") %>% select(-split) default_valid <- default_df %>% filter(split == "valid") %>% select(-split)
Now that we have explored the dataset, we can start on the task of classification. We can imagine a credit card company wanting to predict whether a customer will default on the loan so they can take steps to prevent this from happening.
The first method we will be using is k-nearest neighbours (KNN). It classifies datapoints based on a majority vote of the k points closest to it. In
class package contains a
knn() function to perform knn.
income(but no basis functions of those variables) in the
default_traindataset. Set k to 5. Store the predictions in a variable called
Remember: make sure to review the
knn() function through the help panel on the GUI or through typing “?knn” into the console. For further guidance on the
knn() function, please see Section 4.6.5 in An introduction to Statistical Learning
knn_5_pred <- knn( train = default_train %>% select(-default), test = default_valid %>% select(-default), cl = as_factor(default_train$default), k = 5 )
default) mapped to the colour aesthetic, and one with the predicted class (
knn_5_pred) mapped to the colour aesthetic. Hint: Add the predicted class
default_validdataset before starting your
ggplot()call of the second plot. What do you see?
# first plot is the same as before default_valid %>% arrange(default) %>% ggplot(aes(x = balance, y = income, colour = default)) + geom_point(size = 1.3) + scale_colour_viridis_d() + theme_minimal() + labs(title = "True class")
# second plot maps pred to colour bind_cols(default_valid, pred = knn_5_pred) %>% arrange(default) %>% ggplot(aes(x = balance, y = income, colour = pred)) + geom_point(size = 1.3) + scale_colour_viridis_d() + theme_minimal() + labs(title = "Predicted class (5nn)")
# there are quite some misclassifications: many "No" predictions # with "Yes" true class and vice versa.
knn_2_predvector generated from a 2-nearest neighbours algorithm. Are there any differences?
knn_2_pred <- knn( train = default_train %>% select(-default), test = default_valid %>% select(-default), cl = as_factor(default_train$default), k = 2 ) # second plot maps pred to colour bind_cols(default_valid, pred = knn_2_pred) %>% arrange(default) %>% ggplot(aes(x = balance, y = income, colour = pred)) + geom_point(size = 1.3) + scale_colour_viridis_d() + theme_minimal() + labs(title = "Predicted class (2nn)")
# compared to the 5-nn model, more people get classified as "Yes" # Still, the method is not perfect
During this we have manually tested two different values for K, this although useful in exploring your data. To know the optimal value for K, you should use cross validation.
The confusion matrix is an insightful summary of the plots we have made and the correct and incorrect classifications therein. A confusion matrix can be made in
R with the
table() function by entering two
conf_2NN <- table(predicted = knn_2_pred, true = default_valid$default) conf_2NN
## true ## predicted No Yes ## No 1899 55 ## Yes 31 15
To learn more these, please see Section 4.4.3 in An Introduction to Statistical Learning, where it discusses Confusion Matrices in the context of another classification method Linear Discriminant Analysis (LDA).
# All the observations would fall in the yes-yes or no-no categories; # the off-diagonal elements would be 0 like so: table(predicted = default_valid$default, true = default_valid$default)
## true ## predicted No Yes ## No 1930 0 ## Yes 0 70
conf_5NN <- table(predicted = knn_5_pred, true = default_valid$default) conf_5NN
## true ## predicted No Yes ## No 1922 61 ## Yes 8 9
# the 2nn model has more true positives (yes-yes) but also more false # positives (truly no but predicted yes). Overall the 5nn method has # slightly better accuracy (proportion of correct classifications).
spec_2NN <- conf_2NN[1,1] / (conf_2NN[1,1] + conf_2NN[2,1]) spec_2NN
##  0.9839378
spec_5NN <- conf_5NN[1,1] / (conf_5NN[1,1] + conf_5NN[2,1]) spec_5NN
##  0.9958549
sens_2NN <- conf_2NN[2,2] / (conf_2NN[2,2] + conf_2NN[1,2]) sens_2NN
##  0.2142857
sens_5NN <- conf_5NN[2,2] / (conf_5NN[2,2] + conf_5NN[1,2]) sens_5NN
##  0.1285714
acc_2NN <- (conf_2NN[1,1] + conf_2NN[2,2]) / sum(conf_2NN) acc_2NN
##  0.957
acc_5NN <- (conf_5NN[1,1] + conf_5NN[2,2]) / sum(conf_5NN) acc_5NN
##  0.9655
prec_2NN <- conf_2NN[2,2] / (conf_2NN[2,2] + conf_2NN[2,1]) prec_2NN
##  0.326087
prec_5NN <- conf_5NN[2,2] / (conf_5NN[2,2] + conf_5NN[2,1]) prec_5NN
##  0.5294118
# The 5NN model has better specificity, but worse sensitivity. However, the overall accuracy of the 5NN model is (slightly) better. When we look at the precision, the 5NN model performs a lot better compared to the 2NN model.
KNN directly predicts the class of a new observation using a majority vote of the existing observations closest to it. In contrast to this, logistic regression predicts the
log-odds of belonging to category 1. These log-odds can then be transformed to probabilities by performing an inverse logit transform:
p = 1⁄(1 + ℇ -α)
where α indicates log-odds for being in class 1 and p is the probability.
Therefore, logistic regression is a
probabilistic classifier as opposed to a
direct classifier such as KNN: indirectly, it outputs a probability which can then be used in conjunction with a cutoff (usually 0.5) to classify new observations.
Logistic regression in
R happens with the
glm() function, which stands for generalized linear model. Here we have to indicate that the residuals are modeled not as a Gaussian (normal distribution), but as a
family = binomialto fit a logistic regression model
lr_mod <- glm(default ~ ., family = binomial, data = default_train)
Now we have generated a model, we can use the
predict() method to output the estimated probabilities for each point in the training dataset. By default
predict outputs the log-odds, but we can transform it back using the inverse logit function of before or setting the argument
type = "response" within the predict function.
lr_mod. You can choose for yourself which type of visualisation you would like to make. Write down your interpretations along with your plot.
tibble(observed = default_train$default, predicted = predict(lr_mod, type = "response")) %>% ggplot(aes(y = predicted, x = observed, colour = observed)) + geom_point(position = position_jitter(width = 0.2), alpha = .3) + scale_colour_manual(values = c("dark blue", "orange"), guide = "none") + theme_minimal() + labs(y = "Predicted probability to default")
# I opted for a raw data display of all the points in the test set. Here, # we can see that the defaulting category has a higher average probability # for a default compared to the "No" category, but there are still data # points in the "No" category with high predicted probability for defaulting.
Another advantage of logistic regression is that we get coefficients we can interpret.
lr_modmodel and interpret the coefficient for
balance. What would the probability of default be for a person who is not a student, has an income of 40000, and a balance of 3000 dollars at the end of each month? Is this what you expect based on the plots we’ve made before?
coefs <- coef(lr_mod) coefs["balance"]
## balance ## 0.005672977
# The higher the balance, the higher the log-odds of defaulting. Precisely: # Each dollar increase in balance increases the log-odds by 0.0058. # Let's calculate the log-odds for our person logodds <- coefs + 4e4*coefs + 3e3*coefs # Let's convert this to a probability 1 / (1 + exp(-logodds))
## (Intercept) ## 0.9982497
# probability of .998 of defaulting. This is in line with the plots of before # because this new data point would be all the way on the right.
Let’s visualise the effect
balance has on the predicted default probability.
balance_dfwith 3 columns and 500 rows:
balanceranging from 0 to 3000, and
incomealways the mean income in the
balance_df <- tibble( student = rep(0, 500), balance = seq(0, 3000, length.out = 500), income = rep(mean(default_train$income), 500) )
lr_modto output the predicted probabilities for different values of
balance. Then create a plot with the
balance_df$balancevariable mapped to x and the predicted probabilities mapped to y. Is this in line with what you expect?
balance_df$predprob <- predict(lr_mod, newdata = balance_df, type = "response") balance_df %>% ggplot(aes(x = balance, y = predprob)) + geom_line(col = "dark blue", size = 1) + theme_minimal()
# Just before 2000 in the first plot is where the ratio of # defaults to non-defaults is 50-50. So this line is exactly what we expect!
pred_prob <- predict(lr_mod, newdata = default_valid, type = "response") pred_lr <- factor(pred_prob > .5, labels = c("No", "Yes")) conf_logreg <- table(predicted = pred_lr, true = default_valid$default) conf_logreg
## true ## predicted No Yes ## No 1925 47 ## Yes 5 23
# logistic regression performs better in every way than knn. This depends on # your random split so your mileage may vary
spec_logreg <- conf_logreg[1,1] / (conf_logreg[1,1] + conf_logreg[2,1]) spec_logreg
##  0.9974093
sens_logreg <- conf_logreg[2,2] / (conf_logreg[2,2] + conf_logreg[1,2]) sens_logreg
##  0.3285714
acc_logreg <- (conf_logreg[1,1] + conf_logreg[2,2]) / sum(conf_logreg) acc_logreg
##  0.974
prec_logreg <- conf_logreg[2,2] / (conf_logreg[2,2] + conf_logreg[2,1]) prec_logreg
##  0.8214286
# Now we can very clearly see that logisitc regression performs a lot better compared to KNN, especially the increase in precision is impressive!
Now let’s do another - slightly less guided - round of KNN and/or logistic regression on a new dataset in order to predict the outcome for a specific case. We will use the Titanic dataset also discussed in the lecture. The data can be found in the
/data folder of your project. Before creating a model, explore the data, for example by using
titanic <- read_csv("data/Titanic.csv")
## Parsed with column specification: ## cols( ## Name = col_character(), ## PClass = col_character(), ## Age = col_double(), ## Sex = col_character(), ## Survived = col_double() ## )
# I'll do a logistic regression with all interactions lr_mod_titanic <- glm(Survived ~ PClass * Sex * Age, data = titanic) predict(lr_mod_titanic, newdata = tibble( PClass = c( "3rd", "2nd"), Age = c( 14, 14), Sex = c("male", "female") ), type = "response" )
## 1 2 ## 0.2215689 0.9230483
# So our hypothetical passenger does not have a large survival probability: # our model would classify the boy as not surviving. The girl would likely # survive however. This is due to the women and children getting preferred # access to the lifeboats. Also 3rd class was way below deck.