# Linear regression for data science

Introduction

In this practical, you will learn how to handle many variables with regression by using variable selection techniques, shrinkage techniques, and how to tune hyper-parameters for these techniques. This practical has been derived from chapter 6 of ISLR. In addition, you will need for loops (see also the Basics: For Loops tab on the course website under week 3) and data manipulation techniques from Dplyr.

One of the packages we are going to use is `glmnet`. For this, you will probably need to `install.packages("glmnet")` before running the `library()` functions.

``````library(ISLR)
library(glmnet)
library(tidyverse)``````

Best subset selection

Our goal for today is to to predict `Salary` from the `Hitters` dataset from the `ISLR` package. First, we will prepare a dataframe `baseball` from the `Hitters` dataset where you remove the baseball players for which the `Salary` is missing. Use the following code:

``baseball <- Hitters %>% filter(!is.na(Salary))``

We can check how many baseball players are left using:

``nrow(baseball)``
``##  263``

1. Create `baseball_train` (50%), `baseball_valid` (30%), and `baseball_test` (20%) datasets.

Hint: In addition to once again using the `dplyr` function `filter()` to specify which players you want in the new data frame, consider how you can split the amount of players into three groups of these specified sizes.

``````split <- c(rep("train", 132), rep("valid", 79), rep("test",  52))
baseball <- baseball %>% mutate(split = sample(split))

baseball_train <- baseball %>% filter(split == "train")
baseball_valid <- baseball %>% filter(split == "valid")
baseball_test  <- baseball %>% filter(split == "test")``````

1. Using your knowledge of `ggplot` from lab 2, plot the salary information of the train, validate and test groups using `geom_histogram()` or `geom_density()`
``````ggplot() +
geom_histogram(data = baseball_train, mapping = aes(x = Salary), alpha = 0.3, colour = "Blue") +
geom_histogram(data = baseball_valid, mapping = aes(x = Salary), alpha = 0.3, colour = "Red") +
geom_histogram(data = baseball_test, mapping = aes(x = Salary), alpha = 0.3, colour = "Orange")  ``````
``````## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.`````` ``````ggplot() +
geom_density(data = baseball_train, mapping = aes(x = Salary), alpha = 0.3, colour = "Blue") +
geom_density(data = baseball_valid, mapping = aes(x = Salary), alpha = 0.3, colour = "Red") +
geom_density(data = baseball_test, mapping = aes(x = Salary), alpha = 0.3, colour = "Orange")  `````` We will use the following function which we called `lm_mse()` to obtain the mse on the validation dataset for predictions from a linear model:

``````lm_mse <- function(formula, train_data, valid_data) {
y_name <- as.character(formula)
y_true <- valid_data[[y_name]]

lm_fit <- lm(formula, train_data)
y_pred <- predict(lm_fit, newdata = valid_data)

mean((y_true - y_pred)^2)
}``````

Note that the input consists of (1) a formula, (2) a training dataset, and (3) a test dataset.

1. Try out the function with the formula `Salary ~ Hits + Runs`, using `baseball_train` and `baseball_valid`.

``lm_mse(Salary ~ Hits + Runs, baseball_train, baseball_valid)``
``##  142631.1``

We have pre-programmed a function for you to generate a character vector for all formulas with a set number of `p` variables. You can load the function into your environment by sourcing the `.R` file it is written in:

``source("generate_formulas.R")``

You can use it like so:

1. Create a character vector of all predictor variables from the `Hitters` dataset. `colnames()` may be of help. Note that `Salary` is not a predictor variable.

``````x_vars <- colnames(Hitters)
x_vars <- x_vars[x_vars != "Salary"]``````

1. Using the function `generate_formulas()`, generate all formulas with as outcome `Salary` and 3 predictors from the `Hitters` data. Assign this to a variable called `formulas`. There should be 969 elements in this vector.

``````formulas <- generate_formulas(p = 3, x_vars = x_vars, y_var = "Salary")
length(formulas)``````
``##  969``

1. Use a `for loop` to find the best set of 3 predictors in the `Hitters` dataset based on MSE. Use the `baseball_train` and `baseball_valid` datasets.

When creating the `for loop`, use the function `as.formula()` from the stats package to loop over all the equations contained in `formulas`. `as.formula()` transforms the characters of the input to a formula, so we can actually use it as a formula in our code.

To select the best formula with the best MSE, use the function `which.min()`, which presents the lowest value from the list provided.

``````# Initialise a vector we will fill with MSE values
mses <- rep(0, 969)

# loop over all the formulas
for (i in 1:969) {
mses[i] <- lm_mse(as.formula(formulas[i]), baseball_train, baseball_valid)
}

# select the formula with the lowest MSE
best_3_preds <- formulas[which.min(mses)]``````

1. Do the same for 1, 2 and 4 predictors. Now select the best model from the models with the best set of 1, 2, 3, or 4 predictors in terms of its out-of-sample MSE

``````# Generate formulas
formulas_1 <- generate_formulas(p = 1, x_vars = x_vars, y_var = "Salary")
formulas_2 <- generate_formulas(p = 2, x_vars = x_vars, y_var = "Salary")
formulas_4 <- generate_formulas(p = 4, x_vars = x_vars, y_var = "Salary")

# Initialise a vector we will fill with MSE values
mses_1 <- rep(0, length(formulas_1))
mses_2 <- rep(0, length(formulas_2))
mses_4 <- rep(0, length(formulas_4))

# loop over all the formulas
for (i in 1:length(formulas_1)) {
mses_1[i] <- lm_mse(as.formula(formulas_1[i]), baseball_train, baseball_valid)
}

for (i in 1:length(formulas_2)) {
mses_2[i] <- lm_mse(as.formula(formulas_2[i]), baseball_train, baseball_valid)
}

for (i in 1:length(formulas_4)) {
mses_4[i] <- lm_mse(as.formula(formulas_4[i]), baseball_train, baseball_valid)
}

# Compare mses
min(mses_1)
min(mses_2)
min(mses)
min(mses_4)

# min(mses_4) is lowest of them all!
# So let's see which model that is

formulas_4[which.min(mses_4)]``````
``````##  111606.5
##  102359.1
##  97916.01
##  97461.05
##  "Salary ~ Years + CAtBat + CHits + PutOuts"``````

1. Calculate the test MSE for the model with the best number of predictors.

``````# Estimate model and calculate mse
lm_best <- lm(Salary ~ Walks + CAtBat + CHits + CRBI, data = baseball %>% filter(split != "test"))
mse <- function(y_true, y_pred) mean((y_true - y_pred)^2)
mse(baseball_test\$Salary, predict(lm_best, newdata = baseball_test))``````
``##  103458.7``

1. Using the model with the best number of predictors, create a plot comparing predicted values (mapped to x position) versus observed values (mapped to y position) of `baseball_test`.

``````# create a plot
tibble(
y_true = baseball_test\$Salary,
y_pred = predict(lm_best, newdata = baseball_test)
) %>%
ggplot(aes(x = y_pred, y = y_true)) +
geom_abline(slope = 1, intercept = 0, lty = 2) +
geom_point() +
theme_minimal()`````` Through enumerating all possibilities, we have selected the best subset of at most 4 non-interacting predictors for the prediction of baseball salaries. This method works well for few predictors, but the computational cost of enumeration increases quickly to the point where it is not feasible to enumerate all combinations of variables: Regularization with glmnet

`glmnet` is a package that implements efficient (quick!) algorithms for LASSO and ridge regression, among other things.

1. Skim through the help file of `glmnet`. We are going to perform a linear regression with normal (gaussian) error terms. What format should our data be in?

``````# We need to input a predictor matrix x and a response (outcome) variable y,
# as well as a family = "gaussian" ``````

Again, we will try to predict baseball salary, this time using all the available variables and using the LASSO penalty to perform subset selection. For this, we first need to generate an input matrix.

1. First generate the input matrix using (a variation on) the following code. Remember that the “.” in a formula means “all available variables”. Make sure to check that this `x_train` looks like what you would expect.

``````x_train <- model.matrix(Salary ~ ., data = baseball_train %>% select(-split))
``````##   (Intercept) AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns
## 1           1   315   81     7   24  38    39    14   3449   835     69   321
## 2           1   496  141    20   65  78    37    11   5628  1575    225   828
## 3           1   321   87    10   39  42    30     2    396   101     12    48
## 4           1   594  169     4   74  51    35    11   4408  1133     19   501
## 5           1   185   37     1   23   8    21     2    214    42      1    30
## 6           1   298   73     0   24  24     7     3    509   108      0    41
##   CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1  414    375       1         1     632      43     10          1
## 2  838    354       1         0     200      11      3          1
## 3   46     33       1         0     805      40      4          1
## 4  336    194       0         1     282     421     25          0
## 5    9     24       1         0      76     127      7          0
## 6   37     12       0         1     121     283      9          0``````

The `model.matrix()` function takes a dataset and a formula and outputs the predictor matrix where the categorical variables have been correctly transformed into dummy variables, and it adds an intercept. It is used internally by the `lm()` function as well!

1. Using `glmnet()`, perform a LASSO regression with the generated `x_train` as the predictor matrix and `Salary` as the response variable. Set the `lambda` parameter of the penalty to 15. NB: Remove the intercept column from the `x_matrix``glmnet` adds an intercept internally.

``````result <- glmnet(x      = x_train[, -1],          # X matrix without intercept
y      = baseball_train\$Salary,  # Salary as response
family = "gaussian",             # Normally distributed errors
alpha  = 1,                      # LASSO penalty
lambda = 15)                     # Penalty value``````

1. The coefficients for the variables are in the `beta` element of the list generated by the `glmnet()` function. Which variables have been selected? You may use the `coef()` function.

``rownames(coef(result))[which(coef(result) != 0)]``
``````##   "(Intercept)" "Hits"        "Runs"        "Walks"       "CHmRun"
##   "CRuns"       "CRBI"        "DivisionW"   "PutOuts"     "Assists"
##  "NewLeagueN"``````

1. Create a predicted versus observed plot for the model you generated with the `baseball_valid` data. Use the `predict()` function for this! What is the MSE on the validation set?

``````x_valid <- model.matrix(Salary ~ ., data = baseball_valid %>% select(-split))[, -1]
y_pred <- as.numeric(predict(result, newx = x_valid))

tibble(Predicted = y_pred, Observed = baseball_valid\$Salary) %>%
ggplot(aes(x = Predicted, y = Observed)) +
geom_point() +
geom_abline(slope = 1, intercept = 0, lty = 2) +
theme_minimal() +
labs(title = "Predicted versus observed salary")`````` ``mse(baseball_valid\$Salary, y_pred)``
``##  131955.2``

Tuning lambda

Like many methods of analysis, regularized regression has a tuning parameter. In the previous section, we’ve set this parameter to 15. The `lambda` parameter changes the strength of the shrinkage in `glmnet()`. Changing the tuning parameter will change the predictions, and thus the MSE. In this section, we will select the tuning parameter based on out-of-sample MSE.

1. Fit a LASSO regression model on the same data as before, but now do not enter a specific `lambda` value. What is different about the object that is generated? Hint: use the `coef()` and `plot()` methods on the resulting object.

``````result_nolambda <- glmnet(x = x_train[, -1], y = baseball_train\$Salary,
family = "gaussian", alpha  = 1)

# This object contains sets of coefficients for different values of lambda,
# i.e., different models ranging from an intercept-only model (very high
# lambda) to almost no shrinkage (very low lambda).

plot(result_nolambda)`````` For deciding which value of lambda to choose, we could work similarly to what we have don in the best subset selection section before. However, the `glmnet` package includes another method for this task: cross validation.

1. Use the `cv.glmnet` function to determine the `lambda` value for which the out-of-sample MSE is lowest using 15-fold cross validation. As your dataset, you may use the training and validation sets bound together with bind_rows(). What is the best lambda value?

Remember to remove column 21 in your dataset (the mutated column from Question 1. a) using `[ , -21]`, which designated which observations should be split into which groups), as this is not a variable within the Hitters Dataset (which you have called Baseball). And to call the specific data `lambda.min` from the result of using `cv.glmnet`.

``````x_cv <- model.matrix(Salary ~ ., bind_rows(baseball_train, baseball_valid)[, -21])[, -1]
result_cv <- cv.glmnet(x = x_cv, y = c(baseball_train\$Salary, baseball_valid\$Salary), nfolds = 15)
best_lambda <- result_cv\$lambda.min
best_lambda``````
``##  1.774744``

1. Try out the plot() method on this object. What do you see? What does this tell you about the bias-variance tradeoff?

``plot(result_cv)`` ``````# the MSE is high with very small values of lambda (no shrinkage) and
# with very large values of lambda (intercept-only model).

# introducing a bit of bias lowers the variance relatively strongly
# (fewer variables in the model) and therefore the MSE is reduced.``````

It should be noted, that for all these previous exercises they can also be completed using the Ridge Method which is not covered in much depth during this practical session. To learn more about this method please refer back Section 6.2 in the An Introduction to Statistical Learning Textbook.

This last exercise is optional. You can also opt to view the answer when made available and try to understand what is happening in the code.

1. Create a bar plot comparing the test set (baseball_test) MSE of (a) linear regression with all variables, (b) the best subset selection regression model we created, (c) LASSO with lambda set to 50, and (d) LASSO with cross-validated lambda. As training dataset, use the rows in both the `baseball_train` and `baseball_valid`

``````# create this new training dataset and the test dataset
train_data <- bind_rows(baseball_train, baseball_valid)[, -21]
x_test <- model.matrix(Salary ~ ., data = baseball_test %>% select(-split))[, -1]

# generate predictions from the models
y_pred_ols <- predict(lm(Salary ~ ., data = train_data), newdata = baseball_test)
y_pred_sub <- predict(lm(Salary ~ Runs + CHits + Division + PutOuts, data = train_data),
newdata = baseball_test)
# these two use x_cv from the previous exercises
y_pred_las <- as.numeric(predict(glmnet(x_cv, train_data\$Salary, lambda = 50), newx = x_test))
y_pred_cv  <- as.numeric(predict(result_cv, newx = x_test, s = best_lambda))

# Calculate MSEs
mses <- c(
mse(baseball_test\$Salary, y_pred_ols),
mse(baseball_test\$Salary, y_pred_sub),
mse(baseball_test\$Salary, y_pred_las),
mse(baseball_test\$Salary, y_pred_cv)
)

# Create a plot
tibble(Method = as_factor(c("lm", "subset", "lasso", "cv_las")), MSE = mses) %>%
ggplot(aes(x = Method, y = MSE, fill = Method)) +
geom_bar(stat = "identity", col = "black") +
theme_minimal() +
theme(legend.position = "none") +
labs(title = "Comparison of test set MSE for different prediction methods") +
scale_fill_viridis_d() # different colour scale`````` 