# Moving beyond linearity

Introduction

In this practical, we will learn about nonlinear extensions to regression using basis functions and how to create, visualise, and interpret them. Parts of it are adapted from the practicals in ISLR chapter 7.

One of the packages we are going to use is `splines`. For this, you will probably need to `install.packages("splines")` before running the `library()` functions, however depending on when you installed `R`, it may be a base function within your library.

Additionally this practical will use lots of functions from the `tidyverse` package `dplyr`. Therefore, if you are struggling with there usage, please review the guidance in the dplyr basics tab.

``````library(MASS)
library(splines)
library(ISLR)
library(tidyverse)``````

Median housing prices in Boston do not have a linear relation with the proportion of low SES households. Today we are going to focus exclusively on prediction.

``````Boston %>%
ggplot(aes(x = lstat, y = medv)) +
geom_point() +
theme_minimal()`````` First, we need a way of visualising the predictions. The below function `pred_plot()` does this for us: the function `pred_plot()` takes as input an `lm` object, which outputs the above plot but with a prediction line generated from the model object using the `predict()` method.

``````pred_plot <- function(model) {
x_pred <- seq(min(Boston\$lstat), max(Boston\$lstat), length.out = 500)
y_pred <- predict(model, newdata = tibble(lstat = x_pred))

Boston %>%
ggplot(aes(x = lstat, y = medv)) +
geom_point() +
geom_line(data = tibble(lstat = x_pred, medv = y_pred), size = 1, col = "blue") +
theme_minimal()
}``````
``````pred_plot <- function(model) {
# First create predictions for all values of lstat
# create the x-values for which we should predict the expected housing prices
x_pred <- seq(min(Boston\$lstat), max(Boston\$lstat), length.out = 500)
# predict the expected housing prices for the values of x using the lm model we gave as input to the model
y_pred <- predict(model, newdata = tibble(lstat = x_pred))

# Then create a ggplot object with a line based on those predictions
# use Boston data
Boston %>%
# from the boston data, put SES on the x-axis and housing prices on the y-axis
ggplot(aes(x = lstat, y = medv)) +
# use a scatterplot to visualize the relationship between SES and housing prices
geom_point() +
# add a prediction line to the plot using y_pred obtained above
geom_line(data = tibble(lstat = x_pred, medv = y_pred), size = 1, col = "blue") + #
# use nice theme for our plot
theme_minimal()
}``````

1. Create a linear regression object called `lin_mod` which models `medv` as a function of `lstat`. Check if the prediction plot works by running `pred_plot(lin_mod)`. Do you see anything out of the ordinary with the predictions?

``````lin_mod <- lm(medv ~ lstat, data = Boston)
pred_plot(lin_mod)`````` ``# the predicted median housing value is below 0 for high values.``

Polynomial regression

The first extension to linear regression is polynomial regression, with basis functions βj • x1j (ISLR, p. 270).

1. Create another linear model `pn3_mod`, where you add the second and third-degree polynomial terms `I(lstat^2)` and `I(lstat^3)` to the formula. Create a `pred_plot()` with this model.

``````pn3_mod <- lm(medv ~ lstat + I(lstat^2) + I(lstat^3), data = Boston)
pred_plot(pn3_mod)`````` The function `poly()` can automatically generate a matrix which contains columns with polynomial basis function outputs.

1. Play around with the poly() function. What output does it generate with the arguments `degree = 3` and `raw = TRUE`?

``poly(1:5, degree = 3, raw = TRUE)``
``````##      1  2   3
## [1,] 1  1   1
## [2,] 2  4   8
## [3,] 3  9  27
## [4,] 4 16  64
## [5,] 5 25 125
## attr(,"degree")
##  1 2 3
## attr(,"class")
##  "poly"   "matrix"``````
``# these are the original column (1:5), then squared, and then cubed.``

1. Use the poly() function directly in the model formula to create a 3rd-degree polynomial regression predicting `medv` using `lstat`. Compare the prediction plot to the previous prediction plot you made. What happens if you change the poly() function to `raw = FALSE`?

``````pn3_mod2 <- lm(medv ~ poly(lstat, 3, raw = TRUE), data = Boston)
pred_plot(pn3_mod2)`````` ``````# The plot is exactly the same as the previous prediction plot
# By default, with raw = FALSE, poly() computes an orthogonal polynomial.
# This does not change the fitted values but you can see whether a certain order in the polynomial significantly improves the regression over the lower orders.
# You can check the summary(pn3_mod2).``````

Piecewise constant (Step function)

Another basis function we can use is a step function. For example, we can split the `lstat` variable into two groups based on its median and take the average of these groups to predict `medv`.

1. Create a model called `pw2_mod` with one predictor: `I(lstat <= median(lstat))`. Create a pred_plot with this model. Use the coefficients in `coef(pw2_mod)` to find out what the predicted value for a low-lstat neighbourhood is.

``````pw2_mod <- lm(medv ~ I(lstat <= median(lstat)), data = Boston)
pred_plot(pw2_mod)`````` ``coef(pw2_mod)``
``````##                   (Intercept) I(lstat <= median(lstat))TRUE
##                      16.67747                      11.71067``````
``# the predicted value for low-lstat neighbourhoods is 16.68 + 11.71 = 28.39``

1. Use the `cut()` function in the formula to generate a piecewise regression model called `pw5_mod` that contains 5 equally spaced sections. Again, plot the result using `pred_plot`.

``````pw5_mod <- lm(medv ~ cut(lstat, 5), data = Boston)
pred_plot(pw5_mod)`````` Note that the sections generated by `cut()` are equally spaced in terms of `lstat`, but they do not have equal amounts of data. In fact, the last section has only 9 data points to work with:

``table(cut(Boston\$lstat, 5))``
``````##
## (1.69,8.98] (8.98,16.2] (16.2,23.5] (23.5,30.7]   (30.7,38]
##         183         183          94          37           9``````

1. Optional: Create a piecewise regression model `pwq_mod` where the sections are not equally spaced, but have equal amounts of training data. Hint: use the `quantile()` function.

``````brks <- c(-Inf, quantile(Boston\$lstat, probs = c(.2, .4, .6, .8)), Inf)
pwq_mod <- lm(medv ~ cut(lstat, brks), data = Boston)
pred_plot(pwq_mod)`````` ``table(cut(Boston\$lstat, brks))``
``````##
## (-Inf,6.29] (6.29,9.53] (9.53,13.3] (13.3,18.1] (18.1, Inf]
##         102         101         101         101         101``````

Splines

Using splines, we can combine polyniomials with step functions (as is done in piecewise polynomical regression, see ISLR and the lecture) AND take out the discontinuities.

The `bs()` function from the `splines` package does all the work for us.

1. Create a cubic spline model `bs1_mod` with a knot at the median using the `bs()` function.

hint: If you are not sure how to use the function `bs()`, check out the first example at the bottom of the help file by using `?bs`.

``````bs1_mod <- lm(medv ~ bs(lstat, knots = median(lstat)), data = Boston)
summary(bs1_mod)``````
``````##
## Call:
## lm(formula = medv ~ bs(lstat, knots = median(lstat)), data = Boston)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -13.5106  -3.0547  -0.7488   2.1178  27.1383
##
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)
## (Intercept)                         50.085      1.549   32.34   <2e-16 ***
## bs(lstat, knots = median(lstat))1  -24.362      2.331  -10.45   <2e-16 ***
## bs(lstat, knots = median(lstat))2  -31.781      1.927  -16.49   <2e-16 ***
## bs(lstat, knots = median(lstat))3  -44.421      3.488  -12.73   <2e-16 ***
## bs(lstat, knots = median(lstat))4  -35.831      3.092  -11.59   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.206 on 501 degrees of freedom
## Multiple R-squared:  0.6822, Adjusted R-squared:  0.6796
## F-statistic: 268.8 on 4 and 501 DF,  p-value: < 2.2e-16``````

1. Create a prediction plot from the `bs1_mod` object using the `plot_pred()` function.

``pred_plot(bs1_mod)`` Note that this line fits very well, but at the right end of the plot, the curve slopes up. Theoretically, this is unexpected – always pay attention to which predictions you are making and whether that behaviour is in line with your expectations.

The last extension we will look at is the natural spline. This works in the same way as the cubic spline, with the additional constraint that the function is required to be linear at the boundaries. The `ns()` function from the `splines` package is for generating the basis representation for a natural spline.

1. Create a natural cubic spline model (`ns3_mod`) with 3 degrees of freedom using the `ns()` function. Plot it, and compare it to the `bs1_mod`.

``````ns3_mod <- lm(medv ~ ns(lstat, df = 3), data = Boston)
pred_plot(ns3_mod)`````` ``# Still a good fit but with linear ends.``

1. Plot `lin_mod`, `pn3_mod`, `pw5_mod`, `bs1_mod`, and `ns3_mod` and give them nice titles by adding `+ ggtitle("My title")` to the plot. You may use the function `plot_grid()` from the package `cowplot` to put your plots in a grid.

``````library(cowplot)
plot_grid(
pred_plot(lin_mod) + ggtitle("Linear regression"),
pred_plot(pn3_mod) + ggtitle("Polynomial"),
pred_plot(pw5_mod) + ggtitle("Piecewise constant"),
pred_plot(bs1_mod) + ggtitle("Cubic spline"),
pred_plot(ns3_mod) + ggtitle("Natural spline")
)`````` Programming exercise (optional)

1. Use 12-fold cross validation to determine which of the 5 methods (`lin`, `pn3`, `pw5`, `bs1`, and `ns3`) has the lowest out-of-sample MSE.

``````# first create an mse function
mse <- function(y_true, y_pred) mean((y_true - y_pred)^2)

# add a 12 split column to the boston dataset so we can cross-validate
boston_cv <- Boston %>% mutate(split = sample(rep(1:12, length.out = nrow(Boston))))

# prepare an output matrix with 12 slots per method for mse values
output_matrix <- matrix(nrow = 12, ncol = 5)
colnames(output_matrix) <- c("lin", "pn3", "pw5", "bs1", "ns3")

# loop over the splits, run each method, and return the mse values
for (i in 1:12) {
train <- boston_cv %>% filter(split != i)
test  <- boston_cv %>% filter(split == i)

brks <- c(-Inf, 7, 15, 22, Inf)

lin_mod <- lm(medv ~ lstat,                            data = train)
pn3_mod <- lm(medv ~ poly(lstat, 3),                   data = train)
pw5_mod <- lm(medv ~ cut(lstat, brks),                 data = train)
bs1_mod <- lm(medv ~ bs(lstat, knots = median(lstat)), data = train)
ns3_mod <- lm(medv ~ ns(lstat, df = 3),                data = train)

output_matrix[i, ] <- c(
mse(test\$medv, predict(lin_mod, newdata = test)),
mse(test\$medv, predict(pn3_mod, newdata = test)),
mse(test\$medv, predict(pw5_mod, newdata = test)),
mse(test\$medv, predict(bs1_mod, newdata = test)),
mse(test\$medv, predict(ns3_mod, newdata = test))
)
}

# this is the comparison of the methods
colMeans(output_matrix)``````
``````##      lin      pn3      pw5      bs1      ns3
## 38.76726 29.36707 37.95038 27.51021 28.68858``````
``````# we can show it graphically too
tibble(names = as_factor(colnames(output_matrix)),
mse   = colMeans(output_matrix)) %>%
ggplot(aes(x = names, y = mse, fill = names)) +
geom_bar(stat = "identity") +
theme_minimal() +
scale_fill_viridis_d(guide = "none") +
labs(
x     = "Method",
y     = "Mean squared error",
title = "Comparing regression method prediction performance"
)`````` 