Simple linear regression with tidyverse
In this tutorial we will learn how to perform simple linear regression between two numerical variables in R using lm() function. The resulting object from a linear regression analysis using lm() is unwieldy and not intuitive for beginners. We use broom package that is part of tidymodels to make the messy output from lm() into human friendly tidy data frames.
First let us load the packages needed.
library(tidyverse)
library(palmerpenguins)
library(broom)
theme_set(theme_bw(16))Our Penguins data looks like this.
penguins %>% head()
# A tibble: 6 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
# ℹ 2 more variables: sex , year We will be using two numerical variables, bill length and flipper length of Penguins, and look at their association using the linear model. First, let use visualize the relationship between the two variables with a scatter plot made with ggplot2.
penguins %>%
ggplot(aes(bill_length_mm, flipper_length_mm))+
geom_point()+
geom_smooth(method="lm")
ggsave("regression_quanitative_variables.png")Linear Regression with lm() in R
lm() function in R let us build linear regression model. We provide the formula describing our linear model as the first argument and then data as second argument. Note that in the simple linear regression we specify the model as bill_length_mm ~ flipper_length_mm.
lm_res = lm(bill_length_mm ~ flipper_length_mm,
data=penguins)We also save the result of our linear regression model. By simply using the resulting variable name we get the basic information about the model, the formula used in the model and the coefficients after fitting the data.
lm_res
Call:
lm(formula = bill_length_mm ~ flipper_length_mm, data = penguins)
Coefficients:
(Intercept) flipper_length_mm
-7.2649 0.2548Extracting the results linear regression result object
Often we need more information from the results of linear regression model. In R, we can use summary() function on the lm object to get more information about the model.
The results from summary() function contain summary of residuals of the model, coefficient estimates and the p-values. Extracting any information from the summary() of linear model is difficult.
summary(lm_res)
Call:
lm(formula = bill_length_mm ~ flipper_length_mm, data = penguins)
Residuals:
Min 1Q Median 3Q Max
-8.5792 -2.6715 -0.5721 2.0148 19.1518
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.26487 3.20016 -2.27 0.0238 *
flipper_length_mm 0.25477 0.01589 16.03 % broom::glance()
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
1 0.431 0.429 4.13 257. 1.74e-43 1 -969. 1944. 1955.
# 3 more variables: deviance , df.residual , nobs broom’s tidy() function to clean the lm results
clean lm() result object with broomWith broom, it is much easier to get the results from a linear regression model. tidy() function in broom package will give the coefficient estimates and p-values in a nice tabular form as a tibble.
lm_res %>%
broom::tidy()
# A tibble: 2 × 5
term estimate std.error statistic p.value
1 (Intercept) -7.26 3.20 -2.27 2.38e- 2
2 flipper_length_mm 0.255 0.0159 16.0 1.74e-43For example we can see that in the “term” column our flipper length variable and see that it has significant association with bill length.
With conf.int=TRUE option we can also get the confidence interval for our estimates.
lm_res %>%
broom::tidy(conf.int=TRUE)
# A tibble: 2 × 7
term estimate std.error statistic p.value conf.low conf.high
1 (Intercept) -7.26 3.20 -2.27 2.38e- 2 -13.6 -0.970
2 flipper_length_mm 0.255 0.0159 16.0 1.74e-43 0.224 0.286With augment() function in broom we can add the original data to results from linear model.
lm_res %>%
augment()
# A tibble: 342 × 9
.rownames bill_length_mm flipper_length_mm .fitted .resid .hat .sigma
1 1 39.1 181 38.8 0.252 0.00881 4.13
2 2 39.5 186 40.1 -0.622 0.00622 4.13
3 3 40.3 195 42.4 -2.11 0.00344 4.13
4 5 36.7 193 41.9 -5.21 0.00385 4.12
5 6 39.3 190 41.1 -1.84 0.00469 4.13
6 7 38.9 181 38.8 0.0518 0.00881 4.13
7 8 39.2 195 42.4 -3.21 0.00344 4.13
8 9 34.1 193 41.9 -7.81 0.00385 4.11
9 10 42 190 41.1 0.859 0.00469 4.13
10 11 37.8 186 40.1 -2.32 0.00622 4.13
# 332 more rows
# 2 more variables: .cooksd , .std.resid