Simple linear regression with tidyverse

broom

tidymodels

Published

September 13, 2023

In this tutorial we will learn how to perform simple linear regression between two numerical variables in R using lm() function. The resulting object from a linear regression analysis using lm() is unwieldy and not intuitive for beginners. We use broom package that is part of tidymodels to make the messy output from lm() into human friendly tidy data frames.

First let us load the packages needed.

library(tidyverse)
library(palmerpenguins)
library(broom)
theme_set(theme_bw(16))

Our Penguins data looks like this.

penguins %>% head()

# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
                                                
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
5 Adelie  Torgersen           36.7          19.3               193        3450
6 Adelie  Torgersen           39.3          20.6               190        3650
# ℹ 2 more variables: sex , year

We will be using two numerical variables, bill length and flipper length of Penguins, and look at their association using the linear model. First, let use visualize the relationship between the two variables with a scatter plot made with ggplot2.

penguins %>%
  ggplot(aes(bill_length_mm, flipper_length_mm))+
  geom_point()+
  geom_smooth(method="lm")
ggsave("regression_quanitative_variables.png")

Linear Regression with lm() in R

lm() function in R let us build linear regression model. We provide the formula describing our linear model as the first argument and then data as second argument. Note that in the simple linear regression we specify the model as bill_length_mm ~ flipper_length_mm.

lm_res = lm(bill_length_mm ~ flipper_length_mm, 
           data=penguins)

We also save the result of our linear regression model. By simply using the resulting variable name we get the basic information about the model, the formula used in the model and the coefficients after fitting the data.

lm_res

Call:
lm(formula = bill_length_mm ~ flipper_length_mm, data = penguins)

Coefficients:
      (Intercept)  flipper_length_mm  
          -7.2649             0.2548

Extracting the results linear regression result object

Often we need more information from the results of linear regression model. In R, we can use summary() function on the lm object to get more information about the model.

The results from summary() function contain summary of residuals of the model, coefficient estimates and the p-values. Extracting any information from the summary() of linear model is difficult.

summary(lm_res)

Call:
lm(formula = bill_length_mm ~ flipper_length_mm, data = penguins)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.5792 -2.6715 -0.5721  2.0148 19.1518 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -7.26487    3.20016   -2.27   0.0238 *  
flipper_length_mm  0.25477    0.01589   16.03   % broom::glance()

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
                              
1     0.431         0.429  4.13      257. 1.74e-43     1  -969. 1944. 1955.
# 3 more variables: deviance , df.residual , nobs

broom’s tidy() function to clean the lm results

clean lm() result object with broomWith broom, it is much easier to get the results from a linear regression model. tidy() function in broom package will give the coefficient estimates and p-values in a nice tabular form as a tibble.

lm_res %>%
  broom::tidy()

# A tibble: 2 × 5
  term              estimate std.error statistic  p.value
                                
1 (Intercept)         -7.26     3.20       -2.27 2.38e- 2
2 flipper_length_mm    0.255    0.0159     16.0  1.74e-43

For example we can see that in the “term” column our flipper length variable and see that it has significant association with bill length.

With conf.int=TRUE option we can also get the confidence interval for our estimates.

lm_res %>%
  broom::tidy(conf.int=TRUE)

# A tibble: 2 × 7
  term              estimate std.error statistic  p.value conf.low conf.high
                                         
1 (Intercept)         -7.26     3.20       -2.27 2.38e- 2  -13.6      -0.970
2 flipper_length_mm    0.255    0.0159     16.0  1.74e-43    0.224     0.286

With augment() function in broom we can add the original data to results from linear model.

lm_res %>%
  augment() 

# A tibble: 342 × 9
   .rownames bill_length_mm flipper_length_mm .fitted  .resid    .hat .sigma
                                         
 1 1                   39.1               181    38.8  0.252  0.00881   4.13
 2 2                   39.5               186    40.1 -0.622  0.00622   4.13
 3 3                   40.3               195    42.4 -2.11   0.00344   4.13
 4 5                   36.7               193    41.9 -5.21   0.00385   4.12
 5 6                   39.3               190    41.1 -1.84   0.00469   4.13
 6 7                   38.9               181    38.8  0.0518 0.00881   4.13
 7 8                   39.2               195    42.4 -3.21   0.00344   4.13
 8 9                   34.1               193    41.9 -7.81   0.00385   4.11
 9 10                  42                 190    41.1  0.859  0.00469   4.13
10 11                  37.8               186    40.1 -2.32   0.00622   4.13
# 332 more rows
# 2 more variables: .cooksd , .std.resid

--- title: "Simple linear regression with tidyverse" date: 2023-09-13 categories: ['broom', 'tidymodels'] format: html: code-fold: false code-tools: true --- In this tutorial we will learn how to perform simple linear regression between two numerical variables in R using lm() function. The resulting object from a linear regression analysis using lm() is unwieldy and not intuitive for beginners. We use broom package that is part of tidymodels to make the messy output from lm() into human friendly tidy data frames. First let us load the packages needed. ```r library(tidyverse) library(palmerpenguins) library(broom) theme_set(theme_bw(16)) ``` Our Penguins data looks like this. ```r penguins %>% head() # A tibble: 6 × 8 species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g 1 Adelie Torgersen 39.1 18.7 181 3750 2 Adelie Torgersen 39.5 17.4 186 3800 3 Adelie Torgersen 40.3 18 195 3250 4 Adelie Torgersen NA NA NA NA 5 Adelie Torgersen 36.7 19.3 193 3450 6 Adelie Torgersen 39.3 20.6 190 3650 # ℹ 2 more variables: sex , year ``` We will be using two numerical variables, bill length and flipper length of Penguins, and look at their association using the linear model. First, let use visualize the relationship between the two variables with a scatter plot made with ggplot2. ```r penguins %>% ggplot(aes(bill_length_mm, flipper_length_mm))+ geom_point()+ geom_smooth(method="lm") ggsave("regression_quanitative_variables.png") ``` ### Linear Regression with lm() in R lm() function in R let us build linear regression model. We provide the formula describing our linear model as the first argument and then data as second argument. Note that in the simple linear regression we specify the model as **bill_length_mm ~ flipper_length_mm**. ```r lm_res = lm(bill_length_mm ~ flipper_length_mm, data=penguins) ``` We also save the result of our linear regression model. By simply using the resulting variable name we get the basic information about the model, the formula used in the model and the coefficients after fitting the data. ```r lm_res Call: lm(formula = bill_length_mm ~ flipper_length_mm, data = penguins) Coefficients: (Intercept) flipper_length_mm -7.2649 0.2548 ``` ### Extracting the results linear regression result object Often we need more information from the results of linear regression model. In R, we can use summary() function on the lm object to get more information about the model. The results from summary() function contain summary of residuals of the model, coefficient estimates and the p-values. Extracting any information from the summary() of linear model is difficult. ```r summary(lm_res) Call: lm(formula = bill_length_mm ~ flipper_length_mm, data = penguins) Residuals: Min 1Q Median 3Q Max -8.5792 -2.6715 -0.5721 2.0148 19.1518 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -7.26487 3.20016 -2.27 0.0238 * flipper_length_mm 0.25477 0.01589 16.03 % broom::glance() # A tibble: 1 × 12 r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC 1 0.431 0.429 4.13 257. 1.74e-43 1 -969. 1944. 1955. # 3 more variables: deviance , df.residual , nobs ``` ### broom's tidy() function to clean the lm results ![clean lm() result object with broom](https://rstats101.com/wp-content/uploads/2023/09/broom_logo-259x300.png) clean lm() result object with broomWith broom, it is much easier to get the results from a linear regression model. tidy() function in broom package will give the coefficient estimates and p-values in a nice tabular form as a tibble. ```r lm_res %>% broom::tidy() # A tibble: 2 × 5 term estimate std.error statistic p.value 1 (Intercept) -7.26 3.20 -2.27 2.38e- 2 2 flipper_length_mm 0.255 0.0159 16.0 1.74e-43 ``` For example we can see that in the "term" column our flipper length variable and see that it has significant association with bill length. With conf.int=TRUE option we can also get the confidence interval for our estimates. ```r lm_res %>% broom::tidy(conf.int=TRUE) # A tibble: 2 × 7 term estimate std.error statistic p.value conf.low conf.high 1 (Intercept) -7.26 3.20 -2.27 2.38e- 2 -13.6 -0.970 2 flipper_length_mm 0.255 0.0159 16.0 1.74e-43 0.224 0.286 ``` With augment() function in broom we can add the original data to results from linear model. ```r lm_res %>% augment() # A tibble: 342 × 9 .rownames bill_length_mm flipper_length_mm .fitted .resid .hat .sigma 1 1 39.1 181 38.8 0.252 0.00881 4.13 2 2 39.5 186 40.1 -0.622 0.00622 4.13 3 3 40.3 195 42.4 -2.11 0.00344 4.13 4 5 36.7 193 41.9 -5.21 0.00385 4.12 5 6 39.3 190 41.1 -1.84 0.00469 4.13 6 7 38.9 181 38.8 0.0518 0.00881 4.13 7 8 39.2 195 42.4 -3.21 0.00344 4.13 8 9 34.1 193 41.9 -7.81 0.00385 4.11 9 10 42 190 41.1 0.859 0.00469 4.13 10 11 37.8 186 40.1 -2.32 0.00622 4.13 # 332 more rows # 2 more variables: .cooksd , .std.resid ```