Many linear regression models with tidyverse

Published

October 28, 2023

In this tutorial, we will learn how to build many simple linear regression models in R using tidyverse. We will start with a simple approach that is not scalable to illustrate the challenges. Then we will show a naive approach using for loop to build many linear regression models. And finally we will show an elegant solution of building many linear regression models using map() function in purrr in tidyverse.

Let us load the packages needed . We will use penguins data to build many linear regression models with lm().

library(tidyverse)
library(palmerpenguins)
theme_set(theme_bw(16))

penguins %>% head()

# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
                                                
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
5 Adelie  Torgersen           36.7          19.3               193        3450
6 Adelie  Torgersen           39.3          20.6               190        3650
# ℹ 2 more variables: sex , year

Let us say we are interested in building linear regression models between bill length and flipper length for all three penguin species in the data. Here we just visualize the linear association of these two quantitative variables.

penguins %>% 
  ggplot(aes(bill_length_mm, flipper_length_mm))+
  geom_point()+
  facet_wrap(~species)

many linear regression model data as scatter plot ### Naive approach to build many linear models In our example we need to build three simple linear regression models. A naive approach is to build 3 linear regression models one by one manually. To do that, we might filter the orginal data to a subset we want to build one linear regression model as shown below. Here we get data for one species

Adelie_df % 
  filter(species=="Adelie")

And build linear regression model for one species using lm() function in R.

lm_Adelie = lm(bill_length_mm ~flipper_length_mm, data=Adelie_df)

We can then get the results from the linear regression model. Here we find that bill length is strongly associated with flipper length in Adelie penguins.

lm_Adelie %>% tidy()

# A tibble: 2 × 5
  term              estimate std.error statistic   p.value
                                 
1 (Intercept)         13.6      6.00        2.27 0.0249   
2 flipper_length_mm    0.133    0.0315      4.21 0.0000447

With the naive approach, we will do the same but now for the second species.

Chinstrap_df % 
  filter(species=="Chinstrap") 

lm_Chinstrap = lm(bill_length_mm ~flipper_length_mm, data=Chinstrap_df)
lm_Chinstrap %>% tidy()

# A tibble: 2 × 5
  term              estimate std.error statistic   p.value
                                 
1 (Intercept)          5.59     9.96       0.562 0.576    
2 flipper_length_mm    0.221    0.0508     4.34  0.0000492

Many linear regression models with a for loop

The naive approach described above is not scalable. A better option is to use loop through each species with a for loop and build linear regression model.

To do that let us first get the list of species that we would like to loop through.

p_species  = penguins %>%
  count(species) %>%
  pull(species)
p_species

[1] Adelie    Chinstrap Gentoo   
Levels: Adelie Chinstrap Gentoo

And then we can write a for loop to go over each element in the species vector to build linear regression model.

for (i in seq_along(p_species)){
  # get species name
  s % 
    filter(species==s)
  # fit lm 
  lm_fit % tidy()
  # print results
  print(lm_fit_tidy)
}

# A tibble: 2 × 5
  term              estimate std.error statistic   p.value
                                 
1 (Intercept)         13.6      6.00        2.27 0.0249   
2 flipper_length_mm    0.133    0.0315      4.21 0.0000447
# A tibble: 2 × 5
  term              estimate std.error statistic   p.value
                                 
1 (Intercept)          5.59     9.96       0.562 0.576    
2 flipper_length_mm    0.221    0.0508     4.34  0.0000492
# A tibble: 2 × 5
  term              estimate std.error statistic  p.value
                                
1 (Intercept)        -20.7      7.04       -2.94 3.88e- 3
2 flipper_length_mm    0.314    0.0324      9.69 8.60e-17

Although building many linear models using a for loop is scalable, it is often difficult to use the results from each model. In this example we have just printed the results for convenience. Often one might want to save the result and access different aspect of the results later.

Many linear models with purrr in tidyverse

Tidyverse’ purrr package can help building many linear models easily and access the results from linear models. In the example below let us walk through step by step on building many models.

The basic idea is to split the data as we like using tidyverse framework using list columns in a dataframe and building many models

The basic idea is that we start with the full dataframe and perform many linear models on a subset of the dataframe. We do that by using “the nested data frame”. We use group_by() to get grouped dataframe and use nest() function to create a nested dataframe.

penguins_by_species %
  group_by(species) %>%
  nest()
penguins_by_species

# A tibble: 3 × 2
# Groups:   species [3]
  species   data              
                   
1 Adelie    
2 Gentoo    
3 Chinstrap

Our nested dataframe looks like this where one column is our penguin species and the other column has a subset of the dataframe for each grouping variable. Basically, we have created a data frame with a column that is a list of smaller data frames.

penguins_by_species

# A tibble: 3 × 2
# Groups:   species [3]
  species   data              
                   
1 Adelie    
2 Gentoo    
3 Chinstrap

We can check single element of the data column as shown below. Here we have a data frame containing Adelei penguins.

penguins_by_species$data[[1]]

# A tibble: 152 × 7
   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex    year
                                             
 1 Torge…           39.1          18.7               181        3750 male   2007
 2 Torge…           39.5          17.4               186        3800 fema…  2007
 3 Torge…           40.3          18                 195        3250 fema…  2007
 4 Torge…           NA            NA                  NA          NA    2007
 5 Torge…           36.7          19.3               193        3450 fema…  2007
 6 Torge…           39.3          20.6               190        3650 male   2007
 7 Torge…           38.9          17.8               181        3625 fema…  2007
 8 Torge…           39.2          19.6               195        4675 male   2007
 9 Torge…           34.1          18.1               193        3475    2007
10 Torge…           42            20.2               190        4250    2007
# ℹ 142 more rows

Now we are almost ready to fit linear models. Let us first write a helper function which takes a dataframe as input and fits linear regression model on two of the variables in the dataframe.

linear_reg %
  mutate(lm_fit = map(data, linear_reg))

Just as we could get a peek at the dataframe before, we can access the elements of list columns using similar approach with the column name and double square brackets to get the linear model results.

lm_by_species$lm_fit[[1]]

Call:
lm(formula = bill_length_mm ~ flipper_length_mm, data = df)

Coefficients:
      (Intercept)  flipper_length_mm  
          13.5871             0.1327

We can go ahead and add further columns, like tidied results using broom’s tidy() function

penguins_by_species %>%
  mutate(lm_fit = map(data, linear_reg)) %>%
  mutate(tidy = map(lm_fit, broom::tidy)) %>%
  unnest(tidy)

# A tibble: 6 × 8
# Groups:   species [3]
  species   data              lm_fit term  estimate std.error statistic  p.value
                                      
1 Adelie                 (Int…   13.6      6.00       2.27  2.49e- 2
2 Adelie                 flip…    0.133    0.0315     4.21  4.47e- 5
3 Gentoo                 (Int…  -20.7      7.04      -2.94  3.88e- 3
4 Gentoo                 flip…    0.314    0.0324     9.69  8.60e-17
5 Chinstrap     (Int…    5.59     9.96       0.562 5.76e- 1
6 Chinstrap     flip…    0.221    0.0508     4.34  4.92e- 5

And we wanted a specific variable, we can use filter to get the rows we needed.

penguins_by_species %>%
  mutate(lm_fit = map(data, linear_reg),
         tidy = map(lm_fit, broom::tidy)) %>%
  unnest(tidy) %>%
  filter(term=="flipper_length_mm")

# A tibble: 3 × 8
# Groups:   species [3]
  species   data              lm_fit term  estimate std.error statistic  p.value
                                      
1 Adelie                 flip…    0.133    0.0315      4.21 4.47e- 5
2 Gentoo                 flip…    0.314    0.0324      9.69 8.60e-17
3 Chinstrap     flip…    0.221    0.0508      4.34 4.92e- 5

Here is a quick plot looking at the effect sizes and the std error of the many linear models in a series of steps with purrr.

penguins_by_species %>%
  mutate(lm_fit = map(data, linear_reg),
         tidy = map(lm_fit, broom::tidy)) %>%
  unnest(tidy) %>%
  filter(term=="flipper_length_mm") %>%
  ggplot(aes(species, estimate,
                  ymin = estimate - std.error,
                  ymax = estimate + std.error)) +
  geom_pointrange()

--- title: "Many linear regression models with tidyverse" date: 2023-10-28 categories: [] format: html: code-fold: false code-tools: true --- In this tutorial, we will learn how to build **many** simple linear regression models in R using tidyverse. We will start with a simple approach that is not scalable to illustrate the challenges. Then we will show a naive approach using for loop to build many linear regression models. And finally we will show an elegant solution of building many linear regression models using map() function in purrr in tidyverse. Let us load the packages needed . We will use penguins data to build many linear regression models with lm(). ```r library(tidyverse) library(palmerpenguins) theme_set(theme_bw(16)) ``` ```r penguins %>% head() # A tibble: 6 × 8 species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g 1 Adelie Torgersen 39.1 18.7 181 3750 2 Adelie Torgersen 39.5 17.4 186 3800 3 Adelie Torgersen 40.3 18 195 3250 4 Adelie Torgersen NA NA NA NA 5 Adelie Torgersen 36.7 19.3 193 3450 6 Adelie Torgersen 39.3 20.6 190 3650 # ℹ 2 more variables: sex , year ``` Let us say we are interested in building linear regression models between bill length and flipper length for all three penguin species in the data. Here we just visualize the linear association of these two quantitative variables. ```r penguins %>% ggplot(aes(bill_length_mm, flipper_length_mm))+ geom_point()+ facet_wrap(~species) ``` ![many linear regression model data as scatter plot](https://rstats101.com/wp-content/uploads/2023/10/many_linear_models_data.png) many linear regression model data as scatter plot ### Naive approach to build many linear models In our example we need to build three simple linear regression models. A naive approach is to build 3 linear regression models one by one manually. To do that, we might filter the orginal data to a subset we want to build one linear regression model as shown below. Here we get data for one species ```r Adelie_df % filter(species=="Adelie") ``` And build linear regression model for one species using lm() function in R. ```r lm_Adelie = lm(bill_length_mm ~flipper_length_mm, data=Adelie_df) ``` We can then get the results from the linear regression model. Here we find that bill length is strongly associated with flipper length in Adelie penguins. ```r lm_Adelie %>% tidy() # A tibble: 2 × 5 term estimate std.error statistic p.value 1 (Intercept) 13.6 6.00 2.27 0.0249 2 flipper_length_mm 0.133 0.0315 4.21 0.0000447 ``` With the naive approach, we will do the same but now for the second species. ```r Chinstrap_df % filter(species=="Chinstrap") lm_Chinstrap = lm(bill_length_mm ~flipper_length_mm, data=Chinstrap_df) lm_Chinstrap %>% tidy() # A tibble: 2 × 5 term estimate std.error statistic p.value 1 (Intercept) 5.59 9.96 0.562 0.576 2 flipper_length_mm 0.221 0.0508 4.34 0.0000492 ``` ### Many linear regression models with a for loop The naive approach described above is not scalable. A better option is to use loop through each species with a for loop and build linear regression model. To do that let us first get the list of species that we would like to loop through. ```r p_species = penguins %>% count(species) %>% pull(species) p_species [1] Adelie Chinstrap Gentoo Levels: Adelie Chinstrap Gentoo ``` And then we can write a for loop to go over each element in the species vector to build linear regression model. ```r for (i in seq_along(p_species)){ # get species name s % filter(species==s) # fit lm lm_fit % tidy() # print results print(lm_fit_tidy) } # A tibble: 2 × 5 term estimate std.error statistic p.value 1 (Intercept) 13.6 6.00 2.27 0.0249 2 flipper_length_mm 0.133 0.0315 4.21 0.0000447 # A tibble: 2 × 5 term estimate std.error statistic p.value 1 (Intercept) 5.59 9.96 0.562 0.576 2 flipper_length_mm 0.221 0.0508 4.34 0.0000492 # A tibble: 2 × 5 term estimate std.error statistic p.value 1 (Intercept) -20.7 7.04 -2.94 3.88e- 3 2 flipper_length_mm 0.314 0.0324 9.69 8.60e-17 ``` Although building many linear models using a for loop is scalable, it is often difficult to use the results from each model. In this example we have just printed the results for convenience. Often one might want to save the result and access different aspect of the results later. ### Many linear models with purrr in tidyverse Tidyverse' purrr package can help [building many linear models easily and access the results](https://r4ds.had.co.nz/many-models.html) from linear models. In the example below let us walk through step by step on building many models. The basic idea is to split the data as we like using tidyverse framework using list columns in a dataframe and building many models The basic idea is that we start with the full dataframe and perform many linear models on a subset of the dataframe. We do that by using "the nested data frame". We use group_by() to get grouped dataframe and use nest() function to create a nested dataframe. ```r penguins_by_species % group_by(species) %>% nest() penguins_by_species # A tibble: 3 × 2 # Groups: species [3] species data 1 Adelie 2 Gentoo 3 Chinstrap ``` Our nested dataframe looks like this where one column is our penguin species and the other column has a subset of the dataframe for each grouping variable. Basically, we have created a data frame with a column that is a list of smaller data frames. ```r penguins_by_species # A tibble: 3 × 2 # Groups: species [3] species data 1 Adelie 2 Gentoo 3 Chinstrap ``` We can check single element of the data column as shown below. Here we have a data frame containing Adelei penguins. ```r penguins_by_species$data[[1]] # A tibble: 152 × 7 island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year 1 Torge… 39.1 18.7 181 3750 male 2007 2 Torge… 39.5 17.4 186 3800 fema… 2007 3 Torge… 40.3 18 195 3250 fema… 2007 4 Torge… NA NA NA NA 2007 5 Torge… 36.7 19.3 193 3450 fema… 2007 6 Torge… 39.3 20.6 190 3650 male 2007 7 Torge… 38.9 17.8 181 3625 fema… 2007 8 Torge… 39.2 19.6 195 4675 male 2007 9 Torge… 34.1 18.1 193 3475 2007 10 Torge… 42 20.2 190 4250 2007 # ℹ 142 more rows ``` Now we are almost ready to fit linear models. Let us first write a helper function which takes a dataframe as input and fits linear regression model on two of the variables in the dataframe. ```r linear_reg % mutate(lm_fit = map(data, linear_reg)) ``` Just as we could get a peek at the dataframe before, we can access the elements of list columns using similar approach with the column name and double square brackets to get the linear model results. ```r lm_by_species$lm_fit[[1]] Call: lm(formula = bill_length_mm ~ flipper_length_mm, data = df) Coefficients: (Intercept) flipper_length_mm 13.5871 0.1327 ``` We can go ahead and add further columns, like tidied results using broom's tidy() function ```r penguins_by_species %>% mutate(lm_fit = map(data, linear_reg)) %>% mutate(tidy = map(lm_fit, broom::tidy)) %>% unnest(tidy) # A tibble: 6 × 8 # Groups: species [3] species data lm_fit term estimate std.error statistic p.value 1 Adelie (Int… 13.6 6.00 2.27 2.49e- 2 2 Adelie flip… 0.133 0.0315 4.21 4.47e- 5 3 Gentoo (Int… -20.7 7.04 -2.94 3.88e- 3 4 Gentoo flip… 0.314 0.0324 9.69 8.60e-17 5 Chinstrap (Int… 5.59 9.96 0.562 5.76e- 1 6 Chinstrap flip… 0.221 0.0508 4.34 4.92e- 5 ``` And we wanted a specific variable, we can use filter to get the rows we needed. ```r penguins_by_species %>% mutate(lm_fit = map(data, linear_reg), tidy = map(lm_fit, broom::tidy)) %>% unnest(tidy) %>% filter(term=="flipper_length_mm") # A tibble: 3 × 8 # Groups: species [3] species data lm_fit term estimate std.error statistic p.value 1 Adelie flip… 0.133 0.0315 4.21 4.47e- 5 2 Gentoo flip… 0.314 0.0324 9.69 8.60e-17 3 Chinstrap flip… 0.221 0.0508 4.34 4.92e- 5 ``` Here is a quick plot looking at the [effect sizes and the std error](https://datavizpyr.com/geoms-for-visualizing-uncertainty-ggplot2/) of the many linear models in a series of steps with purrr. ```r penguins_by_species %>% mutate(lm_fit = map(data, linear_reg), tidy = map(lm_fit, broom::tidy)) %>% unnest(tidy) %>% filter(term=="flipper_length_mm") %>% ggplot(aes(species, estimate, ymin = estimate - std.error, ymax = estimate + std.error)) + geom_pointrange() ```