Many linear regression models with tidyverse
In this tutorial, we will learn how to build many simple linear regression models in R using tidyverse. We will start with a simple approach that is not scalable to illustrate the challenges. Then we will show a naive approach using for loop to build many linear regression models. And finally we will show an elegant solution of building many linear regression models using map() function in purrr in tidyverse.
Let us load the packages needed . We will use penguins data to build many linear regression models with lm().
library(tidyverse)
library(palmerpenguins)
theme_set(theme_bw(16))penguins %>% head()
# A tibble: 6 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
# ℹ 2 more variables: sex , year Let us say we are interested in building linear regression models between bill length and flipper length for all three penguin species in the data. Here we just visualize the linear association of these two quantitative variables.
penguins %>%
ggplot(aes(bill_length_mm, flipper_length_mm))+
geom_point()+
facet_wrap(~species)
many linear regression model data as scatter plot ### Naive approach to build many linear models In our example we need to build three simple linear regression models. A naive approach is to build 3 linear regression models one by one manually. To do that, we might filter the orginal data to a subset we want to build one linear regression model as shown below. Here we get data for one species
Adelie_df %
filter(species=="Adelie")And build linear regression model for one species using lm() function in R.
lm_Adelie = lm(bill_length_mm ~flipper_length_mm, data=Adelie_df)We can then get the results from the linear regression model. Here we find that bill length is strongly associated with flipper length in Adelie penguins.
lm_Adelie %>% tidy()
# A tibble: 2 × 5
term estimate std.error statistic p.value
1 (Intercept) 13.6 6.00 2.27 0.0249
2 flipper_length_mm 0.133 0.0315 4.21 0.0000447With the naive approach, we will do the same but now for the second species.
Chinstrap_df %
filter(species=="Chinstrap")
lm_Chinstrap = lm(bill_length_mm ~flipper_length_mm, data=Chinstrap_df)
lm_Chinstrap %>% tidy()
# A tibble: 2 × 5
term estimate std.error statistic p.value
1 (Intercept) 5.59 9.96 0.562 0.576
2 flipper_length_mm 0.221 0.0508 4.34 0.0000492Many linear regression models with a for loop
The naive approach described above is not scalable. A better option is to use loop through each species with a for loop and build linear regression model.
To do that let us first get the list of species that we would like to loop through.
p_species = penguins %>%
count(species) %>%
pull(species)
p_species
[1] Adelie Chinstrap Gentoo
Levels: Adelie Chinstrap GentooAnd then we can write a for loop to go over each element in the species vector to build linear regression model.
for (i in seq_along(p_species)){
# get species name
s %
filter(species==s)
# fit lm
lm_fit % tidy()
# print results
print(lm_fit_tidy)
}
# A tibble: 2 × 5
term estimate std.error statistic p.value
1 (Intercept) 13.6 6.00 2.27 0.0249
2 flipper_length_mm 0.133 0.0315 4.21 0.0000447
# A tibble: 2 × 5
term estimate std.error statistic p.value
1 (Intercept) 5.59 9.96 0.562 0.576
2 flipper_length_mm 0.221 0.0508 4.34 0.0000492
# A tibble: 2 × 5
term estimate std.error statistic p.value
1 (Intercept) -20.7 7.04 -2.94 3.88e- 3
2 flipper_length_mm 0.314 0.0324 9.69 8.60e-17Although building many linear models using a for loop is scalable, it is often difficult to use the results from each model. In this example we have just printed the results for convenience. Often one might want to save the result and access different aspect of the results later.
Many linear models with purrr in tidyverse
Tidyverse’ purrr package can help building many linear models easily and access the results from linear models. In the example below let us walk through step by step on building many models.
The basic idea is to split the data as we like using tidyverse framework using list columns in a dataframe and building many models
The basic idea is that we start with the full dataframe and perform many linear models on a subset of the dataframe. We do that by using “the nested data frame”. We use group_by() to get grouped dataframe and use nest() function to create a nested dataframe.
penguins_by_species %
group_by(species) %>%
nest()
penguins_by_species
# A tibble: 3 × 2
# Groups: species [3]
species data
1 Adelie
2 Gentoo
3 Chinstrap Our nested dataframe looks like this where one column is our penguin species and the other column has a subset of the dataframe for each grouping variable. Basically, we have created a data frame with a column that is a list of smaller data frames.
penguins_by_species
# A tibble: 3 × 2
# Groups: species [3]
species data
1 Adelie
2 Gentoo
3 Chinstrap We can check single element of the data column as shown below. Here we have a data frame containing Adelei penguins.
penguins_by_species$data[[1]]
# A tibble: 152 × 7
island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
1 Torge… 39.1 18.7 181 3750 male 2007
2 Torge… 39.5 17.4 186 3800 fema… 2007
3 Torge… 40.3 18 195 3250 fema… 2007
4 Torge… NA NA NA NA 2007
5 Torge… 36.7 19.3 193 3450 fema… 2007
6 Torge… 39.3 20.6 190 3650 male 2007
7 Torge… 38.9 17.8 181 3625 fema… 2007
8 Torge… 39.2 19.6 195 4675 male 2007
9 Torge… 34.1 18.1 193 3475 2007
10 Torge… 42 20.2 190 4250 2007
# ℹ 142 more rowsNow we are almost ready to fit linear models. Let us first write a helper function which takes a dataframe as input and fits linear regression model on two of the variables in the dataframe.
linear_reg %
mutate(lm_fit = map(data, linear_reg))Just as we could get a peek at the dataframe before, we can access the elements of list columns using similar approach with the column name and double square brackets to get the linear model results.
lm_by_species$lm_fit[[1]]
Call:
lm(formula = bill_length_mm ~ flipper_length_mm, data = df)
Coefficients:
(Intercept) flipper_length_mm
13.5871 0.1327We can go ahead and add further columns, like tidied results using broom’s tidy() function
penguins_by_species %>%
mutate(lm_fit = map(data, linear_reg)) %>%
mutate(tidy = map(lm_fit, broom::tidy)) %>%
unnest(tidy)
# A tibble: 6 × 8
# Groups: species [3]
species data lm_fit term estimate std.error statistic p.value
1 Adelie (Int… 13.6 6.00 2.27 2.49e- 2
2 Adelie flip… 0.133 0.0315 4.21 4.47e- 5
3 Gentoo (Int… -20.7 7.04 -2.94 3.88e- 3
4 Gentoo flip… 0.314 0.0324 9.69 8.60e-17
5 Chinstrap (Int… 5.59 9.96 0.562 5.76e- 1
6 Chinstrap flip… 0.221 0.0508 4.34 4.92e- 5And we wanted a specific variable, we can use filter to get the rows we needed.
penguins_by_species %>%
mutate(lm_fit = map(data, linear_reg),
tidy = map(lm_fit, broom::tidy)) %>%
unnest(tidy) %>%
filter(term=="flipper_length_mm")
# A tibble: 3 × 8
# Groups: species [3]
species data lm_fit term estimate std.error statistic p.value
1 Adelie flip… 0.133 0.0315 4.21 4.47e- 5
2 Gentoo flip… 0.314 0.0324 9.69 8.60e-17
3 Chinstrap flip… 0.221 0.0508 4.34 4.92e- 5Here is a quick plot looking at the effect sizes and the std error of the many linear models in a series of steps with purrr.
penguins_by_species %>%
mutate(lm_fit = map(data, linear_reg),
tidy = map(lm_fit, broom::tidy)) %>%
unnest(tidy) %>%
filter(term=="flipper_length_mm") %>%
ggplot(aes(species, estimate,
ymin = estimate - std.error,
ymax = estimate + std.error)) +
geom_pointrange()