Many linear regression models with tidyverse

Learn many linear regression models with tidyverse in R. Practical tutorial with examples.

Published

October 28, 2023

Introduction

When you need to fit the same linear regression model to different subsets of your data, the tidyverse provides powerful tools to streamline this process. This approach is particularly useful when analyzing grouped data where you want to compare model coefficients, predictions, or performance metrics across categories.

Getting Started

library(tidyverse)
library(palmerpenguins)
library(broom)

Example 1: Basic Usage

The Problem

We want to examine the relationship between bill length and bill depth for each penguin species separately. Instead of manually filtering data and fitting three separate models, we’ll use tidyverse functions to automate this process.

Step 1: Nest data by groups

We’ll organize our data so each species gets its own dataset.

penguin_nested <- penguins |>
  filter(!is.na(bill_length_mm), !is.na(bill_depth_mm)) |>
  group_by(species) |>
  nest()

penguin_nested

This creates a tibble where each row represents a species, and the data column contains the subset of observations for that species.

Step 2: Fit models to each group

We’ll apply the same linear model to each nested dataset.

penguin_models <- penguin_nested |>
  mutate(
    model = map(data, ~ lm(bill_depth_mm ~ bill_length_mm, data = .x))
  )

penguin_models

Now each species has its own fitted linear model stored in the model column.

Step 3: Extract model summaries

We’ll use broom functions to get clean model outputs.

model_summaries <- penguin_models |>
  mutate(
    tidy_model = map(model, tidy),
    glance_model = map(model, glance)
  )

model_summaries

The tidy_model column contains coefficient information, while glance_model contains model-level statistics like R-squared values.

Step 4: Unnest results for comparison

We’ll extract the coefficient information for easy comparison across species.

coefficients <- model_summaries |>
  select(species, tidy_model) |>
  unnest(tidy_model)

coefficients

Now we can easily compare slopes, intercepts, and significance levels across all three penguin species.

Example 2: Practical Application

The Problem

We’re analyzing car performance data and want to understand how weight affects fuel efficiency (mpg) for cars with different numbers of cylinders. We also want to generate predictions and assess model quality for each cylinder group.

Step 1: Prepare and nest the data

We’ll group cars by cylinder count and create nested datasets.

car_nested <- mtcars |>
  select(mpg, wt, cyl) |>
  group_by(cyl) |>
  nest() |>
  filter(map_lgl(data, ~ nrow(.x) >= 3))

car_nested

We filtered to ensure each group has at least 3 observations for meaningful regression analysis.

Step 2: Fit models and extract diagnostics

We’ll fit weight-to-mpg models and extract comprehensive model information.

car_analysis <- car_nested |>
  mutate(
    model = map(data, ~ lm(mpg ~ wt, data = .x)),
    tidy = map(model, tidy),
    glance = map(model, glance)
  )

Each cylinder group now has its own model with diagnostic information readily available.

Step 3: Compare model performance

We’ll examine R-squared values and other fit statistics across cylinder groups.

model_performance <- car_analysis |>
  select(cyl, glance) |>
  unnest(glance) |>
  select(cyl, r.squared, adj.r.squared, p.value)

model_performance

This shows us which cylinder groups have the strongest weight-mpg relationships and which models explain the most variance.

Step 4: Generate predictions

We’ll create predictions for new weight values for each cylinder group.

new_weights <- tibble(wt = seq(2, 5, by = 0.5))

predictions <- car_analysis |>
  mutate(
    pred = map(model, ~ predict(.x, newdata = new_weights))
  ) |>
  select(cyl, pred) |>
  unnest(pred) |>
  bind_cols(rep(list(new_weights), nrow(car_nested))) |>
  unnest(wt)

These predictions allow us to visualize and compare the weight-mpg relationship across different cylinder groups.

Summary

Use nest() after group_by() to create separate datasets for each group you want to model
Apply map() with modeling functions to fit the same model structure to each nested dataset
Leverage broom functions (tidy(), glance(), augment()) to extract clean model outputs
Use unnest() to expand results back into regular tibble format for easy comparison and visualization
This workflow scales efficiently from a few groups to hundreds of models with minimal code changes

--- title: "Many linear regression models with tidyverse" description: "Learn many linear regression models with tidyverse in R. Practical tutorial with examples." date: 2023-10-28 categories: [] format: html: code-fold: false code-tools: true --- ## Introduction When you need to fit the same linear regression model to different subsets of your data, the tidyverse provides powerful tools to streamline this process. This approach is particularly useful when analyzing grouped data where you want to compare model coefficients, predictions, or performance metrics across categories. ## Getting Started ```r library(tidyverse) library(palmerpenguins) library(broom) ``` ## Example 1: Basic Usage ### The Problem We want to examine the relationship between bill length and bill depth for each penguin species separately. Instead of manually filtering data and fitting three separate models, we'll use tidyverse functions to automate this process. ### Step 1: Nest data by groups We'll organize our data so each species gets its own dataset. ```r penguin_nested <- penguins |> filter(!is.na(bill_length_mm), !is.na(bill_depth_mm)) |> group_by(species) |> nest() penguin_nested ``` This creates a tibble where each row represents a species, and the `data` column contains the subset of observations for that species. ### Step 2: Fit models to each group We'll apply the same linear model to each nested dataset. ```r penguin_models <- penguin_nested |> mutate( model = map(data, ~ lm(bill_depth_mm ~ bill_length_mm, data = .x)) ) penguin_models ``` Now each species has its own fitted linear model stored in the `model` column. ### Step 3: Extract model summaries We'll use broom functions to get clean model outputs. ```r model_summaries <- penguin_models |> mutate( tidy_model = map(model, tidy), glance_model = map(model, glance) ) model_summaries ``` The `tidy_model` column contains coefficient information, while `glance_model` contains model-level statistics like R-squared values. ### Step 4: Unnest results for comparison We'll extract the coefficient information for easy comparison across species. ```r coefficients <- model_summaries |> select(species, tidy_model) |> unnest(tidy_model) coefficients ``` Now we can easily compare slopes, intercepts, and significance levels across all three penguin species. ## Example 2: Practical Application ### The Problem We're analyzing car performance data and want to understand how weight affects fuel efficiency (mpg) for cars with different numbers of cylinders. We also want to generate predictions and assess model quality for each cylinder group. ### Step 1: Prepare and nest the data We'll group cars by cylinder count and create nested datasets. ```r car_nested <- mtcars |> select(mpg, wt, cyl) |> group_by(cyl) |> nest() |> filter(map_lgl(data, ~ nrow(.x) >= 3)) car_nested ``` We filtered to ensure each group has at least 3 observations for meaningful regression analysis. ### Step 2: Fit models and extract diagnostics We'll fit weight-to-mpg models and extract comprehensive model information. ```r car_analysis <- car_nested |> mutate( model = map(data, ~ lm(mpg ~ wt, data = .x)), tidy = map(model, tidy), glance = map(model, glance) ) ``` Each cylinder group now has its own model with diagnostic information readily available. ### Step 3: Compare model performance We'll examine R-squared values and other fit statistics across cylinder groups. ```r model_performance <- car_analysis |> select(cyl, glance) |> unnest(glance) |> select(cyl, r.squared, adj.r.squared, p.value) model_performance ``` This shows us which cylinder groups have the strongest weight-mpg relationships and which models explain the most variance. ### Step 4: Generate predictions We'll create predictions for new weight values for each cylinder group. ```r new_weights <- tibble(wt = seq(2, 5, by = 0.5)) predictions <- car_analysis |> mutate( pred = map(model, ~ predict(.x, newdata = new_weights)) ) |> select(cyl, pred) |> unnest(pred) |> bind_cols(rep(list(new_weights), nrow(car_nested))) |> unnest(wt) ``` These predictions allow us to visualize and compare the weight-mpg relationship across different cylinder groups. ## Summary - Use [`nest()`](/tidyr/how-to-use-nest-in-r.html) after [`group_by()`](/dplyr/how-to-use-groupby-in-r.html) to create separate datasets for each group you want to model - Apply `map()` with modeling functions to fit the same model structure to each nested dataset - Leverage broom functions (`tidy()`, `glance()`, `augment()`) to extract clean model outputs - Use [`unnest()`](/tidyr/how-to-use-unnest-in-r.html) to expand results back into regular tibble format for easy comparison and visualization - This workflow scales efficiently from a few groups to hundreds of models with minimal code changes --- ## Related Posts - [Simple linear regression with tidyverse](/how-to/simple-linear-regression-with-tidyverse.html) - [How to compute annualized return of a stock with tidyverse](/how-to/compute-annualized-return-of-a-stock.html) - [How to Replace NAs with Column mean using tidyverse](/how-to/replace-nas-with-column_mean-using-tidyverse.html) - [3 ways to rank numbers with tidyverse](/dplyr/rank-numbers-with-tidyverse.html) - [How to use select() in R](/dplyr/how-to-use-select-in-r.html)