How to create a nested dataframe with lists
Introduction
A nested dataframe stores complex data structures like lists or other dataframes within individual cells. This approach is particularly useful when you need to group related observations together while maintaining the ability to perform vectorized operations across groups, such as storing multiple measurements per subject or organizing hierarchical data.
Getting Started
library(tidyverse)
library(palmerpenguins)Example 1: Basic Usage
The Problem
We want to create a simple nested dataframe that groups penguin data by species and stores all observations for each species as list-columns. This allows us to work with grouped data while keeping it in a compact, organized structure.
Step 1: Group and nest the data
We’ll start by grouping the penguins dataset by species and creating nested data.
nested_penguins <- penguins |>
drop_na() |>
group_by(species) |>
nest()
nested_penguinsThis creates a dataframe where each species has its own row, and all observations for that species are stored in a list-column called data.
Step 2: Examine the structure
Let’s explore what our nested dataframe looks like and access the nested data.
# View the structure
str(nested_penguins)
# Access data for first species
nested_penguins$data[[1]]Each element in the data column contains a complete dataframe with all the original columns except the grouping variable (species).
Step 3: Create custom list-columns
We can manually create list-columns with specific vectors or summaries for each group.
nested_with_summary <- penguins |>
drop_na() |>
group_by(species) |>
summarise(
count = n(),
bill_lengths = list(bill_length_mm),
avg_mass = mean(body_mass_g)
)
nested_with_summaryThis creates a dataframe where bill_lengths contains a vector of all bill lengths for each species, stored as list elements.
Example 2: Practical Application
The Problem
Imagine we’re analyzing car performance data and want to group cars by number of cylinders, then store various measurements as lists while computing summary statistics. This structure allows us to keep detailed data accessible while having summaries readily available.
Step 1: Create nested structure with multiple list-columns
We’ll group the mtcars dataset and create several list-columns containing different measurements.
nested_cars <- mtcars |>
rownames_to_column("car_name") |>
group_by(cyl) |>
summarise(
car_names = list(car_name),
mpg_values = list(mpg),
horsepower = list(hp),
car_count = n()
)
nested_carsNow each cylinder group has lists containing car names, MPG values, and horsepower ratings, plus a summary count.
Step 2: Work with the nested data
We can perform operations on our list-columns using map functions.
# Calculate statistics for each group
nested_cars |>
mutate(
avg_mpg = map_dbl(mpg_values, mean),
max_hp = map_dbl(horsepower, max),
top_car = map2_chr(car_names, mpg_values, ~ .x[which.max(.y)])
)This adds new columns showing average MPG, maximum horsepower, and the most fuel-efficient car for each cylinder group.
Step 3: Unnest when needed
We can easily convert back to a regular dataframe structure when we need to work with individual observations.
# Unnest specific columns
nested_cars |>
select(cyl, car_names, mpg_values) |>
unnest(cols = c(car_names, mpg_values)) |>
head(8)This expands the nested structure back into individual rows, making it easy to switch between nested and unnested formats as needed.
Summary
- Nested dataframes store complex data structures like lists within individual cells, perfect for grouped or hierarchical data
- Use
nest()aftergroup_by()to automatically create nested structures, orsummarise()withlist()for custom list-columns - List-columns can contain vectors, dataframes, or any R objects, providing flexible data organization options
- Apply functions across nested data using
map()family functions for efficient group-wise operations Convert back to regular format using
unnest()when you need to work with individual observations rather than groups