How to create a nested dataframe with lists

tidyr

tidyr unnest()

Learn how to create a nested dataframe with lists in R. Practical tutorial with examples.

Published

November 27, 2024

Introduction

A nested dataframe stores complex data structures like lists or other dataframes within individual cells. This approach is particularly useful when you need to group related observations together while maintaining the ability to perform vectorized operations across groups, such as storing multiple measurements per subject or organizing hierarchical data.

Getting Started

library(tidyverse)
library(palmerpenguins)

Example 1: Basic Usage

The Problem

We want to create a simple nested dataframe that groups penguin data by species and stores all observations for each species as list-columns. This allows us to work with grouped data while keeping it in a compact, organized structure.

Step 1: Group and nest the data

We’ll start by grouping the penguins dataset by species and creating nested data.

nested_penguins <- penguins |>
  drop_na() |>
  group_by(species) |>
  nest()

nested_penguins

This creates a dataframe where each species has its own row, and all observations for that species are stored in a list-column called data.

Step 2: Examine the structure

Let’s explore what our nested dataframe looks like and access the nested data.

# View the structure
str(nested_penguins)

# Access data for first species
nested_penguins$data[[1]]

Each element in the data column contains a complete dataframe with all the original columns except the grouping variable (species).

Step 3: Create custom list-columns

We can manually create list-columns with specific vectors or summaries for each group.

nested_with_summary <- penguins |>
  drop_na() |>
  group_by(species) |>
  summarise(
    count = n(),
    bill_lengths = list(bill_length_mm),
    avg_mass = mean(body_mass_g)
  )

nested_with_summary

This creates a dataframe where bill_lengths contains a vector of all bill lengths for each species, stored as list elements.

Example 2: Practical Application

The Problem

Imagine we’re analyzing car performance data and want to group cars by number of cylinders, then store various measurements as lists while computing summary statistics. This structure allows us to keep detailed data accessible while having summaries readily available.

Step 1: Create nested structure with multiple list-columns

We’ll group the mtcars dataset and create several list-columns containing different measurements.

nested_cars <- mtcars |>
  rownames_to_column("car_name") |>
  group_by(cyl) |>
  summarise(
    car_names = list(car_name),
    mpg_values = list(mpg),
    horsepower = list(hp),
    car_count = n()
  )

nested_cars

Now each cylinder group has lists containing car names, MPG values, and horsepower ratings, plus a summary count.

Step 2: Work with the nested data

We can perform operations on our list-columns using map functions.

# Calculate statistics for each group
nested_cars |>
  mutate(
    avg_mpg = map_dbl(mpg_values, mean),
    max_hp = map_dbl(horsepower, max),
    top_car = map2_chr(car_names, mpg_values, ~ .x[which.max(.y)])
  )

This adds new columns showing average MPG, maximum horsepower, and the most fuel-efficient car for each cylinder group.

Step 3: Unnest when needed

We can easily convert back to a regular dataframe structure when we need to work with individual observations.

# Unnest specific columns
nested_cars |>
  select(cyl, car_names, mpg_values) |>
  unnest(cols = c(car_names, mpg_values)) |>
  head(8)

This expands the nested structure back into individual rows, making it easy to switch between nested and unnested formats as needed.

Summary

Nested dataframes store complex data structures like lists within individual cells, perfect for grouped or hierarchical data
Use nest() after group_by() to automatically create nested structures, or summarise() with list() for custom list-columns
List-columns can contain vectors, dataframes, or any R objects, providing flexible data organization options
Apply functions across nested data using map() family functions for efficient group-wise operations
Convert back to regular format using unnest() when you need to work with individual observations rather than groups

--- title: "How to create a nested dataframe with lists" description: "Learn how to create a nested dataframe with lists in R. Practical tutorial with examples." date: 2024-11-27 categories: ['tidyr', 'tidyr unnest()'] format: html: code-fold: false code-tools: true --- ## Introduction A nested dataframe stores complex data structures like lists or other dataframes within individual cells. This approach is particularly useful when you need to group related observations together while maintaining the ability to perform vectorized operations across groups, such as storing multiple measurements per subject or organizing hierarchical data. ## Getting Started ```r library(tidyverse) library(palmerpenguins) ``` ## Example 1: Basic Usage ### The Problem We want to create a simple nested dataframe that groups penguin data by species and stores all observations for each species as list-columns. This allows us to work with grouped data while keeping it in a compact, organized structure. ### Step 1: Group and nest the data We'll start by grouping the penguins dataset by species and creating nested data. ```r nested_penguins <- penguins |> drop_na() |> group_by(species) |> nest() nested_penguins ``` This creates a dataframe where each species has its own row, and all observations for that species are stored in a list-column called `data`. ### Step 2: Examine the structure Let's explore what our nested dataframe looks like and access the nested data. ```r # View the structure str(nested_penguins) # Access data for first species nested_penguins$data[[1]] ``` Each element in the `data` column contains a complete dataframe with all the original columns except the grouping variable (species). ### Step 3: Create custom list-columns We can manually create list-columns with specific vectors or summaries for each group. ```r nested_with_summary <- penguins |> drop_na() |> group_by(species) |> summarise( count = n(), bill_lengths = list(bill_length_mm), avg_mass = mean(body_mass_g) ) nested_with_summary ``` This creates a dataframe where `bill_lengths` contains a vector of all bill lengths for each species, stored as list elements. ## Example 2: Practical Application ### The Problem Imagine we're analyzing car performance data and want to group cars by number of cylinders, then store various measurements as lists while computing summary statistics. This structure allows us to keep detailed data accessible while having summaries readily available. ### Step 1: Create nested structure with multiple list-columns We'll group the mtcars dataset and create several list-columns containing different measurements. ```r nested_cars <- mtcars |> rownames_to_column("car_name") |> group_by(cyl) |> summarise( car_names = list(car_name), mpg_values = list(mpg), horsepower = list(hp), car_count = n() ) nested_cars ``` Now each cylinder group has lists containing car names, MPG values, and horsepower ratings, plus a summary count. ### Step 2: Work with the nested data We can perform operations on our list-columns using map functions. ```r # Calculate statistics for each group nested_cars |> mutate( avg_mpg = map_dbl(mpg_values, mean), max_hp = map_dbl(horsepower, max), top_car = map2_chr(car_names, mpg_values, ~ .x[which.max(.y)]) ) ``` This adds new columns showing average MPG, maximum horsepower, and the most fuel-efficient car for each cylinder group. ### Step 3: Unnest when needed We can easily convert back to a regular dataframe structure when we need to work with individual observations. ```r # Unnest specific columns nested_cars |> select(cyl, car_names, mpg_values) |> unnest(cols = c(car_names, mpg_values)) |> head(8) ``` This expands the nested structure back into individual rows, making it easy to switch between nested and unnested formats as needed. ## Summary - Nested dataframes store complex data structures like lists within individual cells, perfect for grouped or hierarchical data - Use [`nest()`](/tidyr/how-to-use-nest-in-r.html) after [`group_by()`](/dplyr/how-to-use-groupby-in-r.html) to automatically create nested structures, or [`summarise()`](/dplyr/how-to-use-summarise-in-r.html) with [`list()`](/base-r/how-to-use-list-in-r.html) for custom list-columns - List-columns can contain vectors, dataframes, or any R objects, providing flexible data organization options - Apply functions across nested data using `map()` family functions for efficient group-wise operations - Convert back to regular format using [`unnest()`](/tidyr/how-to-use-unnest-in-r.html) when you need to work with individual observations rather than groups --- ## Related Posts - [pivot_longer on dataframe with single row](/tidyr/pivot_longer-on-dataframe-with-single-row.html) - [How to replace NAs with zero in a dataframe](/tidyr/tidyr-replace_na-function.html) - [expand_grid(): Create all possible combinations of variables](/tidyr/expand_grid-create-all-possible-combinations-of-variables.html) - [How to select only numeric columns in a dataframe](/dplyr/select-all-numeric-columns-in-a-dataframe.html) - [dplyr's mutate(): How to create new columns](/dplyr/dplyr-mutate-create-new-columns.html)

Introduction

Getting Started

Example 1: Basic Usage

The Problem

Step 1: Group and nest the data

Step 2: Examine the structure

Step 3: Create custom list-columns

Example 2: Practical Application

The Problem

Step 1: Create nested structure with multiple list-columns

Step 2: Work with the nested data

Step 3: Unnest when needed

Summary

Convert back to regular format using unnest() when you need to work with individual observations rather than groups

Related Posts

Convert back to regular format using `unnest()` when you need to work with individual observations rather than groups