How to use expand() in R
Introduction
The tidyr::expand() function creates a data frame containing all possible combinations of the specified variables. It takes vectors or columns and generates every unique combination, creating a complete grid of possibilities. This function is particularly useful when you need to ensure your data includes all theoretical combinations of categorical variables, even if some combinations don’t exist in your original dataset.
You would use expand() when performing complete case analysis, filling in missing combinations for time series data, creating lookup tables, or preparing data for modeling where you need all possible factor combinations represented.
Getting Started
library(tidyverse)
library(palmerpenguins)Example 1: Basic Usage
Let’s start with a simple example using the penguins dataset to create all combinations of species and islands:
# Basic expansion of species and island combinations
penguins |>
expand(species, island)This creates a data frame with all 9 possible combinations of the 3 penguin species (Adelie, Chinstrap, Gentoo) and 3 islands (Biscoe, Dream, Torgersen). Notice that some combinations like “Chinstrap penguins on Biscoe island” might not exist in the original data, but expand() includes them anyway.
You can also expand with specific values rather than existing columns:
# Expand with custom values
expand(tibble(),
year = 2007:2009,
month = 1:12)This generates all combinations of years 2007-2009 with months 1-12, creating a complete time grid.
Example 2: Practical Application
A common real-world scenario is preparing data for analysis where you need complete cases. Let’s say we want to analyze penguin body mass across all species-island-year combinations, ensuring we account for missing combinations:
# Create complete grid and join with actual data
complete_penguin_grid <- penguins |>
expand(species, island, year) |>
left_join(
penguins |>
group_by(species, island, year) |>
summarise(
avg_body_mass = mean(body_mass_g, na.rm = TRUE),
n_penguins = n(),
.groups = "drop"
),
by = c("species", "island", "year")
) |>
mutate(
avg_body_mass = ifelse(is.nan(avg_body_mass), NA, avg_body_mass),
n_penguins = replace_na(n_penguins, 0)
)This workflow expands all possible combinations of species, island, and year, then joins the actual summarized data. Missing combinations get NA for average body mass and 0 for count, giving us a complete picture of data availability.
Another practical use is expanding nested data:
# Expand within groups
penguins |>
group_by(species) |>
expand(island, year = full_seq(year, 1)) |>
ungroup()This creates all island-year combinations within each species group, using full_seq() to ensure consecutive years are included even if missing from the original data.
You can also combine expand() with nesting() to preserve existing combinations while expanding others:
penguins |>
expand(nesting(species, island), year = 2007:2009)This maintains only the species-island combinations that actually exist in the data while expanding across all specified years.
Summary
expand()generates all possible combinations of specified variables, creating a complete grid that includes combinations not present in your original data- It’s invaluable for ensuring complete case analysis, preparing data for modeling, and identifying missing combinations in your dataset
Combine
expand()withleft_join()and other dplyr functions to create comprehensive analytical frameworks that account for all theoretical possibilities in your data structure