How to use complete() in R
Introduction
The complete() function from tidyr helps you identify and fill in missing combinations of data in your dataset. It’s particularly useful when you have implicit missing values - combinations that should exist but are absent from your data, such as missing dates in time series or missing factor combinations in grouped data.
Getting Started
library(tidyverse)
library(palmerpenguins)Example 1: Basic Usage
The Problem
Imagine you have survey data where not every participant answered every question, creating gaps in your dataset. You need to explicitly show these missing combinations to properly analyze response patterns.
Step 1: Create sample data with missing combinations
Let’s create a simple dataset that’s missing some obvious combinations.
survey_data <- tibble(
participant = c(1, 1, 2, 2, 3),
question = c("A", "B", "A", "C", "B"),
response = c(4, 3, 5, 2, 4)
)
print(survey_data)Notice that participant 3 only answered question B, and participant 2 never answered question B.
Step 2: Identify the complete structure
First, let’s see what combinations should exist in our data.
survey_data |>
expand(participant, question)This shows all possible combinations of participants and questions that could exist.
Step 3: Fill in missing combinations
Now we’ll use complete() to add the missing combinations to our original data.
survey_data |>
complete(participant, question)The missing combinations now appear with NA values for response, making the gaps in our data explicit.
Step 4: Fill missing values with defaults
We can provide default values for the missing combinations.
complete_survey <- survey_data |>
complete(participant, question, fill = list(response = 0))
print(complete_survey)Now all missing responses are filled with 0, indicating no response was given.
Example 2: Practical Application
The Problem
You’re analyzing penguin data and want to ensure you have entries for every species on every island, even if no penguins of that species were observed there. This is crucial for accurate statistical analysis and visualization.
Step 1: Examine the current data structure
Let’s look at the species-island combinations in our penguin data.
penguin_summary <- penguins |>
filter(!is.na(species), !is.na(island)) |>
count(species, island, name = "count") |>
arrange(species, island)
print(penguin_summary)We can see that not all species appear on all islands in our dataset.
Step 2: Complete all species-island combinations
Now let’s ensure every species has an entry for every island.
complete_penguins <- penguin_summary |>
complete(species, island, fill = list(count = 0))
print(complete_penguins)Missing combinations now show 0 count, indicating no observations of that species on that island.
Step 3: Create a complete time series example
Let’s create a monthly observation dataset with some missing months.
observations <- tibble(
date = as.Date(c("2023-01-01", "2023-03-01", "2023-05-01", "2023-07-01")),
species = c("Adelie", "Chinstrap", "Gentoo", "Adelie"),
count = c(15, 8, 12, 20)
)Notice we’re missing February, April, and June observations.
Step 4: Fill in missing months
We’ll complete the time series to include all months.
complete_observations <- observations |>
complete(date = seq(min(date), max(date), by = "month"),
fill = list(count = 0, species = "Unknown"))
print(complete_observations)Now we have entries for every month, with appropriate defaults for missing data.
Summary
complete()makes implicit missing values explicit by adding rows for missing combinations- Use
expand()first to preview what the complete structure would look like - The
fillparameter lets you specify default values for missing combinations
- It’s essential for time series analysis, ensuring continuous date sequences
Particularly useful for grouped data where you need consistent factor level combinations