How to use unite() in R
Introduction
The tidyr::unite() function combines multiple columns into a single column by concatenating their values with a specified separator. This function is essential for creating composite identifiers, formatting data for analysis, or preparing data for visualization where you need information from multiple columns displayed together. It’s particularly useful when you need to create unique identifiers, combine categorical variables, or format data for reporting purposes.
Getting Started
library(tidyverse)
library(palmerpenguins)Example 1: Basic Usage
Let’s start with a simple example using the Palmer penguins dataset to create a combined species-island identifier:
penguins |>
unite(col = "species_island",
species, island,
sep = "_") |>
select(species_island, bill_length_mm, bill_depth_mm)In this example, unite() takes the species and island columns and combines them into a new column called species_island using an underscore as the separator. The original columns are removed by default, leaving us with the new combined column along with the other variables we selected.
You can also keep the original columns by setting remove = FALSE:
penguins |>
unite(col = "species_island",
species, island,
sep = "_",
remove = FALSE) |>
select(species, island, species_island, body_mass_g)Example 2: Practical Application
Here’s a more practical example where we create a comprehensive penguin identifier that includes multiple characteristics, then use it for grouping and analysis:
penguins |>
drop_na() |>
unite(col = "penguin_id",
species, island, sex, year,
sep = "-") |>
group_by(penguin_id) |>
summarise(
count = n(),
avg_bill_length = mean(bill_length_mm),
avg_body_mass = mean(body_mass_g),
.groups = "drop"
) |>
filter(count >= 5) |>
arrange(desc(avg_body_mass))This example demonstrates how unite() works seamlessly with other tidyverse functions. We first remove missing values, then create a comprehensive identifier combining species, island, sex, and year. This identifier helps us group penguins by these characteristics to calculate meaningful statistics.
Another practical application is formatting data for labels or reports:
penguins |>
drop_na(bill_length_mm, bill_depth_mm) |>
unite(col = "bill_dimensions",
bill_length_mm, bill_depth_mm,
sep = " x ",
remove = FALSE) |>
unite(col = "penguin_label",
species, bill_dimensions,
sep = ": ") |>
select(penguin_label, body_mass_g, flipper_length_mm) |>
slice_head(n = 10)Here we create descriptive labels by first combining bill dimensions with ” x ” as a separator, then combining the species name with these dimensions using a colon separator. This creates human-readable labels perfect for plots or reports.
You can also handle missing values explicitly:
penguins |>
unite(col = "location_year",
island, year,
sep = "_",
na.rm = TRUE) |>
count(location_year, species) |>
arrange(desc(n))Summary
unite()is perfect for creating composite identifiers, formatted labels, or combining categorical variables for analysis and visualization- The function offers flexibility through parameters like
sepfor custom separators,removeto control whether original columns are kept, andna.rmto handle missing values appropriately
It integrates seamlessly with other tidyverse functions, making it ideal for data preparation workflows where you need to reshape data before analysis or create meaningful grouping variables