tidyr unite(): combine multiple columns into one

tidyr
tidyr unite()
Learn tidyr unite() to combine multiple columns into one. R tutorial with practical examples.
Published

January 20, 2023

Introduction

The unite() function from the tidyr package is a powerful tool for combining multiple columns into a single column. This function is particularly useful when you need to create composite identifiers, concatenate text fields, or prepare data for analysis by merging related information.

You’ll commonly use unite() when working with datasets where related information is stored across separate columns - such as combining first and last names, merging date components (year, month, day), or creating unique identifiers from multiple categorical variables. The function provides flexible options for specifying separators and handling missing values, making it an essential tool for data cleaning and preparation workflows.

Getting Started

First, let’s load the required packages:

library(tidyverse)
library(palmerpenguins)

Example 1: Basic Usage

Let’s start with a simple example using the built-in mtcars dataset. We’ll combine the number of cylinders and gears to create a new identifier:

# Create a sample dataset
car_data <- mtcars |>
  rownames_to_column("car_model") |>
  select(car_model, cyl, gear, mpg) |>
  slice_head(n = 6)

# Basic unite() usage
car_data |>
  unite(col = "cyl_gear", cyl, gear, sep = "_")

In this example, unite() combines the cyl and gear columns into a new column called cyl_gear, using an underscore as the separator. The original columns are removed by default.

To keep the original columns, use the remove = FALSE parameter:

car_data |>
  unite(col = "cyl_gear", cyl, gear, sep = "_", remove = FALSE)

Example 2: Practical Application

Let’s work with the Palmer Penguins dataset to create a more comprehensive example. We’ll combine species and island information to create unique group identifiers, and also demonstrate handling missing values:

# Prepare penguin data with some modifications
penguin_data <- penguins |>
  select(species, island, sex, year, body_mass_g) |>
  slice_head(n = 10)

# Create a comprehensive identifier combining species, island, and year
penguin_data |>
  unite(col = "species_location_year", 
        species, island, year, 
        sep = "-", 
        remove = FALSE) |>
  select(species_location_year, everything())

Here’s a more advanced example that demonstrates different separator options and handling of missing data:

# Create sample data with missing values
sample_data <- tibble(
  id = 1:5,
  first_name = c("John", "Jane", "Bob", "Alice", NA),
  last_name = c("Doe", "Smith", NA, "Johnson", "Brown"),
  department = c("Sales", "IT", "HR", "Finance", "Sales")
)

# Unite names with different approaches
sample_data |>
  unite(col = "full_name", 
        first_name, last_name, 
        sep = " ", 
        na.rm = TRUE) |>  # Remove NA values before uniting
  unite(col = "employee_id", 
        id, department, 
        sep = "_") |>
  select(employee_id, full_name)

The na.rm = TRUE parameter is particularly useful when dealing with missing data, as it excludes NA values from the concatenation rather than including them as “NA” text.

You can also use unite() with conditional logic in a more complex pipeline:

penguins |>
  filter(!is.na(sex)) |>
  unite(col = "species_sex", species, sex, sep = "_") |>
  group_by(species_sex, island) |>
  summarise(
    avg_body_mass = mean(body_mass_g, na.rm = TRUE),
    count = n(),
    .groups = "drop"
  ) |>
  arrange(desc(avg_body_mass))

Summary

The unite() function is an essential tool for data manipulation that allows you to combine multiple columns efficiently. Key takeaways include:

  • Use unite(col = "new_name", col1, col2, sep = "separator") for basic concatenation
  • Set remove = FALSE to preserve original columns
  • Use na.rm = TRUE to handle missing values gracefully
  • Combine unite() with other tidyverse functions for powerful data transformation pipelines
  • The function works seamlessly with the pipe operator for clean, readable code

Whether you’re creating composite keys, formatting text data, or preparing datasets for analysis, unite() provides a clean and flexible solution for column combination tasks.