tidyr unite(): combine multiple columns into one

tidyr

tidyr unite()

Learn tidyr unite() to combine multiple columns into one. R tutorial with practical examples.

Published

January 20, 2023

Introduction

The unite() function from the tidyr package is a powerful tool for combining multiple columns into a single column. This function is particularly useful when you need to create composite identifiers, concatenate text fields, or prepare data for analysis by merging related information.

You’ll commonly use unite() when working with datasets where related information is stored across separate columns - such as combining first and last names, merging date components (year, month, day), or creating unique identifiers from multiple categorical variables. The function provides flexible options for specifying separators and handling missing values, making it an essential tool for data cleaning and preparation workflows.

Getting Started

First, let’s load the required packages:

library(tidyverse)
library(palmerpenguins)

Example 1: Basic Usage

Let’s start with a simple example using the built-in mtcars dataset. We’ll combine the number of cylinders and gears to create a new identifier:

# Create a sample dataset
car_data <- mtcars |>
  rownames_to_column("car_model") |>
  select(car_model, cyl, gear, mpg) |>
  slice_head(n = 6)

# Basic unite() usage
car_data |>
  unite(col = "cyl_gear", cyl, gear, sep = "_")

In this example, unite() combines the cyl and gear columns into a new column called cyl_gear, using an underscore as the separator. The original columns are removed by default.

To keep the original columns, use the remove = FALSE parameter:

car_data |>
  unite(col = "cyl_gear", cyl, gear, sep = "_", remove = FALSE)

Example 2: Practical Application

Let’s work with the Palmer Penguins dataset to create a more comprehensive example. We’ll combine species and island information to create unique group identifiers, and also demonstrate handling missing values:

# Prepare penguin data with some modifications
penguin_data <- penguins |>
  select(species, island, sex, year, body_mass_g) |>
  slice_head(n = 10)

# Create a comprehensive identifier combining species, island, and year
penguin_data |>
  unite(col = "species_location_year", 
        species, island, year, 
        sep = "-", 
        remove = FALSE) |>
  select(species_location_year, everything())

Here’s a more advanced example that demonstrates different separator options and handling of missing data:

# Create sample data with missing values
sample_data <- tibble(
  id = 1:5,
  first_name = c("John", "Jane", "Bob", "Alice", NA),
  last_name = c("Doe", "Smith", NA, "Johnson", "Brown"),
  department = c("Sales", "IT", "HR", "Finance", "Sales")
)

# Unite names with different approaches
sample_data |>
  unite(col = "full_name", 
        first_name, last_name, 
        sep = " ", 
        na.rm = TRUE) |>  # Remove NA values before uniting
  unite(col = "employee_id", 
        id, department, 
        sep = "_") |>
  select(employee_id, full_name)

The na.rm = TRUE parameter is particularly useful when dealing with missing data, as it excludes NA values from the concatenation rather than including them as “NA” text.

You can also use unite() with conditional logic in a more complex pipeline:

penguins |>
  filter(!is.na(sex)) |>
  unite(col = "species_sex", species, sex, sep = "_") |>
  group_by(species_sex, island) |>
  summarise(
    avg_body_mass = mean(body_mass_g, na.rm = TRUE),
    count = n(),
    .groups = "drop"
  ) |>
  arrange(desc(avg_body_mass))

Summary

The unite() function is an essential tool for data manipulation that allows you to combine multiple columns efficiently. Key takeaways include:

Use unite(col = "new_name", col1, col2, sep = "separator") for basic concatenation
Set remove = FALSE to preserve original columns
Use na.rm = TRUE to handle missing values gracefully
Combine unite() with other tidyverse functions for powerful data transformation pipelines
The function works seamlessly with the pipe operator for clean, readable code

Whether you’re creating composite keys, formatting text data, or preparing datasets for analysis, `unite()` provides a clean and flexible solution for column combination tasks.

--- title: "tidyr unite(): combine multiple columns into one" description: "Learn tidyr unite() to combine multiple columns into one. R tutorial with practical examples." date: 2023-01-20 categories: ['tidyr', 'tidyr unite()'] format: html: code-fold: false code-tools: true --- ## Introduction The [`unite()`](/tidyr/how-to-use-unite-in-r.html) function from the tidyr package is a powerful tool for combining multiple columns into a single column. This function is particularly useful when you need to create composite identifiers, concatenate text fields, or prepare data for analysis by merging related information. You'll commonly use `unite()` when working with datasets where related information is stored across separate columns - such as combining first and last names, merging date components (year, month, day), or creating unique identifiers from multiple categorical variables. The function provides flexible options for specifying separators and handling missing values, making it an essential tool for data cleaning and preparation workflows. ## Getting Started First, let's load the required packages: ```r library(tidyverse) library(palmerpenguins) ``` ## Example 1: Basic Usage Let's start with a simple example using the built-in `mtcars` dataset. We'll combine the number of cylinders and gears to create a new identifier: ```r # Create a sample dataset car_data <- mtcars |> rownames_to_column("car_model") |> select(car_model, cyl, gear, mpg) |> slice_head(n = 6) # Basic unite() usage car_data |> unite(col = "cyl_gear", cyl, gear, sep = "_") ``` In this example, `unite()` combines the `cyl` and `gear` columns into a new column called `cyl_gear`, using an underscore as the separator. The original columns are removed by default. To keep the original columns, use the `remove = FALSE` parameter: ```r car_data |> unite(col = "cyl_gear", cyl, gear, sep = "_", remove = FALSE) ``` ## Example 2: Practical Application Let's work with the Palmer Penguins dataset to create a more comprehensive example. We'll combine species and island information to create unique group identifiers, and also demonstrate handling missing values: ```r # Prepare penguin data with some modifications penguin_data <- penguins |> select(species, island, sex, year, body_mass_g) |> slice_head(n = 10) # Create a comprehensive identifier combining species, island, and year penguin_data |> unite(col = "species_location_year", species, island, year, sep = "-", remove = FALSE) |> select(species_location_year, everything()) ``` Here's a more advanced example that demonstrates different separator options and handling of missing data: ```r # Create sample data with missing values sample_data <- tibble( id = 1:5, first_name = c("John", "Jane", "Bob", "Alice", NA), last_name = c("Doe", "Smith", NA, "Johnson", "Brown"), department = c("Sales", "IT", "HR", "Finance", "Sales") ) # Unite names with different approaches sample_data |> unite(col = "full_name", first_name, last_name, sep = " ", na.rm = TRUE) |> # Remove NA values before uniting unite(col = "employee_id", id, department, sep = "_") |> select(employee_id, full_name) ``` The `na.rm = TRUE` parameter is particularly useful when dealing with missing data, as it excludes NA values from the concatenation rather than including them as "NA" text. You can also use `unite()` with conditional logic in a more complex pipeline: ```r penguins |> filter(!is.na(sex)) |> unite(col = "species_sex", species, sex, sep = "_") |> group_by(species_sex, island) |> summarise( avg_body_mass = mean(body_mass_g, na.rm = TRUE), count = n(), .groups = "drop" ) |> arrange(desc(avg_body_mass)) ``` ## Summary The `unite()` function is an essential tool for data manipulation that allows you to combine multiple columns efficiently. Key takeaways include: - Use `unite(col = "new_name", col1, col2, sep = "separator")` for basic concatenation - Set `remove = FALSE` to preserve original columns - Use `na.rm = TRUE` to handle missing values gracefully - Combine `unite()` with other tidyverse functions for powerful data transformation pipelines - The function works seamlessly with the pipe operator for clean, readable code Whether you're creating composite keys, formatting text data, or preparing datasets for analysis, `unite()` provides a clean and flexible solution for column combination tasks. --- ## Related Posts - [tidyr's separate_delim_wider(): Split a string into columns](/tidyr/tidyrs-separate_delim_wider-split-a-string-into-columns.html) - [How to Separate a Column into Multiple Rows in R: Hint tidyr's spearate_row()](/tidyr/separate-a-collapsed-column-into-multiple-rows.html) - [How to apply a function on multiple columns using across()](/dplyr/apply-a-function-on-multiple-columns-using-across.html) - [How to rename one or more columns of a dataframe](/dplyr/rename-one-or-more-columns-of-a-dataframe.html) - [How to use unite() in R](/tidyr/how-to-use-unite-in-r.html)