tidyr’s separate_delim_wider(): Split a string into columns

tidyr
tidyr separate_delim_wider()
Master tidyr separate_delim_wider() in R. Step-by-step guide with practical examples.
Published

July 3, 2024

The separate_delim_wider() function from tidyr splits strings containing delimited data into multiple columns. This is essential when you have concatenated information in a single column that needs to be separated for analysis or visualization.

Getting Started

library(tidyverse)
library(palmerpenguins)

Example 1: Basic Usage

The Problem

We often encounter data where multiple values are stored in one column, separated by delimiters like commas, semicolons, or pipes. Let’s create a sample dataset with penguin species and island information combined in one column.

Step 1: Create sample data

First, we’ll create a dataset with combined information that needs splitting.

penguin_data <- tibble(
  id = 1:4,
  species_island = c("Adelie_Torgersen", "Chinstrap_Dream", 
                     "Gentoo_Biscoe", "Adelie_Dream")
)
penguin_data

This creates a simple dataset where species and island names are combined with underscores.

Step 2: Apply separate_delim_wider()

Now we’ll split the combined column into separate columns using the underscore delimiter.

penguin_data |>
  separate_delim_wider(
    cols = species_island,
    delim = "_",
    names = c("species", "island")
  )

The function successfully splits each string at the underscore, creating two new columns with our specified names.

Step 3: Handle the original column

By default, the original column is removed, but we can keep it by setting keep = TRUE.

penguin_data |>
  separate_delim_wider(
    cols = species_island,
    delim = "_", 
    names = c("species", "island"),
    keep = TRUE
  )

Now we have both the original combined column and the new separated columns.

Example 2: Practical Application

The Problem

Let’s work with a more realistic scenario using car data where multiple specifications are stored together. We’ll create a dataset with combined engine information that includes displacement, cylinders, and transmission type separated by pipes.

Step 1: Create realistic car data

We’ll build a dataset similar to what you might encounter in automotive databases.

car_specs <- tibble(
  model = c("Toyota Camry", "Honda Accord", "Ford Mustang", "Chevy Corvette"),
  engine_info = c("2.5L|4cyl|Auto", "1.5L|4cyl|Manual", 
                  "5.0L|8cyl|Auto", "6.2L|8cyl|Manual")
)
car_specs

This represents a common data format where technical specifications are concatenated with pipe delimiters.

Step 2: Split the engine specifications

We’ll separate the engine information into individual columns for easier analysis.

car_specs |>
  separate_delim_wider(
    cols = engine_info,
    delim = "|",
    names = c("displacement", "cylinders", "transmission")
  )

Each specification now has its own column, making the data much more useful for filtering and analysis.

Step 3: Handle inconsistent data

Sometimes data has varying numbers of elements. Let’s see how to handle this gracefully.

messy_data <- tibble(
  car = c("Car A", "Car B", "Car C"),
  specs = c("2.0L|4cyl|Auto|AWD", "1.8L|4cyl", "3.0L|6cyl|Manual")
)

messy_data |>
  separate_delim_wider(
    cols = specs,
    delim = "|",
    names = c("engine", "cylinders", "trans", "drivetrain"),
    too_few = "align_start"
  )

The too_few = "align_start" parameter fills missing values with NA when rows have fewer elements than expected.

Step 4: Working with excess data

When some rows have more elements than expected, we can handle the excess appropriately.

messy_data |>
  separate_delim_wider(
    cols = specs,
    delim = "|",
    names = c("engine", "cylinders"),
    too_many = "merge"
  )

The too_many = "merge" option combines excess elements into the final column.

Summary

  • separate_delim_wider() splits delimited strings into multiple columns efficiently
  • Use the names parameter to specify meaningful column names for your separated data
  • The keep parameter controls whether to retain the original column alongside new ones
  • Handle inconsistent data with too_few and too_many parameters for robust data processing
  • This function is invaluable for cleaning datasets where multiple values are inappropriately stored together