tidyverse all_of(): select columns from a vector

tidyselect
Learn tidyverse all_of(): select columns from a vector with this comprehensive R tutorial. Includes practical examples and code snippets.
Published

November 2, 2022

Introduction

The all_of() function in tidyverse allows you to select columns using a character vector of column names. This is especially useful when you have column names stored in variables or when you need to programmatically select columns based on external input or conditions.

Getting Started

library(tidyverse)
library(palmerpenguins)

Example 1: Basic Usage

The Problem

You want to select specific columns from a dataset, but the column names are stored in a character vector rather than typed directly. This commonly happens when column names come from user input or configuration files.

Step 1: Create a vector of column names

First, we’ll define which columns we want to select by storing their names in a character vector.

# Define columns we want to analyze
selected_cols <- c("species", "bill_length_mm", "body_mass_g")

# View our penguin data
head(penguins, 3)

The vector selected_cols now contains the exact column names we want to extract from our dataset.

Step 2: Select columns using all_of()

Now we’ll use all_of() within select() to choose only the columns specified in our vector.

# Select columns using all_of()
penguin_subset <- penguins |>
  select(all_of(selected_cols))

# Check the result
head(penguin_subset)

The all_of() function ensures all specified columns exist and selects them, creating a dataset with only our three chosen variables.

Step 3: Compare with direct selection

Let’s see how this differs from selecting columns directly by name.

# Direct selection (traditional method)
direct_selection <- penguins |>
  select(species, bill_length_mm, body_mass_g)

# Both methods produce identical results
identical(penguin_subset, direct_selection)

Both approaches yield the same result, but all_of() provides flexibility when column names are stored in variables.

Example 2: Practical Application

The Problem

You’re analyzing multiple datasets with different combinations of measurement columns. You need to create a flexible function that can select measurement columns based on what’s available in each dataset, and you want to avoid errors when some expected columns don’t exist.

Step 1: Create sample datasets with different columns

Let’s simulate having different datasets by creating subsets with varying column combinations.

# Create datasets with different available columns
dataset1 <- penguins |> select(species, bill_length_mm, bill_depth_mm, island)
dataset2 <- penguins |> select(species, body_mass_g, flipper_length_mm, year)

# Check what columns each dataset has
names(dataset1)
names(dataset2)

Now we have two datasets with different measurement columns available.

Step 2: Define flexible column selection

We’ll create vectors of desired columns and use all_of() to select only existing ones.

# Define preferred measurement columns
measurement_cols <- c("bill_length_mm", "bill_depth_mm", "body_mass_g")

# Select measurements from dataset1 (some columns exist)
measurements1 <- dataset1 |>
  select(species, all_of(intersect(measurement_cols, names(dataset1))))

head(measurements1)

The intersect() function ensures we only try to select columns that actually exist in our dataset.

Step 3: Apply the same logic to different dataset

Now we’ll use the same approach with our second dataset that has different available columns.

# Select measurements from dataset2 (different columns exist)  
measurements2 <- dataset2 |>
  select(species, all_of(intersect(measurement_cols, names(dataset2))))

head(measurements2)

This approach successfully selects only the measurement columns that exist in each dataset, avoiding errors.

Step 4: Create a reusable function

Finally, let’s wrap this logic into a reusable function for any dataset.

# Create function for flexible column selection
select_measurements <- function(data, desired_cols) {
  available_cols <- intersect(desired_cols, names(data))
  data |> select(species, all_of(available_cols))
}

# Test the function
result1 <- select_measurements(dataset1, measurement_cols)
result2 <- select_measurements(dataset2, measurement_cols)

Our function now works with any dataset, selecting only the available measurement columns without throwing errors.

Summary

  • all_of() selects columns using a character vector of column names, enabling programmatic column selection
  • It’s particularly useful when column names are stored in variables or come from external sources
  • Combine all_of() with intersect() to safely select only existing columns from a desired list
  • This approach prevents errors when working with datasets that have varying column structures
  • all_of() makes your code more flexible and reusable compared to hard-coding column names