How to use across() in R
Introduction
The across() function is a powerful tool in the dplyr package that allows you to apply the same operation to multiple columns simultaneously. Instead of writing repetitive code to perform similar transformations on different columns, across() lets you select multiple columns and apply functions to them in a single, elegant expression.
You’ll find across() particularly useful when you need to summarize multiple numeric columns, apply the same transformation to several variables, or perform operations on columns that share common characteristics. It’s commonly used within summarise(), mutate(), and other dplyr verbs to make your data manipulation code more concise and maintainable.
Getting Started
First, let’s load the required packages:
library(tidyverse)
library(palmerpenguins)Example 1: Basic Usage
Let’s start with a simple example using the penguins dataset. Suppose we want to calculate the mean of all numeric columns:
penguins |>
summarise(across(where(is.numeric), mean, na.rm = TRUE))We can also apply multiple functions to the same columns by providing a list of functions:
penguins |>
summarise(across(c(bill_length_mm, bill_depth_mm, flipper_length_mm),
list(mean = mean, sd = sd),
na.rm = TRUE))Here’s how to use across() with mutate() to transform multiple columns. Let’s convert millimeter measurements to centimeters:
penguins |>
mutate(across(ends_with("_mm"), ~ .x / 10, .names = "{.col}_cm")) |>
select(species, contains("_cm"))Example 2: Practical Application
Now let’s work with a more complex real-world scenario. Imagine we’re analyzing penguin data and need to create a comprehensive summary report grouped by species and island. We’ll use across() to efficiently calculate multiple statistics:
penguin_summary <- penguins |>
group_by(species, island) |>
summarise(
count = n(),
across(c(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g),
list(
mean = ~ mean(.x, na.rm = TRUE),
median = ~ median(.x, na.rm = TRUE),
min = ~ min(.x, na.rm = TRUE),
max = ~ max(.x, na.rm = TRUE)
),
.names = "{.col}_{.fn}"),
.groups = "drop"
)
penguin_summaryWe can also use across() with conditional logic. Let’s standardize (z-score) all numeric measurements while preserving the original grouping variables:
penguins_standardized <- penguins |>
group_by(species) |>
mutate(across(where(is.numeric),
~ scale(.x)[,1],
.names = "{.col}_std")) |>
ungroup()
penguins_standardized |>
select(species, contains("_std"))Here’s another practical example where we handle missing values differently for different types of columns:
penguins_cleaned <- penguins |>
mutate(
across(where(is.numeric), ~ ifelse(is.na(.x), median(.x, na.rm = TRUE), .x)),
across(where(is.character), ~ ifelse(is.na(.x), "Unknown", .x))
)
penguins_cleaned |>
summarise(across(everything(), ~ sum(is.na(.x))))Summary
The across() function is essential for efficient data manipulation in R. Key takeaways include:
- Use
across(where(condition), function)to apply operations to columns meeting specific criteria - Combine
across()with column selection helpers likestarts_with(),ends_with(), orcontains() - Apply multiple functions using lists:
across(cols, list(mean = mean, sd = sd)) - Control output names with the
.namesparameter using{.col}and{.fn}placeholders - The
~syntax allows for more complex transformations withinacross()