How to apply a function on multiple columns using across()
Introduction
The across() function in R’s dplyr package is a powerful tool for applying functions to multiple columns simultaneously. Instead of writing repetitive code to perform the same operation on different columns, across() allows you to select multiple columns and apply transformations efficiently in a single step.
This function is particularly useful when you need to summarize data, transform variables, or perform calculations across several columns that share similar characteristics. Whether you’re calculating means for numeric variables, converting data types, or applying custom functions to selected columns, across() streamlines your workflow and makes your code more readable and maintainable.
Getting Started
First, let’s load the required packages. We’ll use the tidyverse for data manipulation and the palmerpenguins dataset for our examples.
library(tidyverse)
library(palmerpenguins)
# Preview the penguins dataset
glimpse(penguins)Example 1: Basic Usage
Let’s start with a simple example using the penguins dataset. We’ll calculate the mean of all numeric columns, removing missing values.
# Basic usage: calculate means for all numeric columns
penguins |>
summarise(across(where(is.numeric), mean, na.rm = TRUE))
# Apply across() to specific columns by name
penguins |>
summarise(across(c(bill_length_mm, bill_depth_mm, flipper_length_mm),
mean, na.rm = TRUE))
# Use column selection helpers
penguins |>
summarise(across(ends_with("_mm"), mean, na.rm = TRUE))You can also apply multiple functions to the same columns:
# Apply multiple functions using a list
penguins |>
summarise(across(where(is.numeric),
list(mean = mean, sd = sd),
na.rm = TRUE))Example 2: Practical Application
Let’s explore a more comprehensive example that demonstrates across() in a real-world scenario. We’ll analyze penguin measurements by species and island, applying different transformations and summaries.
# Group by species and calculate multiple statistics
penguin_summary <- penguins |>
group_by(species, island) |>
summarise(
# Count observations
n = n(),
# Calculate means for measurement columns
across(c(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g),
list(mean = ~mean(.x, na.rm = TRUE),
sd = ~sd(.x, na.rm = TRUE)),
.names = "{.col}_{.fn}"),
.groups = "drop"
)
# View the results
penguin_summaryHere’s another practical example showing data transformation:
# Transform multiple columns: standardize numeric variables
penguins_standardized <- penguins |>
mutate(across(where(is.numeric),
~scale(.x)[,1],
.names = "{.col}_scaled")) |>
select(species, island, ends_with("_scaled"))
# Convert multiple columns to different data types
penguins_transformed <- penguins |>
mutate(
# Convert character columns to factors
across(where(is.character), as.factor),
# Round numeric columns to 1 decimal place
across(where(is.numeric), ~round(.x, 1))
)You can also use across() with conditional logic:
# Apply different functions based on column characteristics
penguins |>
group_by(species) |>
summarise(
across(where(is.numeric) & !contains("year"),
list(min = min, max = max),
na.rm = TRUE),
across(where(is.factor), ~length(unique(.x))),
.groups = "drop"
)Summary
The across() function is an essential tool for efficient data manipulation in R. Key takeaways include:
- Use
across()to apply functions to multiple columns simultaneously, reducing code duplication - Combine it with selection helpers like
where(),starts_with(),ends_with(), andcontains()for flexible column selection - Apply multiple functions using lists and control output names with
.namesparameter across()works seamlessly withgroup_by()for grouped operations- Use anonymous functions with
~syntax for custom transformations