dplyr across(): Compute column-wise mean

dplyr

dplyr across()

Master dplyr across() to compute column-wise mean. Complete R tutorial with examples using real datasets.

Published

October 29, 2022

Introduction

The across() function in dplyr is a powerful tool for applying functions to multiple columns simultaneously. When computing column-wise means, across() eliminates the need to repeat code for each column, making your analysis more efficient and readable. This function is particularly useful when working with datasets containing multiple numeric variables that require the same statistical operation.

You’ll want to use across() when you need to calculate means for several columns at once, when creating summary statistics for groups of variables, or when you want to apply the same transformation to multiple columns while maintaining clean, reproducible code. It’s especially valuable in exploratory data analysis and when preparing summary reports.

Getting Started

First, let’s load the required packages:

library(tidyverse)
library(palmerpenguins)

Example 1: Basic Usage

Let’s start with a simple example using the penguins dataset to calculate means across all numeric columns:

penguins |>
  summarise(across(where(is.numeric), mean, na.rm = TRUE))

We can also specify columns by name or use selection helpers:

penguins |>
  summarise(across(c(bill_length_mm, bill_depth_mm, flipper_length_mm), 
                   mean, na.rm = TRUE))

To make the output more readable, we can add custom names:

penguins |>
  summarise(across(c(bill_length_mm, bill_depth_mm, flipper_length_mm), 
                   mean, na.rm = TRUE, .names = "mean_{.col}"))

Example 2: Practical Application

Now let’s explore a more complex real-world scenario. Suppose we want to calculate column-wise means for different penguin species and compare body measurements:

species_means <- penguins |>
  group_by(species) |>
  summarise(across(c(bill_length_mm, bill_depth_mm, 
                     flipper_length_mm, body_mass_g), 
                   mean, na.rm = TRUE, .names = "avg_{.col}"),
            .groups = "drop")

species_means

We can also create a more comprehensive summary that includes multiple statistics:

penguin_summary <- penguins |>
  group_by(species, island) |>
  summarise(across(c(bill_length_mm, flipper_length_mm, body_mass_g),
                   list(mean = ~mean(.x, na.rm = TRUE),
                        sd = ~sd(.x, na.rm = TRUE)),
                   .names = "{.fn}_{.col}"),
            sample_size = n(),
            .groups = "drop")

penguin_summary

For a practical visualization of our column-wise means, we can reshape and plot the data:

species_means |>
  pivot_longer(cols = starts_with("avg_"),
               names_to = "measurement",
               values_to = "mean_value") |>
  mutate(measurement = str_remove(measurement, "avg_")) |>
  ggplot(aes(x = species, y = mean_value, fill = species)) +
  geom_col() +
  facet_wrap(~measurement, scales = "free_y") +
  theme_minimal() +
  labs(title = "Average Body Measurements by Penguin Species",
       x = "Species", y = "Mean Value") +
  theme(legend.position = "none")

Faceted bar chart showing average body measurements by penguin species computed with dplyr across in R

Here’s another practical example using conditional selection to calculate means only for measurements above certain thresholds:

penguins |>
  group_by(species) |>
  summarise(across(where(is.numeric), 
                   ~mean(.x[.x > quantile(.x, 0.25, na.rm = TRUE)], 
                         na.rm = TRUE),
                   .names = "upper75_mean_{.col}"))

Summary

The `across()` function streamlines column-wise mean calculations by allowing you to apply the mean function to multiple columns simultaneously. Key takeaways include: use `where(is.numeric)` to select all numeric columns automatically; always include `na.rm = TRUE` when dealing with missing values; leverage `.names` parameter for custom column naming; and combine `across()` with `group_by()` for powerful grouped summaries. This approach reduces code repetition, improves readability, and makes your data analysis workflow more efficient and maintainable.

--- title: "dplyr across(): Compute column-wise mean" description: "Master dplyr across() to compute column-wise mean. Complete R tutorial with examples using real datasets." date: 2022-10-29 categories: ['dplyr', 'dplyr across()'] image: /images/dplyr/dplyr-across-in-r-species-means-faceted-ggplot.png format: html: code-fold: false code-tools: true --- ## Introduction The [`across()`](/dplyr/how-to-use-across-in-r.html) function in dplyr is a powerful tool for applying functions to multiple columns simultaneously. When computing column-wise means, `across()` eliminates the need to repeat code for each column, making your analysis more efficient and readable. This function is particularly useful when working with datasets containing multiple numeric variables that require the same statistical operation. You'll want to use `across()` when you need to calculate means for several columns at once, when creating summary statistics for groups of variables, or when you want to apply the same transformation to multiple columns while maintaining clean, reproducible code. It's especially valuable in exploratory data analysis and when preparing summary reports. ## Getting Started First, let's load the required packages: ```r library(tidyverse) library(palmerpenguins) ``` ## Example 1: Basic Usage Let's start with a simple example using the penguins dataset to calculate means across all numeric columns: ```r penguins |> summarise(across(where(is.numeric), mean, na.rm = TRUE)) ``` We can also specify columns by name or use selection helpers: ```r penguins |> summarise(across(c(bill_length_mm, bill_depth_mm, flipper_length_mm), mean, na.rm = TRUE)) ``` To make the output more readable, we can add custom names: ```r penguins |> summarise(across(c(bill_length_mm, bill_depth_mm, flipper_length_mm), mean, na.rm = TRUE, .names = "mean_{.col}")) ``` ## Example 2: Practical Application Now let's explore a more complex real-world scenario. Suppose we want to calculate column-wise means for different penguin species and compare body measurements: ```r species_means <- penguins |> group_by(species) |> summarise(across(c(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g), mean, na.rm = TRUE, .names = "avg_{.col}"), .groups = "drop") species_means ``` We can also create a more comprehensive summary that includes multiple statistics: ```r penguin_summary <- penguins |> group_by(species, island) |> summarise(across(c(bill_length_mm, flipper_length_mm, body_mass_g), list(mean = ~mean(.x, na.rm = TRUE), sd = ~sd(.x, na.rm = TRUE)), .names = "{.fn}_{.col}"), sample_size = n(), .groups = "drop") penguin_summary ``` For a practical visualization of our column-wise means, we can reshape and plot the data: ```r species_means |> pivot_longer(cols = starts_with("avg_"), names_to = "measurement", values_to = "mean_value") |> mutate(measurement = str_remove(measurement, "avg_")) |> ggplot(aes(x = species, y = mean_value, fill = species)) + geom_col() + facet_wrap(~measurement, scales = "free_y") + theme_minimal() + labs(title = "Average Body Measurements by Penguin Species", x = "Species", y = "Mean Value") + theme(legend.position = "none") ``` ![Faceted bar chart showing average body measurements by penguin species computed with dplyr across in R](/images/dplyr/dplyr-across-in-r-species-means-faceted-ggplot.png) Here's another practical example using conditional selection to calculate means only for measurements above certain thresholds: ```r penguins |> group_by(species) |> summarise(across(where(is.numeric), ~mean(.x[.x > quantile(.x, 0.25, na.rm = TRUE)], na.rm = TRUE), .names = "upper75_mean_{.col}")) ``` ## Summary The `across()` function streamlines column-wise mean calculations by allowing you to apply the mean function to multiple columns simultaneously. Key takeaways include: use `where(is.numeric)` to select all numeric columns automatically; always include `na.rm = TRUE` when dealing with missing values; leverage `.names` parameter for custom column naming; and combine `across()` with [`group_by()`](/dplyr/how-to-use-groupby-in-r.html) for powerful grouped summaries. This approach reduces code repetition, improves readability, and makes your data analysis workflow more efficient and maintainable. --- ## Related Posts - [Join dataframes by different column names with dplyr](/dplyr/join-dataframes-by-different-column-names-with-dplyr.html) - [Compute rowwise mean and standard deviation](/dplyr/compute-rowwise-mean-and-standard-deviation.html) - [How to replace NA in a column with specific value](/dplyr/how-to-replace-na-in-a-column-with-specific-value.html) - [How to Separate a Column into Multiple Rows in R: Hint tidyr's spearate_row()](/tidyr/separate-a-collapsed-column-into-multiple-rows.html) - [How to use separate() in R](/tidyr/how-to-use-separate-in-r.html)