How to calculate z-scores using tidyverse in R
Introduction
The pick() function in dplyr is a powerful selection helper that allows you to choose columns dynamically within data manipulation functions like mutate(), summarise(), and other dplyr verbs. Unlike traditional column selection, pick() returns the actual data from selected columns, making it perfect for operations that need to work with multiple columns simultaneously, such as row-wise calculations or applying functions across selected columns.
Setup
Let’s start by loading the required packages and exploring our dataset:
library(tidyverse)
library(palmerpenguins)penguins |> head()The Palmer penguins dataset contains measurements for three penguin species with both numeric and character variables, making it perfect for demonstrating pick() functionality.
Basic Column Selection with pick()
The simplest use of pick() is selecting specific columns by name. This returns the actual data from those columns:
penguins |>
pick(species, sex)You can also use tidyselect helpers like starts_with() to select columns based on naming patterns:
penguins |>
pick(starts_with("s"))This selects all columns whose names start with “s” (species and sex in our case).
Using pick() for Row-wise Calculations
One of the most powerful uses of pick() is performing calculations across multiple columns in each row. Here’s how to sum all numeric columns:
df <- tibble(x = 1:2, y = 3:4, z = 5:6)
df |>
mutate(total = rowSums(pick(is.numeric)))The pick(is.numeric) selects all numeric columns, and rowSums() calculates the sum for each row. This is much cleaner than manually specifying each column name.
Combining pick() with across()
You can combine pick() with across() to apply functions to selected columns. Here’s how to standardize all numeric columns:
df |>
mutate(across(pick(is.numeric), ~ (. - mean(.)) / sd(.)))This approach first uses pick() to select numeric columns, then applies the standardization function to each selected column.
Working with Character Columns
pick() works equally well with character columns. You can select and transform character data:
penguins |>
mutate(across(pick(is.character), toupper))This converts all character columns to uppercase. The pick(is.character) dynamically identifies character columns without you needing to name them explicitly.
Advanced Selection Patterns
You can use various tidyselect helpers with pick() for sophisticated column selection:
penguins |>
pick(ends_with("g"))This selects columns ending with “g” (like “bill_length_mm” variables). You can even use pick() within ranking functions:
penguins |>
mutate(rank = dense_rank(pick(ends_with("g")))) |>
arrange(rank)Using pick() with count()
pick() is also useful for counting combinations of variables:
penguins |>
count(pick(starts_with("s")))This counts unique combinations of all columns starting with “s”, providing a frequency table of species-sex combinations.
Complex Data Transformations
For more complex scenarios, you can store selected columns as nested data:
df <- tibble(
x = c(3, 2, 2, 2, 1),
y = c(0, 2, 1, 1, 4),
z1 = c("a", "a", "a", "b", "a"),
z2 = c("c", "d", "d", "a", "c")
)
df |> mutate(cols = pick(x, y))This creates a new column containing the selected data as nested tibbles, useful for complex analytical workflows.
Summarizing with pick()
pick() works seamlessly with summarise() for aggregate operations:
df <- tibble(
id = 1:5,
var1 = rnorm(5),
var2 = rnorm(5),
category = letters[1:5]
)
df |>
summarise(across(pick(is.numeric), mean))This calculates the mean of all numeric columns, automatically excluding non-numeric variables.
Summary
The pick() function revolutionizes column selection in dplyr by returning actual data rather than just column references. It’s particularly powerful when combined with predicate functions like is.numeric() or tidyselect helpers like starts_with(), enabling dynamic and flexible data manipulation. Whether you’re performing row-wise calculations, applying transformations across multiple columns, or creating complex summaries, pick() makes your code more concise and maintainable by eliminating the need to manually specify column names.