How to use where() in R
Introduction
The where() function in dplyr allows you to select columns based on their data type or other column properties. Instead of selecting columns by name, you can select all numeric columns, all character columns, or columns that meet specific criteria. This is particularly useful when working with wide datasets where manually specifying column names would be tedious.
Getting Started
library(tidyverse)
library(palmerpenguins)Example 1: Basic Usage
The Problem
You have a dataset with mixed data types and want to perform operations only on specific types of columns. Let’s say you want to find the mean of all numeric columns in the penguins dataset.
Step 1: Examine the dataset structure
First, let’s look at what types of columns we’re working with.
penguins |>
glimpse()This shows us the data types for each column - we can see numeric columns like bill_length_mm and character columns like species.
Step 2: Select all numeric columns
Now we’ll use where() to select only numeric columns.
penguins |>
select(where(is.numeric)) |>
head()The where(is.numeric) function identifies and selects all columns that contain numeric data, excluding factors and character columns.
Step 3: Calculate summary statistics
We can now easily calculate means for all numeric columns at once.
penguins |>
select(where(is.numeric)) |>
summarise(across(everything(), ~ mean(.x, na.rm = TRUE)))This gives us the mean values for all numeric columns in one operation, automatically handling missing values.
Example 2: Practical Application
The Problem
You’re analyzing the mtcars dataset and need to standardize all numeric variables while keeping categorical variables unchanged. You also want to identify which columns have missing values across different data types.
Step 1: Standardize numeric columns
Let’s standardize (z-score) all numeric columns in mtcars.
mtcars_standardized <- mtcars |>
mutate(across(where(is.numeric), ~ scale(.x)[,1]))
head(mtcars_standardized)The where(is.numeric) ensures we only apply the scaling function to numeric columns, leaving any categorical data untouched.
Step 2: Check for missing values by data type
Now let’s see how many missing values exist in different types of columns.
penguins |>
summarise(
numeric_na = sum(is.na(select(cur_data(), where(is.numeric)))),
character_na = sum(is.na(select(cur_data(), where(is.character))))
)This approach lets us quickly assess data quality across different column types without manually specifying each column.
Step 3: Apply different transformations by type
We can apply type-specific operations in a single pipeline.
penguins |>
mutate(
across(where(is.numeric), ~ round(.x, 1)),
across(where(is.character), ~ str_to_upper(.x))
) |>
head()This simultaneously rounds numeric columns to one decimal place and converts character columns to uppercase.
Step 4: Filter columns by custom criteria
You can also use custom functions with where() to select columns based on specific conditions.
penguins |>
select(where(~ n_distinct(.x, na.rm = TRUE) > 10)) |>
head()This selects only columns that have more than 10 distinct values, useful for identifying continuous variables or high-cardinality categorical variables.
Summary
where()selects columns based on data type or properties rather than column names- Use
where(is.numeric),where(is.character), orwhere(is.factor)for common data types
- Combine
where()withacross()to apply functions to columns of specific types - Create custom selection criteria by passing any function that returns TRUE/FALSE for each column
Perfect for wide datasets where manual column specification would be impractical