dplyr ends_with(): select columns that end with a suffix
Introduction
The ends_with() function in dplyr is a powerful selection helper that allows you to choose columns based on their suffix. This function is particularly useful when working with datasets that follow naming conventions where related variables share common endings, such as measurements in different units, time periods, or categories.
You’ll find ends_with() invaluable when dealing with survey data (questions ending in “_score”), longitudinal studies (variables ending in “_2023”, “_2024”), or scientific measurements (variables ending in “_mm”, “_cm”). Instead of manually typing each column name, ends_with() provides a clean, efficient way to select multiple related columns at once, making your code more readable and maintainable.
Getting Started
First, let’s load the required packages for this tutorial:
library(tidyverse)
library(palmerpenguins)Example 1: Basic Usage
Let’s start with a simple example using the penguins dataset. We’ll select all columns that end with “_mm” to get the physical measurements:
# View the column names to see what we're working with
colnames(penguins)
# Select columns ending with "_mm"
penguin_measurements <- penguins |>
select(ends_with("_mm"))
# Check the selected columns
colnames(penguin_measurements)This selects bill_length_mm, bill_depth_mm, and flipper_length_mm. You can also combine ends_with() with other selection helpers:
# Select species and all measurement columns
penguins |>
select(species, ends_with("_mm"))
# Select multiple suffixes
penguins |>
select(ends_with(c("_mm", "_g")))Example 2: Practical Application
Let’s create a more complex example by analyzing penguin body measurements and calculating summary statistics. We’ll use ends_with() to efficiently work with measurement columns:
# Calculate mean measurements by species for all "_mm" columns
penguin_summary <- penguins |>
group_by(species) |>
summarise(
across(ends_with("_mm"), ~ mean(.x, na.rm = TRUE)),
.groups = "drop"
) |>
# Round to one decimal place
mutate(across(ends_with("_mm"), ~ round(.x, 1)))
print(penguin_summary)Here’s another practical example where we standardize (z-score) all measurement columns:
# Standardize all measurement columns
penguins_standardized <- penguins |>
mutate(
across(ends_with("_mm"), ~ scale(.x)[,1])
) |>
select(species, island, ends_with("_mm"))
# View the first few rows
head(penguins_standardized)We can also use ends_with() for data quality checks, such as finding missing values across measurement columns:
# Count missing values in measurement columns
missing_summary <- penguins |>
summarise(
across(ends_with("_mm"), ~ sum(is.na(.x)))
)
print(missing_summary)For a more advanced application, let’s create a correlation matrix for all measurement variables:
# Create correlation matrix for measurement columns
correlation_matrix <- penguins |>
select(ends_with("_mm")) |>
cor(use = "complete.obs") |>
round(3)
print(correlation_matrix)Summary
The ends_with() function is an essential tool for efficient column selection in dplyr. Key takeaways include:
- Pattern matching:
ends_with()selects columns based on suffix patterns, perfect for consistently named variables - Flexible usage: Works seamlessly with other dplyr functions like
select(),mutate(),across(), andsummarise() - Multiple patterns: You can specify multiple suffixes using a character vector
- Clean code: Reduces repetitive column naming and makes your code more maintainable