dplyr contains(): select columns that contains a string
Introduction
The contains() function in dplyr is a powerful selection helper that allows you to choose columns based on partial string matches in their names. This function is particularly useful when working with datasets that have many columns with similar naming patterns, such as survey data with multiple questions sharing prefixes, or datasets with various measurement types that share common suffixes.
You’ll find contains() invaluable when you need to quickly subset your data to focus on specific groups of variables without manually typing each column name. It’s commonly used in data cleaning, exploratory analysis, and when preparing data for visualization or modeling. The function works seamlessly with dplyr’s select() function and other tidyverse operations.
Getting Started
First, let’s load the required packages and examine our dataset:
library(tidyverse)
library(palmerpenguins)
# Take a look at the penguins dataset structure
glimpse(penguins)Example 1: Basic Usage
Let’s start with a simple example using the penguins dataset. We’ll select columns that contain the word “length”:
# Select columns containing "length"
penguins_length <- penguins |>
select(contains("length"))
# View the selected columns
colnames(penguins_length)
# We can also combine contains() with other column selections
penguins_subset <- penguins |>
select(species, contains("length"))
head(penguins_subset)The contains() function is case-sensitive by default, but you can make it case-insensitive by adding the ignore.case = TRUE parameter:
# Case-insensitive selection
penguins |>
select(contains("LENGTH", ignore.case = TRUE)) |>
head()Example 2: Practical Application
Now let’s explore a more complex real-world scenario. Suppose we want to analyze only the bill measurements and perform some calculations:
# Select all bill-related measurements and perform analysis
bill_analysis <- penguins |>
select(species, island, contains("bill")) |>
filter(!is.na(bill_length_mm), !is.na(bill_depth_mm)) |>
mutate(
bill_ratio = bill_length_mm / bill_depth_mm,
bill_area = bill_length_mm * bill_depth_mm
) |>
group_by(species) |>
summarise(
avg_length = mean(bill_length_mm),
avg_depth = mean(bill_depth_mm),
avg_ratio = mean(bill_ratio),
avg_area = mean(bill_area),
.groups = "drop"
)
print(bill_analysis)We can also use contains() with multiple patterns or combine it with other selection helpers:
# Select columns containing either "bill" or "flipper"
morphology_data <- penguins |>
select(
species,
sex,
contains("bill"),
contains("flipper")
) |>
filter(complete.cases(.))
# Calculate correlations between bill and flipper measurements
correlation_analysis <- morphology_data |>
select(contains("mm")) |>
cor() |>
round(3)
print(correlation_analysis)Here’s another practical example using pattern matching to reshape data:
# Create a summary focusing on measurements (columns ending with "mm")
measurement_summary <- penguins |>
select(species, contains("mm")) |>
group_by(species) |>
summarise(
across(contains("mm"),
list(mean = ~mean(.x, na.rm = TRUE),
sd = ~sd(.x, na.rm = TRUE)),
.names = "{.col}_{.fn}")
)
print(measurement_summary)Summary
The contains() function is an essential tool for efficient column selection in dplyr. Key takeaways include:
- Use
contains("string")withinselect()to choose columns with names containing specific text - Combine
contains()with other selection helpers and column names for flexible data subsetting - Set
ignore.case = TRUEfor case-insensitive matching contains()works seamlessly with other dplyr functions likemutate(),summarise(), andacross()- This approach is particularly valuable for datasets with systematic naming conventions