dplyr contains(): select columns that contains a string

dplyr
dplyr contains()
Master dplyr contains() to select columns that contains a string. Complete R tutorial with examples using real datasets.
Published

August 5, 2022

Introduction

The contains() function in dplyr is a powerful selection helper that allows you to choose columns based on partial string matches in their names. This function is particularly useful when working with datasets that have many columns with similar naming patterns, such as survey data with multiple questions sharing prefixes, or datasets with various measurement types that share common suffixes.

You’ll find contains() invaluable when you need to quickly subset your data to focus on specific groups of variables without manually typing each column name. It’s commonly used in data cleaning, exploratory analysis, and when preparing data for visualization or modeling. The function works seamlessly with dplyr’s select() function and other tidyverse operations.

Getting Started

First, let’s load the required packages and examine our dataset:

library(tidyverse)
library(palmerpenguins)

# Take a look at the penguins dataset structure
glimpse(penguins)

Example 1: Basic Usage

Let’s start with a simple example using the penguins dataset. We’ll select columns that contain the word “length”:

# Select columns containing "length"
penguins_length <- penguins |>
  select(contains("length"))

# View the selected columns
colnames(penguins_length)

# We can also combine contains() with other column selections
penguins_subset <- penguins |>
  select(species, contains("length"))

head(penguins_subset)

The contains() function is case-sensitive by default, but you can make it case-insensitive by adding the ignore.case = TRUE parameter:

# Case-insensitive selection
penguins |>
  select(contains("LENGTH", ignore.case = TRUE)) |>
  head()

Example 2: Practical Application

Now let’s explore a more complex real-world scenario. Suppose we want to analyze only the bill measurements and perform some calculations:

# Select all bill-related measurements and perform analysis
bill_analysis <- penguins |>
  select(species, island, contains("bill")) |>
  filter(!is.na(bill_length_mm), !is.na(bill_depth_mm)) |>
  mutate(
    bill_ratio = bill_length_mm / bill_depth_mm,
    bill_area = bill_length_mm * bill_depth_mm
  ) |>
  group_by(species) |>
  summarise(
    avg_length = mean(bill_length_mm),
    avg_depth = mean(bill_depth_mm),
    avg_ratio = mean(bill_ratio),
    avg_area = mean(bill_area),
    .groups = "drop"
  )

print(bill_analysis)

We can also use contains() with multiple patterns or combine it with other selection helpers:

# Select columns containing either "bill" or "flipper"
morphology_data <- penguins |>
  select(
    species, 
    sex,
    contains("bill"),
    contains("flipper")
  ) |>
  filter(complete.cases(.))

# Calculate correlations between bill and flipper measurements
correlation_analysis <- morphology_data |>
  select(contains("mm")) |>
  cor() |>
  round(3)

print(correlation_analysis)

Here’s another practical example using pattern matching to reshape data:

# Create a summary focusing on measurements (columns ending with "mm")
measurement_summary <- penguins |>
  select(species, contains("mm")) |>
  group_by(species) |>
  summarise(
    across(contains("mm"), 
           list(mean = ~mean(.x, na.rm = TRUE),
                sd = ~sd(.x, na.rm = TRUE)),
           .names = "{.col}_{.fn}")
  )

print(measurement_summary)

Summary

The contains() function is an essential tool for efficient column selection in dplyr. Key takeaways include:

  • Use contains("string") within select() to choose columns with names containing specific text
  • Combine contains() with other selection helpers and column names for flexible data subsetting
  • Set ignore.case = TRUE for case-insensitive matching
  • contains() works seamlessly with other dplyr functions like mutate(), summarise(), and across()
  • This approach is particularly valuable for datasets with systematic naming conventions

By mastering contains(), you’ll be able to work more efficiently with wide datasets and write more maintainable code that adapts well to changes in column names.