dplyr contains(): select columns that contains a string

dplyr

dplyr contains()

Master dplyr contains() to select columns that contains a string. Complete R tutorial with examples using real datasets.

Published

August 5, 2022

Introduction

The contains() function in dplyr is a powerful selection helper that allows you to choose columns based on partial string matches in their names. This function is particularly useful when working with datasets that have many columns with similar naming patterns, such as survey data with multiple questions sharing prefixes, or datasets with various measurement types that share common suffixes.

You’ll find contains() invaluable when you need to quickly subset your data to focus on specific groups of variables without manually typing each column name. It’s commonly used in data cleaning, exploratory analysis, and when preparing data for visualization or modeling. The function works seamlessly with dplyr’s select() function and other tidyverse operations.

Getting Started

First, let’s load the required packages and examine our dataset:

library(tidyverse)
library(palmerpenguins)

# Take a look at the penguins dataset structure
glimpse(penguins)

Example 1: Basic Usage

Let’s start with a simple example using the penguins dataset. We’ll select columns that contain the word “length”:

# Select columns containing "length"
penguins_length <- penguins |>
  select(contains("length"))

# View the selected columns
colnames(penguins_length)

# We can also combine contains() with other column selections
penguins_subset <- penguins |>
  select(species, contains("length"))

head(penguins_subset)

The contains() function is case-sensitive by default, but you can make it case-insensitive by adding the ignore.case = TRUE parameter:

# Case-insensitive selection
penguins |>
  select(contains("LENGTH", ignore.case = TRUE)) |>
  head()

Example 2: Practical Application

Now let’s explore a more complex real-world scenario. Suppose we want to analyze only the bill measurements and perform some calculations:

# Select all bill-related measurements and perform analysis
bill_analysis <- penguins |>
  select(species, island, contains("bill")) |>
  filter(!is.na(bill_length_mm), !is.na(bill_depth_mm)) |>
  mutate(
    bill_ratio = bill_length_mm / bill_depth_mm,
    bill_area = bill_length_mm * bill_depth_mm
  ) |>
  group_by(species) |>
  summarise(
    avg_length = mean(bill_length_mm),
    avg_depth = mean(bill_depth_mm),
    avg_ratio = mean(bill_ratio),
    avg_area = mean(bill_area),
    .groups = "drop"
  )

print(bill_analysis)

We can also use contains() with multiple patterns or combine it with other selection helpers:

# Select columns containing either "bill" or "flipper"
morphology_data <- penguins |>
  select(
    species, 
    sex,
    contains("bill"),
    contains("flipper")
  ) |>
  filter(complete.cases(.))

# Calculate correlations between bill and flipper measurements
correlation_analysis <- morphology_data |>
  select(contains("mm")) |>
  cor() |>
  round(3)

print(correlation_analysis)

Here’s another practical example using pattern matching to reshape data:

# Create a summary focusing on measurements (columns ending with "mm")
measurement_summary <- penguins |>
  select(species, contains("mm")) |>
  group_by(species) |>
  summarise(
    across(contains("mm"), 
           list(mean = ~mean(.x, na.rm = TRUE),
                sd = ~sd(.x, na.rm = TRUE)),
           .names = "{.col}_{.fn}")
  )

print(measurement_summary)

Summary

The contains() function is an essential tool for efficient column selection in dplyr. Key takeaways include:

Use contains("string") within select() to choose columns with names containing specific text
Combine contains() with other selection helpers and column names for flexible data subsetting
Set ignore.case = TRUE for case-insensitive matching
contains() works seamlessly with other dplyr functions like mutate(), summarise(), and across()
This approach is particularly valuable for datasets with systematic naming conventions

By mastering `contains()`, you’ll be able to work more efficiently with wide datasets and write more maintainable code that adapts well to changes in column names.

--- title: "dplyr contains(): select columns that contains a string" description: "Master dplyr contains() to select columns that contains a string. Complete R tutorial with examples using real datasets." date: 2022-08-05 categories: ['dplyr', 'dplyr contains()'] format: html: code-fold: false code-tools: true --- ## Introduction The `contains()` function in dplyr is a powerful selection helper that allows you to choose columns based on partial string matches in their names. This function is particularly useful when working with datasets that have many columns with similar naming patterns, such as survey data with multiple questions sharing prefixes, or datasets with various measurement types that share common suffixes. You'll find `contains()` invaluable when you need to quickly subset your data to focus on specific groups of variables without manually typing each column name. It's commonly used in data cleaning, exploratory analysis, and when preparing data for visualization or modeling. The function works seamlessly with dplyr's [`select()`](/dplyr/how-to-use-select-in-r.html) function and other tidyverse operations. ## Getting Started First, let's load the required packages and examine our dataset: ```r library(tidyverse) library(palmerpenguins) # Take a look at the penguins dataset structure glimpse(penguins) ``` ## Example 1: Basic Usage Let's start with a simple example using the penguins dataset. We'll select columns that contain the word "length": ```r # Select columns containing "length" penguins_length <- penguins |> select(contains("length")) # View the selected columns colnames(penguins_length) # We can also combine contains() with other column selections penguins_subset <- penguins |> select(species, contains("length")) head(penguins_subset) ``` The `contains()` function is case-sensitive by default, but you can make it case-insensitive by adding the `ignore.case = TRUE` parameter: ```r # Case-insensitive selection penguins |> select(contains("LENGTH", ignore.case = TRUE)) |> head() ``` ## Example 2: Practical Application Now let's explore a more complex real-world scenario. Suppose we want to analyze only the bill measurements and perform some calculations: ```r # Select all bill-related measurements and perform analysis bill_analysis <- penguins |> select(species, island, contains("bill")) |> filter(!is.na(bill_length_mm), !is.na(bill_depth_mm)) |> mutate( bill_ratio = bill_length_mm / bill_depth_mm, bill_area = bill_length_mm * bill_depth_mm ) |> group_by(species) |> summarise( avg_length = mean(bill_length_mm), avg_depth = mean(bill_depth_mm), avg_ratio = mean(bill_ratio), avg_area = mean(bill_area), .groups = "drop" ) print(bill_analysis) ``` We can also use `contains()` with multiple patterns or combine it with other selection helpers: ```r # Select columns containing either "bill" or "flipper" morphology_data <- penguins |> select( species, sex, contains("bill"), contains("flipper") ) |> filter(complete.cases(.)) # Calculate correlations between bill and flipper measurements correlation_analysis <- morphology_data |> select(contains("mm")) |> cor() |> round(3) print(correlation_analysis) ``` Here's another practical example using pattern matching to reshape data: ```r # Create a summary focusing on measurements (columns ending with "mm") measurement_summary <- penguins |> select(species, contains("mm")) |> group_by(species) |> summarise( across(contains("mm"), list(mean = ~mean(.x, na.rm = TRUE), sd = ~sd(.x, na.rm = TRUE)), .names = "{.col}_{.fn}") ) print(measurement_summary) ``` ## Summary The `contains()` function is an essential tool for efficient column selection in dplyr. Key takeaways include: - Use `contains("string")` within `select()` to choose columns with names containing specific text - Combine `contains()` with other selection helpers and column names for flexible data subsetting - Set `ignore.case = TRUE` for case-insensitive matching - `contains()` works seamlessly with other dplyr functions like [`mutate()`](/dplyr/how-to-use-mutate-in-r.html), [`summarise()`](/dplyr/how-to-use-summarise-in-r.html), and [`across()`](/dplyr/how-to-use-across-in-r.html) - This approach is particularly valuable for datasets with systematic naming conventions By mastering `contains()`, you'll be able to work more efficiently with wide datasets and write more maintainable code that adapts well to changes in column names. --- ## Related Posts - [How to select columns that starts with a prefix/string in R](/dplyr/select-columns-that-starts-with-a-prefix.html) - [dplyr ends_with(): select columns that end with a suffix](/dplyr/select-columns-that-end-with-a-suffix.html) - [dplyr filter(): How to select rows with partially matching string](/dplyr/dplyr-filter-partial-match.html) - [tidyr's separate_delim_wider(): Split a string into columns](/tidyr/tidyrs-separate_delim_wider-split-a-string-into-columns.html) - [tidyr unite(): combine multiple columns into one](/tidyr/tidyr-unite-combine-multiple-columns-into-one.html)