dplyr filter(): How to select rows with partially matching string

dplyr
Master dplyr filter() to how to select rows with partially matching string. Complete R tutorial with examples using real datasets.
Published

July 11, 2022

Introduction

String matching is a fundamental task in data analysis, and dplyr’s filter() function provides powerful tools for selecting rows based on partial string matches. Unlike exact matching, partial string matching allows you to find rows where a column contains, starts with, or ends with specific text patterns.

This technique is invaluable when working with messy real-world data where you need to find records containing certain keywords, filter by partial names, or identify entries with specific patterns. Whether you’re searching for all products containing “premium” in their names or finding all locations with “North” in their description, partial string matching with filter() makes these tasks straightforward and efficient.

Getting Started

First, let’s load the required packages and prepare our data:

library(tidyverse)
library(palmerpenguins)

# Preview the penguins dataset
glimpse(penguins)

Example 1: Basic Usage

Let’s start with simple partial string matching using the penguins dataset. We’ll use str_detect() to find rows where the species name contains specific patterns:

# Filter penguins with "Adelie" in species name
penguins |>
  filter(str_detect(species, "Adelie")) |>
  select(species, island, bill_length_mm)

# Filter islands containing "Dream"
penguins |>
  filter(str_detect(island, "Dream")) |>
  select(species, island, year) |>
  head()

# Case-insensitive matching
penguins |>
  filter(str_detect(tolower(species), "chin")) |>
  select(species, island, body_mass_g) |>
  head()

You can also use the grepl() function as an alternative to str_detect():

# Using grepl() for the same result
penguins |>
  filter(grepl("Adelie", species)) |>
  select(species, island, flipper_length_mm) |>
  head()

Example 2: Practical Application

Now let’s explore more advanced partial matching techniques with real-world scenarios. We’ll work with the mtcars dataset to demonstrate various string matching patterns:

# Convert mtcars rownames to a column for easier manipulation
cars_data <- mtcars |>
  rownames_to_column(var = "car_name")

# Find all cars with "Merc" in the name (Mercedes models)
mercedes_cars <- cars_data |>
  filter(str_detect(car_name, "Merc")) |>
  select(car_name, mpg, hp, wt) |>
  arrange(desc(mpg))

print(mercedes_cars)

Let’s combine multiple string patterns and conditions:

# Find cars that start with "Ford" or contain "Camaro"
american_muscle <- cars_data |>
  filter(str_detect(car_name, "^Ford|Camaro")) |>
  select(car_name, mpg, hp, qsec) |>
  arrange(desc(hp))

# Find cars ending with numbers (model years)
cars_with_numbers <- cars_data |>
  filter(str_detect(car_name, "\\d+$")) |>
  select(car_name, cyl, disp, hp) |>
  arrange(car_name)

Here’s a more complex example combining partial string matching with other conditions:

# Find high-performance cars (hp > 150) with specific name patterns
performance_cars <- cars_data |>
  filter(
    hp > 150,
    str_detect(car_name, "Merc|Ford|Pontiac"),
    !str_detect(car_name, "280")  # Exclude specific models
  ) |>
  select(car_name, mpg, hp, qsec) |>
  mutate(
    performance_category = case_when(
      hp > 200 ~ "High Performance",
      hp > 150 ~ "Performance",
      TRUE ~ "Standard"
    )
  ) |>
  arrange(desc(hp))

You can also use negative matching to exclude certain patterns:

# Find all cars NOT containing "Merc" or "Toyota"
non_luxury <- cars_data |>
  filter(!str_detect(car_name, "Merc|Toyota")) |>
  select(car_name, mpg, hp) |>
  filter(mpg > 20) |>
  arrange(desc(mpg))

Summary

Partial string matching with dplyr::filter() is essential for flexible data filtering. Key functions include str_detect() for pattern detection and grepl() as an alternative. Use ^ for start-of-string matching, $ for end-of-string, and | for multiple patterns. Combine with tolower() for case-insensitive searches and ! for negative matching. These techniques enable powerful data exploration and cleaning workflows when working with text-based filtering criteria.