dplyr filter(): How to select rows with partially matching string
Introduction
String matching is a fundamental task in data analysis, and dplyr’s filter() function provides powerful tools for selecting rows based on partial string matches. Unlike exact matching, partial string matching allows you to find rows where a column contains, starts with, or ends with specific text patterns.
This technique is invaluable when working with messy real-world data where you need to find records containing certain keywords, filter by partial names, or identify entries with specific patterns. Whether you’re searching for all products containing “premium” in their names or finding all locations with “North” in their description, partial string matching with filter() makes these tasks straightforward and efficient.
Getting Started
First, let’s load the required packages and prepare our data:
library(tidyverse)
library(palmerpenguins)
# Preview the penguins dataset
glimpse(penguins)Example 1: Basic Usage
Let’s start with simple partial string matching using the penguins dataset. We’ll use str_detect() to find rows where the species name contains specific patterns:
# Filter penguins with "Adelie" in species name
penguins |>
filter(str_detect(species, "Adelie")) |>
select(species, island, bill_length_mm)
# Filter islands containing "Dream"
penguins |>
filter(str_detect(island, "Dream")) |>
select(species, island, year) |>
head()
# Case-insensitive matching
penguins |>
filter(str_detect(tolower(species), "chin")) |>
select(species, island, body_mass_g) |>
head()You can also use the grepl() function as an alternative to str_detect():
# Using grepl() for the same result
penguins |>
filter(grepl("Adelie", species)) |>
select(species, island, flipper_length_mm) |>
head()Example 2: Practical Application
Now let’s explore more advanced partial matching techniques with real-world scenarios. We’ll work with the mtcars dataset to demonstrate various string matching patterns:
# Convert mtcars rownames to a column for easier manipulation
cars_data <- mtcars |>
rownames_to_column(var = "car_name")
# Find all cars with "Merc" in the name (Mercedes models)
mercedes_cars <- cars_data |>
filter(str_detect(car_name, "Merc")) |>
select(car_name, mpg, hp, wt) |>
arrange(desc(mpg))
print(mercedes_cars)Let’s combine multiple string patterns and conditions:
# Find cars that start with "Ford" or contain "Camaro"
american_muscle <- cars_data |>
filter(str_detect(car_name, "^Ford|Camaro")) |>
select(car_name, mpg, hp, qsec) |>
arrange(desc(hp))
# Find cars ending with numbers (model years)
cars_with_numbers <- cars_data |>
filter(str_detect(car_name, "\\d+$")) |>
select(car_name, cyl, disp, hp) |>
arrange(car_name)Here’s a more complex example combining partial string matching with other conditions:
# Find high-performance cars (hp > 150) with specific name patterns
performance_cars <- cars_data |>
filter(
hp > 150,
str_detect(car_name, "Merc|Ford|Pontiac"),
!str_detect(car_name, "280") # Exclude specific models
) |>
select(car_name, mpg, hp, qsec) |>
mutate(
performance_category = case_when(
hp > 200 ~ "High Performance",
hp > 150 ~ "Performance",
TRUE ~ "Standard"
)
) |>
arrange(desc(hp))You can also use negative matching to exclude certain patterns:
# Find all cars NOT containing "Merc" or "Toyota"
non_luxury <- cars_data |>
filter(!str_detect(car_name, "Merc|Toyota")) |>
select(car_name, mpg, hp) |>
filter(mpg > 20) |>
arrange(desc(mpg))