dplyr filter(): How to select rows with partially matching string

dplyr

Master dplyr filter() to how to select rows with partially matching string. Complete R tutorial with examples using real datasets.

Published

July 11, 2022

Introduction

String matching is a fundamental task in data analysis, and dplyr’s filter() function provides powerful tools for selecting rows based on partial string matches. Unlike exact matching, partial string matching allows you to find rows where a column contains, starts with, or ends with specific text patterns.

This technique is invaluable when working with messy real-world data where you need to find records containing certain keywords, filter by partial names, or identify entries with specific patterns. Whether you’re searching for all products containing “premium” in their names or finding all locations with “North” in their description, partial string matching with filter() makes these tasks straightforward and efficient.

Getting Started

First, let’s load the required packages and prepare our data:

library(tidyverse)
library(palmerpenguins)

# Preview the penguins dataset
glimpse(penguins)

Example 1: Basic Usage

Let’s start with simple partial string matching using the penguins dataset. We’ll use str_detect() to find rows where the species name contains specific patterns:

# Filter penguins with "Adelie" in species name
penguins |>
  filter(str_detect(species, "Adelie")) |>
  select(species, island, bill_length_mm)

# Filter islands containing "Dream"
penguins |>
  filter(str_detect(island, "Dream")) |>
  select(species, island, year) |>
  head()

# Case-insensitive matching
penguins |>
  filter(str_detect(tolower(species), "chin")) |>
  select(species, island, body_mass_g) |>
  head()

You can also use the grepl() function as an alternative to str_detect():

# Using grepl() for the same result
penguins |>
  filter(grepl("Adelie", species)) |>
  select(species, island, flipper_length_mm) |>
  head()

Example 2: Practical Application

Now let’s explore more advanced partial matching techniques with real-world scenarios. We’ll work with the mtcars dataset to demonstrate various string matching patterns:

# Convert mtcars rownames to a column for easier manipulation
cars_data <- mtcars |>
  rownames_to_column(var = "car_name")

# Find all cars with "Merc" in the name (Mercedes models)
mercedes_cars <- cars_data |>
  filter(str_detect(car_name, "Merc")) |>
  select(car_name, mpg, hp, wt) |>
  arrange(desc(mpg))

print(mercedes_cars)

Let’s combine multiple string patterns and conditions:

# Find cars that start with "Ford" or contain "Camaro"
american_muscle <- cars_data |>
  filter(str_detect(car_name, "^Ford|Camaro")) |>
  select(car_name, mpg, hp, qsec) |>
  arrange(desc(hp))

# Find cars ending with numbers (model years)
cars_with_numbers <- cars_data |>
  filter(str_detect(car_name, "\\d+$")) |>
  select(car_name, cyl, disp, hp) |>
  arrange(car_name)

Here’s a more complex example combining partial string matching with other conditions:

# Find high-performance cars (hp > 150) with specific name patterns
performance_cars <- cars_data |>
  filter(
    hp > 150,
    str_detect(car_name, "Merc|Ford|Pontiac"),
    !str_detect(car_name, "280")  # Exclude specific models
  ) |>
  select(car_name, mpg, hp, qsec) |>
  mutate(
    performance_category = case_when(
      hp > 200 ~ "High Performance",
      hp > 150 ~ "Performance",
      TRUE ~ "Standard"
    )
  ) |>
  arrange(desc(hp))

You can also use negative matching to exclude certain patterns:

# Find all cars NOT containing "Merc" or "Toyota"
non_luxury <- cars_data |>
  filter(!str_detect(car_name, "Merc|Toyota")) |>
  select(car_name, mpg, hp) |>
  filter(mpg > 20) |>
  arrange(desc(mpg))

Summary

Partial string matching with `dplyr::filter()` is essential for flexible data filtering. Key functions include `str_detect()` for pattern detection and `grepl()` as an alternative. Use `^` for start-of-string matching, `$` for end-of-string, and `|` for multiple patterns. Combine with `tolower()` for case-insensitive searches and `!` for negative matching. These techniques enable powerful data exploration and cleaning workflows when working with text-based filtering criteria.

--- title: "dplyr filter(): How to select rows with partially matching string" description: "Master dplyr filter() to how to select rows with partially matching string. Complete R tutorial with examples using real datasets." date: 2022-07-11 categories: ['dplyr'] format: html: code-fold: false code-tools: true --- ## Introduction String matching is a fundamental task in data analysis, and `dplyr`'s [`filter()`](/dplyr/how-to-use-filter-in-r.html) function provides powerful tools for selecting rows based on partial string matches. Unlike exact matching, partial string matching allows you to find rows where a column contains, starts with, or ends with specific text patterns. This technique is invaluable when working with messy real-world data where you need to find records containing certain keywords, filter by partial names, or identify entries with specific patterns. Whether you're searching for all products containing "premium" in their names or finding all locations with "North" in their description, partial string matching with `filter()` makes these tasks straightforward and efficient. ## Getting Started First, let's load the required packages and prepare our data: ```r library(tidyverse) library(palmerpenguins) # Preview the penguins dataset glimpse(penguins) ``` ## Example 1: Basic Usage Let's start with simple partial string matching using the `penguins` dataset. We'll use `str_detect()` to find rows where the species name contains specific patterns: ```r # Filter penguins with "Adelie" in species name penguins |> filter(str_detect(species, "Adelie")) |> select(species, island, bill_length_mm) # Filter islands containing "Dream" penguins |> filter(str_detect(island, "Dream")) |> select(species, island, year) |> head() # Case-insensitive matching penguins |> filter(str_detect(tolower(species), "chin")) |> select(species, island, body_mass_g) |> head() ``` You can also use the `grepl()` function as an alternative to `str_detect()`: ```r # Using grepl() for the same result penguins |> filter(grepl("Adelie", species)) |> select(species, island, flipper_length_mm) |> head() ``` ## Example 2: Practical Application Now let's explore more advanced partial matching techniques with real-world scenarios. We'll work with the `mtcars` dataset to demonstrate various string matching patterns: ```r # Convert mtcars rownames to a column for easier manipulation cars_data <- mtcars |> rownames_to_column(var = "car_name") # Find all cars with "Merc" in the name (Mercedes models) mercedes_cars <- cars_data |> filter(str_detect(car_name, "Merc")) |> select(car_name, mpg, hp, wt) |> arrange(desc(mpg)) print(mercedes_cars) ``` Let's combine multiple string patterns and conditions: ```r # Find cars that start with "Ford" or contain "Camaro" american_muscle <- cars_data |> filter(str_detect(car_name, "^Ford|Camaro")) |> select(car_name, mpg, hp, qsec) |> arrange(desc(hp)) # Find cars ending with numbers (model years) cars_with_numbers <- cars_data |> filter(str_detect(car_name, "\\d+$")) |> select(car_name, cyl, disp, hp) |> arrange(car_name) ``` Here's a more complex example combining partial string matching with other conditions: ```r # Find high-performance cars (hp > 150) with specific name patterns performance_cars <- cars_data |> filter( hp > 150, str_detect(car_name, "Merc|Ford|Pontiac"), !str_detect(car_name, "280") # Exclude specific models ) |> select(car_name, mpg, hp, qsec) |> mutate( performance_category = case_when( hp > 200 ~ "High Performance", hp > 150 ~ "Performance", TRUE ~ "Standard" ) ) |> arrange(desc(hp)) ``` You can also use negative matching to exclude certain patterns: ```r # Find all cars NOT containing "Merc" or "Toyota" non_luxury <- cars_data |> filter(!str_detect(car_name, "Merc|Toyota")) |> select(car_name, mpg, hp) |> filter(mpg > 20) |> arrange(desc(mpg)) ``` ## Summary Partial string matching with `dplyr::filter()` is essential for flexible data filtering. Key functions include `str_detect()` for pattern detection and `grepl()` as an alternative. Use `^` for start-of-string matching, `$` for end-of-string, and `|` for multiple patterns. Combine with `tolower()` for case-insensitive searches and `!` for negative matching. These techniques enable powerful data exploration and cleaning workflows when working with text-based filtering criteria. --- ## Related Posts - [How to filter rows in a dataframe: dplyr's filter()](/dplyr/dplyr-filter-select-rows-in-a-dataframe.html) - [dplyr contains(): select columns that contains a string](/dplyr/dplyr-contains-select-columns-that-contains-a-string.html) - [dplyr n_distinct(): count unique elements or rows](/dplyr/dplyr-n_distinct-count-unique-combinations.html) - [tidyr's separate_delim_wider(): Split a string into columns](/tidyr/tidyrs-separate_delim_wider-split-a-string-into-columns.html) - [How to Separate a Column into Multiple Rows in R: Hint tidyr's spearate_row()](/tidyr/separate-a-collapsed-column-into-multiple-rows.html)