How to use filter() in R
dplyr::filter() Tutorial
Introduction
The filter() function from the dplyr package is used to subset rows from a data frame based on logical conditions. It’s one of the most fundamental data manipulation functions in the tidyverse, allowing you to keep only the rows that meet your specified criteria while discarding all others. You would use filter() when you need to extract specific observations from your dataset - for example, selecting only data from a particular year, filtering for values above a certain threshold, or excluding missing values. This function is part of the dplyr package, which is included in the tidyverse collection of packages, and it’s essential for exploratory data analysis and data cleaning workflows.
Syntax
filter(.data, ..., .preserve = FALSE)Key arguments: - .data: A data frame or tibble to filter - ...: Logical predicates defined in terms of the variables in .data. Multiple conditions are combined with & (AND logic) - .preserve: When FALSE (default), the grouping structure is recalculated based on the resulting data
Example 1: Basic Usage
Let’s start with a simple example using the Palmer Penguins dataset:
library(tidyverse)
library(palmerpenguins)
# Filter penguins with bill length greater than 45mm
filtered_penguins <- penguins |>
filter(bill_length_mm > 45)
head(filtered_penguins)# A tibble: 6 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 46.0 21.5 194 4200
2 Adelie Torgersen 48.7 19.3 196 3700
3 Adelie Torgersen 46.5 17.9 192 3800
4 Adelie Torgersen 47.8 18.1 178 4850
5 Adelie Torgersen 48.2 14.3 210 4600
6 Adelie Torgersen 46.1 20.2 198 4400
# ℹ 2 more variables: sex <fct>, year <int>
This filtered our original 344 penguins down to only those with bill lengths exceeding 45mm. The filter() function evaluated the condition bill_length_mm > 45 for each row and kept only those where it was TRUE.
Example 2: Practical Application
Here’s a more complex real-world example where we want to analyze large Gentoo penguins from a specific island and year:
# Filter for large Gentoo penguins from Biscoe island in 2008, then summarize
large_gentoo_analysis <- penguins |>
filter(species == "Gentoo",
island == "Biscoe",
year == 2008,
body_mass_g > 5000) |>
summarise(
count = n(),
avg_bill_length = mean(bill_length_mm, na.rm = TRUE),
avg_flipper_length = mean(flipper_length_mm, na.rm = TRUE),
.by = sex
)
large_gentoo_analysis# A tibble: 2 × 4
sex count avg_bill_length avg_flipper_length
<fct> <int> <dbl> <dbl>
1 male 13 47.4 222.
2 female 2 45.6 212.
This example demonstrates multiple filter conditions combined together (they work as AND logic by default), followed by grouping and summarization to create meaningful insights about our filtered subset.
Example 3: Advanced Usage
Here are some advanced filtering techniques including handling missing values and using complex conditions:
# Advanced filtering with multiple conditions and NA handling
advanced_filter <- penguins |>
filter(
# Remove rows with missing bill measurements
!is.na(bill_length_mm) & !is.na(bill_depth_mm),
# Complex condition: either large Adelie or any Chinstrap
(species == "Adelie" & body_mass_g > 4000) | species == "Chinstrap",
# Use %in% for multiple values
year %in% c(2007, 2009)
) |>
arrange(desc(body_mass_g))
head(advanced_filter, 3)# A tibble: 3 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Chinstrap Dream 55.8 19.8 207 4000
2 Chinstrap Dream 52.5 20.0 220 4000
3 Adelie Biscoe 42.5 20.7 197 4400
# ℹ 2 more variables: sex <fct>, year <int>
This example shows how to combine multiple logical operators (&, |, !) and use functions like is.na() and %in% within filter conditions.
Common Mistakes
1. Using = instead of == for equality:
# Wrong
penguins |> filter(species = "Adelie")
# Correct
penguins |> filter(species == "Adelie")2. Forgetting to handle missing values:
# This might give unexpected results if there are NAs
penguins |> filter(bill_length_mm > 40)
# Better approach
penguins |> filter(bill_length_mm > 40, !is.na(bill_length_mm))3. Misunderstanding multiple conditions (they’re combined with AND by default):
# This filters for penguins that are BOTH Adelie AND Gentoo (impossible!)
penguins |> filter(species == "Adelie", species == "Gentoo")
# Use | for OR logic instead
penguins |> filter(species == "Adelie" | species == "Gentoo")