How to use filter() in R

dplyr

dplyr filter()

Published

February 20, 2026

dplyr::filter() Tutorial

Introduction

The filter() function from the dplyr package is used to subset rows from a data frame based on logical conditions. It’s one of the most fundamental data manipulation functions in the tidyverse, allowing you to keep only the rows that meet your specified criteria while discarding all others. You would use filter() when you need to extract specific observations from your dataset - for example, selecting only data from a particular year, filtering for values above a certain threshold, or excluding missing values. This function is part of the dplyr package, which is included in the tidyverse collection of packages, and it’s essential for exploratory data analysis and data cleaning workflows.

Syntax

filter(.data, ..., .preserve = FALSE)

Key arguments: - .data: A data frame or tibble to filter - ...: Logical predicates defined in terms of the variables in .data. Multiple conditions are combined with & (AND logic) - .preserve: When FALSE (default), the grouping structure is recalculated based on the resulting data

Example 1: Basic Usage

Let’s start with a simple example using the Palmer Penguins dataset:

library(tidyverse)
library(palmerpenguins)

# Filter penguins with bill length greater than 45mm
filtered_penguins <- penguins |> 
  filter(bill_length_mm > 45)

head(filtered_penguins)

# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           46.0          21.5               194        4200
2 Adelie  Torgersen           48.7          19.3               196        3700
3 Adelie  Torgersen           46.5          17.9               192        3800
4 Adelie  Torgersen           47.8          18.1               178        4850
5 Adelie  Torgersen           48.2          14.3               210        4600
6 Adelie  Torgersen           46.1          20.2               198        4400
# ℹ 2 more variables: sex <fct>, year <int>

This filtered our original 344 penguins down to only those with bill lengths exceeding 45mm. The filter() function evaluated the condition bill_length_mm > 45 for each row and kept only those where it was TRUE.

Example 2: Practical Application

Here’s a more complex real-world example where we want to analyze large Gentoo penguins from a specific island and year:

# Filter for large Gentoo penguins from Biscoe island in 2008, then summarize
large_gentoo_analysis <- penguins |> 
  filter(species == "Gentoo", 
         island == "Biscoe", 
         year == 2008, 
         body_mass_g > 5000) |> 
  summarise(
    count = n(),
    avg_bill_length = mean(bill_length_mm, na.rm = TRUE),
    avg_flipper_length = mean(flipper_length_mm, na.rm = TRUE),
    .by = sex
  )

large_gentoo_analysis

# A tibble: 2 × 4
  sex    count avg_bill_length avg_flipper_length
  <fct>  <int>           <dbl>              <dbl>
1 male      13            47.4               222.
2 female     2            45.6               212.

This example demonstrates multiple filter conditions combined together (they work as AND logic by default), followed by grouping and summarization to create meaningful insights about our filtered subset.

Example 3: Advanced Usage

Here are some advanced filtering techniques including handling missing values and using complex conditions:

# Advanced filtering with multiple conditions and NA handling
advanced_filter <- penguins |> 
  filter(
    # Remove rows with missing bill measurements
    !is.na(bill_length_mm) & !is.na(bill_depth_mm),
    # Complex condition: either large Adelie or any Chinstrap
    (species == "Adelie" & body_mass_g > 4000) | species == "Chinstrap",
    # Use %in% for multiple values
    year %in% c(2007, 2009)
  ) |> 
  arrange(desc(body_mass_g))

head(advanced_filter, 3)

# A tibble: 3 × 8
  species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>     <fct>           <dbl>         <dbl>             <int>       <int>
1 Chinstrap Dream            55.8          19.8               207        4000
2 Chinstrap Dream            52.5          20.0               220        4000
3 Adelie    Biscoe           42.5          20.7               197        4400
# ℹ 2 more variables: sex <fct>, year <int>

This example shows how to combine multiple logical operators (&, |, !) and use functions like is.na() and %in% within filter conditions.

Common Mistakes

1. Using = instead of == for equality:

# Wrong
penguins |> filter(species = "Adelie")

# Correct
penguins |> filter(species == "Adelie")

2. Forgetting to handle missing values:

# This might give unexpected results if there are NAs
penguins |> filter(bill_length_mm > 40)

# Better approach
penguins |> filter(bill_length_mm > 40, !is.na(bill_length_mm))

3. Misunderstanding multiple conditions (they’re combined with AND by default):

# This filters for penguins that are BOTH Adelie AND Gentoo (impossible!)
penguins |> filter(species == "Adelie", species == "Gentoo")

# Use | for OR logic instead
penguins |> filter(species == "Adelie" | species == "Gentoo")

--- title: "How to use filter() in R" date: 2026-02-20 categories: ["dplyr", "dplyr filter()"] format: html: code-fold: false code-tools: true --- # dplyr::filter() Tutorial ## Introduction The `filter()` function from the dplyr package is used to subset rows from a data frame based on logical conditions. It's one of the most fundamental data manipulation functions in the tidyverse, allowing you to keep only the rows that meet your specified criteria while discarding all others. You would use `filter()` when you need to extract specific observations from your dataset - for example, selecting only data from a particular year, filtering for values above a certain threshold, or excluding missing values. This function is part of the dplyr package, which is included in the tidyverse collection of packages, and it's essential for exploratory data analysis and data cleaning workflows. ## Syntax ```r filter(.data, ..., .preserve = FALSE) ``` **Key arguments:** - `.data`: A data frame or tibble to filter - `...`: Logical predicates defined in terms of the variables in `.data`. Multiple conditions are combined with `&` (AND logic) - `.preserve`: When `FALSE` (default), the grouping structure is recalculated based on the resulting data ## Example 1: Basic Usage Let's start with a simple example using the Palmer Penguins dataset: ```r library(tidyverse) library(palmerpenguins) # Filter penguins with bill length greater than 45mm filtered_penguins <- penguins |> filter(bill_length_mm > 45) head(filtered_penguins) ``` ``` # A tibble: 6 × 8 species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g <fct> <fct> <dbl> <dbl> <int> <int> 1 Adelie Torgersen 46.0 21.5 194 4200 2 Adelie Torgersen 48.7 19.3 196 3700 3 Adelie Torgersen 46.5 17.9 192 3800 4 Adelie Torgersen 47.8 18.1 178 4850 5 Adelie Torgersen 48.2 14.3 210 4600 6 Adelie Torgersen 46.1 20.2 198 4400 # ℹ 2 more variables: sex <fct>, year <int> ``` This filtered our original 344 penguins down to only those with bill lengths exceeding 45mm. The `filter()` function evaluated the condition `bill_length_mm > 45` for each row and kept only those where it was `TRUE`. ## Example 2: Practical Application Here's a more complex real-world example where we want to analyze large Gentoo penguins from a specific island and year: ```r # Filter for large Gentoo penguins from Biscoe island in 2008, then summarize large_gentoo_analysis <- penguins |> filter(species == "Gentoo", island == "Biscoe", year == 2008, body_mass_g > 5000) |> summarise( count = n(), avg_bill_length = mean(bill_length_mm, na.rm = TRUE), avg_flipper_length = mean(flipper_length_mm, na.rm = TRUE), .by = sex ) large_gentoo_analysis ``` ``` # A tibble: 2 × 4 sex count avg_bill_length avg_flipper_length <fct> <int> <dbl> <dbl> 1 male 13 47.4 222. 2 female 2 45.6 212. ``` This example demonstrates multiple filter conditions combined together (they work as AND logic by default), followed by grouping and summarization to create meaningful insights about our filtered subset. ## Example 3: Advanced Usage Here are some advanced filtering techniques including handling missing values and using complex conditions: ```r # Advanced filtering with multiple conditions and NA handling advanced_filter <- penguins |> filter( # Remove rows with missing bill measurements !is.na(bill_length_mm) & !is.na(bill_depth_mm), # Complex condition: either large Adelie or any Chinstrap (species == "Adelie" & body_mass_g > 4000) | species == "Chinstrap", # Use %in% for multiple values year %in% c(2007, 2009) ) |> arrange(desc(body_mass_g)) head(advanced_filter, 3) ``` ``` # A tibble: 3 × 8 species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g <fct> <fct> <dbl> <dbl> <int> <int> 1 Chinstrap Dream 55.8 19.8 207 4000 2 Chinstrap Dream 52.5 20.0 220 4000 3 Adelie Biscoe 42.5 20.7 197 4400 # ℹ 2 more variables: sex <fct>, year <int> ``` This example shows how to combine multiple logical operators (`&`, `|`, `!`) and use functions like `is.na()` and `%in%` within filter conditions. ## Common Mistakes **1. Using `=` instead of `==` for equality:** ```r # Wrong penguins |> filter(species = "Adelie") # Correct penguins |> filter(species == "Adelie") ``` **2. Forgetting to handle missing values:** ```r # This might give unexpected results if there are NAs penguins |> filter(bill_length_mm > 40) # Better approach penguins |> filter(bill_length_mm > 40, !is.na(bill_length_mm)) ``` **3. Misunderstanding multiple conditions (they're combined with AND by default):** ```r # This filters for penguins that are BOTH Adelie AND Gentoo (impossible!) penguins |> filter(species == "Adelie", species == "Gentoo") # Use | for OR logic instead penguins |> filter(species == "Adelie" | species == "Gentoo") ``` ## Related Functions - `slice()`: Select rows by position rather than logical conditions - `distinct()`: Remove duplicate rows from a data frame - `sample_n()` / `slice_sample()`: Randomly sample a specific number of rows - `top_n()` / `slice_max()`: Select top n rows based on a variable - `select()`: Choose columns (while filter chooses rows)