How to filter rows in a dataframe: dplyr’s filter()

dplyr
dplyr filter()
Learn how to filter rows in a dataframe: dplyr’s filter() with this comprehensive R tutorial. Includes practical examples and code snippets.
Published

July 9, 2022

Introduction

The filter() function from dplyr is used to extract rows from a dataframe that meet specific conditions. It’s essential for data analysis when you need to work with subsets of your data based on logical criteria. This function is particularly useful for exploratory data analysis and preparing data for visualization or modeling.

Getting Started

library(tidyverse)
library(palmerpenguins)

Example 1: Basic Filtering

The Problem

We need to extract specific rows from the penguins dataset based on simple conditions. Let’s start by filtering penguins from a single species and then explore different logical operators.

Step 1: Filter by single condition

We’ll filter the dataset to show only Adelie penguins.

penguins |>
  filter(species == "Adelie") |>
  head()

This returns only rows where the species column equals “Adelie”, giving us a subset of the original dataset.

Step 2: Filter with numeric conditions

Now let’s filter penguins based on their body mass, keeping only those heavier than 4000 grams.

penguins |>
  filter(body_mass_g > 4000) |>
  select(species, body_mass_g, sex) |>
  head()

The filter returns 120 penguins that meet our weight criteria, showing how numeric comparisons work.

Step 3: Combine multiple conditions

We can use multiple conditions with the & operator to be more specific in our filtering.

penguins |>
  filter(species == "Adelie" & body_mass_g > 4000) |>
  select(species, body_mass_g, island) |>
  head()

This gives us only Adelie penguins that weigh more than 4000 grams, demonstrating how to combine logical conditions.

Example 2: Advanced Filtering Techniques

The Problem

In real data analysis, we often need more complex filtering scenarios. We might want to exclude missing values, filter based on multiple categories, or use pattern matching. Let’s explore these advanced techniques with practical examples.

Step 1: Filter out missing values

First, let’s remove rows with missing values in the sex column.

penguins |>
  filter(!is.na(sex)) |>
  count(sex)

The !is.na() function excludes rows where sex is missing, giving us a clean dataset for analysis.

Step 2: Filter with multiple categories

Now we’ll filter for multiple species using the %in% operator.

penguins |>
  filter(species %in% c("Adelie", "Chinstrap")) |>
  count(species, island)

This shows us the distribution of Adelie and Chinstrap penguins across different islands, excluding Gentoo penguins.

Step 3: Filter with OR conditions

Let’s find penguins that are either very light or very heavy using the OR operator |.

penguins |>
  filter(body_mass_g < 3000 | body_mass_g > 5500) |>
  select(species, body_mass_g, sex) |>
  arrange(body_mass_g)

This captures the extremes of body mass in our dataset, useful for identifying outliers or unusual specimens.

Step 4: Filter with ranges

Finally, let’s filter penguins with bill lengths within a specific range.

penguins |>
  filter(bill_length_mm >= 45 & bill_length_mm <= 50) |>
  select(species, bill_length_mm, bill_depth_mm) |>
  arrange(desc(bill_length_mm))

This gives us penguins with moderate bill lengths, which might represent the typical size range for comparative analysis.

Summary

  • Use filter() with logical operators (==, >, <, >=, <=) for basic row selection
  • Combine conditions with & (AND) or | (OR) to create complex filtering criteria
  • Use %in% to filter for multiple values within the same column efficiently
  • Remove missing values with !is.na() to ensure clean data for analysis
  • Chain filter() with other dplyr functions like select() and arrange() for comprehensive data manipulation