How to use filter() in R
Introduction
The filter() function from the dplyr package allows you to subset rows from data frames based on specific conditions. It’s essential for data analysis when you need to focus on particular observations that meet your criteria. Use filter() whenever you want to narrow down your dataset to rows that satisfy logical conditions.
Getting Started
library(tidyverse)
library(palmerpenguins)Example 1: Basic Usage
The Problem
We want to explore the penguins dataset by selecting only specific rows that meet certain criteria. Let’s start with simple filtering operations to understand how filter() works.
Step 1: Filter by a single condition
First, let’s filter penguins to show only those from the Adelie species.
# Filter for Adelie penguins only
adelie_penguins <- penguins |>
filter(species == "Adelie")
head(adelie_penguins)This creates a new dataset containing only the 152 Adelie penguins from the original 344 observations.
Step 2: Filter with numeric conditions
Now let’s filter penguins with body mass greater than 4000 grams.
# Filter for heavy penguins
heavy_penguins <- penguins |>
filter(body_mass_g > 4000)
nrow(heavy_penguins)This returns 104 penguins that weigh more than 4000 grams, showing how numeric comparisons work in filter().
Step 3: Filter with multiple conditions
Let’s combine conditions using the AND operator to find large male penguins.
# Filter for large male penguins
large_males <- penguins |>
filter(sex == "male" & body_mass_g > 4500)
head(large_males)This demonstrates how to use multiple conditions simultaneously, returning only male penguins weighing over 4500 grams.
Example 2: Practical Application
The Problem
Imagine you’re a researcher studying penguin populations across different islands and years. You need to analyze specific subsets of data to understand patterns in bill dimensions and body mass. This requires more complex filtering operations that combine multiple criteria.
Step 1: Filter by multiple categories
Let’s find penguins from specific islands and species combinations.
# Filter for Gentoo penguins from Biscoe island
gentoo_biscoe <- penguins |>
filter(species == "Gentoo" & island == "Biscoe")
summary(gentoo_biscoe$bill_length_mm)This gives us 119 Gentoo penguins specifically from Biscoe island, allowing focused analysis of this population.
Step 2: Filter using the OR operator
Now let’s find penguins that are either very light or very heavy.
# Filter for extreme weights
extreme_weights <- penguins |>
filter(body_mass_g < 3000 | body_mass_g > 5500) |>
select(species, body_mass_g, sex)
extreme_weightsThis identifies penguins at the extremes of the weight distribution, useful for studying outliers or exceptional cases.
Step 3: Filter with the %in% operator
Let’s filter for penguins from multiple islands at once.
# Filter for penguins from Dream or Torgersen islands
dream_torgersen <- penguins |>
filter(island %in% c("Dream", "Torgersen"))
table(dream_torgersen$island, dream_torgersen$species)This creates a contingency table showing species distribution across the two selected islands, demonstrating efficient multi-value filtering.
Step 4: Filter and remove missing values
Finally, let’s create a clean dataset for analysis by filtering out missing values.
# Filter complete cases for bill measurements
complete_bills <- penguins |>
filter(!is.na(bill_length_mm) & !is.na(bill_depth_mm))
nrow(complete_bills)This removes rows with missing bill measurements, giving us 342 complete observations ready for analysis.
Summary
- Use
filter()to subset rows based on logical conditions with comparison operators (==, >, <, >=, <=) - Combine multiple conditions with AND (&) to make filtering more restrictive
- Use OR (|) to include rows meeting any of several conditions
- The
%in%operator efficiently filters for multiple values in a single column Always handle missing values appropriately using
is.na()or!is.na()in your filter conditions