How to filter rows in a dataframe: dplyr’s filter()

dplyr

dplyr filter()

Learn how to filter rows in a dataframe: dplyr’s filter() with this comprehensive R tutorial. Includes practical examples and code snippets.

Published

July 9, 2022

Introduction

The filter() function from dplyr is used to extract rows from a dataframe that meet specific conditions. It’s essential for data analysis when you need to work with subsets of your data based on logical criteria. This function is particularly useful for exploratory data analysis and preparing data for visualization or modeling.

Getting Started

library(tidyverse)
library(palmerpenguins)

Example 1: Basic Filtering

The Problem

We need to extract specific rows from the penguins dataset based on simple conditions. Let’s start by filtering penguins from a single species and then explore different logical operators.

Step 1: Filter by single condition

We’ll filter the dataset to show only Adelie penguins.

penguins |>
  filter(species == "Adelie") |>
  head()

This returns only rows where the species column equals “Adelie”, giving us a subset of the original dataset.

Step 2: Filter with numeric conditions

Now let’s filter penguins based on their body mass, keeping only those heavier than 4000 grams.

penguins |>
  filter(body_mass_g > 4000) |>
  select(species, body_mass_g, sex) |>
  head()

The filter returns 120 penguins that meet our weight criteria, showing how numeric comparisons work.

Step 3: Combine multiple conditions

We can use multiple conditions with the & operator to be more specific in our filtering.

penguins |>
  filter(species == "Adelie" & body_mass_g > 4000) |>
  select(species, body_mass_g, island) |>
  head()

This gives us only Adelie penguins that weigh more than 4000 grams, demonstrating how to combine logical conditions.

Example 2: Advanced Filtering Techniques

The Problem

In real data analysis, we often need more complex filtering scenarios. We might want to exclude missing values, filter based on multiple categories, or use pattern matching. Let’s explore these advanced techniques with practical examples.

Step 1: Filter out missing values

First, let’s remove rows with missing values in the sex column.

penguins |>
  filter(!is.na(sex)) |>
  count(sex)

The !is.na() function excludes rows where sex is missing, giving us a clean dataset for analysis.

Step 2: Filter with multiple categories

Now we’ll filter for multiple species using the %in% operator.

penguins |>
  filter(species %in% c("Adelie", "Chinstrap")) |>
  count(species, island)

This shows us the distribution of Adelie and Chinstrap penguins across different islands, excluding Gentoo penguins.

Step 3: Filter with OR conditions

Let’s find penguins that are either very light or very heavy using the OR operator |.

penguins |>
  filter(body_mass_g < 3000 | body_mass_g > 5500) |>
  select(species, body_mass_g, sex) |>
  arrange(body_mass_g)

This captures the extremes of body mass in our dataset, useful for identifying outliers or unusual specimens.

Step 4: Filter with ranges

Finally, let’s filter penguins with bill lengths within a specific range.

penguins |>
  filter(bill_length_mm >= 45 & bill_length_mm <= 50) |>
  select(species, bill_length_mm, bill_depth_mm) |>
  arrange(desc(bill_length_mm))

This gives us penguins with moderate bill lengths, which might represent the typical size range for comparative analysis.

Summary

Use filter() with logical operators (==, >, <, >=, <=) for basic row selection
Combine conditions with & (AND) or | (OR) to create complex filtering criteria
Use %in% to filter for multiple values within the same column efficiently
Remove missing values with !is.na() to ensure clean data for analysis
Chain filter() with other dplyr functions like select() and arrange() for comprehensive data manipulation

--- title: "How to filter rows in a dataframe: dplyr's filter()" description: "Learn how to filter rows in a dataframe: dplyr's filter() with this comprehensive R tutorial. Includes practical examples and code snippets." date: 2022-07-09 categories: ['dplyr', 'dplyr filter()'] format: html: code-fold: false code-tools: true --- ## Introduction The [`filter()`](/dplyr/how-to-use-filter-in-r.html) function from dplyr is used to extract rows from a dataframe that meet specific conditions. It's essential for data analysis when you need to work with subsets of your data based on logical criteria. This function is particularly useful for exploratory data analysis and preparing data for visualization or modeling. ## Getting Started ```r library(tidyverse) library(palmerpenguins) ``` ## Example 1: Basic Filtering ### The Problem We need to extract specific rows from the penguins dataset based on simple conditions. Let's start by filtering penguins from a single species and then explore different logical operators. ### Step 1: Filter by single condition We'll filter the dataset to show only Adelie penguins. ```r penguins |> filter(species == "Adelie") |> head() ``` This returns only rows where the species column equals "Adelie", giving us a subset of the original dataset. ### Step 2: Filter with numeric conditions Now let's filter penguins based on their body mass, keeping only those heavier than 4000 grams. ```r penguins |> filter(body_mass_g > 4000) |> select(species, body_mass_g, sex) |> head() ``` The filter returns 120 penguins that meet our weight criteria, showing how numeric comparisons work. ### Step 3: Combine multiple conditions We can use multiple conditions with the `&` operator to be more specific in our filtering. ```r penguins |> filter(species == "Adelie" & body_mass_g > 4000) |> select(species, body_mass_g, island) |> head() ``` This gives us only Adelie penguins that weigh more than 4000 grams, demonstrating how to combine logical conditions. ## Example 2: Advanced Filtering Techniques ### The Problem In real data analysis, we often need more complex filtering scenarios. We might want to exclude missing values, filter based on multiple categories, or use pattern matching. Let's explore these advanced techniques with practical examples. ### Step 1: Filter out missing values First, let's remove rows with missing values in the sex column. ```r penguins |> filter(!is.na(sex)) |> count(sex) ``` The `!is.na()` function excludes rows where sex is missing, giving us a clean dataset for analysis. ### Step 2: Filter with multiple categories Now we'll filter for multiple species using the `%in%` operator. ```r penguins |> filter(species %in% c("Adelie", "Chinstrap")) |> count(species, island) ``` This shows us the distribution of Adelie and Chinstrap penguins across different islands, excluding Gentoo penguins. ### Step 3: Filter with OR conditions Let's find penguins that are either very light or very heavy using the OR operator `|`. ```r penguins |> filter(body_mass_g < 3000 | body_mass_g > 5500) |> select(species, body_mass_g, sex) |> arrange(body_mass_g) ``` This captures the extremes of body mass in our dataset, useful for identifying outliers or unusual specimens. ### Step 4: Filter with ranges Finally, let's filter penguins with bill lengths within a specific range. ```r penguins |> filter(bill_length_mm >= 45 & bill_length_mm <= 50) |> select(species, bill_length_mm, bill_depth_mm) |> arrange(desc(bill_length_mm)) ``` This gives us penguins with moderate bill lengths, which might represent the typical size range for comparative analysis. ## Summary - Use `filter()` with logical operators (`==`, `>`, `<`, `>=`, `<=`) for basic row selection - Combine conditions with `&` (AND) or `|` (OR) to create complex filtering criteria - Use `%in%` to filter for multiple values within the same column efficiently - Remove missing values with `!is.na()` to ensure clean data for analysis - Chain `filter()` with other dplyr functions like [`select()`](/dplyr/how-to-use-select-in-r.html) and [`arrange()`](/dplyr/how-to-use-arrange-in-r.html) for comprehensive data manipulation --- ## Related Posts - [dplyr's anti_join() to find rows based on presence or absence in a dataframe](/dplyr/dplyrs-anti_join-to-unmatched-rows.html) - [dplyr filter(): How to select rows with partially matching string](/dplyr/dplyr-filter-partial-match.html) - [dplyr's mutate(): How to create new columns](/dplyr/dplyr-mutate-create-new-columns.html) - [How to Separate a Column into Multiple Rows in R: Hint tidyr's spearate_row()](/tidyr/separate-a-collapsed-column-into-multiple-rows.html) - [pivot_longer on dataframe with single row](/tidyr/pivot_longer-on-dataframe-with-single-row.html)

Introduction

Getting Started

Example 1: Basic Filtering

The Problem

Step 1: Filter by single condition

Step 2: Filter with numeric conditions

Step 3: Combine multiple conditions

Example 2: Advanced Filtering Techniques

The Problem

Step 1: Filter out missing values

Step 2: Filter with multiple categories

Step 3: Filter with OR conditions

Step 4: Filter with ranges

Summary

Chain filter() with other dplyr functions like select() and arrange() for comprehensive data manipulation

Related Posts

Chain `filter()` with other dplyr functions like `select()` and `arrange()` for comprehensive data manipulation