dplyr’s anti_join() to find rows based on presence or absence in a dataframe

dplyr anti_join()
Complete guide to dplyr anti_join() in R. Learn with practical examples and step-by-step explanations.
Published

July 5, 2024

Introduction

The anti_join() function from dplyr identifies rows in one dataframe that don’t have matching values in another dataframe. It’s particularly useful for finding missing records, identifying outliers, or cleaning data by removing unwanted matches. Think of it as the opposite of an inner join - it keeps only the rows that would be excluded.

Getting Started

library(tidyverse)
library(palmerpenguins)

Example 1: Basic Usage

The Problem

We want to find which penguins from our dataset are NOT from a specific list of islands we’re studying. This helps us identify data that might need special handling or exclusion.

Step 1: Create our main dataset

We’ll start with the penguins data and select relevant columns for our analysis.

penguins_data <- penguins |>
  select(species, island, bill_length_mm) |>
  filter(!is.na(bill_length_mm))

head(penguins_data)

This gives us a clean dataset with 333 penguin observations across three islands.

Step 2: Define islands of interest

We’ll create a reference dataframe containing only the islands we want to focus on.

focus_islands <- tibble(
  island = c("Biscoe", "Dream")
)

focus_islands

Now we have a simple reference table with just two islands we’re studying.

Step 3: Find penguins NOT on focus islands

We’ll use anti_join() to identify penguins from islands not in our focus list.

excluded_penguins <- penguins_data |>
  anti_join(focus_islands, by = "island")

nrow(excluded_penguins)
unique(excluded_penguins$island)

The anti_join returns 152 penguins from Torgersen island, which wasn’t in our focus list.

Example 2: Practical Application

The Problem

Imagine we’re conducting a follow-up study and have a list of specific penguins we’ve already analyzed. We need to identify which penguins from our complete dataset haven’t been studied yet, so we can prioritize them for new research.

Step 1: Create a studied penguins dataset

We’ll simulate a dataset of penguins that have already been analyzed in previous research.

studied_penguins <- penguins |>
  select(species, island, bill_length_mm, body_mass_g) |>
  filter(
    species == "Adelie",
    island == "Torgersen",
    !is.na(bill_length_mm)
  ) |>
  slice_head(n = 20)

This represents 20 Adelie penguins from Torgersen that we’ve already studied.

Step 2: Prepare the complete dataset

We’ll create our full dataset of potential study subjects.

all_penguins <- penguins |>
  select(species, island, bill_length_mm, body_mass_g) |>
  filter(!is.na(bill_length_mm), !is.na(body_mass_g))

nrow(all_penguins)

Our complete dataset has 342 penguins with complete measurements.

Step 3: Find unstudied penguins

We’ll use anti_join() to identify penguins that haven’t been included in previous research.

unstudied_penguins <- all_penguins |>
  anti_join(studied_penguins, 
            by = c("species", "island", "bill_length_mm", "body_mass_g"))

nrow(unstudied_penguins)

The anti_join identifies 322 penguins that don’t match our previously studied group.

Step 4: Analyze the remaining candidates

Let’s examine what types of penguins still need to be studied.

unstudied_penguins |>
  count(species, island, sort = TRUE)

This shows us the distribution of unstudied penguins across species and islands, helping prioritize future research efforts.

Summary

  • anti_join() returns rows from the first dataframe that have no matching values in the second dataframe
  • It’s perfect for identifying missing records, exclusions, or data that doesn’t meet certain criteria
  • The function requires specifying which columns to match on using the by parameter
  • Unlike filtering operations, anti_join compares entire rows between two separate datasets
  • It’s commonly used in data cleaning workflows to remove unwanted records or identify gaps in data collection