How to use distinct() in R
Introduction
The distinct() function from the dplyr package is a powerful tool for removing duplicate rows from data frames. It’s essential for data cleaning and exploratory data analysis when you need to identify unique observations or combinations of variables.
You’ll commonly use distinct() when working with messy datasets that contain duplicate records, when you want to find all unique values in specific columns, or when preparing data for analysis by ensuring each observation appears only once. The function is particularly useful in data preprocessing workflows where duplicate entries can skew results or create misleading insights.
Unlike base R’s unique() function, distinct() integrates seamlessly with tidyverse workflows and offers more flexibility for handling complex data structures with multiple columns.
Getting Started
First, let’s load the required packages for this tutorial:
library(tidyverse)
library(palmerpenguins)Example 1: Basic Usage
Let’s start with basic applications of distinct() using the penguins dataset. The simplest use case removes all duplicate rows from the entire dataset:
data(penguins)
# Remove completely duplicate rows
penguins_unique <- penguins |>
distinct()
# Check the difference in row counts
nrow(penguins)
nrow(penguins_unique)You can also find distinct values for specific columns. This is useful when you want to see all unique combinations of certain variables:
# Find unique species
penguins |>
distinct(species)
# Find unique combinations of species and island
penguins |>
distinct(species, island)By default, distinct() only returns the columns you specify. To keep all other columns, use the .keep_all = TRUE argument:
# Keep all columns while finding unique species-island combinations
penguins |>
distinct(species, island, .keep_all = TRUE)Example 2: Practical Application
Let’s explore a more complex scenario where distinct() becomes crucial for data analysis. Imagine you’re studying penguin populations and need to understand the unique environmental conditions where different species are found:
# Create a summary of unique environmental conditions per species
penguin_environments <- penguins |>
filter(!is.na(bill_length_mm), !is.na(body_mass_g)) |>
mutate(
size_category = case_when(
body_mass_g < 3500 ~ "Small",
body_mass_g >= 3500 & body_mass_g < 4500 ~ "Medium",
body_mass_g >= 4500 ~ "Large"
)
) |>
distinct(species, island, sex, size_category, .keep_all = TRUE) |>
arrange(species, island, sex)This workflow demonstrates distinct() in a realistic data pipeline where you’re identifying unique combinations of biological and environmental factors. You can also use distinct() with computed variables:
# Find unique bill length categories by species
penguins |>
filter(!is.na(bill_length_mm)) |>
mutate(
bill_category = cut(bill_length_mm,
breaks = 3,
labels = c("Short", "Medium", "Long"))
) |>
distinct(species, bill_category) |>
arrange(species, bill_category)For quality control purposes, you might want to identify and examine potential duplicates before removing them:
# Identify rows that would be considered duplicates
potential_duplicates <- penguins |>
group_by(species, island, bill_length_mm, bill_depth_mm,
flipper_length_mm, body_mass_g, sex, year) |>
filter(n() > 1) |>
ungroup()
# Then remove duplicates while keeping track of the process
cleaned_penguins <- penguins |>
distinct(species, island, bill_length_mm, bill_depth_mm,
flipper_length_mm, body_mass_g, sex, year,
.keep_all = TRUE)Summary
The distinct() function is an indispensable tool for data cleaning and exploration in R. Key takeaways include:
- Use
distinct()without arguments to remove completely duplicate rows - Specify column names to find unique combinations of those variables
- Add
.keep_all = TRUEto retain all columns in your output - Combine
distinct()with other dplyr verbs for powerful data processing pipelines - Always consider whether you want to examine duplicates before removing them