How to use distinct() in R

dplyr

dplyr distinct()

Learn how to use distinct() in R with practical examples. Step-by-step guide with code you can copy and run immediately.

Published

February 21, 2026

Introduction

The distinct() function from the dplyr package is a powerful tool for removing duplicate rows from data frames. It’s essential for data cleaning and exploratory data analysis when you need to identify unique observations or combinations of variables.

You’ll commonly use distinct() when working with messy datasets that contain duplicate records, when you want to find all unique values in specific columns, or when preparing data for analysis by ensuring each observation appears only once. The function is particularly useful in data preprocessing workflows where duplicate entries can skew results or create misleading insights.

Unlike base R’s unique() function, distinct() integrates seamlessly with tidyverse workflows and offers more flexibility for handling complex data structures with multiple columns.

Getting Started

First, let’s load the required packages for this tutorial:

library(tidyverse)
library(palmerpenguins)

Example 1: Basic Usage

Let’s start with basic applications of distinct() using the penguins dataset. The simplest use case removes all duplicate rows from the entire dataset:

data(penguins)

# Remove completely duplicate rows
penguins_unique <- penguins |> 
  distinct()

# Check the difference in row counts
nrow(penguins)
nrow(penguins_unique)

You can also find distinct values for specific columns. This is useful when you want to see all unique combinations of certain variables:

# Find unique species
penguins |> 
  distinct(species)

# Find unique combinations of species and island
penguins |> 
  distinct(species, island)

By default, distinct() only returns the columns you specify. To keep all other columns, use the .keep_all = TRUE argument:

# Keep all columns while finding unique species-island combinations
penguins |> 
  distinct(species, island, .keep_all = TRUE)

Example 2: Practical Application

Let’s explore a more complex scenario where distinct() becomes crucial for data analysis. Imagine you’re studying penguin populations and need to understand the unique environmental conditions where different species are found:

# Create a summary of unique environmental conditions per species
penguin_environments <- penguins |> 
  filter(!is.na(bill_length_mm), !is.na(body_mass_g)) |> 
  mutate(
    size_category = case_when(
      body_mass_g < 3500 ~ "Small",
      body_mass_g >= 3500 & body_mass_g < 4500 ~ "Medium",
      body_mass_g >= 4500 ~ "Large"
    )
  ) |> 
  distinct(species, island, sex, size_category, .keep_all = TRUE) |> 
  arrange(species, island, sex)

This workflow demonstrates distinct() in a realistic data pipeline where you’re identifying unique combinations of biological and environmental factors. You can also use distinct() with computed variables:

# Find unique bill length categories by species
penguins |> 
  filter(!is.na(bill_length_mm)) |> 
  mutate(
    bill_category = cut(bill_length_mm, 
                       breaks = 3, 
                       labels = c("Short", "Medium", "Long"))
  ) |> 
  distinct(species, bill_category) |> 
  arrange(species, bill_category)

For quality control purposes, you might want to identify and examine potential duplicates before removing them:

# Identify rows that would be considered duplicates
potential_duplicates <- penguins |> 
  group_by(species, island, bill_length_mm, bill_depth_mm, 
           flipper_length_mm, body_mass_g, sex, year) |> 
  filter(n() > 1) |> 
  ungroup()

# Then remove duplicates while keeping track of the process
cleaned_penguins <- penguins |> 
  distinct(species, island, bill_length_mm, bill_depth_mm, 
           flipper_length_mm, body_mass_g, sex, year, 
           .keep_all = TRUE)

Summary

The distinct() function is an indispensable tool for data cleaning and exploration in R. Key takeaways include:

Use distinct() without arguments to remove completely duplicate rows
Specify column names to find unique combinations of those variables
Add .keep_all = TRUE to retain all columns in your output
Combine distinct() with other dplyr verbs for powerful data processing pipelines
Always consider whether you want to examine duplicates before removing them

Master these techniques to ensure your datasets are clean and your analyses are based on truly unique observations.

--- title: "How to use distinct() in R" description: "Learn how to use distinct() in R with practical examples. Step-by-step guide with code you can copy and run immediately." date: 2026-02-21 categories: ['dplyr', 'dplyr distinct()'] format: html: code-fold: false code-tools: true --- ## Introduction The `distinct()` function from the dplyr package is a powerful tool for removing duplicate rows from data frames. It's essential for data cleaning and exploratory data analysis when you need to identify unique observations or combinations of variables. You'll commonly use `distinct()` when working with messy datasets that contain duplicate records, when you want to find all unique values in specific columns, or when preparing data for analysis by ensuring each observation appears only once. The function is particularly useful in data preprocessing workflows where duplicate entries can skew results or create misleading insights. Unlike base R's `unique()` function, `distinct()` integrates seamlessly with tidyverse workflows and offers more flexibility for handling complex data structures with multiple columns. ## Getting Started First, let's load the required packages for this tutorial: ```r library(tidyverse) library(palmerpenguins) ``` ## Example 1: Basic Usage Let's start with basic applications of `distinct()` using the penguins dataset. The simplest use case removes all duplicate rows from the entire dataset: ```r data(penguins) # Remove completely duplicate rows penguins_unique <- penguins |> distinct() # Check the difference in row counts nrow(penguins) nrow(penguins_unique) ``` You can also find distinct values for specific columns. This is useful when you want to see all unique combinations of certain variables: ```r # Find unique species penguins |> distinct(species) # Find unique combinations of species and island penguins |> distinct(species, island) ``` By default, `distinct()` only returns the columns you specify. To keep all other columns, use the `.keep_all = TRUE` argument: ```r # Keep all columns while finding unique species-island combinations penguins |> distinct(species, island, .keep_all = TRUE) ``` ## Example 2: Practical Application Let's explore a more complex scenario where `distinct()` becomes crucial for data analysis. Imagine you're studying penguin populations and need to understand the unique environmental conditions where different species are found: ```r # Create a summary of unique environmental conditions per species penguin_environments <- penguins |> filter(!is.na(bill_length_mm), !is.na(body_mass_g)) |> mutate( size_category = case_when( body_mass_g < 3500 ~ "Small", body_mass_g >= 3500 & body_mass_g < 4500 ~ "Medium", body_mass_g >= 4500 ~ "Large" ) ) |> distinct(species, island, sex, size_category, .keep_all = TRUE) |> arrange(species, island, sex) ``` This workflow demonstrates `distinct()` in a realistic data pipeline where you're identifying unique combinations of biological and environmental factors. You can also use `distinct()` with computed variables: ```r # Find unique bill length categories by species penguins |> filter(!is.na(bill_length_mm)) |> mutate( bill_category = cut(bill_length_mm, breaks = 3, labels = c("Short", "Medium", "Long")) ) |> distinct(species, bill_category) |> arrange(species, bill_category) ``` For quality control purposes, you might want to identify and examine potential duplicates before removing them: ```r # Identify rows that would be considered duplicates potential_duplicates <- penguins |> group_by(species, island, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, sex, year) |> filter(n() > 1) |> ungroup() # Then remove duplicates while keeping track of the process cleaned_penguins <- penguins |> distinct(species, island, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, sex, year, .keep_all = TRUE) ``` ## Summary The `distinct()` function is an indispensable tool for data cleaning and exploration in R. Key takeaways include: - Use `distinct()` without arguments to remove completely duplicate rows - Specify column names to find unique combinations of those variables - Add `.keep_all = TRUE` to retain all columns in your output - Combine `distinct()` with other dplyr verbs for powerful data processing pipelines - Always consider whether you want to examine duplicates before removing them Master these techniques to ensure your datasets are clean and your analyses are based on truly unique observations. --- ## Related Posts - [How to use select() in R](/dplyr/how-to-use-select-in-r.html) - [How to use mutate() in R](/dplyr/how-to-use-mutate-in-r.html) - [How to use pull() in R](/dplyr/how-to-use-pull-in-r.html) - [How to use separate() in R](/tidyr/how-to-use-separate-in-r.html) - [How to use separate_wider_delim() in R](/tidyr/how-to-use-separatewiderdelim-in-r.html)