dplyr n_distinct(): count unique elements or rows

dplyr n_distinct()
Master dplyr n_distinct() to count unique elements or rows. Complete R tutorial with examples using real datasets.
Published

September 14, 2024

Introduction

The n_distinct() function in dplyr counts the number of unique (distinct) values in a vector or across multiple columns. It’s particularly useful when you need to quickly determine how many different categories, groups, or combinations exist in your data without actually listing them out.

Getting Started

library(tidyverse)
library(palmerpenguins)

Example 1: Basic Usage

The Problem

We want to count how many unique species of penguins exist in our dataset. This is a common first step in exploratory data analysis to understand the diversity of categorical variables.

Step 1: Count unique values in a single column

First, let’s see how many distinct penguin species we have.

penguins |> 
  summarise(unique_species = n_distinct(species))

This returns a single number showing there are 3 unique penguin species in the dataset.

Step 2: Count unique values with grouping

Now let’s count unique species within each island to see the distribution.

penguins |> 
  group_by(island) |> 
  summarise(
    unique_species = n_distinct(species),
    total_penguins = n()
  )

This shows how many different species live on each island and the total penguin count per island.

Step 3: Count unique combinations

We can count unique combinations across multiple columns simultaneously.

penguins |> 
  summarise(
    unique_combinations = n_distinct(species, island, sex, na.rm = TRUE)
  )

This counts how many unique combinations of species, island, and sex exist in our data.

Example 2: Practical Application

The Problem

A researcher wants to analyze the diversity of penguin measurements to understand sampling completeness. They need to know how many unique body mass values were recorded and identify potential data quality issues across different groups.

Step 1: Examine measurement diversity

Let’s count unique body mass values to understand measurement precision.

penguins |> 
  summarise(
    unique_body_mass = n_distinct(body_mass_g, na.rm = TRUE),
    total_records = n(),
    missing_values = sum(is.na(body_mass_g))
  )

This reveals how many different body mass measurements exist and helps identify data completeness.

Step 2: Compare diversity across species

Now we’ll examine measurement diversity within each species group.

penguins |> 
  group_by(species) |> 
  summarise(
    unique_bill_lengths = n_distinct(bill_length_mm, na.rm = TRUE),
    unique_bill_depths = n_distinct(bill_depth_mm, na.rm = TRUE),
    sample_size = n()
  )

This comparison shows whether some species have more varied measurements than others.

Step 3: Identify sampling patterns by year

Finally, let’s examine how sampling diversity changed over time.

penguins |> 
  group_by(year) |> 
  summarise(
    unique_species = n_distinct(species),
    unique_islands = n_distinct(island),
    unique_individuals = n_distinct(species, island, sex, na.rm = TRUE)
  )

This analysis reveals whether sampling was consistent across years and locations.

Step 4: Create a comprehensive diversity report

Let’s combine multiple diversity metrics into a single summary.

diversity_report <- penguins |> 
  summarise(
    across(where(is.numeric), ~ n_distinct(.x, na.rm = TRUE), .names = "unique_{.col}"),
    across(where(is.factor), ~ n_distinct(.x), .names = "unique_{.col}")
  )

diversity_report

This creates a comprehensive overview showing the diversity of all variables in the dataset.

Summary

  • n_distinct() efficiently counts unique values without creating lists of those values
  • Use na.rm = TRUE to exclude missing values from the count
  • Combine with group_by() to count unique values within different categories
  • Apply to multiple columns simultaneously to count unique combinations
  • Use with across() to quickly assess diversity across many variables at once