dplyr n_distinct(): count unique elements or rows
Introduction
The n_distinct() function in dplyr counts the number of unique (distinct) values in a vector or across multiple columns. It’s particularly useful when you need to quickly determine how many different categories, groups, or combinations exist in your data without actually listing them out.
Getting Started
library(tidyverse)
library(palmerpenguins)Example 1: Basic Usage
The Problem
We want to count how many unique species of penguins exist in our dataset. This is a common first step in exploratory data analysis to understand the diversity of categorical variables.
Step 1: Count unique values in a single column
First, let’s see how many distinct penguin species we have.
penguins |>
summarise(unique_species = n_distinct(species))This returns a single number showing there are 3 unique penguin species in the dataset.
Step 2: Count unique values with grouping
Now let’s count unique species within each island to see the distribution.
penguins |>
group_by(island) |>
summarise(
unique_species = n_distinct(species),
total_penguins = n()
)This shows how many different species live on each island and the total penguin count per island.
Step 3: Count unique combinations
We can count unique combinations across multiple columns simultaneously.
penguins |>
summarise(
unique_combinations = n_distinct(species, island, sex, na.rm = TRUE)
)This counts how many unique combinations of species, island, and sex exist in our data.
Example 2: Practical Application
The Problem
A researcher wants to analyze the diversity of penguin measurements to understand sampling completeness. They need to know how many unique body mass values were recorded and identify potential data quality issues across different groups.
Step 1: Examine measurement diversity
Let’s count unique body mass values to understand measurement precision.
penguins |>
summarise(
unique_body_mass = n_distinct(body_mass_g, na.rm = TRUE),
total_records = n(),
missing_values = sum(is.na(body_mass_g))
)This reveals how many different body mass measurements exist and helps identify data completeness.
Step 2: Compare diversity across species
Now we’ll examine measurement diversity within each species group.
penguins |>
group_by(species) |>
summarise(
unique_bill_lengths = n_distinct(bill_length_mm, na.rm = TRUE),
unique_bill_depths = n_distinct(bill_depth_mm, na.rm = TRUE),
sample_size = n()
)This comparison shows whether some species have more varied measurements than others.
Step 3: Identify sampling patterns by year
Finally, let’s examine how sampling diversity changed over time.
penguins |>
group_by(year) |>
summarise(
unique_species = n_distinct(species),
unique_islands = n_distinct(island),
unique_individuals = n_distinct(species, island, sex, na.rm = TRUE)
)This analysis reveals whether sampling was consistent across years and locations.
Step 4: Create a comprehensive diversity report
Let’s combine multiple diversity metrics into a single summary.
diversity_report <- penguins |>
summarise(
across(where(is.numeric), ~ n_distinct(.x, na.rm = TRUE), .names = "unique_{.col}"),
across(where(is.factor), ~ n_distinct(.x), .names = "unique_{.col}")
)
diversity_reportThis creates a comprehensive overview showing the diversity of all variables in the dataset.
Summary
n_distinct()efficiently counts unique values without creating lists of those values- Use
na.rm = TRUEto exclude missing values from the count - Combine with
group_by()to count unique values within different categories - Apply to multiple columns simultaneously to count unique combinations
Use with
across()to quickly assess diversity across many variables at once