How to count unique values with n_distinct() in R
Introduction
The n_distinct() function in dplyr is a powerful tool for counting the number of unique values in a vector or dataset. This function is essential for data exploration and quality checks, helping you quickly understand the diversity of your data. You’ll find it particularly useful when examining categorical variables, checking for duplicates, or summarizing data by groups.
Setup
Let’s start by loading the tidyverse package and creating some sample data to work with:
library(tidyverse)# Create sample data with some duplicate IDs and a missing value
df <- tibble(
id = c(2, 4, 1, 2, 3, 4, NA),
amount = c(250, 200, 150, 250, 300, 120, 200)
)
dfOur dataset contains 7 rows but notice that some ID values are repeated, and we have one missing value (NA).
Basic Usage of n_distinct()
The most common way to count distinct values is to use n_distinct() with pull() to extract a column:
df |>
pull(id) |>
n_distinct()This counts all unique values including NA, so we get 5 distinct values (1, 2, 3, 4, and NA).
Handling Missing Values
To exclude missing values from the count, use the na.rm parameter:
df |>
pull(id) |>
n_distinct(na.rm = TRUE)Now we get 4 distinct values, excluding the NA.
Alternative Approaches
You can achieve the same result using unique() and length():
df |>
pull(id) |>
unique() |>
length()This approach first gets unique values, then counts them. However, n_distinct() is more concise and handles missing values more explicitly.
Direct Column Access
You can also use n_distinct() directly on a column without pipes:
n_distinct(df$id)This base R syntax is shorter for simple cases but doesn’t integrate as well with dplyr workflows.
Counting Distinct Rows
When applied to an entire dataframe, n_distinct() counts unique combinations of all columns:
df |>
n_distinct()This tells us how many completely unique rows exist in our dataset.
Using n_distinct() with Group Operations
One of the most powerful applications is counting distinct values within groups:
# Example with grouped data
df_grouped <- tibble(
category = c("A", "A", "B", "B", "B"),
value = c(10, 20, 10, 30, 20)
)
df_grouped |>
group_by(category) |>
summarize(distinct_values = n_distinct(value))This shows how many distinct values exist within each category, which is invaluable for understanding data distribution across groups.
Summary
The n_distinct() function is an essential tool for exploratory data analysis in R. Use it to quickly count unique values in columns, check data quality, and summarize categorical variables. Remember to use na.rm = TRUE when you want to exclude missing values from your counts. Combined with group_by(), it becomes even more powerful for understanding patterns within different subsets of your data.