dplyr n_distinct(): count unique elements or rows

dplyr n_distinct()

Master dplyr n_distinct() to count unique elements or rows. Complete R tutorial with examples using real datasets.

Published

September 14, 2024

Introduction

The n_distinct() function in dplyr counts the number of unique (distinct) values in a vector or across multiple columns. It’s particularly useful when you need to quickly determine how many different categories, groups, or combinations exist in your data without actually listing them out.

Getting Started

library(tidyverse)
library(palmerpenguins)

Example 1: Basic Usage

The Problem

We want to count how many unique species of penguins exist in our dataset. This is a common first step in exploratory data analysis to understand the diversity of categorical variables.

Step 1: Count unique values in a single column

First, let’s see how many distinct penguin species we have.

penguins |> 
  summarise(unique_species = n_distinct(species))

This returns a single number showing there are 3 unique penguin species in the dataset.

Step 2: Count unique values with grouping

Now let’s count unique species within each island to see the distribution.

penguins |> 
  group_by(island) |> 
  summarise(
    unique_species = n_distinct(species),
    total_penguins = n()
  )

This shows how many different species live on each island and the total penguin count per island.

Step 3: Count unique combinations

We can count unique combinations across multiple columns simultaneously.

penguins |> 
  summarise(
    unique_combinations = n_distinct(species, island, sex, na.rm = TRUE)
  )

This counts how many unique combinations of species, island, and sex exist in our data.

Example 2: Practical Application

The Problem

A researcher wants to analyze the diversity of penguin measurements to understand sampling completeness. They need to know how many unique body mass values were recorded and identify potential data quality issues across different groups.

Step 1: Examine measurement diversity

Let’s count unique body mass values to understand measurement precision.

penguins |> 
  summarise(
    unique_body_mass = n_distinct(body_mass_g, na.rm = TRUE),
    total_records = n(),
    missing_values = sum(is.na(body_mass_g))
  )

This reveals how many different body mass measurements exist and helps identify data completeness.

Step 2: Compare diversity across species

Now we’ll examine measurement diversity within each species group.

penguins |> 
  group_by(species) |> 
  summarise(
    unique_bill_lengths = n_distinct(bill_length_mm, na.rm = TRUE),
    unique_bill_depths = n_distinct(bill_depth_mm, na.rm = TRUE),
    sample_size = n()
  )

This comparison shows whether some species have more varied measurements than others.

Step 3: Identify sampling patterns by year

Finally, let’s examine how sampling diversity changed over time.

penguins |> 
  group_by(year) |> 
  summarise(
    unique_species = n_distinct(species),
    unique_islands = n_distinct(island),
    unique_individuals = n_distinct(species, island, sex, na.rm = TRUE)
  )

This analysis reveals whether sampling was consistent across years and locations.

Step 4: Create a comprehensive diversity report

Let’s combine multiple diversity metrics into a single summary.

diversity_report <- penguins |> 
  summarise(
    across(where(is.numeric), ~ n_distinct(.x, na.rm = TRUE), .names = "unique_{.col}"),
    across(where(is.factor), ~ n_distinct(.x), .names = "unique_{.col}")
  )

diversity_report

This creates a comprehensive overview showing the diversity of all variables in the dataset.

Summary

n_distinct() efficiently counts unique values without creating lists of those values
Use na.rm = TRUE to exclude missing values from the count
Combine with group_by() to count unique values within different categories
Apply to multiple columns simultaneously to count unique combinations
Use with across() to quickly assess diversity across many variables at once

--- title: "dplyr n_distinct(): count unique elements or rows" description: "Master dplyr n_distinct() to count unique elements or rows. Complete R tutorial with examples using real datasets." date: 2024-09-14 categories: ['dplyr n_distinct()'] format: html: code-fold: false code-tools: true --- ## Introduction The `n_distinct()` function in dplyr counts the number of unique (distinct) values in a vector or across multiple columns. It's particularly useful when you need to quickly determine how many different categories, groups, or combinations exist in your data without actually listing them out. ## Getting Started ```r library(tidyverse) library(palmerpenguins) ``` ## Example 1: Basic Usage ### The Problem We want to count how many unique species of penguins exist in our dataset. This is a common first step in exploratory data analysis to understand the diversity of categorical variables. ### Step 1: Count unique values in a single column First, let's see how many distinct penguin species we have. ```r penguins |> summarise(unique_species = n_distinct(species)) ``` This returns a single number showing there are 3 unique penguin species in the dataset. ### Step 2: Count unique values with grouping Now let's count unique species within each island to see the distribution. ```r penguins |> group_by(island) |> summarise( unique_species = n_distinct(species), total_penguins = n() ) ``` This shows how many different species live on each island and the total penguin count per island. ### Step 3: Count unique combinations We can count unique combinations across multiple columns simultaneously. ```r penguins |> summarise( unique_combinations = n_distinct(species, island, sex, na.rm = TRUE) ) ``` This counts how many unique combinations of species, island, and sex exist in our data. ## Example 2: Practical Application ### The Problem A researcher wants to analyze the diversity of penguin measurements to understand sampling completeness. They need to know how many unique body mass values were recorded and identify potential data quality issues across different groups. ### Step 1: Examine measurement diversity Let's count unique body mass values to understand measurement precision. ```r penguins |> summarise( unique_body_mass = n_distinct(body_mass_g, na.rm = TRUE), total_records = n(), missing_values = sum(is.na(body_mass_g)) ) ``` This reveals how many different body mass measurements exist and helps identify data completeness. ### Step 2: Compare diversity across species Now we'll examine measurement diversity within each species group. ```r penguins |> group_by(species) |> summarise( unique_bill_lengths = n_distinct(bill_length_mm, na.rm = TRUE), unique_bill_depths = n_distinct(bill_depth_mm, na.rm = TRUE), sample_size = n() ) ``` This comparison shows whether some species have more varied measurements than others. ### Step 3: Identify sampling patterns by year Finally, let's examine how sampling diversity changed over time. ```r penguins |> group_by(year) |> summarise( unique_species = n_distinct(species), unique_islands = n_distinct(island), unique_individuals = n_distinct(species, island, sex, na.rm = TRUE) ) ``` This analysis reveals whether sampling was consistent across years and locations. ### Step 4: Create a comprehensive diversity report Let's combine multiple diversity metrics into a single summary. ```r diversity_report <- penguins |> summarise( across(where(is.numeric), ~ n_distinct(.x, na.rm = TRUE), .names = "unique_{.col}"), across(where(is.factor), ~ n_distinct(.x), .names = "unique_{.col}") ) diversity_report ``` This creates a comprehensive overview showing the diversity of all variables in the dataset. ## Summary - `n_distinct()` efficiently counts unique values without creating lists of those values - Use `na.rm = TRUE` to exclude missing values from the count - Combine with [`group_by()`](/dplyr/how-to-use-groupby-in-r.html) to count unique values within different categories - Apply to multiple columns simultaneously to count unique combinations - Use with [`across()`](/dplyr/how-to-use-across-in-r.html) to quickly assess diversity across many variables at once --- ## Related Posts - [dplyr count(): count unique values of a variable](/dplyr/dplyr-count-count-unique-values-of-a-variable.html) - [dplyr arrange: Sort rows by one or more variables](/dplyr/dplyr-arrange-sort-rows-by-one-or-more-variables.html) - [dplyr's anti_join() to find rows based on presence or absence in a dataframe](/dplyr/dplyrs-anti_join-to-unmatched-rows.html) - [How to Separate a Column into Multiple Rows in R: Hint tidyr's spearate_row()](/tidyr/separate-a-collapsed-column-into-multiple-rows.html) - [How to use separate() in R](/tidyr/how-to-use-separate-in-r.html)

Introduction

Getting Started

Example 1: Basic Usage

The Problem

Step 1: Count unique values in a single column

Step 2: Count unique values with grouping

Step 3: Count unique combinations

Example 2: Practical Application

The Problem

Step 1: Examine measurement diversity

Step 2: Compare diversity across species

Step 3: Identify sampling patterns by year

Step 4: Create a comprehensive diversity report

Summary

Use with across() to quickly assess diversity across many variables at once

Related Posts

Use with `across()` to quickly assess diversity across many variables at once