3 ways to rank numbers with tidyverse
Introduction
Ranking data is a fundamental operation in data analysis that helps identify top performers, outliers, or ordering within groups. The tidyverse provides three powerful ranking functions - rank(), dense_rank(), and row_number() - each handling ties differently and serving specific analytical needs.
Getting Started
library(tidyverse)
library(palmerpenguins)Example 1: Basic Usage
The Problem
We need to understand how the three ranking functions behave differently when applied to the same dataset. Let’s compare their outputs using penguin body mass data.
Step 1: Create Sample Data
We’ll filter for a subset of penguins to clearly see ranking differences.
sample_penguins <- penguins |>
filter(species == "Adelie", island == "Torgersen") |>
select(species, body_mass_g) |>
slice_head(n = 8) |>
arrange(body_mass_g)This gives us a manageable dataset with some tied values to demonstrate ranking behavior.
Step 2: Apply All Three Ranking Methods
Now we’ll add columns showing each ranking method side by side.
ranked_penguins <- sample_penguins |>
mutate(
standard_rank = rank(body_mass_g),
dense_rank = dense_rank(body_mass_g),
row_number = row_number(body_mass_g)
)Each function handles ties differently: rank() uses average ranks, dense_rank() gives consecutive integers, and row_number() assigns arbitrary order.
Step 3: Examine the Results
Let’s view the comparison to understand the differences.
ranked_penguins |>
arrange(body_mass_g) |>
print()Notice how tied values receive different treatments: standard ranking averages ties, dense ranking maintains consecutive numbering, while row numbering breaks ties arbitrarily.
Example 2: Practical Application
The Problem
A wildlife researcher wants to identify the heaviest penguins within each species for a health study. They need the top 3 penguins per species, but must handle ties appropriately to ensure no penguin is unfairly excluded from the study.
Step 1: Clean and Prepare Data
First, we’ll remove any penguins with missing body mass measurements.
clean_penguins <- penguins |>
filter(!is.na(body_mass_g)) |>
select(species, island, sex, body_mass_g)This ensures our ranking calculations work with complete data only.
Step 2: Apply Dense Ranking by Species
We’ll use dense_rank() to ensure tied penguins receive the same rank without skipping numbers.
species_ranks <- clean_penguins |>
group_by(species) |>
mutate(
weight_rank = dense_rank(desc(body_mass_g))
) |>
arrange(species, weight_rank)Dense ranking ensures that if two penguins tie for 2nd place, the next penguin gets 3rd place (not 4th), maximizing study inclusion.
Step 3: Filter Top Performers
Now we’ll extract the top 3 ranked penguins from each species.
top_penguins <- species_ranks |>
filter(weight_rank <= 3) |>
arrange(species, weight_rank, desc(body_mass_g))This approach ensures we capture all penguins worthy of the top 3 positions, even when ties occur.
Step 4: Verify Results by Species
Let’s examine how many penguins we selected per species.
top_penguins |>
count(species, name = "selected_penguins") |>
arrange(species)Some species might have more than 3 penguins selected due to ties at the 3rd position, which is exactly what we want for inclusive research.
Summary
rank()assigns average ranks to tied values, useful for statistical calculations requiring continuous ranking scalesdense_rank()gives consecutive integer ranks without gaps, ideal when you need compact ranking that includes all tied values
row_number()breaks ties arbitrarily, perfect when you need exactly one rank per observation regardless of ties- Group-wise ranking with
group_by()enables sophisticated analyses like finding top performers within categories Choose your ranking method based on how you want ties handled: averaged, consecutive, or broken arbitrarily