3 ways to rank numbers with tidyverse

dplyr
dplyr dense_rank()
dplyr min_rank()
dplyr row_number()
Learn 3 ways to rank numbers with tidyverse in R. Practical tutorial with examples.
Published

September 16, 2023

Introduction

Ranking data is a fundamental operation in data analysis that helps identify top performers, outliers, or ordering within groups. The tidyverse provides three powerful ranking functions - rank(), dense_rank(), and row_number() - each handling ties differently and serving specific analytical needs.

Getting Started

library(tidyverse)
library(palmerpenguins)

Example 1: Basic Usage

The Problem

We need to understand how the three ranking functions behave differently when applied to the same dataset. Let’s compare their outputs using penguin body mass data.

Step 1: Create Sample Data

We’ll filter for a subset of penguins to clearly see ranking differences.

sample_penguins <- penguins |>
  filter(species == "Adelie", island == "Torgersen") |>
  select(species, body_mass_g) |>
  slice_head(n = 8) |>
  arrange(body_mass_g)

This gives us a manageable dataset with some tied values to demonstrate ranking behavior.

Step 2: Apply All Three Ranking Methods

Now we’ll add columns showing each ranking method side by side.

ranked_penguins <- sample_penguins |>
  mutate(
    standard_rank = rank(body_mass_g),
    dense_rank = dense_rank(body_mass_g),
    row_number = row_number(body_mass_g)
  )

Each function handles ties differently: rank() uses average ranks, dense_rank() gives consecutive integers, and row_number() assigns arbitrary order.

Step 3: Examine the Results

Let’s view the comparison to understand the differences.

ranked_penguins |>
  arrange(body_mass_g) |>
  print()

Notice how tied values receive different treatments: standard ranking averages ties, dense ranking maintains consecutive numbering, while row numbering breaks ties arbitrarily.

Example 2: Practical Application

The Problem

A wildlife researcher wants to identify the heaviest penguins within each species for a health study. They need the top 3 penguins per species, but must handle ties appropriately to ensure no penguin is unfairly excluded from the study.

Step 1: Clean and Prepare Data

First, we’ll remove any penguins with missing body mass measurements.

clean_penguins <- penguins |>
  filter(!is.na(body_mass_g)) |>
  select(species, island, sex, body_mass_g)

This ensures our ranking calculations work with complete data only.

Step 2: Apply Dense Ranking by Species

We’ll use dense_rank() to ensure tied penguins receive the same rank without skipping numbers.

species_ranks <- clean_penguins |>
  group_by(species) |>
  mutate(
    weight_rank = dense_rank(desc(body_mass_g))
  ) |>
  arrange(species, weight_rank)

Dense ranking ensures that if two penguins tie for 2nd place, the next penguin gets 3rd place (not 4th), maximizing study inclusion.

Step 3: Filter Top Performers

Now we’ll extract the top 3 ranked penguins from each species.

top_penguins <- species_ranks |>
  filter(weight_rank <= 3) |>
  arrange(species, weight_rank, desc(body_mass_g))

This approach ensures we capture all penguins worthy of the top 3 positions, even when ties occur.

Step 4: Verify Results by Species

Let’s examine how many penguins we selected per species.

top_penguins |>
  count(species, name = "selected_penguins") |>
  arrange(species)

Some species might have more than 3 penguins selected due to ties at the 3rd position, which is exactly what we want for inclusive research.

Summary

  • rank() assigns average ranks to tied values, useful for statistical calculations requiring continuous ranking scales
  • dense_rank() gives consecutive integer ranks without gaps, ideal when you need compact ranking that includes all tied values
  • row_number() breaks ties arbitrarily, perfect when you need exactly one rank per observation regardless of ties
  • Group-wise ranking with group_by() enables sophisticated analyses like finding top performers within categories
  • Choose your ranking method based on how you want ties handled: averaged, consecutive, or broken arbitrarily