3 ways to rank numbers with tidyverse

dplyr

dplyr dense_rank()

dplyr min_rank()

dplyr row_number()

Learn 3 ways to rank numbers with tidyverse in R. Practical tutorial with examples.

Published

September 16, 2023

Introduction

Ranking data is a fundamental operation in data analysis that helps identify top performers, outliers, or ordering within groups. The tidyverse provides three powerful ranking functions - rank(), dense_rank(), and row_number() - each handling ties differently and serving specific analytical needs.

Getting Started

library(tidyverse)
library(palmerpenguins)

Example 1: Basic Usage

The Problem

We need to understand how the three ranking functions behave differently when applied to the same dataset. Let’s compare their outputs using penguin body mass data.

Step 1: Create Sample Data

We’ll filter for a subset of penguins to clearly see ranking differences.

sample_penguins <- penguins |>
  filter(species == "Adelie", island == "Torgersen") |>
  select(species, body_mass_g) |>
  slice_head(n = 8) |>
  arrange(body_mass_g)

This gives us a manageable dataset with some tied values to demonstrate ranking behavior.

Step 2: Apply All Three Ranking Methods

Now we’ll add columns showing each ranking method side by side.

ranked_penguins <- sample_penguins |>
  mutate(
    standard_rank = rank(body_mass_g),
    dense_rank = dense_rank(body_mass_g),
    row_number = row_number(body_mass_g)
  )

Each function handles ties differently: rank() uses average ranks, dense_rank() gives consecutive integers, and row_number() assigns arbitrary order.

Step 3: Examine the Results

Let’s view the comparison to understand the differences.

ranked_penguins |>
  arrange(body_mass_g) |>
  print()

Notice how tied values receive different treatments: standard ranking averages ties, dense ranking maintains consecutive numbering, while row numbering breaks ties arbitrarily.

Example 2: Practical Application

The Problem

A wildlife researcher wants to identify the heaviest penguins within each species for a health study. They need the top 3 penguins per species, but must handle ties appropriately to ensure no penguin is unfairly excluded from the study.

Step 1: Clean and Prepare Data

First, we’ll remove any penguins with missing body mass measurements.

clean_penguins <- penguins |>
  filter(!is.na(body_mass_g)) |>
  select(species, island, sex, body_mass_g)

This ensures our ranking calculations work with complete data only.

Step 2: Apply Dense Ranking by Species

We’ll use dense_rank() to ensure tied penguins receive the same rank without skipping numbers.

species_ranks <- clean_penguins |>
  group_by(species) |>
  mutate(
    weight_rank = dense_rank(desc(body_mass_g))
  ) |>
  arrange(species, weight_rank)

Dense ranking ensures that if two penguins tie for 2nd place, the next penguin gets 3rd place (not 4th), maximizing study inclusion.

Step 3: Filter Top Performers

Now we’ll extract the top 3 ranked penguins from each species.

top_penguins <- species_ranks |>
  filter(weight_rank <= 3) |>
  arrange(species, weight_rank, desc(body_mass_g))

This approach ensures we capture all penguins worthy of the top 3 positions, even when ties occur.

Step 4: Verify Results by Species

Let’s examine how many penguins we selected per species.

top_penguins |>
  count(species, name = "selected_penguins") |>
  arrange(species)

Some species might have more than 3 penguins selected due to ties at the 3rd position, which is exactly what we want for inclusive research.

Summary

rank() assigns average ranks to tied values, useful for statistical calculations requiring continuous ranking scales
dense_rank() gives consecutive integer ranks without gaps, ideal when you need compact ranking that includes all tied values
row_number() breaks ties arbitrarily, perfect when you need exactly one rank per observation regardless of ties
Group-wise ranking with group_by() enables sophisticated analyses like finding top performers within categories
Choose your ranking method based on how you want ties handled: averaged, consecutive, or broken arbitrarily

--- title: "3 ways to rank numbers with tidyverse" description: "Learn 3 ways to rank numbers with tidyverse in R. Practical tutorial with examples." date: 2023-09-16 categories: ['dplyr', 'dplyr dense_rank()', 'dplyr min_rank()', 'dplyr row_number()'] format: html: code-fold: false code-tools: true --- ## Introduction Ranking data is a fundamental operation in data analysis that helps identify top performers, outliers, or ordering within groups. The tidyverse provides three powerful ranking functions - `rank()`, `dense_rank()`, and [`row_number()`](/dplyr/dplyr-row_number-add-unique-row-number-to-a-dataframe.html) - each handling ties differently and serving specific analytical needs. ## Getting Started ```r library(tidyverse) library(palmerpenguins) ``` ## Example 1: Basic Usage ### The Problem We need to understand how the three ranking functions behave differently when applied to the same dataset. Let's compare their outputs using penguin body mass data. ### Step 1: Create Sample Data We'll filter for a subset of penguins to clearly see ranking differences. ```r sample_penguins <- penguins |> filter(species == "Adelie", island == "Torgersen") |> select(species, body_mass_g) |> slice_head(n = 8) |> arrange(body_mass_g) ``` This gives us a manageable dataset with some tied values to demonstrate ranking behavior. ### Step 2: Apply All Three Ranking Methods Now we'll add columns showing each ranking method side by side. ```r ranked_penguins <- sample_penguins |> mutate( standard_rank = rank(body_mass_g), dense_rank = dense_rank(body_mass_g), row_number = row_number(body_mass_g) ) ``` Each function handles ties differently: `rank()` uses average ranks, `dense_rank()` gives consecutive integers, and `row_number()` assigns arbitrary order. ### Step 3: Examine the Results Let's view the comparison to understand the differences. ```r ranked_penguins |> arrange(body_mass_g) |> print() ``` Notice how tied values receive different treatments: standard ranking averages ties, dense ranking maintains consecutive numbering, while row numbering breaks ties arbitrarily. ## Example 2: Practical Application ### The Problem A wildlife researcher wants to identify the heaviest penguins within each species for a health study. They need the top 3 penguins per species, but must handle ties appropriately to ensure no penguin is unfairly excluded from the study. ### Step 1: Clean and Prepare Data First, we'll remove any penguins with missing body mass measurements. ```r clean_penguins <- penguins |> filter(!is.na(body_mass_g)) |> select(species, island, sex, body_mass_g) ``` This ensures our ranking calculations work with complete data only. ### Step 2: Apply Dense Ranking by Species We'll use `dense_rank()` to ensure tied penguins receive the same rank without skipping numbers. ```r species_ranks <- clean_penguins |> group_by(species) |> mutate( weight_rank = dense_rank(desc(body_mass_g)) ) |> arrange(species, weight_rank) ``` Dense ranking ensures that if two penguins tie for 2nd place, the next penguin gets 3rd place (not 4th), maximizing study inclusion. ### Step 3: Filter Top Performers Now we'll extract the top 3 ranked penguins from each species. ```r top_penguins <- species_ranks |> filter(weight_rank <= 3) |> arrange(species, weight_rank, desc(body_mass_g)) ``` This approach ensures we capture all penguins worthy of the top 3 positions, even when ties occur. ### Step 4: Verify Results by Species Let's examine how many penguins we selected per species. ```r top_penguins |> count(species, name = "selected_penguins") |> arrange(species) ``` Some species might have more than 3 penguins selected due to ties at the 3rd position, which is exactly what we want for inclusive research. ## Summary - **`rank()`** assigns average ranks to tied values, useful for statistical calculations requiring continuous ranking scales - **`dense_rank()`** gives consecutive integer ranks without gaps, ideal when you need compact ranking that includes all tied values - **`row_number()`** breaks ties arbitrarily, perfect when you need exactly one rank per observation regardless of ties - **Group-wise ranking** with [`group_by()`](/dplyr/how-to-use-groupby-in-r.html) enables sophisticated analyses like finding top performers within categories - **Choose your ranking method** based on how you want ties handled: averaged, consecutive, or broken arbitrarily --- ## Related Posts - [dplyr near(): compare if numbers and vectors are nearly the same](/dplyr/near-function-in-dplyr-r-package.html) - [How to use separate() in R](/tidyr/how-to-use-separate-in-r.html) - [How to use separate_wider_delim() in R](/tidyr/how-to-use-separatewiderdelim-in-r.html)

Introduction

Getting Started

Example 1: Basic Usage

The Problem

Step 1: Create Sample Data

Step 2: Apply All Three Ranking Methods

Step 3: Examine the Results

Example 2: Practical Application

The Problem

Step 1: Clean and Prepare Data

Step 2: Apply Dense Ranking by Species

Step 3: Filter Top Performers

Step 4: Verify Results by Species

Summary

Choose your ranking method based on how you want ties handled: averaged, consecutive, or broken arbitrarily

Related Posts