dplyr row_number(): Add unique row number to a dataframe
Introduction
The row_number() function in dplyr is a powerful window function that assigns unique sequential numbers to rows in a dataframe. Unlike base R’s rownames(), row_number() integrates seamlessly with dplyr’s grammar and works perfectly within grouped operations and data pipelines.
This function is particularly useful when you need to create unique identifiers, rank observations, or perform operations that require row positioning. Whether you’re working with ungrouped data or need to number rows within specific groups, row_number() provides a clean, efficient solution that maintains the tidyverse philosophy of readable, chainable code.
Getting Started
Let’s load the required packages for this tutorial:
library(tidyverse)
library(palmerpenguins)Example 1: Basic Usage
The simplest use case is adding row numbers to an entire dataframe. Here’s how to use row_number() with the penguins dataset:
penguins_numbered <- penguins |>
mutate(row_id = row_number())
head(penguins_numbered)You can also use row_number() to create conditional row numbering or filtering. For example, to get the first 5 rows of each species:
first_five_per_species <- penguins |>
group_by(species) |>
mutate(species_row = row_number()) |>
filter(species_row <= 5) |>
select(species, island, species_row, everything())
first_five_per_speciesExample 2: Practical Application
Let’s explore a more complex real-world scenario where we want to analyze penguin body measurements and identify the largest penguins within each species by body mass:
penguin_rankings <- penguins |>
filter(!is.na(body_mass_g)) |>
group_by(species) |>
arrange(desc(body_mass_g)) |>
mutate(
mass_rank = row_number(),
total_in_species = n(),
percentile_rank = round((mass_rank / total_in_species) * 100, 1)
) |>
select(species, island, sex, body_mass_g, mass_rank, percentile_rank) |>
ungroup()
top_penguins <- penguin_rankings |>
filter(mass_rank <= 3)
top_penguinsWe can also combine row_number() with other window functions to create more sophisticated analyses:
penguin_analysis <- penguins |>
filter(!is.na(body_mass_g), !is.na(flipper_length_mm)) |>
group_by(species, sex) |>
arrange(desc(body_mass_g)) |>
mutate(
rank_in_group = row_number(),
is_top_third = rank_in_group <= ceiling(n() / 3),
mass_vs_group_avg = body_mass_g - mean(body_mass_g),
flipper_vs_group_avg = flipper_length_mm - mean(flipper_length_mm)
) |>
filter(rank_in_group <= 5) |>
select(species, sex, rank_in_group, body_mass_g, flipper_length_mm,
is_top_third, mass_vs_group_avg) |>
ungroup()
penguin_analysisSummary
The row_number() function is an essential tool for adding sequential identifiers and creating rankings in your data analysis workflow. Key takeaways include:
- Use
row_number()withinmutate()to add row identifiers - Combine with
group_by()to create row numbers within groups - Pair with
arrange()to control the ordering before numbering - Integrate with
filter()to select top-n observations - Works seamlessly with other dplyr functions in pipe chains