dplyr row_number(): Add unique row number to a dataframe

dplyr row_number()
Master dplyr row_number() to add unique row number to a dataframe. Complete R tutorial with examples using real datasets.
Published

January 23, 2022

Introduction

The row_number() function in dplyr is a powerful window function that assigns unique sequential numbers to rows in a dataframe. Unlike base R’s rownames(), row_number() integrates seamlessly with dplyr’s grammar and works perfectly within grouped operations and data pipelines.

This function is particularly useful when you need to create unique identifiers, rank observations, or perform operations that require row positioning. Whether you’re working with ungrouped data or need to number rows within specific groups, row_number() provides a clean, efficient solution that maintains the tidyverse philosophy of readable, chainable code.

Getting Started

Let’s load the required packages for this tutorial:

library(tidyverse)
library(palmerpenguins)

Example 1: Basic Usage

The simplest use case is adding row numbers to an entire dataframe. Here’s how to use row_number() with the penguins dataset:

penguins_numbered <- penguins |>
  mutate(row_id = row_number())

head(penguins_numbered)

You can also use row_number() to create conditional row numbering or filtering. For example, to get the first 5 rows of each species:

first_five_per_species <- penguins |>
  group_by(species) |>
  mutate(species_row = row_number()) |>
  filter(species_row <= 5) |>
  select(species, island, species_row, everything())

first_five_per_species

Example 2: Practical Application

Let’s explore a more complex real-world scenario where we want to analyze penguin body measurements and identify the largest penguins within each species by body mass:

penguin_rankings <- penguins |>
  filter(!is.na(body_mass_g)) |>
  group_by(species) |>
  arrange(desc(body_mass_g)) |>
  mutate(
    mass_rank = row_number(),
    total_in_species = n(),
    percentile_rank = round((mass_rank / total_in_species) * 100, 1)
  ) |>
  select(species, island, sex, body_mass_g, mass_rank, percentile_rank) |>
  ungroup()

top_penguins <- penguin_rankings |>
  filter(mass_rank <= 3)

top_penguins

We can also combine row_number() with other window functions to create more sophisticated analyses:

penguin_analysis <- penguins |>
  filter(!is.na(body_mass_g), !is.na(flipper_length_mm)) |>
  group_by(species, sex) |>
  arrange(desc(body_mass_g)) |>
  mutate(
    rank_in_group = row_number(),
    is_top_third = rank_in_group <= ceiling(n() / 3),
    mass_vs_group_avg = body_mass_g - mean(body_mass_g),
    flipper_vs_group_avg = flipper_length_mm - mean(flipper_length_mm)
  ) |>
  filter(rank_in_group <= 5) |>
  select(species, sex, rank_in_group, body_mass_g, flipper_length_mm, 
         is_top_third, mass_vs_group_avg) |>
  ungroup()

penguin_analysis

Summary

The row_number() function is an essential tool for adding sequential identifiers and creating rankings in your data analysis workflow. Key takeaways include:

  • Use row_number() within mutate() to add row identifiers
  • Combine with group_by() to create row numbers within groups
  • Pair with arrange() to control the ordering before numbering
  • Integrate with filter() to select top-n observations
  • Works seamlessly with other dplyr functions in pipe chains

This function excels in scenarios requiring data ranking, sampling, or creating unique identifiers while maintaining the clean, readable syntax that makes dplyr so powerful for data manipulation tasks.