How to add row number within each group in dplyr

dplyr
dplyr row_number()
Learn how to add row number within each group in dplyr with this comprehensive R tutorial. Includes practical examples and code snippets.
Published

June 3, 2022

Introduction

Adding row numbers within groups is a fundamental data manipulation task in R. This technique allows you to assign sequential numbers to observations within each category or group of your dataset. You’ll commonly use this when you need to identify the first, last, or nth observation within groups, rank items within categories, or create unique identifiers for grouped data.

The row_number() function in dplyr makes this process straightforward and efficient. Combined with group_by(), it calculates row numbers that restart at 1 for each new group. This is particularly useful for time series data, survey responses, or any dataset where you need to track the order of observations within specific categories.

Getting Started

First, let’s load the required packages and prepare our data:

library(tidyverse)
library(palmerpenguins)

# View the first few rows of penguins data
head(penguins)

Example 1: Basic Usage

Let’s start with a simple example using the penguins dataset. We’ll add row numbers within each species group:

penguins_numbered <- penguins |>
  group_by(species) |>
  mutate(row_num = row_number()) |>
  ungroup()

# View results for each species
penguins_numbered |>
  select(species, island, row_num) |>
  head(10)

You can also sort the data before adding row numbers to control the ordering:

penguins_sorted <- penguins |>
  group_by(species) |>
  arrange(bill_length_mm, .by_group = TRUE) |>
  mutate(row_num = row_number()) |>
  ungroup()

# Check the ordering within groups
penguins_sorted |>
  select(species, bill_length_mm, row_num) |>
  filter(species == "Adelie") |>
  head(8)

Example 2: Practical Application

Now let’s explore a more practical scenario. Suppose we want to identify the largest penguin by body mass within each species and island combination, then add row numbers ordered by body mass:

penguin_rankings <- penguins |>
  filter(!is.na(body_mass_g)) |>
  group_by(species, island) |>
  arrange(desc(body_mass_g), .by_group = TRUE) |>
  mutate(
    mass_rank = row_number(),
    total_in_group = n(),
    is_heaviest = mass_rank == 1
  ) |>
  ungroup()

# Show the heaviest penguin in each species-island group
penguin_rankings |>
  filter(is_heaviest) |>
  select(species, island, body_mass_g, mass_rank, total_in_group)

You can also use row_number() with conditions to create more sophisticated rankings:

top_penguins <- penguins |>
  filter(!is.na(bill_length_mm), !is.na(body_mass_g)) |>
  group_by(species, sex) |>
  arrange(desc(bill_length_mm), .by_group = TRUE) |>
  mutate(
    bill_rank = row_number(),
    percentile = round(100 * (n() - bill_rank + 1) / n(), 1)
  ) |>
  filter(bill_rank <= 3) |>
  ungroup()

# Display top 3 penguins by bill length within each species-sex group
top_penguins |>
  select(species, sex, bill_length_mm, bill_rank, percentile)

Here’s another useful application - finding the first and last observation in each group by date:

penguin_timeline <- penguins |>
  group_by(species, island) |>
  arrange(year, .by_group = TRUE) |>
  mutate(
    obs_number = row_number(),
    total_obs = n(),
    is_first = obs_number == 1,
    is_last = obs_number == max(obs_number)
  ) |>
  ungroup()

# Show first and last observations for each group
penguin_timeline |>
  filter(is_first | is_last) |>
  select(species, island, year, obs_number, total_obs, is_first, is_last) |>
  arrange(species, island, year)

Summary

Adding row numbers within groups using dplyr is accomplished through the combination of group_by() and row_number() functions. Key takeaways include:

  • Use group_by() to define your grouping variables, then mutate() with row_number() to add sequential numbers
  • Always arrange() your data before adding row numbers if order matters
  • Row numbers restart at 1 for each new group
  • Combine with functions like n(), filtering, and logical conditions for advanced applications
  • Remember to ungroup() when finished to avoid unexpected behavior in subsequent operations

This technique is invaluable for data analysis tasks involving rankings, identifying specific observations within groups, and creating structured datasets for further analysis.