How to add row number within each group in dplyr
Introduction
Adding row numbers within groups is a fundamental data manipulation task in R. This technique allows you to assign sequential numbers to observations within each category or group of your dataset. You’ll commonly use this when you need to identify the first, last, or nth observation within groups, rank items within categories, or create unique identifiers for grouped data.
The row_number() function in dplyr makes this process straightforward and efficient. Combined with group_by(), it calculates row numbers that restart at 1 for each new group. This is particularly useful for time series data, survey responses, or any dataset where you need to track the order of observations within specific categories.
Getting Started
First, let’s load the required packages and prepare our data:
library(tidyverse)
library(palmerpenguins)
# View the first few rows of penguins data
head(penguins)Example 1: Basic Usage
Let’s start with a simple example using the penguins dataset. We’ll add row numbers within each species group:
penguins_numbered <- penguins |>
group_by(species) |>
mutate(row_num = row_number()) |>
ungroup()
# View results for each species
penguins_numbered |>
select(species, island, row_num) |>
head(10)You can also sort the data before adding row numbers to control the ordering:
penguins_sorted <- penguins |>
group_by(species) |>
arrange(bill_length_mm, .by_group = TRUE) |>
mutate(row_num = row_number()) |>
ungroup()
# Check the ordering within groups
penguins_sorted |>
select(species, bill_length_mm, row_num) |>
filter(species == "Adelie") |>
head(8)Example 2: Practical Application
Now let’s explore a more practical scenario. Suppose we want to identify the largest penguin by body mass within each species and island combination, then add row numbers ordered by body mass:
penguin_rankings <- penguins |>
filter(!is.na(body_mass_g)) |>
group_by(species, island) |>
arrange(desc(body_mass_g), .by_group = TRUE) |>
mutate(
mass_rank = row_number(),
total_in_group = n(),
is_heaviest = mass_rank == 1
) |>
ungroup()
# Show the heaviest penguin in each species-island group
penguin_rankings |>
filter(is_heaviest) |>
select(species, island, body_mass_g, mass_rank, total_in_group)You can also use row_number() with conditions to create more sophisticated rankings:
top_penguins <- penguins |>
filter(!is.na(bill_length_mm), !is.na(body_mass_g)) |>
group_by(species, sex) |>
arrange(desc(bill_length_mm), .by_group = TRUE) |>
mutate(
bill_rank = row_number(),
percentile = round(100 * (n() - bill_rank + 1) / n(), 1)
) |>
filter(bill_rank <= 3) |>
ungroup()
# Display top 3 penguins by bill length within each species-sex group
top_penguins |>
select(species, sex, bill_length_mm, bill_rank, percentile)Here’s another useful application - finding the first and last observation in each group by date:
penguin_timeline <- penguins |>
group_by(species, island) |>
arrange(year, .by_group = TRUE) |>
mutate(
obs_number = row_number(),
total_obs = n(),
is_first = obs_number == 1,
is_last = obs_number == max(obs_number)
) |>
ungroup()
# Show first and last observations for each group
penguin_timeline |>
filter(is_first | is_last) |>
select(species, island, year, obs_number, total_obs, is_first, is_last) |>
arrange(species, island, year)Summary
Adding row numbers within groups using dplyr is accomplished through the combination of group_by() and row_number() functions. Key takeaways include:
- Use
group_by()to define your grouping variables, thenmutate()withrow_number()to add sequential numbers - Always
arrange()your data before adding row numbers if order matters - Row numbers restart at 1 for each new group
- Combine with functions like
n(), filtering, and logical conditions for advanced applications - Remember to
ungroup()when finished to avoid unexpected behavior in subsequent operations