dplyr row_number(): Add unique row number to a dataframe

dplyr row_number()

Master dplyr row_number() to add unique row number to a dataframe. Complete R tutorial with examples using real datasets.

Published

January 23, 2022

Introduction

The row_number() function in dplyr is a powerful window function that assigns unique sequential numbers to rows in a dataframe. Unlike base R’s rownames(), row_number() integrates seamlessly with dplyr’s grammar and works perfectly within grouped operations and data pipelines.

This function is particularly useful when you need to create unique identifiers, rank observations, or perform operations that require row positioning. Whether you’re working with ungrouped data or need to number rows within specific groups, row_number() provides a clean, efficient solution that maintains the tidyverse philosophy of readable, chainable code.

Getting Started

Let’s load the required packages for this tutorial:

library(tidyverse)
library(palmerpenguins)

Example 1: Basic Usage

The simplest use case is adding row numbers to an entire dataframe. Here’s how to use row_number() with the penguins dataset:

penguins_numbered <- penguins |>
  mutate(row_id = row_number())

head(penguins_numbered)

You can also use row_number() to create conditional row numbering or filtering. For example, to get the first 5 rows of each species:

first_five_per_species <- penguins |>
  group_by(species) |>
  mutate(species_row = row_number()) |>
  filter(species_row <= 5) |>
  select(species, island, species_row, everything())

first_five_per_species

Example 2: Practical Application

Let’s explore a more complex real-world scenario where we want to analyze penguin body measurements and identify the largest penguins within each species by body mass:

penguin_rankings <- penguins |>
  filter(!is.na(body_mass_g)) |>
  group_by(species) |>
  arrange(desc(body_mass_g)) |>
  mutate(
    mass_rank = row_number(),
    total_in_species = n(),
    percentile_rank = round((mass_rank / total_in_species) * 100, 1)
  ) |>
  select(species, island, sex, body_mass_g, mass_rank, percentile_rank) |>
  ungroup()

top_penguins <- penguin_rankings |>
  filter(mass_rank <= 3)

top_penguins

We can also combine row_number() with other window functions to create more sophisticated analyses:

penguin_analysis <- penguins |>
  filter(!is.na(body_mass_g), !is.na(flipper_length_mm)) |>
  group_by(species, sex) |>
  arrange(desc(body_mass_g)) |>
  mutate(
    rank_in_group = row_number(),
    is_top_third = rank_in_group <= ceiling(n() / 3),
    mass_vs_group_avg = body_mass_g - mean(body_mass_g),
    flipper_vs_group_avg = flipper_length_mm - mean(flipper_length_mm)
  ) |>
  filter(rank_in_group <= 5) |>
  select(species, sex, rank_in_group, body_mass_g, flipper_length_mm, 
         is_top_third, mass_vs_group_avg) |>
  ungroup()

penguin_analysis

Summary

The row_number() function is an essential tool for adding sequential identifiers and creating rankings in your data analysis workflow. Key takeaways include:

Use row_number() within mutate() to add row identifiers
Combine with group_by() to create row numbers within groups
Pair with arrange() to control the ordering before numbering
Integrate with filter() to select top-n observations
Works seamlessly with other dplyr functions in pipe chains

This function excels in scenarios requiring data ranking, sampling, or creating unique identifiers while maintaining the clean, readable syntax that makes dplyr so powerful for data manipulation tasks.

--- title: "dplyr row_number(): Add unique row number to a dataframe" description: "Master dplyr row_number() to add unique row number to a dataframe. Complete R tutorial with examples using real datasets." date: 2022-01-23 categories: ['dplyr row_number()'] format: html: code-fold: false code-tools: true --- ## Introduction The `row_number()` function in dplyr is a powerful window function that assigns unique sequential numbers to rows in a dataframe. Unlike base R's `rownames()`, `row_number()` integrates seamlessly with dplyr's grammar and works perfectly within grouped operations and data pipelines. This function is particularly useful when you need to create unique identifiers, rank observations, or perform operations that require row positioning. Whether you're working with ungrouped data or need to number rows within specific groups, `row_number()` provides a clean, efficient solution that maintains the tidyverse philosophy of readable, chainable code. ## Getting Started Let's load the required packages for this tutorial: ```r library(tidyverse) library(palmerpenguins) ``` ## Example 1: Basic Usage The simplest use case is adding row numbers to an entire dataframe. Here's how to use `row_number()` with the penguins dataset: ```r penguins_numbered <- penguins |> mutate(row_id = row_number()) head(penguins_numbered) ``` You can also use `row_number()` to create conditional row numbering or filtering. For example, to get the first 5 rows of each species: ```r first_five_per_species <- penguins |> group_by(species) |> mutate(species_row = row_number()) |> filter(species_row <= 5) |> select(species, island, species_row, everything()) first_five_per_species ``` ## Example 2: Practical Application Let's explore a more complex real-world scenario where we want to analyze penguin body measurements and identify the largest penguins within each species by body mass: ```r penguin_rankings <- penguins |> filter(!is.na(body_mass_g)) |> group_by(species) |> arrange(desc(body_mass_g)) |> mutate( mass_rank = row_number(), total_in_species = n(), percentile_rank = round((mass_rank / total_in_species) * 100, 1) ) |> select(species, island, sex, body_mass_g, mass_rank, percentile_rank) |> ungroup() top_penguins <- penguin_rankings |> filter(mass_rank <= 3) top_penguins ``` We can also combine `row_number()` with other window functions to create more sophisticated analyses: ```r penguin_analysis <- penguins |> filter(!is.na(body_mass_g), !is.na(flipper_length_mm)) |> group_by(species, sex) |> arrange(desc(body_mass_g)) |> mutate( rank_in_group = row_number(), is_top_third = rank_in_group <= ceiling(n() / 3), mass_vs_group_avg = body_mass_g - mean(body_mass_g), flipper_vs_group_avg = flipper_length_mm - mean(flipper_length_mm) ) |> filter(rank_in_group <= 5) |> select(species, sex, rank_in_group, body_mass_g, flipper_length_mm, is_top_third, mass_vs_group_avg) |> ungroup() penguin_analysis ``` ## Summary The `row_number()` function is an essential tool for adding sequential identifiers and creating rankings in your data analysis workflow. Key takeaways include: - Use `row_number()` within [`mutate()`](/dplyr/how-to-use-mutate-in-r.html) to add row identifiers - Combine with [`group_by()`](/dplyr/how-to-use-groupby-in-r.html) to create row numbers within groups - Pair with [`arrange()`](/dplyr/how-to-use-arrange-in-r.html) to control the ordering before numbering - Integrate with [`filter()`](/dplyr/how-to-use-filter-in-r.html) to select top-n observations - Works seamlessly with other dplyr functions in pipe chains This function excels in scenarios requiring data ranking, sampling, or creating unique identifiers while maintaining the clean, readable syntax that makes dplyr so powerful for data manipulation tasks. --- ## Related Posts - [How to add row number within each group in dplyr](/dplyr/add-row-number-within-each-group-in-dplyr.html) - [How to count number of missing values per row in a dataframe](/dplyr/count-number-of-missing-values-per-row-in-a-dataframe.html) - [dplyr transmute(): add new columns and delete existing columns](/dplyr/dplyr-transmute-add-new-columns-and-delete-existing-columns.html) - [pivot_longer on dataframe with single row](/tidyr/pivot_longer-on-dataframe-with-single-row.html) - [How to replace NAs with zero in a dataframe](/tidyr/tidyr-replace_na-function.html)