How to use group_by() in R

dplyr

dplyr group_by()

Learn how to use group_by() in R with practical examples. Step-by-step guide with code you can copy and run immediately.

Published

February 20, 2026

Introduction

The group_by() function in R’s tidyverse is a powerful tool for performing operations on subsets of your data. It allows you to split your dataset into groups based on one or more variables, then apply functions to each group separately. This is essential for calculating group-specific statistics, creating summaries by category, or performing transformations within groups.

You’ll use group_by() when you need to answer questions like “What’s the average sales by region?” or “How many observations are in each category?” It’s particularly valuable in data analysis workflows where you need to compare metrics across different segments of your data, making it indispensable for exploratory data analysis and reporting.

Getting Started

First, let’s load the required packages:

library(tidyverse)
library(palmerpenguins)

Example 1: Basic Usage

Let’s start with a simple example using the Palmer penguins dataset to calculate basic statistics by species:

penguins |>
  group_by(species) |>
  summarise(
    count = n(),
    avg_bill_length = mean(bill_length_mm, na.rm = TRUE),
    avg_body_mass = mean(body_mass_g, na.rm = TRUE)
  )

This code groups the penguins by species and calculates the count, average bill length, and average body mass for each species. The n() function counts the number of observations in each group, while mean() calculates the average values.

We can also group by multiple variables:

penguins |>
  group_by(species, island) |>
  summarise(
    count = n(),
    avg_bill_length = mean(bill_length_mm, na.rm = TRUE),
    .groups = "drop"
  )

The .groups = "drop" argument removes the grouping after summarizing, which is often what you want.

Example 2: Practical Application

Let’s create a more comprehensive analysis that demonstrates the power of group_by() in a real-world scenario. We’ll analyze penguin characteristics by species and sex, including data quality checks:

penguin_analysis <- penguins |>
  filter(!is.na(sex)) |>
  group_by(species, sex) |>
  summarise(
    count = n(),
    avg_bill_length = mean(bill_length_mm, na.rm = TRUE),
    sd_bill_length = sd(bill_length_mm, na.rm = TRUE),
    avg_bill_depth = mean(bill_depth_mm, na.rm = TRUE),
    avg_flipper_length = mean(flipper_length_mm, na.rm = TRUE),
    avg_body_mass = mean(body_mass_g, na.rm = TRUE),
    max_body_mass = max(body_mass_g, na.rm = TRUE),
    min_body_mass = min(body_mass_g, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(species, sex)

penguin_analysis

We can also use group_by() with mutate() to add calculated columns within groups:

penguins |>
  filter(!is.na(body_mass_g)) |>
  group_by(species) |>
  mutate(
    body_mass_centered = body_mass_g - mean(body_mass_g, na.rm = TRUE),
    body_mass_rank = rank(desc(body_mass_g)),
    above_avg_mass = body_mass_g > mean(body_mass_g, na.rm = TRUE)
  ) |>
  select(species, body_mass_g, body_mass_centered, body_mass_rank, above_avg_mass) |>
  arrange(species, desc(body_mass_g))

This example shows how to center body mass values around each species’ mean, rank penguins within their species by body mass, and create a logical flag for above-average individuals.

For a final practical example, let’s calculate proportions within groups:

penguins |>
  filter(!is.na(sex)) |>
  group_by(species) |>
  count(sex) |>
  mutate(
    total_per_species = sum(n),
    proportion = n / total_per_species,
    percentage = round(proportion * 100, 1)
  )

Summary

The group_by() function is essential for data analysis in R, enabling you to perform calculations on subsets of your data efficiently. Key takeaways include:

Use group_by() with summarise() to calculate statistics for each group
Combine with mutate() to add group-wise calculations as new columns
Group by multiple variables for more detailed analysis
Always use na.rm = TRUE when working with functions like mean() and sd() if your data contains missing values
Consider using .groups = "drop" in summarise() to remove grouping when finished

Remember to `ungroup()` your data or use `.groups = "drop"` when you’re done with group operations to avoid unexpected behavior in subsequent operations. The combination of `group_by()` with other tidyverse functions creates a powerful toolkit for data manipulation and analysis.

--- title: "How to use group_by() in R" description: "Learn how to use group_by() in R with practical examples. Step-by-step guide with code you can copy and run immediately." date: 2026-02-20 categories: ['dplyr', 'dplyr group_by()'] format: html: code-fold: false code-tools: true --- ## Introduction The `group_by()` function in R's tidyverse is a powerful tool for performing operations on subsets of your data. It allows you to split your dataset into groups based on one or more variables, then apply functions to each group separately. This is essential for calculating group-specific statistics, creating summaries by category, or performing transformations within groups. You'll use `group_by()` when you need to answer questions like "What's the average sales by region?" or "How many observations are in each category?" It's particularly valuable in data analysis workflows where you need to compare metrics across different segments of your data, making it indispensable for exploratory data analysis and reporting. ## Getting Started First, let's load the required packages: ```r library(tidyverse) library(palmerpenguins) ``` ## Example 1: Basic Usage Let's start with a simple example using the Palmer penguins dataset to calculate basic statistics by species: ```r penguins |> group_by(species) |> summarise( count = n(), avg_bill_length = mean(bill_length_mm, na.rm = TRUE), avg_body_mass = mean(body_mass_g, na.rm = TRUE) ) ``` This code groups the penguins by species and calculates the count, average bill length, and average body mass for each species. The `n()` function counts the number of observations in each group, while `mean()` calculates the average values. We can also group by multiple variables: ```r penguins |> group_by(species, island) |> summarise( count = n(), avg_bill_length = mean(bill_length_mm, na.rm = TRUE), .groups = "drop" ) ``` The `.groups = "drop"` argument removes the grouping after summarizing, which is often what you want. ## Example 2: Practical Application Let's create a more comprehensive analysis that demonstrates the power of `group_by()` in a real-world scenario. We'll analyze penguin characteristics by species and sex, including data quality checks: ```r penguin_analysis <- penguins |> filter(!is.na(sex)) |> group_by(species, sex) |> summarise( count = n(), avg_bill_length = mean(bill_length_mm, na.rm = TRUE), sd_bill_length = sd(bill_length_mm, na.rm = TRUE), avg_bill_depth = mean(bill_depth_mm, na.rm = TRUE), avg_flipper_length = mean(flipper_length_mm, na.rm = TRUE), avg_body_mass = mean(body_mass_g, na.rm = TRUE), max_body_mass = max(body_mass_g, na.rm = TRUE), min_body_mass = min(body_mass_g, na.rm = TRUE), .groups = "drop" ) |> arrange(species, sex) penguin_analysis ``` We can also use `group_by()` with [`mutate()`](/dplyr/how-to-use-mutate-in-r.html) to add calculated columns within groups: ```r penguins |> filter(!is.na(body_mass_g)) |> group_by(species) |> mutate( body_mass_centered = body_mass_g - mean(body_mass_g, na.rm = TRUE), body_mass_rank = rank(desc(body_mass_g)), above_avg_mass = body_mass_g > mean(body_mass_g, na.rm = TRUE) ) |> select(species, body_mass_g, body_mass_centered, body_mass_rank, above_avg_mass) |> arrange(species, desc(body_mass_g)) ``` This example shows how to center body mass values around each species' mean, rank penguins within their species by body mass, and create a logical flag for above-average individuals. For a final practical example, let's calculate proportions within groups: ```r penguins |> filter(!is.na(sex)) |> group_by(species) |> count(sex) |> mutate( total_per_species = sum(n), proportion = n / total_per_species, percentage = round(proportion * 100, 1) ) ``` ## Summary The `group_by()` function is essential for data analysis in R, enabling you to perform calculations on subsets of your data efficiently. Key takeaways include: - Use `group_by()` with [`summarise()`](/dplyr/how-to-use-summarise-in-r.html) to calculate statistics for each group - Combine with `mutate()` to add group-wise calculations as new columns - Group by multiple variables for more detailed analysis - Always use `na.rm = TRUE` when working with functions like `mean()` and `sd()` if your data contains missing values - Consider using `.groups = "drop"` in `summarise()` to remove grouping when finished Remember to `ungroup()` your data or use `.groups = "drop"` when you're done with group operations to avoid unexpected behavior in subsequent operations. The combination of `group_by()` with other tidyverse functions creates a powerful toolkit for data manipulation and analysis. --- ## Related Posts - [How to use select() in R](/dplyr/how-to-use-select-in-r.html) - [How to use mutate() in R](/dplyr/how-to-use-mutate-in-r.html) - [How to use pull() in R](/dplyr/how-to-use-pull-in-r.html) - [How to use separate() in R](/tidyr/how-to-use-separate-in-r.html) - [How to use separate_wider_delim() in R](/tidyr/how-to-use-separatewiderdelim-in-r.html)