How to use group_by() in R

dplyr

dplyr group_by()

Published

February 20, 2026

dplyr::group_by() Tutorial

1. Introduction

The group_by() function from the dplyr package is a fundamental tool for grouping data by one or more variables. It creates invisible groups within your data frame that can then be used with other dplyr functions to perform operations on each group separately. This function is essential for data aggregation, summary statistics, and split-apply-combine operations.

You would use group_by() when you need to calculate statistics for different categories in your data, such as finding the average height by species, counting observations per group, or applying transformations within groups. The function is part of the dplyr package, which is included in the tidyverse collection of packages. group_by() doesn’t change your data itself but adds grouping metadata that subsequent functions can utilize.

2. Syntax

group_by(.data, ..., .add = FALSE, .drop = group_by_drop_default(.data))

Key arguments: - .data: A data frame or tibble to group - ...: Variables to group by (can be column names or expressions) - .add: If TRUE, adds grouping variables to existing groups instead of overriding - .drop: If TRUE, drops unused factor levels from grouping variables

3. Example 1: Basic Usage

library(tidyverse)
library(palmerpenguins)

# Basic grouping by species
penguins |>
  group_by(species) |>
  summarise(count = n())

# A tibble: 3 × 2
  species   count
  <fct>     <int>
1 Adelie      152
2 Chinstrap    68
3 Gentoo      124

In this example, we grouped the penguins dataset by the species variable and then used summarise() with n() to count the number of observations in each group. The group_by() function created three invisible groups (one for each penguin species), and summarise() calculated the count for each group separately, returning one row per group.

4. Example 2: Practical Application

# Calculate average body measurements by species and sex
penguin_summary <- penguins |>
  filter(!is.na(sex)) |>
  group_by(species, sex) |>
  summarise(
    count = n(),
    avg_bill_length = mean(bill_length_mm, na.rm = TRUE),
    avg_body_mass = mean(body_mass_g, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(species, sex)

penguin_summary

# A tibble: 6 × 5
  species   sex    count avg_bill_length avg_body_mass
  <fct>     <fct>  <int>           <dbl>         <dbl>
1 Adelie    female    73            37.3         3369.
2 Adelie    male      73            40.4         4043.
3 Chinstrap female    34            46.6         3527.
4 Chinstrap male      34            51.1         3939.
5 Gentoo    female    58            45.6         4680.
6 Gentoo    male      61            47.5         5485.

This practical example demonstrates grouping by multiple variables (species and sex) to calculate comprehensive summary statistics. We filtered out missing sex values, grouped by both variables, and computed multiple summary metrics for each combination.

5. Example 3: Advanced Usage

# Using group_by with mutate for within-group calculations
penguins |>
  filter(!is.na(body_mass_g)) |>
  group_by(species) |>
  mutate(
    mass_rank = rank(desc(body_mass_g)),
    mass_percentile = percent_rank(body_mass_g),
    deviation_from_species_mean = body_mass_g - mean(body_mass_g, na.rm = TRUE)
  ) |>
  select(species, body_mass_g, mass_rank, mass_percentile, deviation_from_species_mean) |>
  slice_head(n = 3)

# A tibble: 9 × 5
# Groups:   species [3]
  species body_mass_g mass_rank mass_percentile deviation_from_species_mean
  <fct>         <int>     <dbl>           <dbl>                       <dbl>
1 Adelie         3750         76          0.503                        79.8
2 Adelie         3800         65          0.570                       129.8
3 Adelie         3250        117          0.168                      -420.2
4 Chinstrap      3500         37          0.448                      -187.5
5 Chinstrap      3900          9          0.866                       212.5
6 Chinstrap      3650         24          0.642                       -37.5
7 Gentoo         4650         82          0.339                      -568.1
8 Gentoo         5700          8          0.926                       481.9
9 Gentoo         4725         79          0.372                      -493.1

This advanced example shows how group_by() works with mutate() to perform within-group calculations. Each calculation (ranking, percentiles, deviations) is performed separately for each species group.

6. Common Mistakes

Mistake 1: Forgetting to ungroup

# Wrong - groups persist
df_grouped <- penguins |> group_by(species)
df_grouped |> summarise(count = n())  # Still grouped!

# Correct - explicitly ungroup
df_grouped |> summarise(count = n(), .groups = "drop")
# Or use ungroup()
df_grouped |> summarise(count = n()) |> ungroup()

Mistake 2: Using group_by() without a subsequent operation

# Wrong - group_by() alone doesn't do anything visible
penguins |> group_by(species)

# Correct - follow with an operation
penguins |> group_by(species) |> summarise(count = n())

Mistake 3: Not handling missing values in grouping variables

# Be careful with NA values in grouping variables
penguins |> group_by(sex) |> summarise(count = n())  # Creates NA group
# Consider filtering or using na.omit() first

--- title: "How to use group_by() in R" date: 2026-02-20 categories: ["dplyr", "dplyr group_by()"] format: html: code-fold: false code-tools: true --- # dplyr::group_by() Tutorial ## 1. Introduction The `group_by()` function from the dplyr package is a fundamental tool for grouping data by one or more variables. It creates invisible groups within your data frame that can then be used with other dplyr functions to perform operations on each group separately. This function is essential for data aggregation, summary statistics, and split-apply-combine operations. You would use `group_by()` when you need to calculate statistics for different categories in your data, such as finding the average height by species, counting observations per group, or applying transformations within groups. The function is part of the dplyr package, which is included in the tidyverse collection of packages. `group_by()` doesn't change your data itself but adds grouping metadata that subsequent functions can utilize. ## 2. Syntax ```r group_by(.data, ..., .add = FALSE, .drop = group_by_drop_default(.data)) ``` Key arguments: - `.data`: A data frame or tibble to group - `...`: Variables to group by (can be column names or expressions) - `.add`: If TRUE, adds grouping variables to existing groups instead of overriding - `.drop`: If TRUE, drops unused factor levels from grouping variables ## 3. Example 1: Basic Usage ```r library(tidyverse) library(palmerpenguins) # Basic grouping by species penguins |> group_by(species) |> summarise(count = n()) ``` ``` # A tibble: 3 × 2 species count <fct> <int> 1 Adelie 152 2 Chinstrap 68 3 Gentoo 124 ``` In this example, we grouped the penguins dataset by the `species` variable and then used `summarise()` with `n()` to count the number of observations in each group. The `group_by()` function created three invisible groups (one for each penguin species), and `summarise()` calculated the count for each group separately, returning one row per group. ## 4. Example 2: Practical Application ```r # Calculate average body measurements by species and sex penguin_summary <- penguins |> filter(!is.na(sex)) |> group_by(species, sex) |> summarise( count = n(), avg_bill_length = mean(bill_length_mm, na.rm = TRUE), avg_body_mass = mean(body_mass_g, na.rm = TRUE), .groups = "drop" ) |> arrange(species, sex) penguin_summary ``` ``` # A tibble: 6 × 5 species sex count avg_bill_length avg_body_mass <fct> <fct> <int> <dbl> <dbl> 1 Adelie female 73 37.3 3369. 2 Adelie male 73 40.4 4043. 3 Chinstrap female 34 46.6 3527. 4 Chinstrap male 34 51.1 3939. 5 Gentoo female 58 45.6 4680. 6 Gentoo male 61 47.5 5485. ``` This practical example demonstrates grouping by multiple variables (species and sex) to calculate comprehensive summary statistics. We filtered out missing sex values, grouped by both variables, and computed multiple summary metrics for each combination. ## 5. Example 3: Advanced Usage ```r # Using group_by with mutate for within-group calculations penguins |> filter(!is.na(body_mass_g)) |> group_by(species) |> mutate( mass_rank = rank(desc(body_mass_g)), mass_percentile = percent_rank(body_mass_g), deviation_from_species_mean = body_mass_g - mean(body_mass_g, na.rm = TRUE) ) |> select(species, body_mass_g, mass_rank, mass_percentile, deviation_from_species_mean) |> slice_head(n = 3) ``` ``` # A tibble: 9 × 5 # Groups: species [3] species body_mass_g mass_rank mass_percentile deviation_from_species_mean <fct> <int> <dbl> <dbl> <dbl> 1 Adelie 3750 76 0.503 79.8 2 Adelie 3800 65 0.570 129.8 3 Adelie 3250 117 0.168 -420.2 4 Chinstrap 3500 37 0.448 -187.5 5 Chinstrap 3900 9 0.866 212.5 6 Chinstrap 3650 24 0.642 -37.5 7 Gentoo 4650 82 0.339 -568.1 8 Gentoo 5700 8 0.926 481.9 9 Gentoo 4725 79 0.372 -493.1 ``` This advanced example shows how `group_by()` works with `mutate()` to perform within-group calculations. Each calculation (ranking, percentiles, deviations) is performed separately for each species group. ## 6. Common Mistakes **Mistake 1: Forgetting to ungroup** ```r # Wrong - groups persist df_grouped <- penguins |> group_by(species) df_grouped |> summarise(count = n()) # Still grouped! # Correct - explicitly ungroup df_grouped |> summarise(count = n(), .groups = "drop") # Or use ungroup() df_grouped |> summarise(count = n()) |> ungroup() ``` **Mistake 2: Using group_by() without a subsequent operation** ```r # Wrong - group_by() alone doesn't do anything visible penguins |> group_by(species) # Correct - follow with an operation penguins |> group_by(species) |> summarise(count = n()) ``` **Mistake 3: Not handling missing values in grouping variables** ```r # Be careful with NA values in grouping variables penguins |> group_by(sex) |> summarise(count = n()) # Creates NA group # Consider filtering or using na.omit() first ``` ## 7. Related Functions - `ungroup()`: Removes grouping from a grouped data frame - `summarise()`: Creates summary statistics for each group - `mutate()`: Adds new variables, computed within groups when data is grouped - `filter()`: Filters rows, can be applied within groups - `slice()`: Selects rows by position within each group when data is grouped