How to use group_by() in R

dplyr
dplyr group_by()
Published

February 20, 2026

dplyr::group_by() Tutorial

1. Introduction

The group_by() function from the dplyr package is a fundamental tool for grouping data by one or more variables. It creates invisible groups within your data frame that can then be used with other dplyr functions to perform operations on each group separately. This function is essential for data aggregation, summary statistics, and split-apply-combine operations.

You would use group_by() when you need to calculate statistics for different categories in your data, such as finding the average height by species, counting observations per group, or applying transformations within groups. The function is part of the dplyr package, which is included in the tidyverse collection of packages. group_by() doesn’t change your data itself but adds grouping metadata that subsequent functions can utilize.

2. Syntax

group_by(.data, ..., .add = FALSE, .drop = group_by_drop_default(.data))

Key arguments: - .data: A data frame or tibble to group - ...: Variables to group by (can be column names or expressions) - .add: If TRUE, adds grouping variables to existing groups instead of overriding - .drop: If TRUE, drops unused factor levels from grouping variables

3. Example 1: Basic Usage

library(tidyverse)
library(palmerpenguins)

# Basic grouping by species
penguins |>
  group_by(species) |>
  summarise(count = n())
# A tibble: 3 × 2
  species   count
  <fct>     <int>
1 Adelie      152
2 Chinstrap    68
3 Gentoo      124

In this example, we grouped the penguins dataset by the species variable and then used summarise() with n() to count the number of observations in each group. The group_by() function created three invisible groups (one for each penguin species), and summarise() calculated the count for each group separately, returning one row per group.

4. Example 2: Practical Application

# Calculate average body measurements by species and sex
penguin_summary <- penguins |>
  filter(!is.na(sex)) |>
  group_by(species, sex) |>
  summarise(
    count = n(),
    avg_bill_length = mean(bill_length_mm, na.rm = TRUE),
    avg_body_mass = mean(body_mass_g, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(species, sex)

penguin_summary
# A tibble: 6 × 5
  species   sex    count avg_bill_length avg_body_mass
  <fct>     <fct>  <int>           <dbl>         <dbl>
1 Adelie    female    73            37.3         3369.
2 Adelie    male      73            40.4         4043.
3 Chinstrap female    34            46.6         3527.
4 Chinstrap male      34            51.1         3939.
5 Gentoo    female    58            45.6         4680.
6 Gentoo    male      61            47.5         5485.

This practical example demonstrates grouping by multiple variables (species and sex) to calculate comprehensive summary statistics. We filtered out missing sex values, grouped by both variables, and computed multiple summary metrics for each combination.

5. Example 3: Advanced Usage

# Using group_by with mutate for within-group calculations
penguins |>
  filter(!is.na(body_mass_g)) |>
  group_by(species) |>
  mutate(
    mass_rank = rank(desc(body_mass_g)),
    mass_percentile = percent_rank(body_mass_g),
    deviation_from_species_mean = body_mass_g - mean(body_mass_g, na.rm = TRUE)
  ) |>
  select(species, body_mass_g, mass_rank, mass_percentile, deviation_from_species_mean) |>
  slice_head(n = 3)
# A tibble: 9 × 5
# Groups:   species [3]
  species body_mass_g mass_rank mass_percentile deviation_from_species_mean
  <fct>         <int>     <dbl>           <dbl>                       <dbl>
1 Adelie         3750         76          0.503                        79.8
2 Adelie         3800         65          0.570                       129.8
3 Adelie         3250        117          0.168                      -420.2
4 Chinstrap      3500         37          0.448                      -187.5
5 Chinstrap      3900          9          0.866                       212.5
6 Chinstrap      3650         24          0.642                       -37.5
7 Gentoo         4650         82          0.339                      -568.1
8 Gentoo         5700          8          0.926                       481.9
9 Gentoo         4725         79          0.372                      -493.1

This advanced example shows how group_by() works with mutate() to perform within-group calculations. Each calculation (ranking, percentiles, deviations) is performed separately for each species group.

6. Common Mistakes

Mistake 1: Forgetting to ungroup

# Wrong - groups persist
df_grouped <- penguins |> group_by(species)
df_grouped |> summarise(count = n())  # Still grouped!

# Correct - explicitly ungroup
df_grouped |> summarise(count = n(), .groups = "drop")
# Or use ungroup()
df_grouped |> summarise(count = n()) |> ungroup()

Mistake 2: Using group_by() without a subsequent operation

# Wrong - group_by() alone doesn't do anything visible
penguins |> group_by(species)

# Correct - follow with an operation
penguins |> group_by(species) |> summarise(count = n())

Mistake 3: Not handling missing values in grouping variables

# Be careful with NA values in grouping variables
penguins |> group_by(sex) |> summarise(count = n())  # Creates NA group
# Consider filtering or using na.omit() first