How to use summarize() in R

dplyr

dplyr summarize()

Published

February 20, 2026

R Tutorial: dplyr::summarize()

Introduction

The summarize() function (also spelled summarise()) is a powerful data manipulation function from the dplyr package that creates summary statistics from your data. It takes multiple observations and reduces them to a single summary value per group, making it essential for descriptive statistics, data exploration, and reporting.

You would use summarize() when you need to calculate statistics like means, medians, counts, standard deviations, or custom summary measures across your dataset. It’s particularly powerful when combined with group_by() to create summaries for different categories in your data. The function is part of the tidyverse ecosystem and works seamlessly with pipe operators for clean, readable data analysis workflows.

Syntax

summarize(.data, ..., .by = NULL, .groups = NULL)

Key arguments: - .data: A data frame or tibble to summarize - ...: Name-value pairs of summary functions (e.g., mean_height = mean(height)) - .by: Optional grouping variables (alternative to group_by()) - .groups: How to handle grouping structure in output (“drop_last”, “drop”, “keep”, “rowwise”)

Example 1: Basic Usage

Let’s start with a simple example using the palmerpenguins dataset:

library(tidyverse)
library(palmerpenguins)

# Basic summary statistics
penguins |> 
  summarize(
    count = n(),
    avg_bill_length = mean(bill_length_mm, na.rm = TRUE),
    avg_body_mass = mean(body_mass_g, na.rm = TRUE)
  )

# A tibble: 1 × 3
  count avg_bill_length avg_body_mass
  <int>           <dbl>         <dbl>
1   344            43.9         4202.

This code creates a single-row summary of the entire penguins dataset. The n() function counts the total number of observations, while mean() calculates average values. We use na.rm = TRUE to handle missing values properly. The result is a new tibble with three columns containing our summary statistics.

Example 2: Practical Application

A more practical use case involves grouping data to compare different categories:

# Compare penguin species characteristics
species_summary <- penguins |> 
  group_by(species, island) |> 
  summarize(
    n_penguins = n(),
    avg_bill_length = mean(bill_length_mm, na.rm = TRUE),
    avg_bill_depth = mean(bill_depth_mm, na.rm = TRUE),
    avg_flipper_length = mean(flipper_length_mm, na.rm = TRUE),
    sd_body_mass = sd(body_mass_g, na.rm = TRUE),
    .groups = "drop"
  ) |> 
  arrange(desc(avg_bill_length))

print(species_summary)

# A tibble: 5 × 7
  species   island n_penguins avg_bill_length avg_bill_depth avg_flipper_length sd_body_mass
  <fct>     <fct>       <int>           <dbl>          <dbl>              <dbl>        <dbl>
1 Gentoo    Biscoe        124            47.5           15.0               217.         504.
2 Chinstrap Dream          68            48.8           18.4               196.         285.
3 Adelie    Torgersen      52            39.0           18.4               191.         445.
4 Adelie    Biscoe         44            39.0           18.4               189.         347.
5 Adelie    Dream          56            38.5           18.3               190.         297.

This example demonstrates how summarize() works with group_by() to create summaries for each combination of species and island. The .groups = "drop" argument removes the grouping structure from the output, and arrange() sorts the results by bill length.

Example 3: Advanced Usage

Here’s an advanced example showing multiple summary types and custom functions:

# Advanced summaries with multiple statistics and custom functions
advanced_summary <- penguins |> 
  filter(!is.na(sex)) |> 
  group_by(species, sex) |> 
  summarize(
    across(c(bill_length_mm, body_mass_g), 
           list(mean = ~mean(.x, na.rm = TRUE),
                median = ~median(.x, na.rm = TRUE),
                q75 = ~quantile(.x, 0.75, na.rm = TRUE))),
    bill_length_range = max(bill_length_mm, na.rm = TRUE) - min(bill_length_mm, na.rm = TRUE),
    heavy_penguins = sum(body_mass_g > 4000, na.rm = TRUE),
    .groups = "drop"
  )

print(advanced_summary)

# A tibble: 6 × 9
  species   sex    bill_length_mm_mean bill_length_mm_median bill_length_mm_q75 body_mass_g_mean
  <fct>     <fct>                <dbl>                 <dbl>              <dbl>            <dbl>
1 Adelie    female                37.3                  37.2               38.7            3369.
2 Adelie    male                  40.4                  40.8               42.6            4043.
3 Chinstrap female                46.6                  46.9               49.6            3527.
4 Chinstrap male                  51.1                  51.1               53.2            3939.
5 Gentoo    female                45.6                  45.8               48.5            4680.
6 Gentoo    male                  49.5                  49.6               52.2            5485.
# ℹ 3 more variables: body_mass_g_median <dbl>, body_mass_g_q75 <dbl>, heavy_penguins <int>

This advanced example uses across() to apply multiple summary functions to several columns simultaneously, creates custom calculations like range and conditional counts, and demonstrates how to build complex analytical summaries efficiently.

Common Mistakes

1. Forgetting to handle missing values:

# Wrong - will return NA if any missing values exist
penguins |> summarize(avg_bill = mean(bill_length_mm))

# Correct - use na.rm = TRUE
penguins |> summarize(avg_bill = mean(bill_length_mm, na.rm = TRUE))

2. Not understanding grouping behavior:

# This creates one summary row total, not per species
penguins |> 
  group_by(species) |> 
  summarize(count = n()) |> 
  summarize(total = sum(count))  # Loses grouping context

# Better approach
penguins |> 
  group_by(species) |> 
  summarize(count = n(), .groups = "drop") |> 
  summarize(total = sum(count))

3. Mixing vector and scalar operations incorrectly:

# Wrong - trying to return multiple values per group
penguins |> 
  group_by(species) |> 
  summarize(all_bills = bill_length_mm)  # Error!

# Correct - use summary functions that return single values
penguins |> 
  group_by(species) |> 
  summarize(bill_list = list(bill_length_mm))  # Returns list column

--- title: "How to use summarize() in R" date: 2026-02-20 categories: ["dplyr", "dplyr summarize()"] format: html: code-fold: false code-tools: true --- # R Tutorial: dplyr::summarize() ## Introduction The `summarize()` function (also spelled `summarise()`) is a powerful data manipulation function from the dplyr package that creates summary statistics from your data. It takes multiple observations and reduces them to a single summary value per group, making it essential for descriptive statistics, data exploration, and reporting. You would use `summarize()` when you need to calculate statistics like means, medians, counts, standard deviations, or custom summary measures across your dataset. It's particularly powerful when combined with `group_by()` to create summaries for different categories in your data. The function is part of the tidyverse ecosystem and works seamlessly with pipe operators for clean, readable data analysis workflows. ## Syntax ```r summarize(.data, ..., .by = NULL, .groups = NULL) ``` Key arguments: - `.data`: A data frame or tibble to summarize - `...`: Name-value pairs of summary functions (e.g., `mean_height = mean(height)`) - `.by`: Optional grouping variables (alternative to `group_by()`) - `.groups`: How to handle grouping structure in output ("drop_last", "drop", "keep", "rowwise") ## Example 1: Basic Usage Let's start with a simple example using the palmerpenguins dataset: ```r library(tidyverse) library(palmerpenguins) # Basic summary statistics penguins |> summarize( count = n(), avg_bill_length = mean(bill_length_mm, na.rm = TRUE), avg_body_mass = mean(body_mass_g, na.rm = TRUE) ) ``` ``` # A tibble: 1 × 3 count avg_bill_length avg_body_mass <int> <dbl> <dbl> 1 344 43.9 4202. ``` This code creates a single-row summary of the entire penguins dataset. The `n()` function counts the total number of observations, while `mean()` calculates average values. We use `na.rm = TRUE` to handle missing values properly. The result is a new tibble with three columns containing our summary statistics. ## Example 2: Practical Application A more practical use case involves grouping data to compare different categories: ```r # Compare penguin species characteristics species_summary <- penguins |> group_by(species, island) |> summarize( n_penguins = n(), avg_bill_length = mean(bill_length_mm, na.rm = TRUE), avg_bill_depth = mean(bill_depth_mm, na.rm = TRUE), avg_flipper_length = mean(flipper_length_mm, na.rm = TRUE), sd_body_mass = sd(body_mass_g, na.rm = TRUE), .groups = "drop" ) |> arrange(desc(avg_bill_length)) print(species_summary) ``` ``` # A tibble: 5 × 7 species island n_penguins avg_bill_length avg_bill_depth avg_flipper_length sd_body_mass <fct> <fct> <int> <dbl> <dbl> <dbl> <dbl> 1 Gentoo Biscoe 124 47.5 15.0 217. 504. 2 Chinstrap Dream 68 48.8 18.4 196. 285. 3 Adelie Torgersen 52 39.0 18.4 191. 445. 4 Adelie Biscoe 44 39.0 18.4 189. 347. 5 Adelie Dream 56 38.5 18.3 190. 297. ``` This example demonstrates how `summarize()` works with `group_by()` to create summaries for each combination of species and island. The `.groups = "drop"` argument removes the grouping structure from the output, and `arrange()` sorts the results by bill length. ## Example 3: Advanced Usage Here's an advanced example showing multiple summary types and custom functions: ```r # Advanced summaries with multiple statistics and custom functions advanced_summary <- penguins |> filter(!is.na(sex)) |> group_by(species, sex) |> summarize( across(c(bill_length_mm, body_mass_g), list(mean = ~mean(.x, na.rm = TRUE), median = ~median(.x, na.rm = TRUE), q75 = ~quantile(.x, 0.75, na.rm = TRUE))), bill_length_range = max(bill_length_mm, na.rm = TRUE) - min(bill_length_mm, na.rm = TRUE), heavy_penguins = sum(body_mass_g > 4000, na.rm = TRUE), .groups = "drop" ) print(advanced_summary) ``` ``` # A tibble: 6 × 9 species sex bill_length_mm_mean bill_length_mm_median bill_length_mm_q75 body_mass_g_mean <fct> <fct> <dbl> <dbl> <dbl> <dbl> 1 Adelie female 37.3 37.2 38.7 3369. 2 Adelie male 40.4 40.8 42.6 4043. 3 Chinstrap female 46.6 46.9 49.6 3527. 4 Chinstrap male 51.1 51.1 53.2 3939. 5 Gentoo female 45.6 45.8 48.5 4680. 6 Gentoo male 49.5 49.6 52.2 5485. # ℹ 3 more variables: body_mass_g_median <dbl>, body_mass_g_q75 <dbl>, heavy_penguins <int> ``` This advanced example uses `across()` to apply multiple summary functions to several columns simultaneously, creates custom calculations like range and conditional counts, and demonstrates how to build complex analytical summaries efficiently. ## Common Mistakes **1. Forgetting to handle missing values:** ```r # Wrong - will return NA if any missing values exist penguins |> summarize(avg_bill = mean(bill_length_mm)) # Correct - use na.rm = TRUE penguins |> summarize(avg_bill = mean(bill_length_mm, na.rm = TRUE)) ``` **2. Not understanding grouping behavior:** ```r # This creates one summary row total, not per species penguins |> group_by(species) |> summarize(count = n()) |> summarize(total = sum(count)) # Loses grouping context # Better approach penguins |> group_by(species) |> summarize(count = n(), .groups = "drop") |> summarize(total = sum(count)) ``` **3. Mixing vector and scalar operations incorrectly:** ```r # Wrong - trying to return multiple values per group penguins |> group_by(species) |> summarize(all_bills = bill_length_mm) # Error! # Correct - use summary functions that return single values penguins |> group_by(species) |> summarize(bill_list = list(bill_length_mm)) # Returns list column ``` ## Related Functions - `group_by()`: Groups data by variables before summarizing - `count()`: Shortcut for `summarize(n = n())` with optional grouping - `tally()`: Similar to count but for already-grouped data - `across()`: Apply summary functions to multiple columns simultaneously - `reframe()`: Like summarize but allows multiple rows per group (newer alternative)