How to use summarize() in R

dplyr
dplyr summarize()
Published

February 20, 2026

R Tutorial: dplyr::summarize()

Introduction

The summarize() function (also spelled summarise()) is a powerful data manipulation function from the dplyr package that creates summary statistics from your data. It takes multiple observations and reduces them to a single summary value per group, making it essential for descriptive statistics, data exploration, and reporting.

You would use summarize() when you need to calculate statistics like means, medians, counts, standard deviations, or custom summary measures across your dataset. It’s particularly powerful when combined with group_by() to create summaries for different categories in your data. The function is part of the tidyverse ecosystem and works seamlessly with pipe operators for clean, readable data analysis workflows.

Syntax

summarize(.data, ..., .by = NULL, .groups = NULL)

Key arguments: - .data: A data frame or tibble to summarize - ...: Name-value pairs of summary functions (e.g., mean_height = mean(height)) - .by: Optional grouping variables (alternative to group_by()) - .groups: How to handle grouping structure in output (“drop_last”, “drop”, “keep”, “rowwise”)

Example 1: Basic Usage

Let’s start with a simple example using the palmerpenguins dataset:

library(tidyverse)
library(palmerpenguins)

# Basic summary statistics
penguins |> 
  summarize(
    count = n(),
    avg_bill_length = mean(bill_length_mm, na.rm = TRUE),
    avg_body_mass = mean(body_mass_g, na.rm = TRUE)
  )
# A tibble: 1 × 3
  count avg_bill_length avg_body_mass
  <int>           <dbl>         <dbl>
1   344            43.9         4202.

This code creates a single-row summary of the entire penguins dataset. The n() function counts the total number of observations, while mean() calculates average values. We use na.rm = TRUE to handle missing values properly. The result is a new tibble with three columns containing our summary statistics.

Example 2: Practical Application

A more practical use case involves grouping data to compare different categories:

# Compare penguin species characteristics
species_summary <- penguins |> 
  group_by(species, island) |> 
  summarize(
    n_penguins = n(),
    avg_bill_length = mean(bill_length_mm, na.rm = TRUE),
    avg_bill_depth = mean(bill_depth_mm, na.rm = TRUE),
    avg_flipper_length = mean(flipper_length_mm, na.rm = TRUE),
    sd_body_mass = sd(body_mass_g, na.rm = TRUE),
    .groups = "drop"
  ) |> 
  arrange(desc(avg_bill_length))

print(species_summary)
# A tibble: 5 × 7
  species   island n_penguins avg_bill_length avg_bill_depth avg_flipper_length sd_body_mass
  <fct>     <fct>       <int>           <dbl>          <dbl>              <dbl>        <dbl>
1 Gentoo    Biscoe        124            47.5           15.0               217.         504.
2 Chinstrap Dream          68            48.8           18.4               196.         285.
3 Adelie    Torgersen      52            39.0           18.4               191.         445.
4 Adelie    Biscoe         44            39.0           18.4               189.         347.
5 Adelie    Dream          56            38.5           18.3               190.         297.

This example demonstrates how summarize() works with group_by() to create summaries for each combination of species and island. The .groups = "drop" argument removes the grouping structure from the output, and arrange() sorts the results by bill length.

Example 3: Advanced Usage

Here’s an advanced example showing multiple summary types and custom functions:

# Advanced summaries with multiple statistics and custom functions
advanced_summary <- penguins |> 
  filter(!is.na(sex)) |> 
  group_by(species, sex) |> 
  summarize(
    across(c(bill_length_mm, body_mass_g), 
           list(mean = ~mean(.x, na.rm = TRUE),
                median = ~median(.x, na.rm = TRUE),
                q75 = ~quantile(.x, 0.75, na.rm = TRUE))),
    bill_length_range = max(bill_length_mm, na.rm = TRUE) - min(bill_length_mm, na.rm = TRUE),
    heavy_penguins = sum(body_mass_g > 4000, na.rm = TRUE),
    .groups = "drop"
  )

print(advanced_summary)
# A tibble: 6 × 9
  species   sex    bill_length_mm_mean bill_length_mm_median bill_length_mm_q75 body_mass_g_mean
  <fct>     <fct>                <dbl>                 <dbl>              <dbl>            <dbl>
1 Adelie    female                37.3                  37.2               38.7            3369.
2 Adelie    male                  40.4                  40.8               42.6            4043.
3 Chinstrap female                46.6                  46.9               49.6            3527.
4 Chinstrap male                  51.1                  51.1               53.2            3939.
5 Gentoo    female                45.6                  45.8               48.5            4680.
6 Gentoo    male                  49.5                  49.6               52.2            5485.
# ℹ 3 more variables: body_mass_g_median <dbl>, body_mass_g_q75 <dbl>, heavy_penguins <int>

This advanced example uses across() to apply multiple summary functions to several columns simultaneously, creates custom calculations like range and conditional counts, and demonstrates how to build complex analytical summaries efficiently.

Common Mistakes

1. Forgetting to handle missing values:

# Wrong - will return NA if any missing values exist
penguins |> summarize(avg_bill = mean(bill_length_mm))

# Correct - use na.rm = TRUE
penguins |> summarize(avg_bill = mean(bill_length_mm, na.rm = TRUE))

2. Not understanding grouping behavior:

# This creates one summary row total, not per species
penguins |> 
  group_by(species) |> 
  summarize(count = n()) |> 
  summarize(total = sum(count))  # Loses grouping context

# Better approach
penguins |> 
  group_by(species) |> 
  summarize(count = n(), .groups = "drop") |> 
  summarize(total = sum(count))

3. Mixing vector and scalar operations incorrectly:

# Wrong - trying to return multiple values per group
penguins |> 
  group_by(species) |> 
  summarize(all_bills = bill_length_mm)  # Error!

# Correct - use summary functions that return single values
penguins |> 
  group_by(species) |> 
  summarize(bill_list = list(bill_length_mm))  # Returns list column