How to use summarize() in R
R Tutorial: dplyr::summarize()
Introduction
The summarize() function (also spelled summarise()) is a powerful data manipulation function from the dplyr package that creates summary statistics from your data. It takes multiple observations and reduces them to a single summary value per group, making it essential for descriptive statistics, data exploration, and reporting.
You would use summarize() when you need to calculate statistics like means, medians, counts, standard deviations, or custom summary measures across your dataset. It’s particularly powerful when combined with group_by() to create summaries for different categories in your data. The function is part of the tidyverse ecosystem and works seamlessly with pipe operators for clean, readable data analysis workflows.
Syntax
summarize(.data, ..., .by = NULL, .groups = NULL)Key arguments: - .data: A data frame or tibble to summarize - ...: Name-value pairs of summary functions (e.g., mean_height = mean(height)) - .by: Optional grouping variables (alternative to group_by()) - .groups: How to handle grouping structure in output (“drop_last”, “drop”, “keep”, “rowwise”)
Example 1: Basic Usage
Let’s start with a simple example using the palmerpenguins dataset:
library(tidyverse)
library(palmerpenguins)
# Basic summary statistics
penguins |>
summarize(
count = n(),
avg_bill_length = mean(bill_length_mm, na.rm = TRUE),
avg_body_mass = mean(body_mass_g, na.rm = TRUE)
)# A tibble: 1 × 3
count avg_bill_length avg_body_mass
<int> <dbl> <dbl>
1 344 43.9 4202.
This code creates a single-row summary of the entire penguins dataset. The n() function counts the total number of observations, while mean() calculates average values. We use na.rm = TRUE to handle missing values properly. The result is a new tibble with three columns containing our summary statistics.
Example 2: Practical Application
A more practical use case involves grouping data to compare different categories:
# Compare penguin species characteristics
species_summary <- penguins |>
group_by(species, island) |>
summarize(
n_penguins = n(),
avg_bill_length = mean(bill_length_mm, na.rm = TRUE),
avg_bill_depth = mean(bill_depth_mm, na.rm = TRUE),
avg_flipper_length = mean(flipper_length_mm, na.rm = TRUE),
sd_body_mass = sd(body_mass_g, na.rm = TRUE),
.groups = "drop"
) |>
arrange(desc(avg_bill_length))
print(species_summary)# A tibble: 5 × 7
species island n_penguins avg_bill_length avg_bill_depth avg_flipper_length sd_body_mass
<fct> <fct> <int> <dbl> <dbl> <dbl> <dbl>
1 Gentoo Biscoe 124 47.5 15.0 217. 504.
2 Chinstrap Dream 68 48.8 18.4 196. 285.
3 Adelie Torgersen 52 39.0 18.4 191. 445.
4 Adelie Biscoe 44 39.0 18.4 189. 347.
5 Adelie Dream 56 38.5 18.3 190. 297.
This example demonstrates how summarize() works with group_by() to create summaries for each combination of species and island. The .groups = "drop" argument removes the grouping structure from the output, and arrange() sorts the results by bill length.
Example 3: Advanced Usage
Here’s an advanced example showing multiple summary types and custom functions:
# Advanced summaries with multiple statistics and custom functions
advanced_summary <- penguins |>
filter(!is.na(sex)) |>
group_by(species, sex) |>
summarize(
across(c(bill_length_mm, body_mass_g),
list(mean = ~mean(.x, na.rm = TRUE),
median = ~median(.x, na.rm = TRUE),
q75 = ~quantile(.x, 0.75, na.rm = TRUE))),
bill_length_range = max(bill_length_mm, na.rm = TRUE) - min(bill_length_mm, na.rm = TRUE),
heavy_penguins = sum(body_mass_g > 4000, na.rm = TRUE),
.groups = "drop"
)
print(advanced_summary)# A tibble: 6 × 9
species sex bill_length_mm_mean bill_length_mm_median bill_length_mm_q75 body_mass_g_mean
<fct> <fct> <dbl> <dbl> <dbl> <dbl>
1 Adelie female 37.3 37.2 38.7 3369.
2 Adelie male 40.4 40.8 42.6 4043.
3 Chinstrap female 46.6 46.9 49.6 3527.
4 Chinstrap male 51.1 51.1 53.2 3939.
5 Gentoo female 45.6 45.8 48.5 4680.
6 Gentoo male 49.5 49.6 52.2 5485.
# ℹ 3 more variables: body_mass_g_median <dbl>, body_mass_g_q75 <dbl>, heavy_penguins <int>
This advanced example uses across() to apply multiple summary functions to several columns simultaneously, creates custom calculations like range and conditional counts, and demonstrates how to build complex analytical summaries efficiently.
Common Mistakes
1. Forgetting to handle missing values:
# Wrong - will return NA if any missing values exist
penguins |> summarize(avg_bill = mean(bill_length_mm))
# Correct - use na.rm = TRUE
penguins |> summarize(avg_bill = mean(bill_length_mm, na.rm = TRUE))2. Not understanding grouping behavior:
# This creates one summary row total, not per species
penguins |>
group_by(species) |>
summarize(count = n()) |>
summarize(total = sum(count)) # Loses grouping context
# Better approach
penguins |>
group_by(species) |>
summarize(count = n(), .groups = "drop") |>
summarize(total = sum(count))3. Mixing vector and scalar operations incorrectly:
# Wrong - trying to return multiple values per group
penguins |>
group_by(species) |>
summarize(all_bills = bill_length_mm) # Error!
# Correct - use summary functions that return single values
penguins |>
group_by(species) |>
summarize(bill_list = list(bill_length_mm)) # Returns list column