How to use summarise() in R
R Tutorial: dplyr::summarise()
Introduction
The summarise() function from the dplyr package is used to create summary statistics from data frames. It reduces multiple rows down to a single summary row by applying aggregate functions like mean(), sum(), count(), or max() to columns. This function is essential for data analysis when you need to compute descriptive statistics, create reports, or generate insights from grouped data. The summarise() function is part of the tidyverse ecosystem and works seamlessly with other dplyr functions like group_by() and filter(). It’s particularly powerful when combined with grouping operations, allowing you to calculate statistics for different subsets of your data in a single operation.
Syntax
summarise(.data, ..., .by = NULL, .groups = NULL)Key Arguments: - .data: A data frame or tibble to summarize - ...: Name-value pairs of summary functions (e.g., mean_height = mean(height)) - .by: Optional grouping columns (alternative to using group_by()) - .groups: How to handle grouping structure in the output (“drop_last”, “drop”, “keep”, “rowwise”)
Example 1: Basic Usage
Let’s start with a simple example using the palmerpenguins dataset:
library(tidyverse)
library(palmerpenguins)
# Basic summary statistics
penguins |>
summarise(
count = n(),
avg_bill_length = mean(bill_length_mm, na.rm = TRUE),
max_body_mass = max(body_mass_g, na.rm = TRUE),
min_flipper_length = min(flipper_length_mm, na.rm = TRUE)
)# A tibble: 1 × 4
count avg_bill_length max_body_mass min_flipper_length
<int> <dbl> <int> <int>
1 344 43.9 6300 172
This example demonstrates the basic functionality of summarise(). We created four summary statistics: total count of observations using n(), average bill length, maximum body mass, and minimum flipper length. The na.rm = TRUE argument handles missing values by excluding them from calculations.
Example 2: Practical Application
Here’s a more practical example that combines summarise() with group_by() to analyze penguin species:
# Species comparison with grouped summaries
species_summary <- penguins |>
group_by(species, island) |>
summarise(
penguin_count = n(),
avg_bill_length = mean(bill_length_mm, na.rm = TRUE),
avg_bill_depth = mean(bill_depth_mm, na.rm = TRUE),
avg_body_mass = mean(body_mass_g, na.rm = TRUE),
mass_sd = sd(body_mass_g, na.rm = TRUE),
.groups = "drop"
) |>
arrange(desc(avg_body_mass))
species_summary# A tibble: 5 × 7
species island penguin_count avg_bill_length avg_bill_depth avg_body_mass mass_sd
<fct> <fct> <int> <dbl> <dbl> <dbl> <dbl>
1 Gentoo Biscoe 124 47.5 15.0 5076. 504.
2 Chinstrap Dream 68 48.8 18.4 3733. 384.
3 Adelie Biscoe 44 39.0 18.4 3710. 488.
4 Adelie Dream 56 38.5 18.3 3688. 455.
5 Adelie Torgersen 52 39.0 18.4 3706. 445.
This example shows how summarise() works with grouped data to create comprehensive summaries for each species-island combination. We calculated multiple statistics and used arrange() to sort by average body mass, revealing that Gentoo penguins are the heaviest on average.
Example 3: Advanced Usage
Advanced usage includes using multiple summary functions and conditional summaries:
# Advanced summarise with conditional logic and multiple functions
advanced_summary <- penguins |>
group_by(species) |>
summarise(
total_count = n(),
complete_cases = sum(!is.na(bill_length_mm) & !is.na(body_mass_g)),
heavy_penguins = sum(body_mass_g > 4000, na.rm = TRUE),
pct_heavy = round(heavy_penguins / complete_cases * 100, 1),
bill_length_range = max(bill_length_mm, na.rm = TRUE) - min(bill_length_mm, na.rm = TRUE),
mass_quartiles = list(quantile(body_mass_g, na.rm = TRUE)),
.groups = "drop"
)
# Extract quartiles for one species
advanced_summary$mass_quartiles[[1]] # Adelie quartiles 0% 25% 50% 75% 100%
2850 3350 3700 4000 4775
This advanced example demonstrates conditional counting, percentage calculations, range calculations, and storing complex objects like quartiles in list columns.
Common Mistakes
1. Forgetting na.rm = TRUE with missing data:
# Wrong - will return NA if any missing values exist
penguins |> summarise(avg_mass = mean(body_mass_g))
# Correct - handles missing values
penguins |> summarise(avg_mass = mean(body_mass_g, na.rm = TRUE))2. Not handling grouping properly:
# This creates unexpected grouping behavior
penguins |>
group_by(species, sex) |>
summarise(count = n()) # Warning about grouping
# Better - explicitly control grouping
penguins |>
group_by(species, sex) |>
summarise(count = n(), .groups = "drop")3. Using summarise() when you need mutate():
# Wrong - summarise() reduces rows, not what we want here
penguins |> summarise(bill_ratio = bill_length_mm / bill_depth_mm)
# Correct - mutate() adds new columns while keeping all rows
penguins |> mutate(bill_ratio = bill_length_mm / bill_depth_mm)