How to use summarise() in R

dplyr

dplyr summarise()

Published

February 20, 2026

R Tutorial: dplyr::summarise()

Introduction

The summarise() function from the dplyr package is used to create summary statistics from data frames. It reduces multiple rows down to a single summary row by applying aggregate functions like mean(), sum(), count(), or max() to columns. This function is essential for data analysis when you need to compute descriptive statistics, create reports, or generate insights from grouped data. The summarise() function is part of the tidyverse ecosystem and works seamlessly with other dplyr functions like group_by() and filter(). It’s particularly powerful when combined with grouping operations, allowing you to calculate statistics for different subsets of your data in a single operation.

Syntax

summarise(.data, ..., .by = NULL, .groups = NULL)

Key Arguments: - .data: A data frame or tibble to summarize - ...: Name-value pairs of summary functions (e.g., mean_height = mean(height)) - .by: Optional grouping columns (alternative to using group_by()) - .groups: How to handle grouping structure in the output (“drop_last”, “drop”, “keep”, “rowwise”)

Example 1: Basic Usage

Let’s start with a simple example using the palmerpenguins dataset:

library(tidyverse)
library(palmerpenguins)

# Basic summary statistics
penguins |>
  summarise(
    count = n(),
    avg_bill_length = mean(bill_length_mm, na.rm = TRUE),
    max_body_mass = max(body_mass_g, na.rm = TRUE),
    min_flipper_length = min(flipper_length_mm, na.rm = TRUE)
  )

# A tibble: 1 × 4
  count avg_bill_length max_body_mass min_flipper_length
  <int>           <dbl>         <int>              <int>
1   344            43.9          6300                172

This example demonstrates the basic functionality of summarise(). We created four summary statistics: total count of observations using n(), average bill length, maximum body mass, and minimum flipper length. The na.rm = TRUE argument handles missing values by excluding them from calculations.

Example 2: Practical Application

Here’s a more practical example that combines summarise() with group_by() to analyze penguin species:

# Species comparison with grouped summaries
species_summary <- penguins |>
  group_by(species, island) |>
  summarise(
    penguin_count = n(),
    avg_bill_length = mean(bill_length_mm, na.rm = TRUE),
    avg_bill_depth = mean(bill_depth_mm, na.rm = TRUE),
    avg_body_mass = mean(body_mass_g, na.rm = TRUE),
    mass_sd = sd(body_mass_g, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(desc(avg_body_mass))

species_summary

# A tibble: 5 × 7
  species   island    penguin_count avg_bill_length avg_bill_depth avg_body_mass mass_sd
  <fct>     <fct>             <int>           <dbl>          <dbl>         <dbl>   <dbl>
1 Gentoo    Biscoe              124            47.5           15.0         5076.    504.
2 Chinstrap Dream                68            48.8           18.4         3733.    384.
3 Adelie    Biscoe               44            39.0           18.4         3710.    488.
4 Adelie    Dream                56            38.5           18.3         3688.    455.
5 Adelie    Torgersen            52            39.0           18.4         3706.    445.

This example shows how summarise() works with grouped data to create comprehensive summaries for each species-island combination. We calculated multiple statistics and used arrange() to sort by average body mass, revealing that Gentoo penguins are the heaviest on average.

Example 3: Advanced Usage

Advanced usage includes using multiple summary functions and conditional summaries:

# Advanced summarise with conditional logic and multiple functions
advanced_summary <- penguins |>
  group_by(species) |>
  summarise(
    total_count = n(),
    complete_cases = sum(!is.na(bill_length_mm) & !is.na(body_mass_g)),
    heavy_penguins = sum(body_mass_g > 4000, na.rm = TRUE),
    pct_heavy = round(heavy_penguins / complete_cases * 100, 1),
    bill_length_range = max(bill_length_mm, na.rm = TRUE) - min(bill_length_mm, na.rm = TRUE),
    mass_quartiles = list(quantile(body_mass_g, na.rm = TRUE)),
    .groups = "drop"
  )

# Extract quartiles for one species
advanced_summary$mass_quartiles[[1]]  # Adelie quartiles

  0%  25%  50%  75% 100% 
2850 3350 3700 4000 4775

This advanced example demonstrates conditional counting, percentage calculations, range calculations, and storing complex objects like quartiles in list columns.

Common Mistakes

1. Forgetting na.rm = TRUE with missing data:

# Wrong - will return NA if any missing values exist
penguins |> summarise(avg_mass = mean(body_mass_g))

# Correct - handles missing values
penguins |> summarise(avg_mass = mean(body_mass_g, na.rm = TRUE))

2. Not handling grouping properly:

# This creates unexpected grouping behavior
penguins |> 
  group_by(species, sex) |>
  summarise(count = n())  # Warning about grouping

# Better - explicitly control grouping
penguins |> 
  group_by(species, sex) |>
  summarise(count = n(), .groups = "drop")

3. Using summarise() when you need mutate():

# Wrong - summarise() reduces rows, not what we want here
penguins |> summarise(bill_ratio = bill_length_mm / bill_depth_mm)

# Correct - mutate() adds new columns while keeping all rows
penguins |> mutate(bill_ratio = bill_length_mm / bill_depth_mm)

--- title: "How to use summarise() in R" date: 2026-02-20 categories: ["dplyr", "dplyr summarise()"] format: html: code-fold: false code-tools: true --- # R Tutorial: dplyr::summarise() ## Introduction The `summarise()` function from the dplyr package is used to create summary statistics from data frames. It reduces multiple rows down to a single summary row by applying aggregate functions like `mean()`, `sum()`, `count()`, or `max()` to columns. This function is essential for data analysis when you need to compute descriptive statistics, create reports, or generate insights from grouped data. The `summarise()` function is part of the tidyverse ecosystem and works seamlessly with other dplyr functions like `group_by()` and `filter()`. It's particularly powerful when combined with grouping operations, allowing you to calculate statistics for different subsets of your data in a single operation. ## Syntax ```r summarise(.data, ..., .by = NULL, .groups = NULL) ``` **Key Arguments:** - `.data`: A data frame or tibble to summarize - `...`: Name-value pairs of summary functions (e.g., `mean_height = mean(height)`) - `.by`: Optional grouping columns (alternative to using `group_by()`) - `.groups`: How to handle grouping structure in the output ("drop_last", "drop", "keep", "rowwise") ## Example 1: Basic Usage Let's start with a simple example using the palmerpenguins dataset: ```r library(tidyverse) library(palmerpenguins) # Basic summary statistics penguins |> summarise( count = n(), avg_bill_length = mean(bill_length_mm, na.rm = TRUE), max_body_mass = max(body_mass_g, na.rm = TRUE), min_flipper_length = min(flipper_length_mm, na.rm = TRUE) ) ``` ``` # A tibble: 1 × 4 count avg_bill_length max_body_mass min_flipper_length <int> <dbl> <int> <int> 1 344 43.9 6300 172 ``` This example demonstrates the basic functionality of `summarise()`. We created four summary statistics: total count of observations using `n()`, average bill length, maximum body mass, and minimum flipper length. The `na.rm = TRUE` argument handles missing values by excluding them from calculations. ## Example 2: Practical Application Here's a more practical example that combines `summarise()` with `group_by()` to analyze penguin species: ```r # Species comparison with grouped summaries species_summary <- penguins |> group_by(species, island) |> summarise( penguin_count = n(), avg_bill_length = mean(bill_length_mm, na.rm = TRUE), avg_bill_depth = mean(bill_depth_mm, na.rm = TRUE), avg_body_mass = mean(body_mass_g, na.rm = TRUE), mass_sd = sd(body_mass_g, na.rm = TRUE), .groups = "drop" ) |> arrange(desc(avg_body_mass)) species_summary ``` ``` # A tibble: 5 × 7 species island penguin_count avg_bill_length avg_bill_depth avg_body_mass mass_sd <fct> <fct> <int> <dbl> <dbl> <dbl> <dbl> 1 Gentoo Biscoe 124 47.5 15.0 5076. 504. 2 Chinstrap Dream 68 48.8 18.4 3733. 384. 3 Adelie Biscoe 44 39.0 18.4 3710. 488. 4 Adelie Dream 56 38.5 18.3 3688. 455. 5 Adelie Torgersen 52 39.0 18.4 3706. 445. ``` This example shows how `summarise()` works with grouped data to create comprehensive summaries for each species-island combination. We calculated multiple statistics and used `arrange()` to sort by average body mass, revealing that Gentoo penguins are the heaviest on average. ## Example 3: Advanced Usage Advanced usage includes using multiple summary functions and conditional summaries: ```r # Advanced summarise with conditional logic and multiple functions advanced_summary <- penguins |> group_by(species) |> summarise( total_count = n(), complete_cases = sum(!is.na(bill_length_mm) & !is.na(body_mass_g)), heavy_penguins = sum(body_mass_g > 4000, na.rm = TRUE), pct_heavy = round(heavy_penguins / complete_cases * 100, 1), bill_length_range = max(bill_length_mm, na.rm = TRUE) - min(bill_length_mm, na.rm = TRUE), mass_quartiles = list(quantile(body_mass_g, na.rm = TRUE)), .groups = "drop" ) # Extract quartiles for one species advanced_summary$mass_quartiles[[1]] # Adelie quartiles ``` ``` 0% 25% 50% 75% 100% 2850 3350 3700 4000 4775 ``` This advanced example demonstrates conditional counting, percentage calculations, range calculations, and storing complex objects like quartiles in list columns. ## Common Mistakes **1. Forgetting `na.rm = TRUE` with missing data:** ```r # Wrong - will return NA if any missing values exist penguins |> summarise(avg_mass = mean(body_mass_g)) # Correct - handles missing values penguins |> summarise(avg_mass = mean(body_mass_g, na.rm = TRUE)) ``` **2. Not handling grouping properly:** ```r # This creates unexpected grouping behavior penguins |> group_by(species, sex) |> summarise(count = n()) # Warning about grouping # Better - explicitly control grouping penguins |> group_by(species, sex) |> summarise(count = n(), .groups = "drop") ``` **3. Using summarise() when you need mutate():** ```r # Wrong - summarise() reduces rows, not what we want here penguins |> summarise(bill_ratio = bill_length_mm / bill_depth_mm) # Correct - mutate() adds new columns while keeping all rows penguins |> mutate(bill_ratio = bill_length_mm / bill_depth_mm) ``` ## Related Functions - `group_by()`: Groups data by one or more variables before summarizing - `mutate()`: Adds new columns while preserving all existing rows - `count()`: Shortcut for `summarise(n = n())` to count observations - `across()`: Apply summary functions across multiple columns at once - `reframe()`: Like summarise() but allows returning multiple rows per group