How to use summarise() in R
Introduction
The summarise() function is one of the most powerful tools in R’s dplyr package for data analysis. It allows you to collapse multiple rows of data into summary statistics, creating condensed insights from larger datasets. Whether you’re calculating means, counts, standard deviations, or custom metrics, summarise() transforms your raw data into meaningful summaries.
You’ll use summarise() whenever you need to compute aggregate statistics from your data. Common scenarios include calculating average sales by region, counting observations in different categories, finding the maximum values in groups, or creating custom summary metrics. It’s particularly powerful when combined with group_by() to create summaries for different subsets of your data, making it essential for exploratory data analysis and reporting.
Getting Started
First, let’s load the required packages. We’ll use the tidyverse for data manipulation and the palmerpenguins dataset for our examples.
library(tidyverse)
library(palmerpenguins)
# Take a look at our data
head(penguins)Example 1: Basic Usage
Let’s start with simple summary statistics using the penguins dataset. The summarise() function creates new columns containing summary statistics.
# Basic summary statistics
penguins |>
summarise(
count = n(),
avg_bill_length = mean(bill_length_mm, na.rm = TRUE),
avg_body_mass = mean(body_mass_g, na.rm = TRUE),
max_flipper_length = max(flipper_length_mm, na.rm = TRUE)
)You can also use summarise() with a single statistic:
# Single summary statistic
penguins |>
summarise(total_penguins = n())The na.rm = TRUE argument is important when working with real data that may contain missing values. Without it, any NA values would cause the summary functions to return NA.
Example 2: Practical Application
The real power of summarise() emerges when combined with group_by(). Let’s analyze penguin characteristics by species and island to understand patterns in the data.
# Grouped summary statistics
penguin_summary <- penguins |>
group_by(species, island) |>
summarise(
count = n(),
avg_bill_length = mean(bill_length_mm, na.rm = TRUE),
avg_bill_depth = mean(bill_depth_mm, na.rm = TRUE),
avg_body_mass = mean(body_mass_g, na.rm = TRUE),
sd_body_mass = sd(body_mass_g, na.rm = TRUE),
.groups = "drop"
)We can also create more complex summaries with conditional logic and multiple summary functions:
# Advanced summary with conditional statistics
penguins |>
group_by(species) |>
summarise(
total_count = n(),
male_count = sum(sex == "male", na.rm = TRUE),
female_count = sum(sex == "female", na.rm = TRUE),
heavy_penguins = sum(body_mass_g > 4000, na.rm = TRUE),
bill_length_range = max(bill_length_mm, na.rm = TRUE) - min(bill_length_mm, na.rm = TRUE),
avg_bill_ratio = mean(bill_length_mm / bill_depth_mm, na.rm = TRUE),
.groups = "drop"
)For yearly analysis, we can extract information from dates and create time-based summaries:
# Summary by year (if year column exists)
penguins |>
group_by(year, species) |>
summarise(
count = n(),
avg_body_mass = mean(body_mass_g, na.rm = TRUE),
median_flipper_length = median(flipper_length_mm, na.rm = TRUE),
.groups = "drop"
) |>
arrange(year, species)Summary
The summarise() function is essential for data analysis in R, allowing you to transform detailed datasets into meaningful summary statistics. Key takeaways include: always use na.rm = TRUE when dealing with missing values, combine with group_by() for powerful grouped summaries, and use .groups = "drop" to avoid warning messages about grouping structures.