How to use aggregate in R

base-r
aggregate
Master aggregate in R programming with clear examples. Complete tutorial covering syntax, use cases, and best practices.
Published

February 22, 2026

Introduction

The aggregate() function in R is a powerful tool for computing summary statistics by grouping data. It’s particularly useful when you need to calculate means, sums, counts, or other statistics for different categories in your dataset.

Getting Started

library(palmerpenguins)
data(penguins)

Example 1: Basic Usage

The Problem

We want to calculate the average body mass of penguins for each species. This requires grouping our data by species and computing the mean for each group.

Step 1: Examine the data structure

Let’s first look at our dataset to understand what we’re working with.

head(penguins)
str(penguins$species)
str(penguins$body_mass_g)

We can see the species column contains factor data and body_mass_g contains numeric values.

Step 2: Apply basic aggregation

We’ll use aggregate to calculate mean body mass by species.

aggregate(body_mass_g ~ species, 
          data = penguins, 
          FUN = mean, 
          na.rm = TRUE)

This returns a data frame with species and their corresponding average body masses.

Step 3: Multiple grouping variables

Now let’s group by both species and sex to get more detailed statistics.

aggregate(body_mass_g ~ species + sex, 
          data = penguins, 
          FUN = mean, 
          na.rm = TRUE)

The result shows average body mass for each combination of species and sex.

Example 2: Practical Application

The Problem

A researcher needs to analyze penguin morphology across different islands and years. They want to calculate multiple summary statistics (mean, standard deviation) for bill length and flipper length to understand population variations.

Step 1: Create summary statistics for bill length

We’ll calculate mean bill length grouped by island and year.

bill_summary <- aggregate(bill_length_mm ~ island + year, 
                         data = penguins, 
                         FUN = function(x) c(mean = mean(x, na.rm = TRUE),
                                           sd = sd(x, na.rm = TRUE)))

This creates a data frame with nested results containing both mean and standard deviation.

Step 2: Flatten the results

The previous output has a matrix column, so we need to flatten it for easier analysis.

bill_summary_flat <- do.call(data.frame, bill_summary)
colnames(bill_summary_flat) <- c("island", "year", "mean_bill", "sd_bill")
head(bill_summary_flat)

Now we have a clean data frame with separate columns for each statistic.

Step 3: Apply multiple functions to different variables

Let’s aggregate multiple variables simultaneously using different functions.

multi_summary <- aggregate(cbind(bill_length_mm, flipper_length_mm) ~ island, 
                          data = penguins, 
                          FUN = function(x) c(mean = mean(x, na.rm = TRUE),
                                            median = median(x, na.rm = TRUE)))

This calculates both mean and median for bill length and flipper length by island.

Step 4: Using aggregate with custom functions

We can create custom functions for more complex calculations.

cv_function <- function(x) {
  sd(x, na.rm = TRUE) / mean(x, na.rm = TRUE) * 100
}

cv_results <- aggregate(body_mass_g ~ species, 
                       data = penguins, 
                       FUN = cv_function)

This calculates the coefficient of variation (CV) for body mass by species.

Summary

  • aggregate() uses the formula syntax variable ~ grouping_variables to specify what to calculate and how to group
  • The FUN parameter accepts any function (mean, sum, sd, custom functions) to apply to each group
  • Multiple grouping variables can be combined using + in the formula
  • Use cbind() to aggregate multiple response variables simultaneously
  • Always include na.rm = TRUE when dealing with missing values to avoid NA results
  • Custom functions can be created for specialized calculations like coefficients of variation or confidence intervals