How to use aggregate in R
Introduction
The aggregate() function in R is a powerful tool for computing summary statistics by grouping data. It’s particularly useful when you need to calculate means, sums, counts, or other statistics for different categories in your dataset.
Getting Started
library(palmerpenguins)
data(penguins)Example 1: Basic Usage
The Problem
We want to calculate the average body mass of penguins for each species. This requires grouping our data by species and computing the mean for each group.
Step 1: Examine the data structure
Let’s first look at our dataset to understand what we’re working with.
head(penguins)
str(penguins$species)
str(penguins$body_mass_g)We can see the species column contains factor data and body_mass_g contains numeric values.
Step 2: Apply basic aggregation
We’ll use aggregate to calculate mean body mass by species.
aggregate(body_mass_g ~ species,
data = penguins,
FUN = mean,
na.rm = TRUE)This returns a data frame with species and their corresponding average body masses.
Step 3: Multiple grouping variables
Now let’s group by both species and sex to get more detailed statistics.
aggregate(body_mass_g ~ species + sex,
data = penguins,
FUN = mean,
na.rm = TRUE)The result shows average body mass for each combination of species and sex.
Example 2: Practical Application
The Problem
A researcher needs to analyze penguin morphology across different islands and years. They want to calculate multiple summary statistics (mean, standard deviation) for bill length and flipper length to understand population variations.
Step 1: Create summary statistics for bill length
We’ll calculate mean bill length grouped by island and year.
bill_summary <- aggregate(bill_length_mm ~ island + year,
data = penguins,
FUN = function(x) c(mean = mean(x, na.rm = TRUE),
sd = sd(x, na.rm = TRUE)))This creates a data frame with nested results containing both mean and standard deviation.
Step 2: Flatten the results
The previous output has a matrix column, so we need to flatten it for easier analysis.
bill_summary_flat <- do.call(data.frame, bill_summary)
colnames(bill_summary_flat) <- c("island", "year", "mean_bill", "sd_bill")
head(bill_summary_flat)Now we have a clean data frame with separate columns for each statistic.
Step 3: Apply multiple functions to different variables
Let’s aggregate multiple variables simultaneously using different functions.
multi_summary <- aggregate(cbind(bill_length_mm, flipper_length_mm) ~ island,
data = penguins,
FUN = function(x) c(mean = mean(x, na.rm = TRUE),
median = median(x, na.rm = TRUE)))This calculates both mean and median for bill length and flipper length by island.
Step 4: Using aggregate with custom functions
We can create custom functions for more complex calculations.
cv_function <- function(x) {
sd(x, na.rm = TRUE) / mean(x, na.rm = TRUE) * 100
}
cv_results <- aggregate(body_mass_g ~ species,
data = penguins,
FUN = cv_function)This calculates the coefficient of variation (CV) for body mass by species.
Summary
aggregate()uses the formula syntaxvariable ~ grouping_variablesto specify what to calculate and how to group- The
FUNparameter accepts any function (mean, sum, sd, custom functions) to apply to each group - Multiple grouping variables can be combined using
+in the formula - Use
cbind()to aggregate multiple response variables simultaneously
- Always include
na.rm = TRUEwhen dealing with missing values to avoid NA results Custom functions can be created for specialized calculations like coefficients of variation or confidence intervals