How to sum a column by group in R
Introduction
Summing values by group is one of the most common data analysis tasks in R. This operation allows you to calculate totals for different categories within your dataset, such as finding total sales by region, total expenses by department, or total measurements by species.
Group-wise summation is essential for exploratory data analysis, creating summary reports, and preparing data for visualization. Whether you’re working with financial data, scientific measurements, or survey responses, the ability to aggregate values by categorical variables helps reveal patterns and insights that might be hidden in individual observations. R provides several efficient methods to perform these calculations, with the tidyverse approach being particularly intuitive and readable.
Getting Started
First, let’s load the required packages for our examples:
library(tidyverse)
library(palmerpenguins)Example 1: Basic Usage
Let’s start with a simple example using the built-in mtcars dataset. We’ll sum the horsepower (hp) by the number of cylinders (cyl):
data(mtcars)
basic_sum <- mtcars |>
group_by(cyl) |>
summarise(total_hp = sum(hp))
print(basic_sum)This code groups cars by their cylinder count and calculates the total horsepower for each group. The group_by() function specifies the grouping variable, while summarise() performs the aggregation using sum().
You can also sum multiple columns simultaneously:
multiple_sums <- mtcars |>
group_by(cyl) |>
summarise(
total_hp = sum(hp),
total_weight = sum(wt),
count = n()
)
print(multiple_sums)Example 2: Practical Application
Now let’s work with a more complex example using the Palmer penguins dataset. We’ll calculate total body mass by species and island, while also handling missing values:
data(penguins)
penguin_summary <- penguins |>
filter(!is.na(body_mass_g)) |>
group_by(species, island) |>
summarise(
total_body_mass = sum(body_mass_g),
average_body_mass = mean(body_mass_g),
penguin_count = n(),
.groups = "drop"
) |>
arrange(desc(total_body_mass))
print(penguin_summary)For more advanced analysis, you can combine grouping with conditional summation. Here’s how to sum body mass only for penguins above a certain weight threshold:
conditional_sum <- penguins |>
filter(!is.na(body_mass_g)) |>
group_by(species) |>
summarise(
total_heavy_penguins = sum(body_mass_g[body_mass_g > 4000], na.rm = TRUE),
count_heavy_penguins = sum(body_mass_g > 4000, na.rm = TRUE),
proportion_heavy = mean(body_mass_g > 4000, na.rm = TRUE)
)
print(conditional_sum)You can also create summary statistics that include percentages of the total:
percentage_summary <- penguins |>
filter(!is.na(body_mass_g)) |>
group_by(species) |>
summarise(species_total = sum(body_mass_g)) |>
mutate(
overall_total = sum(species_total),
percentage_of_total = round(100 * species_total / overall_total, 1)
)
print(percentage_summary)