How to sum a column by group in R

dplyr group_by()
dplyr summarize()
Learn how to perform sum a column by group in R. Step-by-step statistical tutorial with examples.
Published

April 17, 2022

Introduction

Summing values by group is one of the most common data analysis tasks in R. This operation allows you to calculate totals for different categories within your dataset, such as finding total sales by region, total expenses by department, or total measurements by species.

Group-wise summation is essential for exploratory data analysis, creating summary reports, and preparing data for visualization. Whether you’re working with financial data, scientific measurements, or survey responses, the ability to aggregate values by categorical variables helps reveal patterns and insights that might be hidden in individual observations. R provides several efficient methods to perform these calculations, with the tidyverse approach being particularly intuitive and readable.

Getting Started

First, let’s load the required packages for our examples:

library(tidyverse)
library(palmerpenguins)

Example 1: Basic Usage

Let’s start with a simple example using the built-in mtcars dataset. We’ll sum the horsepower (hp) by the number of cylinders (cyl):

data(mtcars)

basic_sum <- mtcars |>
  group_by(cyl) |>
  summarise(total_hp = sum(hp))

print(basic_sum)

This code groups cars by their cylinder count and calculates the total horsepower for each group. The group_by() function specifies the grouping variable, while summarise() performs the aggregation using sum().

You can also sum multiple columns simultaneously:

multiple_sums <- mtcars |>
  group_by(cyl) |>
  summarise(
    total_hp = sum(hp),
    total_weight = sum(wt),
    count = n()
  )

print(multiple_sums)

Example 2: Practical Application

Now let’s work with a more complex example using the Palmer penguins dataset. We’ll calculate total body mass by species and island, while also handling missing values:

data(penguins)

penguin_summary <- penguins |>
  filter(!is.na(body_mass_g)) |>
  group_by(species, island) |>
  summarise(
    total_body_mass = sum(body_mass_g),
    average_body_mass = mean(body_mass_g),
    penguin_count = n(),
    .groups = "drop"
  ) |>
  arrange(desc(total_body_mass))

print(penguin_summary)

For more advanced analysis, you can combine grouping with conditional summation. Here’s how to sum body mass only for penguins above a certain weight threshold:

conditional_sum <- penguins |>
  filter(!is.na(body_mass_g)) |>
  group_by(species) |>
  summarise(
    total_heavy_penguins = sum(body_mass_g[body_mass_g > 4000], na.rm = TRUE),
    count_heavy_penguins = sum(body_mass_g > 4000, na.rm = TRUE),
    proportion_heavy = mean(body_mass_g > 4000, na.rm = TRUE)
  )

print(conditional_sum)

You can also create summary statistics that include percentages of the total:

percentage_summary <- penguins |>
  filter(!is.na(body_mass_g)) |>
  group_by(species) |>
  summarise(species_total = sum(body_mass_g)) |>
  mutate(
    overall_total = sum(species_total),
    percentage_of_total = round(100 * species_total / overall_total, 1)
  )

print(percentage_summary)

Summary

Summing columns by group in R is straightforward with the tidyverse approach. The key steps are: use group_by() to specify grouping variables, then summarise() with sum() to calculate totals. Remember to handle missing values with na.rm = TRUE when necessary, and use filter() to remove incomplete cases beforehand. The pipe operator |> makes the code readable and easy to follow. These techniques form the foundation for more complex data aggregation tasks and are essential skills for effective data analysis in R.