How to compute proportion with tidyverse

Learn how to compute proportion with tidyverse in R. Practical tutorial with examples.
Published

November 21, 2024

Introduction

Computing proportions is a fundamental task in data analysis that helps you understand the relative frequency of different categories in your dataset. The tidyverse provides elegant tools for calculating proportions through grouping, counting, and mutating operations.

Getting Started

library(tidyverse)
library(palmerpenguins)

Example 1: Basic Proportion Calculation

The Problem

We want to calculate what proportion each penguin species represents in the Palmer Penguins dataset. This involves counting occurrences and converting them to percentages.

Step 1: Count Each Species

First, let’s count how many penguins we have for each species.

penguins |>
  count(species)

This gives us the raw counts for Adelie, Chinstrap, and Gentoo penguins.

Step 2: Calculate Total Count

Next, we need to add the total count to calculate proportions.

species_counts <- penguins |>
  count(species) |>
  mutate(total = sum(n))

species_counts

Now each row shows both the species count and the total number of penguins.

Step 3: Compute Proportions

Finally, we calculate the actual proportions by dividing each count by the total.

species_counts |>
  mutate(proportion = n / total,
         percentage = proportion * 100)

This shows that Adelie penguins make up about 44% of the dataset, while Chinstrap and Gentoo represent roughly 20% and 36% respectively.

Example 2: Practical Application

The Problem

A marine biologist wants to understand the distribution of penguin species across different islands and calculate what proportion each species-island combination represents. This analysis will help identify which species dominate specific habitats.

Step 1: Create Cross-Tabulation

We start by counting penguins for each species-island combination.

penguin_distribution <- penguins |>
  filter(!is.na(species), !is.na(island)) |>
  count(species, island)

penguin_distribution

This creates a detailed breakdown showing how many penguins of each species live on each island.

Step 2: Calculate Overall Proportions

Now we calculate what proportion each combination represents of the total population.

penguin_distribution |>
  mutate(total_penguins = sum(n),
         proportion = n / total_penguins,
         percentage = round(proportion * 100, 1))

These results show the relative frequency of each species-island pairing across the entire dataset.

Step 3: Calculate Within-Species Proportions

We can also calculate proportions within each species to see habitat preferences.

penguin_distribution |>
  group_by(species) |>
  mutate(species_total = sum(n),
         within_species_prop = n / species_total,
         within_species_pct = round(within_species_prop * 100, 1))

This reveals that Adelie penguins are distributed across all three islands, while Chinstrap penguins are found only on Dream Island.

Step 4: Calculate Within-Island Proportions

Finally, let’s see what proportion each species represents on each island.

penguin_distribution |>
  group_by(island) |>
  mutate(island_total = sum(n),
         within_island_prop = n / island_total,
         within_island_pct = round(within_island_prop * 100, 1)) |>
  arrange(island, desc(within_island_pct))

This analysis shows which species dominate each island, revealing that Gentoo penguins make up the majority on Biscoe Island.

Summary

  • Use count() to get frequencies, then mutate() with division to calculate proportions
  • Calculate total counts with sum(n) after counting to establish the denominator
  • Group by different variables to compute proportions within subsets of your data
  • Multiply proportions by 100 and use round() to create readable percentage values
  • Combine group_by() with proportion calculations to analyze distributions across multiple dimensions