How to compute proportion with tidyverse
Introduction
Computing proportions is a fundamental task in data analysis that helps you understand the relative frequency of different categories in your dataset. The tidyverse provides elegant tools for calculating proportions through grouping, counting, and mutating operations.
Getting Started
library(tidyverse)
library(palmerpenguins)Example 1: Basic Proportion Calculation
The Problem
We want to calculate what proportion each penguin species represents in the Palmer Penguins dataset. This involves counting occurrences and converting them to percentages.
Step 1: Count Each Species
First, let’s count how many penguins we have for each species.
penguins |>
count(species)This gives us the raw counts for Adelie, Chinstrap, and Gentoo penguins.
Step 2: Calculate Total Count
Next, we need to add the total count to calculate proportions.
species_counts <- penguins |>
count(species) |>
mutate(total = sum(n))
species_countsNow each row shows both the species count and the total number of penguins.
Step 3: Compute Proportions
Finally, we calculate the actual proportions by dividing each count by the total.
species_counts |>
mutate(proportion = n / total,
percentage = proportion * 100)This shows that Adelie penguins make up about 44% of the dataset, while Chinstrap and Gentoo represent roughly 20% and 36% respectively.
Example 2: Practical Application
The Problem
A marine biologist wants to understand the distribution of penguin species across different islands and calculate what proportion each species-island combination represents. This analysis will help identify which species dominate specific habitats.
Step 1: Create Cross-Tabulation
We start by counting penguins for each species-island combination.
penguin_distribution <- penguins |>
filter(!is.na(species), !is.na(island)) |>
count(species, island)
penguin_distributionThis creates a detailed breakdown showing how many penguins of each species live on each island.
Step 2: Calculate Overall Proportions
Now we calculate what proportion each combination represents of the total population.
penguin_distribution |>
mutate(total_penguins = sum(n),
proportion = n / total_penguins,
percentage = round(proportion * 100, 1))These results show the relative frequency of each species-island pairing across the entire dataset.
Step 3: Calculate Within-Species Proportions
We can also calculate proportions within each species to see habitat preferences.
penguin_distribution |>
group_by(species) |>
mutate(species_total = sum(n),
within_species_prop = n / species_total,
within_species_pct = round(within_species_prop * 100, 1))This reveals that Adelie penguins are distributed across all three islands, while Chinstrap penguins are found only on Dream Island.
Step 4: Calculate Within-Island Proportions
Finally, let’s see what proportion each species represents on each island.
penguin_distribution |>
group_by(island) |>
mutate(island_total = sum(n),
within_island_prop = n / island_total,
within_island_pct = round(within_island_prop * 100, 1)) |>
arrange(island, desc(within_island_pct))This analysis shows which species dominate each island, revealing that Gentoo penguins make up the majority on Biscoe Island.
Summary
- Use
count()to get frequencies, thenmutate()with division to calculate proportions - Calculate total counts with
sum(n)after counting to establish the denominator
- Group by different variables to compute proportions within subsets of your data
- Multiply proportions by 100 and use
round()to create readable percentage values Combine
group_by()with proportion calculations to analyze distributions across multiple dimensions