How to compute proportions in R
Introduction
The count() function in dplyr is one of the most useful tools for quickly summarizing categorical data in R. It provides an efficient way to count observations by groups and can be combined with other functions to calculate proportions and percentages. This tutorial shows how to use count() effectively with the Palmer Penguins dataset.
Setup
First, let’s load the required packages and prepare our data:
library(tidyverse)
library(palmerpenguins)We’ll remove any rows with missing values to ensure clean counts:
penguins_clean <- penguins |>
drop_na()This gives us a complete dataset with 333 penguin observations to work with.
Basic Counting
The simplest use of count() is to count observations in a single variable:
penguins_clean |>
count(species)This returns a tibble showing how many penguins of each species are in our dataset. The count() function automatically creates a column called n with the counts.
Adding Proportions
To convert counts to proportions, we can add a mutate() step:
penguins_clean |>
count(species) |>
mutate(prop = n / sum(n))This calculates what fraction of all penguins each species represents. The proportions will sum to 1.0.
Counting Multiple Variables
You can count combinations of multiple variables by listing them in count():
penguins_clean |>
count(species, sex)This shows the count for each combination of species and sex, giving us six rows (3 species × 2 sexes).
Grouped Proportions
To calculate proportions within each species (rather than across all penguins), use group_by():
penguins_clean |>
count(species, sex) |>
group_by(species) |>
mutate(proportion = n / sum(n))Now the proportions represent the male/female split within each species, and each species’ proportions sum to 1.0.
Alternative Approach with .by
Modern dplyr offers a cleaner syntax using the .by argument:
penguins_clean |>
count(species, sex) |>
mutate(proportion = n / sum(n), .by = species)This produces the same result as the previous example but avoids explicit grouping and ungrouping.
Overall Proportions vs Grouped Proportions
It’s important to understand the difference between overall and grouped proportions:
# Overall proportions (across all penguins)
penguins_clean |>
count(species, sex) |>
mutate(overall_prop = n / sum(n))This calculates what fraction each species-sex combination represents of the entire dataset. These proportions will be much smaller since they’re divided by the total number of penguins (333) rather than by species totals.
Summary
The count() function is essential for exploratory data analysis in R. Key takeaways: - Use count() alone for simple frequency tables - Combine with mutate() to add proportions - Count multiple variables to examine combinations - Use .by or group_by() to calculate proportions within groups - Always consider whether you want overall or grouped proportions for your analysis