dplyr count(): count unique values of a variable

dplyr count()

Master dplyr count() to count unique values of a variable. Complete R tutorial with examples using real datasets.

Published

January 26, 2022

Introduction

The count() function from dplyr is one of the most useful tools for exploratory data analysis in R. It provides a quick and efficient way to count the number of occurrences of unique values within one or more variables in your dataset. This function is particularly valuable when you need to understand the distribution of categorical variables, identify the most common values, or get a quick overview of your data structure.

You’ll find count() especially helpful during initial data exploration, quality checks, or when creating frequency tables for reporting. It’s also commonly used as a preprocessing step before creating visualizations like bar charts or preparing data for statistical analysis.

Getting Started

First, let’s load the required packages. We’ll use the tidyverse for data manipulation and the palmerpenguins dataset for our examples.

library(tidyverse)
library(palmerpenguins)

Example 1: Basic Usage

The simplest use of count() is to count occurrences of a single variable. Let’s count the number of penguins by species in the Palmer penguins dataset:

penguins |> 
  count(species)

You can also count multiple variables simultaneously. This creates a frequency table showing all combinations:

penguins |> 
  count(species, island)

To sort the results by frequency, add the sort parameter:

penguins |> 
  count(species, sort = TRUE)

If you want to customize the name of the count column (which defaults to “n”), use the name parameter:

penguins |> 
  count(species, name = "total_penguins")

Example 2: Practical Application

Let’s explore a more complex scenario where we analyze penguin populations across different islands and years, focusing on complete cases only. This demonstrates how count() integrates seamlessly with other dplyr functions:

penguins |> 
  filter(!is.na(body_mass_g), !is.na(sex)) |> 
  count(island, year, species, sort = TRUE) |> 
  filter(n >= 10) |> 
  arrange(island, desc(n))

We can also use count() with conditional logic. Here’s how to count penguins by size categories we create on the fly:

penguins |> 
  filter(!is.na(body_mass_g)) |> 
  mutate(size_category = case_when(
    body_mass_g < 3500 ~ "Small",
    body_mass_g < 4500 ~ "Medium",
    TRUE ~ "Large"
  )) |> 
  count(species, size_category, sort = TRUE) |> 
  pivot_wider(names_from = size_category, values_from = n, values_fill = 0)

For percentage calculations, you can combine count() with mutate():

penguins |> 
  count(species) |> 
  mutate(
    percentage = round(n / sum(n) * 100, 1),
    percentage_label = paste0(percentage, "%")
  )

Summary

The count() function is an essential tool for data exploration and summarization in R. Key takeaways include:

Use count(variable) for basic frequency counts of single variables
Count multiple variables with count(var1, var2) to see all combinations
Add sort = TRUE to automatically order results by frequency
Customize the count column name with the name parameter
Combine with other dplyr functions like filter() and mutate() for more complex analyses
Use with pivot_wider() to create cross-tabulation tables

Remember that `count()` automatically removes rows with NA values in the counted variables, so consider filtering or handling missing data explicitly when needed for your analysis.

--- title: "dplyr count(): count unique values of a variable" description: "Master dplyr count() to count unique values of a variable. Complete R tutorial with examples using real datasets." date: 2022-01-26 categories: ['dplyr count()'] format: html: code-fold: false code-tools: true --- ## Introduction The `count()` function from dplyr is one of the most useful tools for exploratory data analysis in R. It provides a quick and efficient way to count the number of occurrences of unique values within one or more variables in your dataset. This function is particularly valuable when you need to understand the distribution of categorical variables, identify the most common values, or get a quick overview of your data structure. You'll find `count()` especially helpful during initial data exploration, quality checks, or when creating frequency tables for reporting. It's also commonly used as a preprocessing step before creating visualizations like bar charts or preparing data for statistical analysis. ## Getting Started First, let's load the required packages. We'll use the tidyverse for data manipulation and the palmerpenguins dataset for our examples. ```r library(tidyverse) library(palmerpenguins) ``` ## Example 1: Basic Usage The simplest use of `count()` is to count occurrences of a single variable. Let's count the number of penguins by species in the Palmer penguins dataset: ```r penguins |> count(species) ``` You can also count multiple variables simultaneously. This creates a frequency table showing all combinations: ```r penguins |> count(species, island) ``` To sort the results by frequency, add the `sort` parameter: ```r penguins |> count(species, sort = TRUE) ``` If you want to customize the name of the count column (which defaults to "n"), use the `name` parameter: ```r penguins |> count(species, name = "total_penguins") ``` ## Example 2: Practical Application Let's explore a more complex scenario where we analyze penguin populations across different islands and years, focusing on complete cases only. This demonstrates how `count()` integrates seamlessly with other dplyr functions: ```r penguins |> filter(!is.na(body_mass_g), !is.na(sex)) |> count(island, year, species, sort = TRUE) |> filter(n >= 10) |> arrange(island, desc(n)) ``` We can also use `count()` with conditional logic. Here's how to count penguins by size categories we create on the fly: ```r penguins |> filter(!is.na(body_mass_g)) |> mutate(size_category = case_when( body_mass_g < 3500 ~ "Small", body_mass_g < 4500 ~ "Medium", TRUE ~ "Large" )) |> count(species, size_category, sort = TRUE) |> pivot_wider(names_from = size_category, values_from = n, values_fill = 0) ``` For percentage calculations, you can combine `count()` with [`mutate()`](/dplyr/how-to-use-mutate-in-r.html): ```r penguins |> count(species) |> mutate( percentage = round(n / sum(n) * 100, 1), percentage_label = paste0(percentage, "%") ) ``` ## Summary The `count()` function is an essential tool for data exploration and summarization in R. Key takeaways include: - Use `count(variable)` for basic frequency counts of single variables - Count multiple variables with `count(var1, var2)` to see all combinations - Add `sort = TRUE` to automatically order results by frequency - Customize the count column name with the `name` parameter - Combine with other dplyr functions like [`filter()`](/dplyr/how-to-use-filter-in-r.html) and `mutate()` for more complex analyses - Use with [`pivot_wider()`](/tidyr/how-to-use-pivotwider-in-r.html) to create cross-tabulation tables Remember that `count()` automatically removes rows with NA values in the counted variables, so consider filtering or handling missing data explicitly when needed for your analysis. --- ## Related Posts - [dplyr n_distinct(): count unique elements or rows](/dplyr/dplyr-n_distinct-count-unique-combinations.html) - [How to count number of missing values per row in a dataframe](/dplyr/count-number-of-missing-values-per-row-in-a-dataframe.html) - [dplyr row_number(): Add unique row number to a dataframe](/dplyr/dplyr-row_number-add-unique-row-number-to-a-dataframe.html) - [How to drop unused level of factor variable in R](/base-r/drop-unused-level-of-factor-variable-in-r.html) - [expand_grid(): Create all possible combinations of variables](/tidyr/expand_grid-create-all-possible-combinations-of-variables.html)