dplyr case_when() to create new variable using multiple conditions

dplyr case_when()
Learn dplyr case_when() to create new variable using multiple conditions with this comprehensive R tutorial. Includes practical examples and code snippets.
Published

March 17, 2023

Introduction

The case_when() function in dplyr allows you to create new variables based on multiple conditions, similar to a series of if-else statements. It’s particularly useful when you need to categorize data into groups or assign values based on complex logical conditions.

Getting Started

library(tidyverse)
library(palmerpenguins)

Example 1: Basic Usage

The Problem

We want to categorize penguins from the Palmer Penguins dataset into size groups based on their body mass. This requires checking multiple conditions and assigning appropriate labels.

Step 1: Examine the data

First, let’s look at the body mass distribution to understand our data.

penguins |>
  select(species, body_mass_g) |>
  summary()

This shows us the range of body masses, helping us decide on appropriate cutoff points for our categories.

Step 2: Create size categories

Now we’ll use case_when() to create size categories based on body mass.

penguins_sized <- penguins |>
  mutate(
    size_category = case_when(
      body_mass_g < 3500 ~ "Small",
      body_mass_g >= 3500 & body_mass_g < 4500 ~ "Medium",
      body_mass_g >= 4500 ~ "Large"
    )
  )

The case_when() function evaluates conditions from top to bottom, assigning the first matching condition’s value.

Step 3: Verify the results

Let’s check our new variable by counting penguins in each category.

penguins_sized |>
  count(size_category, sort = TRUE)

This confirms our categorization worked correctly and shows the distribution across size groups.

Example 2: Practical Application

The Problem

We need to create a comprehensive penguin profile that considers multiple characteristics simultaneously. This involves combining species information with physical measurements to create meaningful categories for research purposes.

Step 1: Create the dataset

Let’s start by selecting the variables we’ll use for our classification.

penguin_data <- penguins |>
  select(species, bill_length_mm, bill_depth_mm, 
         flipper_length_mm, body_mass_g) |>
  filter(!is.na(body_mass_g), !is.na(bill_length_mm))

This gives us clean data with the measurements we need for our complex categorization.

Step 2: Create complex categories

Now we’ll use case_when() with multiple conditions to create research categories.

penguin_profiles <- penguin_data |>
  mutate(
    research_category = case_when(
      species == "Adelie" & body_mass_g > 4000 ~ "Large Adelie",
      species == "Adelie" & body_mass_g <= 4000 ~ "Standard Adelie",
      species == "Gentoo" ~ "Gentoo",
      species == "Chinstrap" & bill_length_mm > 50 ~ "Long-billed Chinstrap",
      TRUE ~ "Other Chinstrap"
    )
  )

The TRUE ~ "Other Chinstrap" serves as a catch-all for any remaining cases that don’t match previous conditions.

Step 3: Add bill characteristics

Let’s add another variable that considers bill proportions across all species.

final_profiles <- penguin_profiles |>
  mutate(
    bill_type = case_when(
      bill_length_mm > 45 & bill_depth_mm > 18 ~ "Long & Deep",
      bill_length_mm > 45 & bill_depth_mm <= 18 ~ "Long & Narrow",
      bill_length_mm <= 45 & bill_depth_mm > 18 ~ "Short & Deep",
      TRUE ~ "Short & Narrow"
    )
  )

This creates a comprehensive bill classification that works across all penguin species.

Step 4: Analyze the results

Finally, let’s examine our new categories to ensure they make biological sense.

final_profiles |>
  count(species, research_category, bill_type) |>
  arrange(species, desc(n))

This summary helps us verify that our categorization creates meaningful and well-distributed groups for analysis.

Summary

  • case_when() evaluates conditions sequentially from top to bottom, stopping at the first match
  • Use the format condition ~ value for each case, with conditions using standard logical operators
  • Include TRUE ~ "default_value" as the last condition to handle unmatched cases
  • Multiple conditions can be combined using & (and) or | (or) operators
  • The function works seamlessly with mutate() to create new variables based on existing data