dplyr case_when() to create new variable using multiple conditions
Introduction
The case_when() function in dplyr allows you to create new variables based on multiple conditions, similar to a series of if-else statements. It’s particularly useful when you need to categorize data into groups or assign values based on complex logical conditions.
Getting Started
library(tidyverse)
library(palmerpenguins)Example 1: Basic Usage
The Problem
We want to categorize penguins from the Palmer Penguins dataset into size groups based on their body mass. This requires checking multiple conditions and assigning appropriate labels.
Step 1: Examine the data
First, let’s look at the body mass distribution to understand our data.
penguins |>
select(species, body_mass_g) |>
summary()This shows us the range of body masses, helping us decide on appropriate cutoff points for our categories.
Step 2: Create size categories
Now we’ll use case_when() to create size categories based on body mass.
penguins_sized <- penguins |>
mutate(
size_category = case_when(
body_mass_g < 3500 ~ "Small",
body_mass_g >= 3500 & body_mass_g < 4500 ~ "Medium",
body_mass_g >= 4500 ~ "Large"
)
)The case_when() function evaluates conditions from top to bottom, assigning the first matching condition’s value.
Step 3: Verify the results
Let’s check our new variable by counting penguins in each category.
penguins_sized |>
count(size_category, sort = TRUE)This confirms our categorization worked correctly and shows the distribution across size groups.
Example 2: Practical Application
The Problem
We need to create a comprehensive penguin profile that considers multiple characteristics simultaneously. This involves combining species information with physical measurements to create meaningful categories for research purposes.
Step 1: Create the dataset
Let’s start by selecting the variables we’ll use for our classification.
penguin_data <- penguins |>
select(species, bill_length_mm, bill_depth_mm,
flipper_length_mm, body_mass_g) |>
filter(!is.na(body_mass_g), !is.na(bill_length_mm))This gives us clean data with the measurements we need for our complex categorization.
Step 2: Create complex categories
Now we’ll use case_when() with multiple conditions to create research categories.
penguin_profiles <- penguin_data |>
mutate(
research_category = case_when(
species == "Adelie" & body_mass_g > 4000 ~ "Large Adelie",
species == "Adelie" & body_mass_g <= 4000 ~ "Standard Adelie",
species == "Gentoo" ~ "Gentoo",
species == "Chinstrap" & bill_length_mm > 50 ~ "Long-billed Chinstrap",
TRUE ~ "Other Chinstrap"
)
)The TRUE ~ "Other Chinstrap" serves as a catch-all for any remaining cases that don’t match previous conditions.
Step 3: Add bill characteristics
Let’s add another variable that considers bill proportions across all species.
final_profiles <- penguin_profiles |>
mutate(
bill_type = case_when(
bill_length_mm > 45 & bill_depth_mm > 18 ~ "Long & Deep",
bill_length_mm > 45 & bill_depth_mm <= 18 ~ "Long & Narrow",
bill_length_mm <= 45 & bill_depth_mm > 18 ~ "Short & Deep",
TRUE ~ "Short & Narrow"
)
)This creates a comprehensive bill classification that works across all penguin species.
Step 4: Analyze the results
Finally, let’s examine our new categories to ensure they make biological sense.
final_profiles |>
count(species, research_category, bill_type) |>
arrange(species, desc(n))This summary helps us verify that our categorization creates meaningful and well-distributed groups for analysis.
Summary
case_when()evaluates conditions sequentially from top to bottom, stopping at the first match- Use the format
condition ~ valuefor each case, with conditions using standard logical operators - Include
TRUE ~ "default_value"as the last condition to handle unmatched cases - Multiple conditions can be combined using
&(and) or|(or) operators The function works seamlessly with
mutate()to create new variables based on existing data