dplyr case_when() to create new variable using multiple conditions

dplyr case_when()

Learn dplyr case_when() to create new variable using multiple conditions with this comprehensive R tutorial. Includes practical examples and code snippets.

Published

March 17, 2023

Introduction

The case_when() function in dplyr allows you to create new variables based on multiple conditions, similar to a series of if-else statements. It’s particularly useful when you need to categorize data into groups or assign values based on complex logical conditions.

Getting Started

library(tidyverse)
library(palmerpenguins)

Example 1: Basic Usage

The Problem

We want to categorize penguins from the Palmer Penguins dataset into size groups based on their body mass. This requires checking multiple conditions and assigning appropriate labels.

Step 1: Examine the data

First, let’s look at the body mass distribution to understand our data.

penguins |>
  select(species, body_mass_g) |>
  summary()

This shows us the range of body masses, helping us decide on appropriate cutoff points for our categories.

Step 2: Create size categories

Now we’ll use case_when() to create size categories based on body mass.

penguins_sized <- penguins |>
  mutate(
    size_category = case_when(
      body_mass_g < 3500 ~ "Small",
      body_mass_g >= 3500 & body_mass_g < 4500 ~ "Medium",
      body_mass_g >= 4500 ~ "Large"
    )
  )

The case_when() function evaluates conditions from top to bottom, assigning the first matching condition’s value.

Step 3: Verify the results

Let’s check our new variable by counting penguins in each category.

penguins_sized |>
  count(size_category, sort = TRUE)

This confirms our categorization worked correctly and shows the distribution across size groups.

Example 2: Practical Application

The Problem

We need to create a comprehensive penguin profile that considers multiple characteristics simultaneously. This involves combining species information with physical measurements to create meaningful categories for research purposes.

Step 1: Create the dataset

Let’s start by selecting the variables we’ll use for our classification.

penguin_data <- penguins |>
  select(species, bill_length_mm, bill_depth_mm, 
         flipper_length_mm, body_mass_g) |>
  filter(!is.na(body_mass_g), !is.na(bill_length_mm))

This gives us clean data with the measurements we need for our complex categorization.

Step 2: Create complex categories

Now we’ll use case_when() with multiple conditions to create research categories.

penguin_profiles <- penguin_data |>
  mutate(
    research_category = case_when(
      species == "Adelie" & body_mass_g > 4000 ~ "Large Adelie",
      species == "Adelie" & body_mass_g <= 4000 ~ "Standard Adelie",
      species == "Gentoo" ~ "Gentoo",
      species == "Chinstrap" & bill_length_mm > 50 ~ "Long-billed Chinstrap",
      TRUE ~ "Other Chinstrap"
    )
  )

The TRUE ~ "Other Chinstrap" serves as a catch-all for any remaining cases that don’t match previous conditions.

Step 3: Add bill characteristics

Let’s add another variable that considers bill proportions across all species.

final_profiles <- penguin_profiles |>
  mutate(
    bill_type = case_when(
      bill_length_mm > 45 & bill_depth_mm > 18 ~ "Long & Deep",
      bill_length_mm > 45 & bill_depth_mm <= 18 ~ "Long & Narrow",
      bill_length_mm <= 45 & bill_depth_mm > 18 ~ "Short & Deep",
      TRUE ~ "Short & Narrow"
    )
  )

This creates a comprehensive bill classification that works across all penguin species.

Step 4: Analyze the results

Finally, let’s examine our new categories to ensure they make biological sense.

final_profiles |>
  count(species, research_category, bill_type) |>
  arrange(species, desc(n))

This summary helps us verify that our categorization creates meaningful and well-distributed groups for analysis.

Summary

case_when() evaluates conditions sequentially from top to bottom, stopping at the first match
Use the format condition ~ value for each case, with conditions using standard logical operators
Include TRUE ~ "default_value" as the last condition to handle unmatched cases
Multiple conditions can be combined using & (and) or | (or) operators
The function works seamlessly with mutate() to create new variables based on existing data

--- title: "dplyr case_when() to create new variable using multiple conditions" description: "Learn dplyr case_when() to create new variable using multiple conditions with this comprehensive R tutorial. Includes practical examples and code snippets." date: 2023-03-17 categories: ['dplyr case_when()'] format: html: code-fold: false code-tools: true --- ## Introduction The `case_when()` function in dplyr allows you to create new variables based on multiple conditions, similar to a series of if-else statements. It's particularly useful when you need to categorize data into groups or assign values based on complex logical conditions. ## Getting Started ```r library(tidyverse) library(palmerpenguins) ``` ## Example 1: Basic Usage ### The Problem We want to categorize penguins from the Palmer Penguins dataset into size groups based on their body mass. This requires checking multiple conditions and assigning appropriate labels. ### Step 1: Examine the data First, let's look at the body mass distribution to understand our data. ```r penguins |> select(species, body_mass_g) |> summary() ``` This shows us the range of body masses, helping us decide on appropriate cutoff points for our categories. ### Step 2: Create size categories Now we'll use `case_when()` to create size categories based on body mass. ```r penguins_sized <- penguins |> mutate( size_category = case_when( body_mass_g < 3500 ~ "Small", body_mass_g >= 3500 & body_mass_g < 4500 ~ "Medium", body_mass_g >= 4500 ~ "Large" ) ) ``` The `case_when()` function evaluates conditions from top to bottom, assigning the first matching condition's value. ### Step 3: Verify the results Let's check our new variable by counting penguins in each category. ```r penguins_sized |> count(size_category, sort = TRUE) ``` This confirms our categorization worked correctly and shows the distribution across size groups. ## Example 2: Practical Application ### The Problem We need to create a comprehensive penguin profile that considers multiple characteristics simultaneously. This involves combining species information with physical measurements to create meaningful categories for research purposes. ### Step 1: Create the dataset Let's start by selecting the variables we'll use for our classification. ```r penguin_data <- penguins |> select(species, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) |> filter(!is.na(body_mass_g), !is.na(bill_length_mm)) ``` This gives us clean data with the measurements we need for our complex categorization. ### Step 2: Create complex categories Now we'll use `case_when()` with multiple conditions to create research categories. ```r penguin_profiles <- penguin_data |> mutate( research_category = case_when( species == "Adelie" & body_mass_g > 4000 ~ "Large Adelie", species == "Adelie" & body_mass_g <= 4000 ~ "Standard Adelie", species == "Gentoo" ~ "Gentoo", species == "Chinstrap" & bill_length_mm > 50 ~ "Long-billed Chinstrap", TRUE ~ "Other Chinstrap" ) ) ``` The `TRUE ~ "Other Chinstrap"` serves as a catch-all for any remaining cases that don't match previous conditions. ### Step 3: Add bill characteristics Let's add another variable that considers bill proportions across all species. ```r final_profiles <- penguin_profiles |> mutate( bill_type = case_when( bill_length_mm > 45 & bill_depth_mm > 18 ~ "Long & Deep", bill_length_mm > 45 & bill_depth_mm <= 18 ~ "Long & Narrow", bill_length_mm <= 45 & bill_depth_mm > 18 ~ "Short & Deep", TRUE ~ "Short & Narrow" ) ) ``` This creates a comprehensive bill classification that works across all penguin species. ### Step 4: Analyze the results Finally, let's examine our new categories to ensure they make biological sense. ```r final_profiles |> count(species, research_category, bill_type) |> arrange(species, desc(n)) ``` This summary helps us verify that our categorization creates meaningful and well-distributed groups for analysis. ## Summary - `case_when()` evaluates conditions sequentially from top to bottom, stopping at the first match - Use the format `condition ~ value` for each case, with conditions using standard logical operators - Include `TRUE ~ "default_value"` as the last condition to handle unmatched cases - Multiple conditions can be combined using `&` (and) or `|` (or) operators - The function works seamlessly with [`mutate()`](/dplyr/how-to-use-mutate-in-r.html) to create new variables based on existing data --- ## Related Posts - [dplyr's mutate(): How to create new columns](/dplyr/dplyr-mutate-create-new-columns.html) - [dplyr transmute(): add new columns and delete existing columns](/dplyr/dplyr-transmute-add-new-columns-and-delete-existing-columns.html) - [dplyr count(): count unique values of a variable](/dplyr/dplyr-count-count-unique-values-of-a-variable.html) - [tidyr unite(): combine multiple columns into one](/tidyr/tidyr-unite-combine-multiple-columns-into-one.html) - [How to create a nested dataframe with lists](/tidyr/how-to-create-a-nested-dataframe-with-lists.html)

Introduction

Getting Started

Example 1: Basic Usage

The Problem

Step 1: Examine the data

Step 2: Create size categories

Step 3: Verify the results

Example 2: Practical Application

The Problem

Step 1: Create the dataset

Step 2: Create complex categories

Step 3: Add bill characteristics

Step 4: Analyze the results

Summary

The function works seamlessly with mutate() to create new variables based on existing data

Related Posts

The function works seamlessly with `mutate()` to create new variables based on existing data