dplyr if_else(): Create new variable from existing variable
Introduction
The if_else() function in dplyr allows you to create new variables based on logical conditions from existing variables. It’s particularly useful when you need to categorize data, create flags, or transform values based on specific criteria. This function provides a vectorized way to implement conditional logic in your data transformations.
Getting Started
library(tidyverse)
library(palmerpenguins)Example 1: Basic Usage
The Problem
We want to create a simple binary categorization of penguins based on their body mass. Specifically, we need to classify penguins as either “heavy” or “light” based on whether they weigh more than 4000 grams.
Step 1: Examine the data structure
First, let’s look at our penguin data to understand what we’re working with.
penguins |>
select(species, body_mass_g) |>
head(8)This shows us the species and body mass columns, giving us a clear view of the data we’ll be transforming.
Step 2: Apply basic if_else logic
Now we’ll create our weight category using if_else() with mutate().
penguins |>
mutate(weight_category = if_else(body_mass_g > 4000,
"heavy",
"light")) |>
select(species, body_mass_g, weight_category)The if_else() function evaluates each penguin’s body mass and assigns “heavy” for masses over 4000g, “light” otherwise.
Step 3: Handle missing values
Let’s improve our code to properly handle any missing values in the dataset.
penguins |>
mutate(weight_category = if_else(body_mass_g > 4000,
"heavy",
"light",
missing = "unknown")) |>
select(species, body_mass_g, weight_category) |>
filter(is.na(body_mass_g) | row_number() <= 5)The missing parameter ensures that penguins with unknown body mass get labeled as “unknown” rather than NA.
Example 2: Practical Application
The Problem
We’re analyzing penguin populations and need to create a comprehensive classification system. We want to categorize penguins based on multiple criteria: create size categories based on body mass, identify potential breeding pairs by flagging adult-sized penguins, and create species-specific classifications that account for different size standards across species.
Step 1: Create multiple size categories
We’ll start by creating detailed size classifications using nested conditions.
penguins_classified <- penguins |>
mutate(size_class = if_else(body_mass_g < 3500, "small",
if_else(body_mass_g < 4500, "medium",
"large",
missing = "unknown"),
missing = "unknown"))This creates a three-tier classification system using nested if_else() statements, with proper handling of missing values at each level.
Step 2: Create breeding readiness flags
Next, we’ll identify potentially breeding-age penguins based on size thresholds.
penguins_classified <- penguins_classified |>
mutate(breeding_ready = if_else(body_mass_g >= 3800,
"ready",
"not_ready",
missing = "unknown"))This flag helps researchers quickly identify penguins that have reached breeding size, which is crucial for population studies.
Step 3: Apply species-specific logic
Finally, we’ll create species-specific size categories since different species have different typical weights.
penguins_final <- penguins_classified |>
mutate(species_size = if_else(
species == "Adelie" & body_mass_g > 3700, "large_adelie",
if_else(species == "Gentoo" & body_mass_g > 5000, "large_gentoo",
if_else(species == "Chinstrap" & body_mass_g > 3700, "large_chinstrap",
"standard_size",
missing = "unknown"),
missing = "unknown"),
missing = "unknown")) |>
select(species, body_mass_g, size_class, breeding_ready, species_size)This creates nuanced categories that respect the natural size differences between penguin species.
Step 4: Verify the results
Let’s examine our final classification to ensure it worked correctly.
penguins_final |>
count(species, species_size) |>
arrange(species, species_size)This summary shows how many penguins fall into each species-specific size category, allowing us to validate our classification logic.
Summary
if_else()provides vectorized conditional logic for creating new variables based on existing data- The basic syntax is
if_else(condition, true_value, false_value, missing = NA) - Always include the
missingparameter when working with real-world data that may contain NA values - You can nest multiple
if_else()statements to create complex multi-level categorizations Combine
if_else()withmutate()to add new classification variables to your datasets