dplyr if_else(): Create new variable from existing variable

Master dplyr if_else() to create new variable from existing variable. Complete R tutorial with examples using real datasets.
Published

November 10, 2022

Introduction

The if_else() function in dplyr allows you to create new variables based on logical conditions from existing variables. It’s particularly useful when you need to categorize data, create flags, or transform values based on specific criteria. This function provides a vectorized way to implement conditional logic in your data transformations.

Getting Started

library(tidyverse)
library(palmerpenguins)

Example 1: Basic Usage

The Problem

We want to create a simple binary categorization of penguins based on their body mass. Specifically, we need to classify penguins as either “heavy” or “light” based on whether they weigh more than 4000 grams.

Step 1: Examine the data structure

First, let’s look at our penguin data to understand what we’re working with.

penguins |>
  select(species, body_mass_g) |>
  head(8)

This shows us the species and body mass columns, giving us a clear view of the data we’ll be transforming.

Step 2: Apply basic if_else logic

Now we’ll create our weight category using if_else() with mutate().

penguins |>
  mutate(weight_category = if_else(body_mass_g > 4000, 
                                   "heavy", 
                                   "light")) |>
  select(species, body_mass_g, weight_category)

The if_else() function evaluates each penguin’s body mass and assigns “heavy” for masses over 4000g, “light” otherwise.

Step 3: Handle missing values

Let’s improve our code to properly handle any missing values in the dataset.

penguins |>
  mutate(weight_category = if_else(body_mass_g > 4000, 
                                   "heavy", 
                                   "light",
                                   missing = "unknown")) |>
  select(species, body_mass_g, weight_category) |>
  filter(is.na(body_mass_g) | row_number() <= 5)

The missing parameter ensures that penguins with unknown body mass get labeled as “unknown” rather than NA.

Example 2: Practical Application

The Problem

We’re analyzing penguin populations and need to create a comprehensive classification system. We want to categorize penguins based on multiple criteria: create size categories based on body mass, identify potential breeding pairs by flagging adult-sized penguins, and create species-specific classifications that account for different size standards across species.

Step 1: Create multiple size categories

We’ll start by creating detailed size classifications using nested conditions.

penguins_classified <- penguins |>
  mutate(size_class = if_else(body_mass_g < 3500, "small",
                             if_else(body_mass_g < 4500, "medium", 
                                    "large",
                                    missing = "unknown"),
                             missing = "unknown"))

This creates a three-tier classification system using nested if_else() statements, with proper handling of missing values at each level.

Step 2: Create breeding readiness flags

Next, we’ll identify potentially breeding-age penguins based on size thresholds.

penguins_classified <- penguins_classified |>
  mutate(breeding_ready = if_else(body_mass_g >= 3800, 
                                  "ready", 
                                  "not_ready",
                                  missing = "unknown"))

This flag helps researchers quickly identify penguins that have reached breeding size, which is crucial for population studies.

Step 3: Apply species-specific logic

Finally, we’ll create species-specific size categories since different species have different typical weights.

penguins_final <- penguins_classified |>
  mutate(species_size = if_else(
    species == "Adelie" & body_mass_g > 3700, "large_adelie",
    if_else(species == "Gentoo" & body_mass_g > 5000, "large_gentoo",
           if_else(species == "Chinstrap" & body_mass_g > 3700, "large_chinstrap",
                  "standard_size",
                  missing = "unknown"),
           missing = "unknown"),
    missing = "unknown")) |>
  select(species, body_mass_g, size_class, breeding_ready, species_size)

This creates nuanced categories that respect the natural size differences between penguin species.

Step 4: Verify the results

Let’s examine our final classification to ensure it worked correctly.

penguins_final |>
  count(species, species_size) |>
  arrange(species, species_size)

This summary shows how many penguins fall into each species-specific size category, allowing us to validate our classification logic.

Summary

  • if_else() provides vectorized conditional logic for creating new variables based on existing data
  • The basic syntax is if_else(condition, true_value, false_value, missing = NA)
  • Always include the missing parameter when working with real-world data that may contain NA values
  • You can nest multiple if_else() statements to create complex multi-level categorizations
  • Combine if_else() with mutate() to add new classification variables to your datasets