How to Convert Numerical Variable into a Categorical Variable in R

Learn how to perform convert numerical variable into a categorical variable in R. Step-by-step statistical tutorial with examples.

Published

November 23, 2022

Introduction

Converting numerical variables into categorical variables is a common data preprocessing task that helps create meaningful groups from continuous data. This technique is useful when you want to analyze data by ranges, create ordinal categories, or prepare variables for certain statistical analyses that require categorical inputs.

Getting Started

library(tidyverse)
library(palmerpenguins)

Example 1: Basic Usage

The Problem

We need to convert the continuous body mass variable from the penguins dataset into size categories. This will help us analyze penguins by size groups rather than exact weights.

Step 1: Examine the Data

Let’s first look at the distribution of body mass values.

data(penguins)
penguins |>
  select(species, body_mass_g) |>
  summary()

This shows us the range and distribution of body mass, helping us decide on appropriate category boundaries.

Step 2: Create Categories Using cut()

We’ll divide body mass into three categories: Small, Medium, and Large.

penguins_categorized <- penguins |>
  mutate(size_category = cut(body_mass_g, 
                            breaks = 3, 
                            labels = c("Small", "Medium", "Large")))

The cut() function automatically creates three equal-width intervals and assigns the specified labels.

Step 3: Verify the Results

Let’s check how many penguins fall into each category.

penguins_categorized |>
  count(size_category) |>
  drop_na()

This shows the distribution of penguins across our newly created size categories.

Example 2: Practical Application

The Problem

We want to create a more nuanced categorization system for penguin flipper length that reflects biological meaningful groups. Instead of equal intervals, we’ll use percentiles to ensure balanced groups for statistical analysis.

Step 1: Calculate Percentiles

First, let’s determine the 25th, 50th, and 75th percentiles for flipper length.

flipper_quantiles <- penguins |>
  summarise(
    q25 = quantile(flipper_length_mm, 0.25, na.rm = TRUE),
    q50 = quantile(flipper_length_mm, 0.50, na.rm = TRUE),
    q75 = quantile(flipper_length_mm, 0.75, na.rm = TRUE)
  )
print(flipper_quantiles)

These percentile values will serve as our category boundaries, ensuring roughly equal sample sizes.

Step 2: Create Custom Categories

Now we’ll use these percentiles as breakpoints for our categories.

penguins_flipper <- penguins |>
  mutate(flipper_category = cut(flipper_length_mm,
                               breaks = c(-Inf, 190, 197, 213, Inf),
                               labels = c("Short", "Medium-Short", 
                                        "Medium-Long", "Long")))

Using -Inf and Inf ensures all values are captured, even outliers beyond our calculated range.

Step 3: Cross-Tabulate with Species

Let’s see how flipper categories relate to penguin species.

penguins_flipper |>
  count(species, flipper_category) |>
  drop_na() |>
  pivot_wider(names_from = flipper_category, 
              values_from = n, 
              values_fill = 0)

This cross-tabulation reveals whether certain species tend to have longer or shorter flippers.

Step 4: Visualize the Categories

Finally, let’s create a visualization to see our categorization in action.

penguins_flipper |>
  drop_na(flipper_category) |>
  ggplot(aes(x = flipper_category, fill = species)) +
  geom_bar(position = "dodge") +
  labs(title = "Penguin Count by Flipper Category and Species")

Dodged bar chart of penguin count by flipper length category and species in R

This bar chart clearly shows the relationship between our created categories and the original species variable.

Summary

Use cut() with breaks parameter to create equal-width intervals from numerical data
Specify custom labels to make categories more interpretable than default ranges
Calculate percentiles first when you need balanced group sizes rather than equal intervals
Use -Inf and Inf as boundary values to capture all possible data points including outliers
Always verify your categorization with count() and cross-tabulation to ensure meaningful groups

--- title: "How to Convert Numerical Variable into a Categorical Variable in R" description: "Learn how to perform convert numerical variable into a categorical variable in R. Step-by-step statistical tutorial with examples." date: 2022-11-23 categories: [] image: /images/how-to/convert-numerical-to-categorical-in-r-flipper-categories-ggplot.png format: html: code-fold: false code-tools: true --- ## Introduction Converting numerical variables into categorical variables is a common data preprocessing task that helps create meaningful groups from continuous data. This technique is useful when you want to analyze data by ranges, create ordinal categories, or prepare variables for certain statistical analyses that require categorical inputs. ## Getting Started ```r library(tidyverse) library(palmerpenguins) ``` ## Example 1: Basic Usage ### The Problem We need to convert the continuous body mass variable from the penguins dataset into size categories. This will help us analyze penguins by size groups rather than exact weights. ### Step 1: Examine the Data Let's first look at the distribution of body mass values. ```r data(penguins) penguins |> select(species, body_mass_g) |> summary() ``` This shows us the range and distribution of body mass, helping us decide on appropriate category boundaries. ### Step 2: Create Categories Using cut() We'll divide body mass into three categories: Small, Medium, and Large. ```r penguins_categorized <- penguins |> mutate(size_category = cut(body_mass_g, breaks = 3, labels = c("Small", "Medium", "Large"))) ``` The `cut()` function automatically creates three equal-width intervals and assigns the specified labels. ### Step 3: Verify the Results Let's check how many penguins fall into each category. ```r penguins_categorized |> count(size_category) |> drop_na() ``` This shows the distribution of penguins across our newly created size categories. ## Example 2: Practical Application ### The Problem We want to create a more nuanced categorization system for penguin flipper length that reflects biological meaningful groups. Instead of equal intervals, we'll use percentiles to ensure balanced groups for statistical analysis. ### Step 1: Calculate Percentiles First, let's determine the 25th, 50th, and 75th percentiles for flipper length. ```r flipper_quantiles <- penguins |> summarise( q25 = quantile(flipper_length_mm, 0.25, na.rm = TRUE), q50 = quantile(flipper_length_mm, 0.50, na.rm = TRUE), q75 = quantile(flipper_length_mm, 0.75, na.rm = TRUE) ) print(flipper_quantiles) ``` These percentile values will serve as our category boundaries, ensuring roughly equal sample sizes. ### Step 2: Create Custom Categories Now we'll use these percentiles as breakpoints for our categories. ```r penguins_flipper <- penguins |> mutate(flipper_category = cut(flipper_length_mm, breaks = c(-Inf, 190, 197, 213, Inf), labels = c("Short", "Medium-Short", "Medium-Long", "Long"))) ``` Using `-Inf` and `Inf` ensures all values are captured, even outliers beyond our calculated range. ### Step 3: Cross-Tabulate with Species Let's see how flipper categories relate to penguin species. ```r penguins_flipper |> count(species, flipper_category) |> drop_na() |> pivot_wider(names_from = flipper_category, values_from = n, values_fill = 0) ``` This cross-tabulation reveals whether certain species tend to have longer or shorter flippers. ### Step 4: Visualize the Categories Finally, let's create a visualization to see our categorization in action. ```r penguins_flipper |> drop_na(flipper_category) |> ggplot(aes(x = flipper_category, fill = species)) + geom_bar(position = "dodge") + labs(title = "Penguin Count by Flipper Category and Species") ``` ![Dodged bar chart of penguin count by flipper length category and species in R](/images/how-to/convert-numerical-to-categorical-in-r-flipper-categories-ggplot.png) This bar chart clearly shows the relationship between our created categories and the original species variable. ## Summary - Use `cut()` with `breaks` parameter to create equal-width intervals from numerical data - Specify custom `labels` to make categories more interpretable than default ranges - Calculate percentiles first when you need balanced group sizes rather than equal intervals - Use `-Inf` and `Inf` as boundary values to capture all possible data points including outliers - Always verify your categorization with [`count()`](/dplyr/dplyr-count-count-unique-values-of-a-variable.html) and cross-tabulation to ensure meaningful groups --- ## Related Posts - [How to Save Dataframe into XLSX file in R](/how-to/save-dataframe-as-xlsx-file-in-r.html) - [dplyr if_else(): Create new variable from existing variable](/how-to/dplyr-if_else-create-new-variable-from-existing-variable.html) - [How to Convert a List to a dataframe in R](/how-to/convert-a-list-to-a-dataframe-in-r.html) - [dplyr count(): count unique values of a variable](/dplyr/dplyr-count-count-unique-values-of-a-variable.html) - [How to Split a Dataframe into a list of Dataframes by groups in R](/dplyr/split-a-dataframe-into-a-list-of-dataframes-by-groups-in-r.html)

Introduction

Getting Started

Example 1: Basic Usage

The Problem

Step 1: Examine the Data

Step 2: Create Categories Using cut()

Step 3: Verify the Results

Example 2: Practical Application

The Problem

Step 1: Calculate Percentiles

Step 2: Create Custom Categories

Step 3: Cross-Tabulate with Species

Step 4: Visualize the Categories

Summary

Always verify your categorization with count() and cross-tabulation to ensure meaningful groups

Related Posts

Always verify your categorization with `count()` and cross-tabulation to ensure meaningful groups