How to Replace NAs with Column mean using tidyverse
Introduction
Replacing missing values (NAs) with column means is a common data preprocessing technique in R. This approach helps maintain dataset completeness while preserving the central tendency of your variables, making it particularly useful for statistical analysis and machine learning workflows.
Getting Started
library(tidyverse)
library(palmerpenguins)Example 1: Basic Usage
The Problem
We need to replace missing values in a single numeric column with that column’s mean. This is the fundamental building block for handling NAs in any dataset.
Step 1: Create sample data with missing values
Let’s start by examining the penguins dataset and introducing some missing values for demonstration.
# Load and examine the data
data <- penguins |>
select(species, bill_length_mm, bill_depth_mm, body_mass_g)
# Check for existing NAs
sum(is.na(data$bill_length_mm))We can see there are already some missing values in the bill_length_mm column.
Step 2: Replace NAs with column mean
Now we’ll replace the missing values using the mutate() and ifelse() functions.
# Replace NAs with column mean
data_clean <- data |>
mutate(bill_length_mm = ifelse(is.na(bill_length_mm),
mean(bill_length_mm, na.rm = TRUE),
bill_length_mm))This code checks each value in bill_length_mm and replaces NAs with the calculated mean of non-missing values.
Step 3: Verify the replacement
Let’s confirm that our NA replacement worked correctly.
# Check that NAs are gone
sum(is.na(data_clean$bill_length_mm))
# Compare before and after
cat("Original NAs:", sum(is.na(data$bill_length_mm)), "\n")
cat("After replacement:", sum(is.na(data_clean$bill_length_mm)))The output confirms that all NAs in the bill_length_mm column have been successfully replaced.
Example 2: Practical Application
The Problem
In real-world scenarios, you often need to replace NAs across multiple numeric columns simultaneously. Manually handling each column would be inefficient and error-prone, so we need a scalable approach.
Step 1: Identify numeric columns with missing values
First, let’s examine which columns have missing values and determine our strategy.
# Check NA counts across all numeric columns
penguins |>
select(where(is.numeric)) |>
summarise(across(everything(), ~sum(is.na(.))))This shows us exactly which numeric columns contain missing values and how many.
Step 2: Replace NAs across multiple columns
We’ll use across() to apply our NA replacement logic to multiple columns at once.
# Replace NAs with column means for all numeric columns
penguins_clean <- penguins |>
mutate(across(where(is.numeric),
~ifelse(is.na(.),
mean(., na.rm = TRUE),
.)))The across() function applies our replacement logic to all numeric columns, making the code both concise and maintainable.
Step 3: Create a reusable function
For repeated use, we can create a custom function that encapsulates this logic.
# Create reusable function
replace_na_with_mean <- function(data) {
data |>
mutate(across(where(is.numeric),
~ifelse(is.na(.),
mean(., na.rm = TRUE),
.)))
}This function can now be applied to any dataset to replace numeric NAs with column means.
Step 4: Apply and validate the solution
Let’s test our function and verify the results.
# Apply the function
final_data <- replace_na_with_mean(penguins)
# Verify no NAs remain in numeric columns
final_data |>
select(where(is.numeric)) |>
summarise(across(everything(), ~sum(is.na(.))))The verification step confirms that our function successfully eliminated all NAs from numeric columns.
Summary
- Use
ifelse()withmutate()to replace NAs with column means for single columns - Combine
across()andwhere(is.numeric)to handle multiple numeric columns simultaneously - Always include
na.rm = TRUEwhen calculating means to handle existing missing values - Create reusable functions to standardize your NA replacement workflow across projects
Verify your results by checking NA counts before and after replacement