How to Replace NAs with Column mean using tidyverse

replace NA
replace NAs tidyverse
tidyverse
Learn how to replace nas with column mean using tidyverse with this comprehensive R tutorial. Includes practical examples and code snippets.
Published

January 20, 2022

Introduction

Replacing missing values (NAs) with column means is a common data preprocessing technique in R. This approach helps maintain dataset completeness while preserving the central tendency of your variables, making it particularly useful for statistical analysis and machine learning workflows.

Getting Started

library(tidyverse)
library(palmerpenguins)

Example 1: Basic Usage

The Problem

We need to replace missing values in a single numeric column with that column’s mean. This is the fundamental building block for handling NAs in any dataset.

Step 1: Create sample data with missing values

Let’s start by examining the penguins dataset and introducing some missing values for demonstration.

# Load and examine the data
data <- penguins |>
  select(species, bill_length_mm, bill_depth_mm, body_mass_g)

# Check for existing NAs
sum(is.na(data$bill_length_mm))

We can see there are already some missing values in the bill_length_mm column.

Step 2: Replace NAs with column mean

Now we’ll replace the missing values using the mutate() and ifelse() functions.

# Replace NAs with column mean
data_clean <- data |>
  mutate(bill_length_mm = ifelse(is.na(bill_length_mm),
                                mean(bill_length_mm, na.rm = TRUE),
                                bill_length_mm))

This code checks each value in bill_length_mm and replaces NAs with the calculated mean of non-missing values.

Step 3: Verify the replacement

Let’s confirm that our NA replacement worked correctly.

# Check that NAs are gone
sum(is.na(data_clean$bill_length_mm))

# Compare before and after
cat("Original NAs:", sum(is.na(data$bill_length_mm)), "\n")
cat("After replacement:", sum(is.na(data_clean$bill_length_mm)))

The output confirms that all NAs in the bill_length_mm column have been successfully replaced.

Example 2: Practical Application

The Problem

In real-world scenarios, you often need to replace NAs across multiple numeric columns simultaneously. Manually handling each column would be inefficient and error-prone, so we need a scalable approach.

Step 1: Identify numeric columns with missing values

First, let’s examine which columns have missing values and determine our strategy.

# Check NA counts across all numeric columns
penguins |>
  select(where(is.numeric)) |>
  summarise(across(everything(), ~sum(is.na(.))))

This shows us exactly which numeric columns contain missing values and how many.

Step 2: Replace NAs across multiple columns

We’ll use across() to apply our NA replacement logic to multiple columns at once.

# Replace NAs with column means for all numeric columns
penguins_clean <- penguins |>
  mutate(across(where(is.numeric), 
                ~ifelse(is.na(.),
                       mean(., na.rm = TRUE),
                       .)))

The across() function applies our replacement logic to all numeric columns, making the code both concise and maintainable.

Step 3: Create a reusable function

For repeated use, we can create a custom function that encapsulates this logic.

# Create reusable function
replace_na_with_mean <- function(data) {
  data |>
    mutate(across(where(is.numeric),
                  ~ifelse(is.na(.),
                         mean(., na.rm = TRUE),
                         .)))
}

This function can now be applied to any dataset to replace numeric NAs with column means.

Step 4: Apply and validate the solution

Let’s test our function and verify the results.

# Apply the function
final_data <- replace_na_with_mean(penguins)

# Verify no NAs remain in numeric columns
final_data |>
  select(where(is.numeric)) |>
  summarise(across(everything(), ~sum(is.na(.))))

The verification step confirms that our function successfully eliminated all NAs from numeric columns.

Summary

  • Use ifelse() with mutate() to replace NAs with column means for single columns
  • Combine across() and where(is.numeric) to handle multiple numeric columns simultaneously
  • Always include na.rm = TRUE when calculating means to handle existing missing values
  • Create reusable functions to standardize your NA replacement workflow across projects
  • Verify your results by checking NA counts before and after replacement