How to Randomly Replace Values of Numerical Columns in a dataframe to NAs

dplyr across()
NAs in R
Learn how to randomly replace values of numerical columns in a dataframe to nas with this comprehensive R tutorial. Includes practical examples and code snip…
Published

August 16, 2022

Introduction

Randomly replacing values with NAs in numerical columns is a common technique for simulating missing data patterns or testing the robustness of your analysis. This approach is particularly useful when you want to evaluate how your statistical models or data processing pipelines handle incomplete datasets.

Getting Started

library(tidyverse)
library(palmerpenguins)

Example 1: Basic Usage

The Problem

We need to randomly introduce missing values into specific numerical columns of our dataset. Let’s start with a simple approach using the penguins dataset.

Step 1: Examine the original data

First, let’s look at our starting dataset to understand its structure.

data(penguins)
penguins |>
  select(bill_length_mm, bill_depth_mm, flipper_length_mm) |>
  head(10)

This shows us the first 10 rows of three numerical columns that we’ll work with.

Step 2: Set up random sampling parameters

We’ll define what percentage of values should become NA and set a seed for reproducibility.

set.seed(123)
na_proportion <- 0.15  # 15% of values will become NA
n_rows <- nrow(penguins)

Now we have a consistent framework for introducing missing values.

Step 3: Create random NA positions

We’ll generate random row indices where values should be replaced with NA.

random_indices <- sample(1:n_rows, 
                        size = round(n_rows * na_proportion), 
                        replace = FALSE)
head(random_indices)

These indices represent the rows where we’ll introduce missing values.

Step 4: Replace values with NAs

Now we’ll apply the NA replacement to a single column using conditional logic.

penguins_modified <- penguins |>
  mutate(bill_length_mm = ifelse(row_number() %in% random_indices, 
                                NA, 
                                bill_length_mm))

The ifelse() function checks if each row number is in our random indices and replaces those values with NA.

Example 2: Practical Application

The Problem

In real-world scenarios, you often need to introduce missing values across multiple numerical columns simultaneously, simulating realistic data collection issues. Let’s create a more comprehensive solution that handles multiple columns with different missing data patterns.

Step 1: Create a function for multiple columns

We’ll build a reusable function that can randomly introduce NAs into any numerical columns.

introduce_random_nas <- function(data, columns, proportion = 0.1) {
  set.seed(42)
  n_rows <- nrow(data)
  
  for(col in columns) {
    random_rows <- sample(1:n_rows, size = round(n_rows * proportion))
    data[[col]][random_rows] <- NA
  }
  return(data)
}

This function iterates through specified columns and introduces NAs at randomly selected positions.

Step 2: Apply to multiple numerical columns

Let’s use our function to introduce missing values across several numerical columns.

numerical_cols <- c("bill_length_mm", "bill_depth_mm", 
                   "flipper_length_mm", "body_mass_g")

penguins_with_nas <- penguins |>
  introduce_random_nas(columns = numerical_cols, proportion = 0.12)

Now multiple columns have randomly distributed missing values, simulating real data collection challenges.

Step 3: Verify the results

Let’s examine how many NAs were introduced and their distribution across columns.

penguins_with_nas |>
  select(all_of(numerical_cols)) |>
  summarise(across(everything(), ~sum(is.na(.))))

This summary shows the count of missing values in each numerical column, confirming our random replacement worked correctly.

Step 4: Compare before and after

Finally, let’s visualize the impact of our missing data introduction.

original_complete <- sum(complete.cases(penguins[numerical_cols]))
modified_complete <- sum(complete.cases(penguins_with_nas[numerical_cols]))

cat("Complete cases before:", original_complete, "\n")
cat("Complete cases after:", modified_complete, "\n")

This comparison helps us understand how the missing data affects the completeness of our dataset.

Summary

  • Use sample() and ifelse() for basic random NA introduction in single columns
  • Create reusable functions when working with multiple numerical columns simultaneously
  • Set seeds with set.seed() to ensure reproducible missing data patterns
  • Control the proportion of missing values to match realistic data scenarios
  • Always verify your results by counting NAs and comparing complete cases before and after modification