How to Randomly Replace Values of Numerical Columns in a dataframe to NAs
Introduction
Randomly replacing values with NAs in numerical columns is a common technique for simulating missing data patterns or testing the robustness of your analysis. This approach is particularly useful when you want to evaluate how your statistical models or data processing pipelines handle incomplete datasets.
Getting Started
library(tidyverse)
library(palmerpenguins)Example 1: Basic Usage
The Problem
We need to randomly introduce missing values into specific numerical columns of our dataset. Let’s start with a simple approach using the penguins dataset.
Step 1: Examine the original data
First, let’s look at our starting dataset to understand its structure.
data(penguins)
penguins |>
select(bill_length_mm, bill_depth_mm, flipper_length_mm) |>
head(10)This shows us the first 10 rows of three numerical columns that we’ll work with.
Step 2: Set up random sampling parameters
We’ll define what percentage of values should become NA and set a seed for reproducibility.
set.seed(123)
na_proportion <- 0.15 # 15% of values will become NA
n_rows <- nrow(penguins)Now we have a consistent framework for introducing missing values.
Step 3: Create random NA positions
We’ll generate random row indices where values should be replaced with NA.
random_indices <- sample(1:n_rows,
size = round(n_rows * na_proportion),
replace = FALSE)
head(random_indices)These indices represent the rows where we’ll introduce missing values.
Step 4: Replace values with NAs
Now we’ll apply the NA replacement to a single column using conditional logic.
penguins_modified <- penguins |>
mutate(bill_length_mm = ifelse(row_number() %in% random_indices,
NA,
bill_length_mm))The ifelse() function checks if each row number is in our random indices and replaces those values with NA.
Example 2: Practical Application
The Problem
In real-world scenarios, you often need to introduce missing values across multiple numerical columns simultaneously, simulating realistic data collection issues. Let’s create a more comprehensive solution that handles multiple columns with different missing data patterns.
Step 1: Create a function for multiple columns
We’ll build a reusable function that can randomly introduce NAs into any numerical columns.
introduce_random_nas <- function(data, columns, proportion = 0.1) {
set.seed(42)
n_rows <- nrow(data)
for(col in columns) {
random_rows <- sample(1:n_rows, size = round(n_rows * proportion))
data[[col]][random_rows] <- NA
}
return(data)
}This function iterates through specified columns and introduces NAs at randomly selected positions.
Step 2: Apply to multiple numerical columns
Let’s use our function to introduce missing values across several numerical columns.
numerical_cols <- c("bill_length_mm", "bill_depth_mm",
"flipper_length_mm", "body_mass_g")
penguins_with_nas <- penguins |>
introduce_random_nas(columns = numerical_cols, proportion = 0.12)Now multiple columns have randomly distributed missing values, simulating real data collection challenges.
Step 3: Verify the results
Let’s examine how many NAs were introduced and their distribution across columns.
penguins_with_nas |>
select(all_of(numerical_cols)) |>
summarise(across(everything(), ~sum(is.na(.))))This summary shows the count of missing values in each numerical column, confirming our random replacement worked correctly.
Step 4: Compare before and after
Finally, let’s visualize the impact of our missing data introduction.
original_complete <- sum(complete.cases(penguins[numerical_cols]))
modified_complete <- sum(complete.cases(penguins_with_nas[numerical_cols]))
cat("Complete cases before:", original_complete, "\n")
cat("Complete cases after:", modified_complete, "\n")This comparison helps us understand how the missing data affects the completeness of our dataset.
Summary
- Use
sample()andifelse()for basic random NA introduction in single columns - Create reusable functions when working with multiple numerical columns simultaneously
- Set seeds with
set.seed()to ensure reproducible missing data patterns - Control the proportion of missing values to match realistic data scenarios
Always verify your results by counting NAs and comparing complete cases before and after modification