How to Randomly Replace Values of Numerical Columns in a dataframe to NAs

dplyr across()

NAs in R

Learn how to randomly replace values of numerical columns in a dataframe to nas with this comprehensive R tutorial. Includes practical examples and code snip…

Published

August 16, 2022

Introduction

Randomly replacing values with NAs in numerical columns is a common technique for simulating missing data patterns or testing the robustness of your analysis. This approach is particularly useful when you want to evaluate how your statistical models or data processing pipelines handle incomplete datasets.

Getting Started

library(tidyverse)
library(palmerpenguins)

Example 1: Basic Usage

The Problem

We need to randomly introduce missing values into specific numerical columns of our dataset. Let’s start with a simple approach using the penguins dataset.

Step 1: Examine the original data

First, let’s look at our starting dataset to understand its structure.

data(penguins)
penguins |>
  select(bill_length_mm, bill_depth_mm, flipper_length_mm) |>
  head(10)

This shows us the first 10 rows of three numerical columns that we’ll work with.

Step 2: Set up random sampling parameters

We’ll define what percentage of values should become NA and set a seed for reproducibility.

set.seed(123)
na_proportion <- 0.15  # 15% of values will become NA
n_rows <- nrow(penguins)

Now we have a consistent framework for introducing missing values.

Step 3: Create random NA positions

We’ll generate random row indices where values should be replaced with NA.

random_indices <- sample(1:n_rows, 
                        size = round(n_rows * na_proportion), 
                        replace = FALSE)
head(random_indices)

These indices represent the rows where we’ll introduce missing values.

Step 4: Replace values with NAs

Now we’ll apply the NA replacement to a single column using conditional logic.

penguins_modified <- penguins |>
  mutate(bill_length_mm = ifelse(row_number() %in% random_indices, 
                                NA, 
                                bill_length_mm))

The ifelse() function checks if each row number is in our random indices and replaces those values with NA.

Example 2: Practical Application

The Problem

In real-world scenarios, you often need to introduce missing values across multiple numerical columns simultaneously, simulating realistic data collection issues. Let’s create a more comprehensive solution that handles multiple columns with different missing data patterns.

Step 1: Create a function for multiple columns

We’ll build a reusable function that can randomly introduce NAs into any numerical columns.

introduce_random_nas <- function(data, columns, proportion = 0.1) {
  set.seed(42)
  n_rows <- nrow(data)
  
  for(col in columns) {
    random_rows <- sample(1:n_rows, size = round(n_rows * proportion))
    data[[col]][random_rows] <- NA
  }
  return(data)
}

This function iterates through specified columns and introduces NAs at randomly selected positions.

Step 2: Apply to multiple numerical columns

Let’s use our function to introduce missing values across several numerical columns.

numerical_cols <- c("bill_length_mm", "bill_depth_mm", 
                   "flipper_length_mm", "body_mass_g")

penguins_with_nas <- penguins |>
  introduce_random_nas(columns = numerical_cols, proportion = 0.12)

Now multiple columns have randomly distributed missing values, simulating real data collection challenges.

Step 3: Verify the results

Let’s examine how many NAs were introduced and their distribution across columns.

penguins_with_nas |>
  select(all_of(numerical_cols)) |>
  summarise(across(everything(), ~sum(is.na(.))))

This summary shows the count of missing values in each numerical column, confirming our random replacement worked correctly.

Step 4: Compare before and after

Finally, let’s visualize the impact of our missing data introduction.

original_complete <- sum(complete.cases(penguins[numerical_cols]))
modified_complete <- sum(complete.cases(penguins_with_nas[numerical_cols]))

cat("Complete cases before:", original_complete, "\n")
cat("Complete cases after:", modified_complete, "\n")

This comparison helps us understand how the missing data affects the completeness of our dataset.

Summary

Use sample() and ifelse() for basic random NA introduction in single columns
Create reusable functions when working with multiple numerical columns simultaneously
Set seeds with set.seed() to ensure reproducible missing data patterns
Control the proportion of missing values to match realistic data scenarios
Always verify your results by counting NAs and comparing complete cases before and after modification

--- title: "How to Randomly Replace Values of Numerical Columns in a dataframe to NAs" description: "Learn how to randomly replace values of numerical columns in a dataframe to nas with this comprehensive R tutorial. Includes practical examples and code snip..." date: 2022-08-16 categories: ['dplyr across()', 'NAs in R'] format: html: code-fold: false code-tools: true --- ## Introduction Randomly replacing values with NAs in numerical columns is a common technique for simulating missing data patterns or testing the robustness of your analysis. This approach is particularly useful when you want to evaluate how your statistical models or data processing pipelines handle incomplete datasets. ## Getting Started ```r library(tidyverse) library(palmerpenguins) ``` ## Example 1: Basic Usage ### The Problem We need to randomly introduce missing values into specific numerical columns of our dataset. Let's start with a simple approach using the penguins dataset. ### Step 1: Examine the original data First, let's look at our starting dataset to understand its structure. ```r data(penguins) penguins |> select(bill_length_mm, bill_depth_mm, flipper_length_mm) |> head(10) ``` This shows us the first 10 rows of three numerical columns that we'll work with. ### Step 2: Set up random sampling parameters We'll define what percentage of values should become NA and set a seed for reproducibility. ```r set.seed(123) na_proportion <- 0.15 # 15% of values will become NA n_rows <- nrow(penguins) ``` Now we have a consistent framework for introducing missing values. ### Step 3: Create random NA positions We'll generate random row indices where values should be replaced with NA. ```r random_indices <- sample(1:n_rows, size = round(n_rows * na_proportion), replace = FALSE) head(random_indices) ``` These indices represent the rows where we'll introduce missing values. ### Step 4: Replace values with NAs Now we'll apply the NA replacement to a single column using conditional logic. ```r penguins_modified <- penguins |> mutate(bill_length_mm = ifelse(row_number() %in% random_indices, NA, bill_length_mm)) ``` The `ifelse()` function checks if each row number is in our random indices and replaces those values with NA. ## Example 2: Practical Application ### The Problem In real-world scenarios, you often need to introduce missing values across multiple numerical columns simultaneously, simulating realistic data collection issues. Let's create a more comprehensive solution that handles multiple columns with different missing data patterns. ### Step 1: Create a function for multiple columns We'll build a reusable function that can randomly introduce NAs into any numerical columns. ```r introduce_random_nas <- function(data, columns, proportion = 0.1) { set.seed(42) n_rows <- nrow(data) for(col in columns) { random_rows <- sample(1:n_rows, size = round(n_rows * proportion)) data[[col]][random_rows] <- NA } return(data) } ``` This function iterates through specified columns and introduces NAs at randomly selected positions. ### Step 2: Apply to multiple numerical columns Let's use our function to introduce missing values across several numerical columns. ```r numerical_cols <- c("bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g") penguins_with_nas <- penguins |> introduce_random_nas(columns = numerical_cols, proportion = 0.12) ``` Now multiple columns have randomly distributed missing values, simulating real data collection challenges. ### Step 3: Verify the results Let's examine how many NAs were introduced and their distribution across columns. ```r penguins_with_nas |> select(all_of(numerical_cols)) |> summarise(across(everything(), ~sum(is.na(.)))) ``` This summary shows the count of missing values in each numerical column, confirming our random replacement worked correctly. ### Step 4: Compare before and after Finally, let's visualize the impact of our missing data introduction. ```r original_complete <- sum(complete.cases(penguins[numerical_cols])) modified_complete <- sum(complete.cases(penguins_with_nas[numerical_cols])) cat("Complete cases before:", original_complete, "\n") cat("Complete cases after:", modified_complete, "\n") ``` This comparison helps us understand how the missing data affects the completeness of our dataset. ## Summary - Use `sample()` and `ifelse()` for basic random NA introduction in single columns - Create reusable functions when working with multiple numerical columns simultaneously - Set seeds with `set.seed()` to ensure reproducible missing data patterns - Control the proportion of missing values to match realistic data scenarios - Always verify your results by counting NAs and comparing complete cases before and after modification --- ## Related Posts - [How to count number of missing values per row in a dataframe](/dplyr/count-number-of-missing-values-per-row-in-a-dataframe.html) - [How to rename one or more columns of a dataframe](/dplyr/rename-one-or-more-columns-of-a-dataframe.html) - [How to replace NAs with zero in a dataframe](/tidyr/tidyr-replace_na-function.html) - [How to select only numeric columns in a dataframe](/dplyr/select-all-numeric-columns-in-a-dataframe.html) - [pivot_longer on dataframe with single row](/tidyr/pivot_longer-on-dataframe-with-single-row.html)