Remove rows with missing values using drop_na() in R

drop_na R
Complete guide to remove rows with missing values using drop_na() in R programming. Tutorial with practical examples and code.
Published

September 16, 2021

Introduction

The drop_na() function from the tidyverse package allows you to efficiently remove rows containing missing values (NA) from your datasets. This function is essential for data cleaning workflows where complete cases are required for analysis or when missing data would interfere with statistical computations.

Getting Started

library(tidyverse)
library(palmerpenguins)

Example 1: Basic Usage

The Problem

We have a dataset with missing values scattered across different columns, and we need to understand how drop_na() works in its simplest form.

Step 1: Examine the original data

Let’s first look at the penguins dataset to see what missing values exist.

penguins |>
  summarise(across(everything(), ~sum(is.na(.))))

This shows us that several columns contain missing values, with bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, and sex having NAs.

Step 2: Remove all rows with any missing values

The most basic usage removes any row that contains at least one missing value.

clean_penguins <- penguins |>
  drop_na()

nrow(clean_penguins)

This reduces our dataset from 344 rows to 333 rows, removing all rows with any missing data.

Step 3: Compare before and after

Let’s verify that no missing values remain in our cleaned dataset.

clean_penguins |>
  summarise(across(everything(), ~sum(is.na(.))))

All columns now show 0 missing values, confirming that drop_na() successfully removed incomplete rows.

Example 2: Practical Application

The Problem

In a real analysis scenario, you might only care about missing values in specific columns that are critical for your analysis. Removing rows based on missing values in irrelevant columns would unnecessarily reduce your sample size.

Step 1: Target specific columns for NA removal

Let’s say we only need complete data for bill measurements, not for sex or other variables.

penguins_bills <- penguins |>
  drop_na(bill_length_mm, bill_depth_mm)

nrow(penguins_bills)

This preserves more data (342 rows) by only removing rows where the specified bill measurement columns have missing values.

Step 2: Remove NAs from a single critical column

Sometimes you only need one specific column to be complete.

penguins_mass <- penguins |>
  drop_na(body_mass_g)

nrow(penguins_mass)

This approach keeps 342 rows, removing only the 2 rows where body mass data is missing.

Step 3: Chain with other data operations

Use drop_na() as part of a larger data pipeline for analysis.

penguin_summary <- penguins |>
  drop_na(bill_length_mm, flipper_length_mm) |>
  group_by(species) |>
  summarise(
    avg_bill = mean(bill_length_mm),
    avg_flipper = mean(flipper_length_mm)
  )

This creates a clean summary table with complete data for the variables of interest, ensuring accurate mean calculations.

Step 4: Verify the cleaning worked

Always check that your data cleaning achieved the desired result.

penguin_summary |>
  summarise(across(where(is.numeric), ~sum(is.na(.))))

The summary shows no missing values in our calculated averages, confirming successful data cleaning.

Summary

  • drop_na() without arguments removes any row containing at least one missing value across all columns
  • Specify column names to remove rows only when those particular columns have missing values
  • Targeting specific columns preserves more data than removing all incomplete rows
  • Always verify your data cleaning by checking for remaining missing values after using drop_na()
  • Chain drop_na() with other tidyverse functions for efficient data cleaning pipelines