Remove rows with missing values using drop_na() in R

drop_na R

Complete guide to remove rows with missing values using drop_na() in R programming. Tutorial with practical examples and code.

Published

September 16, 2021

Introduction

The drop_na() function from the tidyverse package allows you to efficiently remove rows containing missing values (NA) from your datasets. This function is essential for data cleaning workflows where complete cases are required for analysis or when missing data would interfere with statistical computations.

Getting Started

library(tidyverse)
library(palmerpenguins)

Example 1: Basic Usage

The Problem

We have a dataset with missing values scattered across different columns, and we need to understand how drop_na() works in its simplest form.

Step 1: Examine the original data

Let’s first look at the penguins dataset to see what missing values exist.

penguins |>
  summarise(across(everything(), ~sum(is.na(.))))

This shows us that several columns contain missing values, with bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, and sex having NAs.

Step 2: Remove all rows with any missing values

The most basic usage removes any row that contains at least one missing value.

clean_penguins <- penguins |>
  drop_na()

nrow(clean_penguins)

This reduces our dataset from 344 rows to 333 rows, removing all rows with any missing data.

Step 3: Compare before and after

Let’s verify that no missing values remain in our cleaned dataset.

clean_penguins |>
  summarise(across(everything(), ~sum(is.na(.))))

All columns now show 0 missing values, confirming that drop_na() successfully removed incomplete rows.

Example 2: Practical Application

The Problem

In a real analysis scenario, you might only care about missing values in specific columns that are critical for your analysis. Removing rows based on missing values in irrelevant columns would unnecessarily reduce your sample size.

Step 1: Target specific columns for NA removal

Let’s say we only need complete data for bill measurements, not for sex or other variables.

penguins_bills <- penguins |>
  drop_na(bill_length_mm, bill_depth_mm)

nrow(penguins_bills)

This preserves more data (342 rows) by only removing rows where the specified bill measurement columns have missing values.

Step 2: Remove NAs from a single critical column

Sometimes you only need one specific column to be complete.

penguins_mass <- penguins |>
  drop_na(body_mass_g)

nrow(penguins_mass)

This approach keeps 342 rows, removing only the 2 rows where body mass data is missing.

Step 3: Chain with other data operations

Use drop_na() as part of a larger data pipeline for analysis.

penguin_summary <- penguins |>
  drop_na(bill_length_mm, flipper_length_mm) |>
  group_by(species) |>
  summarise(
    avg_bill = mean(bill_length_mm),
    avg_flipper = mean(flipper_length_mm)
  )

This creates a clean summary table with complete data for the variables of interest, ensuring accurate mean calculations.

Step 4: Verify the cleaning worked

Always check that your data cleaning achieved the desired result.

penguin_summary |>
  summarise(across(where(is.numeric), ~sum(is.na(.))))

The summary shows no missing values in our calculated averages, confirming successful data cleaning.

Summary

drop_na() without arguments removes any row containing at least one missing value across all columns
Specify column names to remove rows only when those particular columns have missing values
Targeting specific columns preserves more data than removing all incomplete rows
Always verify your data cleaning by checking for remaining missing values after using drop_na()
Chain drop_na() with other tidyverse functions for efficient data cleaning pipelines

--- title: "Remove rows with missing values using drop_na() in R" description: "Complete guide to remove rows with missing values using drop_na() in R programming. Tutorial with practical examples and code." date: 2021-09-16 categories: ['drop_na R'] format: html: code-fold: false code-tools: true --- ## Introduction The [`drop_na()`](/tidyr/how-to-use-dropna-in-r.html) function from the tidyverse package allows you to efficiently remove rows containing missing values (NA) from your datasets. This function is essential for data cleaning workflows where complete cases are required for analysis or when missing data would interfere with statistical computations. ## Getting Started ```r library(tidyverse) library(palmerpenguins) ``` ## Example 1: Basic Usage ### The Problem We have a dataset with missing values scattered across different columns, and we need to understand how `drop_na()` works in its simplest form. ### Step 1: Examine the original data Let's first look at the penguins dataset to see what missing values exist. ```r penguins |> summarise(across(everything(), ~sum(is.na(.)))) ``` This shows us that several columns contain missing values, with bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, and sex having NAs. ### Step 2: Remove all rows with any missing values The most basic usage removes any row that contains at least one missing value. ```r clean_penguins <- penguins |> drop_na() nrow(clean_penguins) ``` This reduces our dataset from 344 rows to 333 rows, removing all rows with any missing data. ### Step 3: Compare before and after Let's verify that no missing values remain in our cleaned dataset. ```r clean_penguins |> summarise(across(everything(), ~sum(is.na(.)))) ``` All columns now show 0 missing values, confirming that `drop_na()` successfully removed incomplete rows. ## Example 2: Practical Application ### The Problem In a real analysis scenario, you might only care about missing values in specific columns that are critical for your analysis. Removing rows based on missing values in irrelevant columns would unnecessarily reduce your sample size. ### Step 1: Target specific columns for NA removal Let's say we only need complete data for bill measurements, not for sex or other variables. ```r penguins_bills <- penguins |> drop_na(bill_length_mm, bill_depth_mm) nrow(penguins_bills) ``` This preserves more data (342 rows) by only removing rows where the specified bill measurement columns have missing values. ### Step 2: Remove NAs from a single critical column Sometimes you only need one specific column to be complete. ```r penguins_mass <- penguins |> drop_na(body_mass_g) nrow(penguins_mass) ``` This approach keeps 342 rows, removing only the 2 rows where body mass data is missing. ### Step 3: Chain with other data operations Use `drop_na()` as part of a larger data pipeline for analysis. ```r penguin_summary <- penguins |> drop_na(bill_length_mm, flipper_length_mm) |> group_by(species) |> summarise( avg_bill = mean(bill_length_mm), avg_flipper = mean(flipper_length_mm) ) ``` This creates a clean summary table with complete data for the variables of interest, ensuring accurate mean calculations. ### Step 4: Verify the cleaning worked Always check that your data cleaning achieved the desired result. ```r penguin_summary |> summarise(across(where(is.numeric), ~sum(is.na(.)))) ``` The summary shows no missing values in our calculated averages, confirming successful data cleaning. ## Summary - `drop_na()` without arguments removes any row containing at least one missing value across all columns - Specify column names to remove rows only when those particular columns have missing values - Targeting specific columns preserves more data than removing all incomplete rows - Always verify your data cleaning by checking for remaining missing values after using `drop_na()` - Chain `drop_na()` with other tidyverse functions for efficient data cleaning pipelines --- ## Related Posts - [Remove rows with missing values using na.omit() in R](/how-to/remove-rows-with-missing-values-in-r-wth-na-omit.html) - [How to count number of missing values per row in a dataframe](/dplyr/count-number-of-missing-values-per-row-in-a-dataframe.html) - [How to remove rows with all NAs](/dplyr/remove-rows-with-all-nas.html) - [How to Randomly Replace Values in a Matrix to NAs](/how-to/how-to-randomly-replace-values-in-a-matrix-to-nas.html) - [How to Replace NA values in a dataframe with Zeros?](/how-to/replace-na-values-in-a-dataframe-with-zeros.html)

Introduction

Getting Started

Example 1: Basic Usage

The Problem

Step 1: Examine the original data

Step 2: Remove all rows with any missing values

Step 3: Compare before and after

Example 2: Practical Application

The Problem

Step 1: Target specific columns for NA removal

Step 2: Remove NAs from a single critical column

Step 3: Chain with other data operations

Step 4: Verify the cleaning worked

Summary

Chain drop_na() with other tidyverse functions for efficient data cleaning pipelines

Related Posts

Chain `drop_na()` with other tidyverse functions for efficient data cleaning pipelines