How to count number of missing values per row in a dataframe

dplyr rowwise()

Learn how to count number of missing values per row in a dataframe with this comprehensive R tutorial. Includes practical examples and code snippets.

Published

October 13, 2022

Introduction

Counting missing values per row is a common data cleaning task that helps identify which observations have incomplete data. This technique is particularly useful when deciding whether to remove rows with too many missing values or when creating data quality reports.

Getting Started

library(tidyverse)
library(palmerpenguins)

Example 1: Basic Usage

The Problem

We need to count how many missing values (NA) exist in each row of our dataset. This helps us understand the completeness of our data at the observation level.

Step 1: Create sample data with missing values

Let’s start by creating a dataset with some missing values to work with.

# Create sample data with missing values
sample_data <- penguins |>
  slice(1:10) |>
  mutate(
    bill_length_mm = ifelse(row_number() %in% c(2, 5), NA, bill_length_mm),
    body_mass_g = ifelse(row_number() %in% c(2, 7), NA, body_mass_g)
  )

This creates a 10-row sample from the penguins dataset with strategically placed missing values.

Step 2: Count missing values using rowwise()

We’ll use rowwise() combined with c_across() to count NAs across all columns.

# Count missing values per row
result <- sample_data |>
  rowwise() |>
  mutate(missing_count = sum(is.na(c_across(everything())))) |>
  ungroup()

print(result$missing_count)

This approach counts missing values across all columns for each row, storing the result in a new column.

Step 3: Select specific columns for counting

Often you only want to count missing values in specific columns of interest.

# Count missing values in specific columns only
result_selective <- sample_data |>
  rowwise() |>
  mutate(
    missing_count = sum(is.na(c_across(c(bill_length_mm:body_mass_g))))
  ) |>
  ungroup()

This counts missing values only in the numeric measurement columns, excluding categorical variables.

Example 2: Practical Application

The Problem

You’re analyzing the penguins dataset for a research project and need to identify rows with multiple missing measurements. Rows with more than 2 missing values should be flagged for review or potential removal.

Step 1: Count missing values and create quality flags

We’ll count missing values and create a data quality flag for further analysis.

# Analyze data quality across the full dataset
penguins_quality <- penguins |>
  rowwise() |>
  mutate(
    missing_count = sum(is.na(c_across(bill_length_mm:body_mass_g))),
    quality_flag = case_when(
      missing_count == 0 ~ "Complete",
      missing_count <= 2 ~ "Acceptable", 
      TRUE ~ "Poor"
    )
  ) |>
  ungroup()

This creates a comprehensive quality assessment with both counts and categorical flags.

Step 2: Summarize data quality patterns

Let’s examine the distribution of missing value patterns across our dataset.

# Summarize quality patterns
quality_summary <- penguins_quality |>
  count(missing_count, quality_flag) |>
  arrange(missing_count)

print(quality_summary)

This summary shows how many rows fall into each missing value category.

Step 3: Filter and examine problematic rows

Finally, let’s identify and examine the rows that need attention.

# Examine rows with poor data quality
poor_quality_rows <- penguins_quality |>
  filter(quality_flag == "Poor") |>
  select(species, island, missing_count, everything())

print(poor_quality_rows)

This filters to show only the problematic rows, making it easy to decide on appropriate handling strategies.

Summary

Use rowwise() with c_across() to count missing values across multiple columns efficiently
Apply sum(is.na()) to convert logical TRUE/FALSE values into numeric counts per row
Use column selection within c_across() to focus on specific variables of interest
Create quality flags based on missing value thresholds to categorize data completeness
Combine counting with filtering and summarizing to develop comprehensive data quality workflows

--- title: "How to count number of missing values per row in a dataframe" description: "Learn how to count number of missing values per row in a dataframe with this comprehensive R tutorial. Includes practical examples and code snippets." date: 2022-10-13 categories: ['dplyr rowwise()'] format: html: code-fold: false code-tools: true --- ## Introduction Counting missing values per row is a common data cleaning task that helps identify which observations have incomplete data. This technique is particularly useful when deciding whether to remove rows with too many missing values or when creating data quality reports. ## Getting Started ```r library(tidyverse) library(palmerpenguins) ``` ## Example 1: Basic Usage ### The Problem We need to count how many missing values (NA) exist in each row of our dataset. This helps us understand the completeness of our data at the observation level. ### Step 1: Create sample data with missing values Let's start by creating a dataset with some missing values to work with. ```r # Create sample data with missing values sample_data <- penguins |> slice(1:10) |> mutate( bill_length_mm = ifelse(row_number() %in% c(2, 5), NA, bill_length_mm), body_mass_g = ifelse(row_number() %in% c(2, 7), NA, body_mass_g) ) ``` This creates a 10-row sample from the penguins dataset with strategically placed missing values. ### Step 2: Count missing values using rowwise() We'll use `rowwise()` combined with `c_across()` to count NAs across all columns. ```r # Count missing values per row result <- sample_data |> rowwise() |> mutate(missing_count = sum(is.na(c_across(everything())))) |> ungroup() print(result$missing_count) ``` This approach counts missing values across all columns for each row, storing the result in a new column. ### Step 3: Select specific columns for counting Often you only want to count missing values in specific columns of interest. ```r # Count missing values in specific columns only result_selective <- sample_data |> rowwise() |> mutate( missing_count = sum(is.na(c_across(c(bill_length_mm:body_mass_g)))) ) |> ungroup() ``` This counts missing values only in the numeric measurement columns, excluding categorical variables. ## Example 2: Practical Application ### The Problem You're analyzing the penguins dataset for a research project and need to identify rows with multiple missing measurements. Rows with more than 2 missing values should be flagged for review or potential removal. ### Step 1: Count missing values and create quality flags We'll count missing values and create a data quality flag for further analysis. ```r # Analyze data quality across the full dataset penguins_quality <- penguins |> rowwise() |> mutate( missing_count = sum(is.na(c_across(bill_length_mm:body_mass_g))), quality_flag = case_when( missing_count == 0 ~ "Complete", missing_count <= 2 ~ "Acceptable", TRUE ~ "Poor" ) ) |> ungroup() ``` This creates a comprehensive quality assessment with both counts and categorical flags. ### Step 2: Summarize data quality patterns Let's examine the distribution of missing value patterns across our dataset. ```r # Summarize quality patterns quality_summary <- penguins_quality |> count(missing_count, quality_flag) |> arrange(missing_count) print(quality_summary) ``` This summary shows how many rows fall into each missing value category. ### Step 3: Filter and examine problematic rows Finally, let's identify and examine the rows that need attention. ```r # Examine rows with poor data quality poor_quality_rows <- penguins_quality |> filter(quality_flag == "Poor") |> select(species, island, missing_count, everything()) print(poor_quality_rows) ``` This filters to show only the problematic rows, making it easy to decide on appropriate handling strategies. ## Summary - Use `rowwise()` with `c_across()` to count missing values across multiple columns efficiently - Apply `sum(is.na())` to convert logical TRUE/FALSE values into numeric counts per row - Use column selection within `c_across()` to focus on specific variables of interest - Create quality flags based on missing value thresholds to categorize data completeness - Combine counting with filtering and summarizing to develop comprehensive data quality workflows --- ## Related Posts - [dplyr count(): count unique values of a variable](/dplyr/dplyr-count-count-unique-values-of-a-variable.html) - [dplyr row_number(): Add unique row number to a dataframe](/dplyr/dplyr-row_number-add-unique-row-number-to-a-dataframe.html) - [How to Randomly Replace Values of Numerical Columns in a dataframe to NAs](/dplyr/randomly-replace-values-of-numerical-columns-in-a-dataframe-to-nas.html) - [pivot_longer on dataframe with single row](/tidyr/pivot_longer-on-dataframe-with-single-row.html) - [How to replace NAs with zero in a dataframe](/tidyr/tidyr-replace_na-function.html)