How to count number of missing values per row in a dataframe
Introduction
Counting missing values per row is a common data cleaning task that helps identify which observations have incomplete data. This technique is particularly useful when deciding whether to remove rows with too many missing values or when creating data quality reports.
Getting Started
library(tidyverse)
library(palmerpenguins)Example 1: Basic Usage
The Problem
We need to count how many missing values (NA) exist in each row of our dataset. This helps us understand the completeness of our data at the observation level.
Step 1: Create sample data with missing values
Let’s start by creating a dataset with some missing values to work with.
# Create sample data with missing values
sample_data <- penguins |>
slice(1:10) |>
mutate(
bill_length_mm = ifelse(row_number() %in% c(2, 5), NA, bill_length_mm),
body_mass_g = ifelse(row_number() %in% c(2, 7), NA, body_mass_g)
)This creates a 10-row sample from the penguins dataset with strategically placed missing values.
Step 2: Count missing values using rowwise()
We’ll use rowwise() combined with c_across() to count NAs across all columns.
# Count missing values per row
result <- sample_data |>
rowwise() |>
mutate(missing_count = sum(is.na(c_across(everything())))) |>
ungroup()
print(result$missing_count)This approach counts missing values across all columns for each row, storing the result in a new column.
Step 3: Select specific columns for counting
Often you only want to count missing values in specific columns of interest.
# Count missing values in specific columns only
result_selective <- sample_data |>
rowwise() |>
mutate(
missing_count = sum(is.na(c_across(c(bill_length_mm:body_mass_g))))
) |>
ungroup()This counts missing values only in the numeric measurement columns, excluding categorical variables.
Example 2: Practical Application
The Problem
You’re analyzing the penguins dataset for a research project and need to identify rows with multiple missing measurements. Rows with more than 2 missing values should be flagged for review or potential removal.
Step 1: Count missing values and create quality flags
We’ll count missing values and create a data quality flag for further analysis.
# Analyze data quality across the full dataset
penguins_quality <- penguins |>
rowwise() |>
mutate(
missing_count = sum(is.na(c_across(bill_length_mm:body_mass_g))),
quality_flag = case_when(
missing_count == 0 ~ "Complete",
missing_count <= 2 ~ "Acceptable",
TRUE ~ "Poor"
)
) |>
ungroup()This creates a comprehensive quality assessment with both counts and categorical flags.
Step 2: Summarize data quality patterns
Let’s examine the distribution of missing value patterns across our dataset.
# Summarize quality patterns
quality_summary <- penguins_quality |>
count(missing_count, quality_flag) |>
arrange(missing_count)
print(quality_summary)This summary shows how many rows fall into each missing value category.
Step 3: Filter and examine problematic rows
Finally, let’s identify and examine the rows that need attention.
# Examine rows with poor data quality
poor_quality_rows <- penguins_quality |>
filter(quality_flag == "Poor") |>
select(species, island, missing_count, everything())
print(poor_quality_rows)This filters to show only the problematic rows, making it easy to decide on appropriate handling strategies.
Summary
- Use
rowwise()withc_across()to count missing values across multiple columns efficiently - Apply
sum(is.na())to convert logical TRUE/FALSE values into numeric counts per row
- Use column selection within
c_across()to focus on specific variables of interest - Create quality flags based on missing value thresholds to categorize data completeness
Combine counting with filtering and summarizing to develop comprehensive data quality workflows