How to use sapply in R
Introduction
The sapply() function in R applies a function to each element of a list or vector and returns a simplified result, typically a vector or matrix. It’s particularly useful when you need to perform the same operation across multiple data elements and want cleaner output than lapply() provides.
Getting Started
library(tidyverse)
data(mtcars)
data(penguins, package = "palmerpenguins")Example 1: Basic Usage
The Problem
We need to calculate summary statistics for multiple numeric columns in the mtcars dataset. Instead of writing separate functions for each column, we want to apply the same function efficiently across all columns.
Step 1: Create sample data
First, let’s select a few numeric columns to work with.
# Select key numeric variables
car_data <- mtcars |>
select(mpg, hp, wt, qsec)
head(car_data, 3)This gives us a clean subset with four numeric variables to analyze.
Step 2: Apply a single function
Now we’ll use sapply() to calculate the mean of each column.
# Calculate mean for each column
column_means <- sapply(car_data, mean)
print(column_means)The sapply() function applied the mean() function to each column and returned a named vector with the results.
Step 3: Apply with additional arguments
We can pass additional arguments to our function through sapply().
# Add some NA values for demonstration
car_data_na <- car_data
car_data_na[1, "mpg"] <- NA
# Calculate means ignoring NA values
means_no_na <- sapply(car_data_na, mean, na.rm = TRUE)
print(means_no_na)The na.rm = TRUE argument was passed to each mean() function call, handling missing values properly.
Example 2: Practical Application
The Problem
We’re analyzing the penguins dataset and need to identify which numeric measurements have outliers and calculate multiple summary statistics. We want to create a comprehensive overview of data quality and distribution for each numeric variable.
Step 1: Prepare the data
Let’s extract numeric columns and remove any missing values for clean analysis.
# Get numeric columns from penguins
penguin_numeric <- penguins |>
select(bill_length_mm, bill_depth_mm,
flipper_length_mm, body_mass_g) |>
na.omit()
dim(penguin_numeric)This gives us a clean dataset with four numeric measurements for analysis.
Step 2: Create a custom function
We’ll build a function that returns multiple statistics for outlier detection.
# Function to calculate summary stats
get_stats <- function(x) {
c(mean = mean(x),
median = median(x),
sd = sd(x),
iqr = IQR(x))
}This function returns four key statistics that help us understand each variable’s distribution.
Step 3: Apply custom function
Now we’ll use sapply() to apply our custom function across all columns.
# Apply custom function to all columns
penguin_stats <- sapply(penguin_numeric, get_stats)
print(round(penguin_stats, 2))The result is a matrix where each column represents a variable and each row represents a different statistic.
Step 4: Create logical tests
We can also use sapply() for logical operations across columns.
# Check which variables have high variability (CV > 0.15)
high_variation <- sapply(penguin_numeric, function(x) {
coefficient_variation <- sd(x) / mean(x)
coefficient_variation > 0.15
})
print(high_variation)This returns a logical vector showing which measurements have high relative variability.
Step 5: Count categories by groups
Using sapply() with factors to count occurrences across different groupings.
# Count species occurrences
species_counts <- sapply(split(penguins$species, penguins$island),
function(x) table(x))
print(species_counts)This creates a breakdown of species counts by island, demonstrating sapply() with more complex data structures.
Summary
sapply()applies functions across list or vector elements and simplifies results into vectors or matrices- It’s ideal for calculating summary statistics across multiple columns efficiently
- You can pass additional arguments to functions using extra parameters in
sapply() - Custom functions work seamlessly with
sapply()for complex operations The function returns simplified output compared to
lapply(), making results easier to read and work with