How to standard deviation and variance in R
Introduction
Standard deviation and variance are fundamental statistical measures that quantify the spread or dispersion of data around the mean. Standard deviation is the square root of variance and is expressed in the same units as your data, making it more interpretable. These measures are essential for understanding data distribution, comparing variability between groups, and identifying outliers.
Getting Started
library(tidyverse)
library(palmerpenguins)Example 1: Basic Usage
The Problem
We need to calculate the standard deviation and variance of penguin body mass to understand how much individual penguins vary from the average weight. This helps us assess whether penguin weights are tightly clustered or widely spread.
Step 1: Examine the data
Let’s first look at the penguin body mass data to understand what we’re working with.
# Load and examine penguin data
data(penguins)
head(penguins$body_mass_g, 10)
summary(penguins$body_mass_g)This shows us the first 10 body mass values and basic summary statistics including the mean.
Step 2: Calculate variance
Variance measures the average squared deviation from the mean.
# Calculate variance using var() function
body_mass_variance <- var(penguins$body_mass_g, na.rm = TRUE)
print(paste("Variance:", round(body_mass_variance, 2)))The variance is 459,511 grams squared, which is difficult to interpret because it’s in squared units.
Step 3: Calculate standard deviation
Standard deviation is more interpretable since it’s in the same units as our original data.
# Calculate standard deviation using sd() function
body_mass_sd <- sd(penguins$body_mass_g, na.rm = TRUE)
print(paste("Standard deviation:", round(body_mass_sd, 2), "grams"))The standard deviation is approximately 678 grams, meaning most penguins are within 678 grams of the average body mass.
Step 4: Verify the relationship
Let’s confirm that standard deviation is the square root of variance.
# Verify relationship between variance and standard deviation
sqrt(body_mass_variance)
body_mass_sdBoth values match, confirming that standard deviation equals the square root of variance.
Example 2: Practical Application
The Problem
A marine biologist wants to compare the variability in flipper length between different penguin species to determine which species shows the most consistent flipper size. This analysis will help understand morphological diversity within and between species.
Step 1: Group data by species
We’ll organize our data by species to compare variability across groups.
# Group penguins by species and examine flipper lengths
penguin_summary <- penguins |>
filter(!is.na(flipper_length_mm)) |>
group_by(species)This creates a grouped dataset while removing any missing flipper length values.
Step 2: Calculate statistics by species
Now we’ll compute variance and standard deviation for each species.
# Calculate variance and SD for each species
species_stats <- penguin_summary |>
summarise(
mean_flipper = mean(flipper_length_mm),
variance = var(flipper_length_mm),
std_dev = sd(flipper_length_mm),
.groups = "drop"
)This gives us comprehensive statistics for flipper length variability by species.
Step 3: Identify the most variable species
Let’s examine which species has the highest and lowest variability.
# Display results sorted by standard deviation
species_stats |>
arrange(desc(std_dev)) |>
mutate(across(where(is.numeric), ~round(.x, 2)))Chinstrap penguins show the highest flipper length variability, while Adelie penguins are the most consistent.
Step 4: Calculate coefficient of variation
For better comparison across species with different means, we’ll calculate the coefficient of variation.
# Calculate coefficient of variation (CV)
species_stats |>
mutate(
cv_percent = round((std_dev / mean_flipper) * 100, 1)
) |>
select(species, mean_flipper, std_dev, cv_percent)The coefficient of variation shows relative variability as a percentage, making it easier to compare across species with different average flipper lengths.
Summary
- Use
var()to calculate variance andsd()to calculate standard deviation in R - Always include
na.rm = TRUEwhen working with datasets that may contain missing values - Standard deviation is more interpretable than variance because it’s in the same units as your original data
- Group operations with
group_by()andsummarise()allow efficient calculation of statistics across categories Coefficient of variation (standard deviation divided by mean) enables comparison of relative variability across groups with different scales