How to standard deviation and variance in R

statistics
standard deviation and variance
Published

February 20, 2026

Standard Deviation and Variance in R: A Complete Tutorial

1. Introduction

Standard deviation and variance are fundamental measures of variability or spread in data. They tell us how much individual data points deviate from the average (mean) value. While the mean tells us the central tendency, variance and standard deviation reveal how clustered or dispersed our data points are around that center.

Variance measures the average squared differences from the mean, while standard deviation is simply the square root of variance. Standard deviation is more interpretable because it’s in the same units as our original data.

When to use them: - Describing the spread of continuous numerical data - Comparing variability between different groups or datasets - Quality control and identifying outliers - Risk assessment in finance - Understanding measurement precision in experiments

Key assumptions: - Data should be numerical (continuous or discrete) - Works best with roughly normal distributions - Sensitive to outliers - Assumes data points are independent

These measures are essential building blocks for more advanced statistical concepts like confidence intervals, hypothesis testing, and standardized scores.

2. The Math

Variance Formula:

Variance = Sum of (each value - mean)² / (n - 1)

Standard Deviation Formula:

Standard Deviation = Square root of Variance

Where: - Each value = individual data points in your dataset - Mean = average of all values - n = number of observations - n - 1 = degrees of freedom (used for sample variance)

Why square the differences? If we didn’t square them, positive and negative deviations would cancel out, giving us zero. Squaring ensures all deviations are positive.

Why divide by n-1? This is called Bessel’s correction. When calculating sample variance (which we usually do), dividing by n-1 instead of n gives us an unbiased estimate of the population variance.

3. R Implementation

Let’s start by loading the necessary packages and exploring the penguin dataset:

library(tidyverse)
library(palmerpenguins)

# Look at the structure of our data
glimpse(penguins)
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ade…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgers…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.…
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.…
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 1…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475…
$ sex               <fct> male, female, female, NA, female, male, female, mal…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 20…

Base R functions:

# Remove rows with missing values for cleaner analysis
penguins_clean <- penguins %>% drop_na()

# Calculate variance and standard deviation
var(penguins_clean$body_mass_g)
sd(penguins_clean$body_mass_g)
[1] 644988.4
[1] 803.1

Using tidyverse for group calculations:

# Calculate by species
penguins_clean %>%
  group_by(species) %>%
  summarise(
    mean_mass = mean(body_mass_g),
    var_mass = var(body_mass_g),
    sd_mass = sd(body_mass_g),
    n = n()
  )
# A tibble: 3 × 5
  species   mean_mass var_mass sd_mass     n
  <fct>         <dbl>    <dbl>   <dbl> <int>
1 Adelie        3706.  250892.   501.   146
2 Chinstrap     3733.  218730.   468.    68
3 Gentoo        5092.  282579.   532.   119

4. Full Worked Example

Let’s analyze the body mass variability of penguin species step by step:

# Step 1: Prepare the data
penguins_mass <- penguins %>%
  drop_na(body_mass_g) %>%
  select(species, body_mass_g)

# Step 2: Calculate overall statistics
overall_stats <- penguins_mass %>%
  summarise(
    mean_mass = mean(body_mass_g),
    variance = var(body_mass_g),
    std_dev = sd(body_mass_g),
    min_mass = min(body_mass_g),
    max_mass = max(body_mass_g),
    n = n()
  )

print(overall_stats)
# A tibble: 1 × 6
  mean_mass variance std_dev min_mass max_mass     n
      <dbl>    <dbl>   <dbl>    <int>    <int> <int>
1     4207.  644988.    803.     2700     6300   333
# Step 3: Calculate by species for comparison
species_stats <- penguins_mass %>%
  group_by(species) %>%
  summarise(
    mean_mass = mean(body_mass_g),
    variance = var(body_mass_g),
    std_dev = sd(body_mass_g),
    cv = sd(body_mass_g) / mean(body_mass_g) * 100,  # Coefficient of variation
    n = n()
  ) %>%
  arrange(desc(mean_mass))

print(species_stats)
# A tibble: 3 × 6
  species   mean_mass variance std_dev    cv     n
  <fct>         <dbl>    <dbl>   <dbl> <dbl> <int>
1 Gentoo        5092.  282579.   532.  10.4   119
2 Chinstrap     3733.  218730.   468.  12.5    68
3 Adelie        3706.  250892.   501.  13.5   146

Interpretation: - Gentoo penguins are heaviest on average (5,092g) but have the lowest relative variability (10.4% CV) - Adelie penguins have the highest relative variability (13.5% CV) despite similar average mass to Chinstraps - The overall standard deviation (803g) is much larger than within-species standard deviations, indicating species is a major factor in body mass variation

5. Visualization

# Create a visualization showing distributions and variability
penguins_mass %>%
  ggplot(aes(x = species, y = body_mass_g, fill = species)) +
  geom_boxplot(alpha = 0.7) +
  geom_jitter(width = 0.2, alpha = 0.5, size = 1) +
  stat_summary(fun = mean, geom = "point", shape = 23, size = 3,
               fill = "red", color = "darkred") +
  labs(
    title = "Penguin Body Mass Distribution by Species",
    subtitle = "Red diamonds show means, boxes show quartiles and outliers",
    x = "Species",
    y = "Body Mass (grams)",
    fill = "Species"
  ) +
  theme_minimal() +
  theme(legend.position = "none")
Figure 1: Penguin Body Mass Distribution by Species
# Create a second plot showing standard deviations
species_stats %>%
  ggplot(aes(x = species, y = std_dev, fill = species)) +
  geom_col(alpha = 0.7) +
  geom_text(aes(label = paste0(round(std_dev, 0), "g")),
            vjust = -0.5, size = 4) +
  labs(
    title = "Standard Deviation of Body Mass by Species",
    x = "Species",
    y = "Standard Deviation (grams)",
    fill = "Species"
  ) +
  theme_minimal() +
  theme(legend.position = "none") +
  ylim(0, 600)
Figure 2: Standard Deviation of Body Mass by Species

These plots show that while Gentoo penguins have the largest absolute standard deviation, they also have the largest mean, so their relative variability is actually the smallest. The boxplot reveals that Gentoo penguins form a distinct group, contributing to the high overall variability.

6. Assumptions & Limitations

When NOT to use standard deviation and variance:

  1. With ordinal or categorical data: These measures require numerical data with meaningful distances between values
# Don't do this:
# var(penguins$species)  # This would be meaningless
  1. With highly skewed data: Standard deviation can be misleading
# Example of skewed data
skewed_data <- c(1, 2, 2, 3, 3, 3, 4, 4, 100)
mean(skewed_data)  # 13.6
sd(skewed_data)    # 32.3 - larger than the mean!

# Consider median and IQR instead
median(skewed_data)  # 3
IQR(skewed_data)     # 2
  1. When outliers are present: Both measures are sensitive to extreme values
# Robust alternatives for outlier-prone data
mad(penguins_clean$body_mass_g)  # Median Absolute Deviation
[1] 742.3

What to do instead: - Use median and interquartile range (IQR) for skewed data - Use robust measures like MAD for outlier-prone data - Consider transforming data (log, square root) for highly skewed distributions

7. Common Mistakes

1. Confusing population vs. sample formulas:

# R uses sample formulas by default (dividing by n-1)
# This is usually what you want
var(c(1, 2, 3, 4, 5))  # Correct for sample data

# Manual calculation showing the difference:
x <- c(1, 2, 3, 4, 5)
n <- length(x)
sample_var <- sum((x - mean(x))^2) / (n - 1)  # Divide by n-1
pop_var <- sum((x - mean(x))^2) / n           # Divide by n

c(sample_var, pop_var)
[1] 2.5 2.0

2. Forgetting about missing values:

# This will return NA if any values are missing
var(penguins$body_mass_g)  

# Always handle missing data explicitly
var(penguins$body_mass_g, na.rm = TRUE)  # Correct approach
[1] NA
[1] 644988.4

3. Misinterpreting units:

# Variance is in squared units - harder to interpret
var(penguins_clean$body_mass_g)  # 644988.4 grams²

# Standard deviation is in original units - more intuitive
sd(penguins_clean$body_mass_g)   # 803.1 grams