How to mean and median in R

statistics
mean and median
Published

February 20, 2026

Understanding Mean and Median in R: A Complete Tutorial

1. Introduction

The mean and median are two fundamental measures of central tendency that help us understand the “typical” or “average” value in a dataset. The mean is the arithmetic average of all values, while the median is the middle value when data is arranged in order.

When to use the mean: The mean works best with normally distributed, continuous data without extreme outliers. It’s ideal for calculating totals, averages for reporting, and when all data points should contribute equally to the central measure. Examples include average test scores, mean temperature, or average sales figures.

When to use the median: The median is preferred when data contains outliers, is skewed, or when you want a measure that represents the “typical” observation rather than the mathematical average. It’s particularly useful for income data, house prices, or any dataset where extreme values might distort the mean.

Key assumptions: The mean assumes that extreme values are meaningful and should influence the central measure. The median assumes an ordinal relationship between values but is robust to outliers. Both require at least interval-level data, though median can work with ordinal data. Neither requires normal distribution, but the mean’s interpretation is clearer with symmetric distributions.

2. The Math

Mean Formula:

Mean = (Sum of all values) / (Number of values)
Mean = (x₁ + x₂ + x₃ + ... + xₙ) / n

Where: - x₁, x₂, x₃, …, xₙ are individual data points - n is the total number of observations

Median Calculation: 1. Sort all values from smallest to largest 2. If n is odd: Median = middle value 3. If n is even: Median = (middle two values) / 2

For example, with values [1, 3, 5, 7, 9], the median is 5 (middle value). With values [2, 4, 6, 8], the median is (4 + 6) / 2 = 5.

3. R Implementation

Let’s start by loading necessary packages and exploring the basic functions:

# Load required packages
library(tidyverse)
library(palmerpenguins)

# Basic mean and median functions
data <- c(10, 15, 20, 25, 30, 35, 100)

# Calculate mean
mean(data)
[1] 33.57143
# Calculate median
median(data)
[1] 25
# Handle missing values
data_with_na <- c(10, 15, NA, 25, 30)
mean(data_with_na, na.rm = TRUE)
[1] 20
median(data_with_na, na.rm = TRUE)
[1] 20

Using tidyverse for grouped calculations:

# Load penguins data
data(penguins)

# Calculate mean and median by species
penguins %>%
  group_by(species) %>%
  summarise(
    mean_bill_length = mean(bill_length_mm, na.rm = TRUE),
    median_bill_length = median(bill_length_mm, na.rm = TRUE),
    .groups = 'drop'
  )
# A tibble: 3 × 3
  species   mean_bill_length median_bill_length
  <fct>                <dbl>              <dbl>
1 Adelie                38.8               38.8
2 Chinstrap             48.8               49.6
3 Gentoo                47.5               47.3

4. Full Worked Example

Let’s analyze penguin body mass to understand the difference between mean and median:

# Step 1: Explore the data
penguins %>%
  select(species, body_mass_g) %>%
  summary()
    species    body_mass_g   
 Adelie   :152   Min.   :2700  
 Chinstrap: 68   1st Qu.:3550  
 Gentoo   :124   Median :4050  
                 Mean   :4202  
                 3rd Qu.:4750  
                 Max.   :6300  
                 NA's   :2     
# Step 2: Calculate overall mean and median
overall_stats <- penguins %>%
  summarise(
    count = sum(!is.na(body_mass_g)),
    mean_mass = mean(body_mass_g, na.rm = TRUE),
    median_mass = median(body_mass_g, na.rm = TRUE),
    difference = mean_mass - median_mass
  )

overall_stats
# A tibble: 1 × 4
  count mean_mass median_mass difference
  <int>     <dbl>       <dbl>      <dbl>
1   342      4202        4050        152
# Step 3: Compare by species
species_stats <- penguins %>%
  group_by(species) %>%
  summarise(
    count = sum(!is.na(body_mass_g)),
    mean_mass = mean(body_mass_g, na.rm = TRUE),
    median_mass = median(body_mass_g, na.rm = TRUE),
    difference = mean_mass - median_mass,
    .groups = 'drop'
  )

species_stats
# A tibble: 3 × 5
  species   count mean_mass median_mass difference
  <fct>     <int>     <dbl>       <dbl>      <dbl>
1 Adelie      151      3701        3700        0.7
2 Chinstrap    68      3733        3700       32.6
3 Gentoo      123      5076        5000       75.8

Interpretation: The overall mean (4202g) is higher than the median (4050g), suggesting a right-skewed distribution with some heavier penguins pulling the mean upward. Gentoo penguins show the largest difference between mean and median, indicating more variability or potential outliers in their body mass.

5. Visualization

# Calculate statistics for plotting
plot_data <- penguins %>%
  filter(!is.na(body_mass_g)) %>%
  group_by(species) %>%
  summarise(
    mean_mass = mean(body_mass_g),
    median_mass = median(body_mass_g),
    .groups = 'drop'
  ) %>%
  pivot_longer(cols = c(mean_mass, median_mass),
               names_to = "statistic", values_to = "value") %>%
  mutate(statistic = case_when(
    statistic == "mean_mass" ~ "Mean",
    statistic == "median_mass" ~ "Median"
  ))

# Create the plot
ggplot(penguins %>% filter(!is.na(body_mass_g)),
       aes(x = species, y = body_mass_g, fill = species)) +
  geom_boxplot(alpha = 0.7) +
  geom_point(data = plot_data, aes(x = species, y = value, shape = statistic),
             size = 4, color = "black") +
  scale_shape_manual(values = c("Mean" = 16, "Median" = 17)) +
  labs(title = "Penguin Body Mass: Mean vs Median by Species",
       subtitle = "Circles = Mean, Triangles = Median",
       x = "Species", y = "Body Mass (g)",
       shape = "Statistic") +
  theme_minimal() +
  theme(legend.position = "bottom")
Figure 1: Penguin Body Mass: Mean vs Median by Species

This plot shows the distribution of body mass for each penguin species with boxplots, overlaid with points showing the mean (circles) and median (triangles). Notice how the mean and median are nearly identical for Adelie penguins, suggesting a symmetric distribution, while Gentoo penguins show the mean above the median, indicating right skewness.

6. Assumptions & Limitations

When NOT to use the mean: - With highly skewed data (income, real estate prices) - When outliers significantly distort the central tendency - With ordinal data where intervals aren’t meaningful - When you need a value that actually exists in your dataset

When NOT to use the median: - When you need to account for the magnitude of all values - For calculating totals or when extreme values are meaningful - When working with small, symmetric datasets where mean is more precise - In mathematical operations where additivity is important

Common violations: - Using mean with heavily skewed data: Use median instead - Ignoring outliers when calculating mean: Consider robust alternatives or outlier removal - Using median when you need mathematical properties: Mean supports algebraic operations better

7. Common Mistakes

Mistake 1: Forgetting to handle missing values

# Wrong - will return NA
mean(c(1, 2, NA, 4, 5))
[1] NA
# Correct - specify na.rm = TRUE
mean(c(1, 2, NA, 4, 5), na.rm = TRUE)
[1] 3

Mistake 2: Using mean with highly skewed data

# Example: Income data with outliers
income <- c(25000, 30000, 35000, 40000, 45000, 1000000)
mean(income)  # Misleading due to outlier
[1] 195833.3
median(income)  # More representative
[1] 37500

Mistake 3: Assuming mean and median are always different With symmetric distributions, mean and median can be nearly identical, and both provide valid measures of central tendency.