Computing Correlation with R

statistics
correlation
Learn how to compute Pearson, Spearman, and Kendall correlation in R using the cor() function with practical examples.
Published

August 17, 2022

In this tutorial, we will learn how to compute correlation between two numerical variables in R using the cor() function. We’ll cover three correlation methods:

Correlation values range from -1 to +1: - -1: Perfect negative correlation - 0: No correlation - +1: Perfect positive correlation

Setup

library(palmerpenguins)
library(tidyverse)

df <- penguins %>%
  drop_na()

head(df)
# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           36.7          19.3               193        3450
5 Adelie  Torgersen           39.3          20.6               190        3650
6 Adelie  Torgersen           38.9          17.8               181        3625
# ℹ 2 more variables: sex <fct>, year <int>

Pearson Correlation (Default)

Pearson correlation measures the linear relationship between two variables. It’s the default method in cor().

# Correlation between body mass and flipper length
cor(df$body_mass_g, df$flipper_length_mm)

This shows a strong positive correlation (~0.87) - penguins with larger body mass tend to have longer flippers.

# Explicitly specify method
cor(df$body_mass_g, df$flipper_length_mm, method = "pearson")

Spearman Correlation

Spearman correlation measures monotonic relationships using ranks. It’s more robust to outliers and works well for non-linear but monotonic relationships.

cor(df$body_mass_g, df$flipper_length_mm, method = "spearman")

Kendall Correlation

Kendall’s tau measures ordinal association between variables. It’s often more robust than Spearman for small samples.

cor(df$body_mass_g, df$flipper_length_mm, method = "kendall")

Comparing All Three Methods

# Create a comparison
methods <- c("pearson", "spearman", "kendall")

correlations <- sapply(methods, function(m) {
  cor(df$body_mass_g, df$flipper_length_mm, method = m)
})

data.frame(
  Method = methods,
  Correlation = round(correlations, 4)
)

Correlation with Vectors

You can also compute correlation between standalone vectors:

set.seed(42)

# Generate correlated data
x <- rnorm(100, mean = 50, sd = 10)
y <- x * 2 + rnorm(100, mean = 0, sd = 5)  # y is related to x

# Compute correlations
cor(x, y, method = "pearson")
cor(x, y, method = "spearman")

Correlation Matrix

To compute correlations between multiple variables at once:

# Select numeric columns
numeric_cols <- df %>%
  select(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g)

# Correlation matrix with Pearson
round(cor(numeric_cols), 2)
# Correlation matrix with Spearman
round(cor(numeric_cols, method = "spearman"), 2)

Visualizing Correlation

ggplot(df, aes(x = body_mass_g, y = flipper_length_mm)) +
  geom_point(aes(color = species), alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  labs(
    title = "Body Mass vs Flipper Length",
    subtitle = paste("Pearson r =", round(cor(df$body_mass_g, df$flipper_length_mm), 3)),
    x = "Body Mass (g)",
    y = "Flipper Length (mm)"
  ) +
  theme_minimal()
Figure 1: Scatter plot showing correlation between body mass and flipper length

When to Use Each Method

Method Use When
Pearson Data is normally distributed, relationship is linear
Spearman Data has outliers, relationship is monotonic but not linear
Kendall Small sample size, ordinal data, more robust estimation needed

Summary

  • Use cor(x, y) for Pearson correlation (default)
  • Use cor(x, y, method = "spearman") for rank-based correlation
  • Use cor(x, y, method = "kendall") for ordinal association
  • All methods return values from -1 to +1