Computing Correlation with R

statistics

correlation

Learn how to compute Pearson, Spearman, and Kendall correlation in R using the cor() function with practical examples.

Published

August 17, 2022

In this tutorial, we will learn how to compute correlation between two numerical variables in R using the cor() function. We’ll cover three correlation methods:

Pearson - measures linear relationship (default)
Spearman - measures monotonic relationship using ranks
Kendall - measures ordinal association

Correlation values range from -1 to +1: - -1: Perfect negative correlation - 0: No correlation - +1: Perfect positive correlation

Setup

library(palmerpenguins)
library(tidyverse)

df <- penguins %>%
  drop_na()

head(df)

# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           36.7          19.3               193        3450
5 Adelie  Torgersen           39.3          20.6               190        3650
6 Adelie  Torgersen           38.9          17.8               181        3625
# ℹ 2 more variables: sex <fct>, year <int>

Pearson Correlation (Default)

Pearson correlation measures the linear relationship between two variables. It’s the default method in cor().

# Correlation between body mass and flipper length
cor(df$body_mass_g, df$flipper_length_mm)

This shows a strong positive correlation (~0.87) - penguins with larger body mass tend to have longer flippers.

# Explicitly specify method
cor(df$body_mass_g, df$flipper_length_mm, method = "pearson")

Spearman Correlation

Spearman correlation measures monotonic relationships using ranks. It’s more robust to outliers and works well for non-linear but monotonic relationships.

cor(df$body_mass_g, df$flipper_length_mm, method = "spearman")

Kendall Correlation

Kendall’s tau measures ordinal association between variables. It’s often more robust than Spearman for small samples.

cor(df$body_mass_g, df$flipper_length_mm, method = "kendall")

Comparing All Three Methods

# Create a comparison
methods <- c("pearson", "spearman", "kendall")

correlations <- sapply(methods, function(m) {
  cor(df$body_mass_g, df$flipper_length_mm, method = m)
})

data.frame(
  Method = methods,
  Correlation = round(correlations, 4)
)

Correlation with Vectors

You can also compute correlation between standalone vectors:

set.seed(42)

# Generate correlated data
x <- rnorm(100, mean = 50, sd = 10)
y <- x * 2 + rnorm(100, mean = 0, sd = 5)  # y is related to x

# Compute correlations
cor(x, y, method = "pearson")
cor(x, y, method = "spearman")

Correlation Matrix

To compute correlations between multiple variables at once:

# Select numeric columns
numeric_cols <- df %>%
  select(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g)

# Correlation matrix with Pearson
round(cor(numeric_cols), 2)

# Correlation matrix with Spearman
round(cor(numeric_cols, method = "spearman"), 2)

Visualizing Correlation

ggplot(df, aes(x = body_mass_g, y = flipper_length_mm)) +
  geom_point(aes(color = species), alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  labs(
    title = "Body Mass vs Flipper Length",
    subtitle = paste("Pearson r =", round(cor(df$body_mass_g, df$flipper_length_mm), 3)),
    x = "Body Mass (g)",
    y = "Flipper Length (mm)"
  ) +
  theme_minimal()

Figure 1: Scatter plot showing correlation between body mass and flipper length

When to Use Each Method

Method	Use When
Pearson	Data is normally distributed, relationship is linear
Spearman	Data has outliers, relationship is monotonic but not linear
Kendall	Small sample size, ordinal data, more robust estimation needed

Summary

Use cor(x, y) for Pearson correlation (default)
Use cor(x, y, method = "spearman") for rank-based correlation
Use cor(x, y, method = "kendall") for ordinal association
All methods return values from -1 to +1

--- title: "Computing Correlation with R" date: 2022-08-17 categories: ['statistics', 'correlation'] description: "Learn how to compute Pearson, Spearman, and Kendall correlation in R using the cor() function with practical examples." format: html: code-fold: false code-tools: true --- In this tutorial, we will learn how to compute correlation between two numerical variables in R using the [`cor()`](/statistics/how-to-pearson-correlation-in-r.html) function. We'll cover three correlation methods: - **Pearson** - measures linear relationship (default) - **Spearman** - measures monotonic relationship using ranks - **Kendall** - measures ordinal association Correlation values range from -1 to +1: - **-1**: Perfect negative correlation - **0**: No correlation - **+1**: Perfect positive correlation ## Setup ```{r} #| message: false library(palmerpenguins) library(tidyverse) df <- penguins %>% drop_na() head(df) ``` ## Pearson Correlation (Default) Pearson correlation measures the **linear relationship** between two variables. It's the default method in `cor()`. ```r # Correlation between body mass and flipper length cor(df$body_mass_g, df$flipper_length_mm) ``` This shows a strong positive correlation (~0.87) - penguins with larger body mass tend to have longer flippers. ```r # Explicitly specify method cor(df$body_mass_g, df$flipper_length_mm, method = "pearson") ``` ## Spearman Correlation Spearman correlation measures **monotonic relationships** using ranks. It's more robust to outliers and works well for non-linear but monotonic relationships. ```r cor(df$body_mass_g, df$flipper_length_mm, method = "spearman") ``` ## Kendall Correlation Kendall's tau measures **ordinal association** between variables. It's often more robust than Spearman for small samples. ```r cor(df$body_mass_g, df$flipper_length_mm, method = "kendall") ``` ## Comparing All Three Methods ```r # Create a comparison methods <- c("pearson", "spearman", "kendall") correlations <- sapply(methods, function(m) { cor(df$body_mass_g, df$flipper_length_mm, method = m) }) data.frame( Method = methods, Correlation = round(correlations, 4) ) ``` ## Correlation with Vectors You can also compute correlation between standalone vectors: ```r set.seed(42) # Generate correlated data x <- rnorm(100, mean = 50, sd = 10) y <- x * 2 + rnorm(100, mean = 0, sd = 5) # y is related to x # Compute correlations cor(x, y, method = "pearson") cor(x, y, method = "spearman") ``` ## Correlation Matrix To compute correlations between multiple variables at once: ```r # Select numeric columns numeric_cols <- df %>% select(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) # Correlation matrix with Pearson round(cor(numeric_cols), 2) ``` ```r # Correlation matrix with Spearman round(cor(numeric_cols, method = "spearman"), 2) ``` ## Visualizing Correlation ```{r} #| label: fig-correlation #| fig-cap: "Scatter plot showing correlation between body mass and flipper length" ggplot(df, aes(x = body_mass_g, y = flipper_length_mm)) + geom_point(aes(color = species), alpha = 0.7) + geom_smooth(method = "lm", se = FALSE, color = "black") + labs( title = "Body Mass vs Flipper Length", subtitle = paste("Pearson r =", round(cor(df$body_mass_g, df$flipper_length_mm), 3)), x = "Body Mass (g)", y = "Flipper Length (mm)" ) + theme_minimal() ``` ## When to Use Each Method | Method | Use When | |--------|----------| | **Pearson** | Data is normally distributed, relationship is linear | | **Spearman** | Data has outliers, relationship is monotonic but not linear | | **Kendall** | Small sample size, ordinal data, more robust estimation needed | ## Summary - Use `cor(x, y)` for Pearson correlation (default) - Use `cor(x, y, method = "spearman")` for rank-based correlation - Use `cor(x, y, method = "kendall")` for ordinal association - All methods return values from -1 to +1 ## Related Tutorials - [How to correlation matrix in R](how-to-correlation-matrix-in-r.html) - [Understanding the Normal Distribution in R](understanding-normal-distribution.html) - [How to extract residuals from a linear regression model](extract-residuals-from-a-linear-regression-model.html) - [How to Compute Z-Score of Multiple Columns](compute-z-score-of-multiple-columns.html) - [How to get p-value from linear regression model](get-p-value-from-linear-regression-model.html)