Computing Correlation with R

cor() in R
Published

August 17, 2022

In this tutorial, we will learn how to compute correlation between two numerical variables in R using cor() function available in base R.

Correlation between two numerical variables can range from -1 to +1, where -ve values suggest these two variables negatively correlated and positive value suggest that the variables are positively correlated. When there is no correlation between the two variables, the correlation value will be around zero.

First, we will compute correlation between two numerical vectors. Next, we will see two examples of how to compute correlation between two numerical variables present in a dataframe.

How to compute correlation between two numerical vectors

First, let us generate two numerical variables, x and y, using random numbers from normal distribution.

set.seed(21)
# generate x variable: random numbers from normal distribution
x %
  drop_na()
df %>% head()

# A tibble: 6 × 8
  species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex  
                                             
1 Adelie  Torge…           39.1          18.7              181        3750 male 
2 Adelie  Torge…           39.5          17.4              186        3800 fema…
3 Adelie  Torge…           40.3          18                195        3250 fema…
4 Adelie  Torge…           36.7          19.3              193        3450 fema…
5 Adelie  Torge…           39.3          20.6              190        3650 male 
6 Adelie  Torge…           38.9          17.8              181        3625 fema…
# … with 1 more variable: year 

To compute correlation between body mass and flipper length, we will extract those two variables from the dataframe and save as new variables.

body_mass % 
     pull(body_mass_g)
flipper_length % 
     pull(flipper_length_mm)

Now we can compute correlation as before using cor() function. In this example, these two variables are highly correlated with pearson correlation value of ~ 0.88.

cor(body_mass, flipper_length)

[1] 0.8729789

Using base R notation, we can directly access a variable from a dataframe using $ symbol. In this second approach we compute correlation by getting the variable from the dataframe using $ symbol as shown below.

cor(df$body_mass_g, df$flipper_length_mm)

[1] 0.8729789