How to Compute Z-Score of Multiple Columns
In this post, we will learn how to compute Z-score of multiple variables (columns) at the same time using tidyverse in R using multiple approaches.
First, we will show an example of computing Z-score of multiple columns, where all the columns in the dataframe is numeric and then we will show example where we have both numeric and non-numeric columns.
library(tidyverse)
library(palmerpenguins)penguins |>
head()
# A tibble: 6 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
# ℹ 2 more variables: sex , year Compute Z-score on multiple columns: When all columns are numeric
Let us first select all numerical columns in our dataframe using where() and is.numeric() functions.
df
drop_na() %>%
select(-year) |>
select(where(is.numeric))Now we have a dataframe where all columns are numeric.
df
# A tibble: 333 × 4
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
1 39.1 18.7 181 3750
2 39.5 17.4 186 3800
3 40.3 18 195 3250
4 36.7 19.3 193 3450
5 39.3 20.6 190 3650
6 38.9 17.8 181 3625
7 39.2 19.6 195 4675
8 41.1 17.6 182 3200
9 38.6 21.2 191 3800
10 34.6 21.1 198 4400
# ℹ 323 more rowsWe can use scale() function to compute Z-score on all the columns.
df |>
scale() |>
as_tibble()
# A tibble: 333 × 4
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
1 -0.895 0.780 -1.42 -0.568
2 -0.822 0.119 -1.07 -0.506
3 -0.675 0.424 -0.426 -1.19
4 -1.33 1.08 -0.568 -0.940
5 -0.858 1.74 -0.782 -0.692
6 -0.931 0.323 -1.42 -0.723
7 -0.876 1.24 -0.426 0.581
8 -0.529 0.221 -1.35 -1.25
9 -0.986 2.05 -0.711 -0.506
10 -1.72 2.00 -0.212 0.240
# ℹ 323 more rowsCompute Z-score on multiple columns: When all columns are numeric
Second approach is to compute z-score of all numerical columns of a dataframe where some columns are numeric and others are non-numeric. We will use across() function with mutate() function to select numerical columns and compute z-scores manually as shown below.
penguins |>
drop_na() |>
select(-year) |>
mutate(across(where(is.numeric), ~ (.-mean(.)) / sd(.)))# A tibble: 333 × 7
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
1 Adelie Torgersen -0.895 0.780 -1.42 -0.568
2 Adelie Torgersen -0.822 0.119 -1.07 -0.506
3 Adelie Torgersen -0.675 0.424 -0.426 -1.19
4 Adelie Torgersen -1.33 1.08 -0.568 -0.940
5 Adelie Torgersen -0.858 1.74 -0.782 -0.692
6 Adelie Torgersen -0.931 0.323 -1.42 -0.723
7 Adelie Torgersen -0.876 1.24 -0.426 0.581
8 Adelie Torgersen -0.529 0.221 -1.35 -1.25
9 Adelie Torgersen -0.986 2.05 -0.711 -0.506
10 Adelie Torgersen -1.72 2.00 -0.212 0.240
# ℹ 323 more rows
# ℹ 1 more variable: sex Third approach is similar to the one above, but this time we will use scale() function with across() instead of computing Z-score manually
penguins |>
drop_na() |>
select(-year) |>
mutate(across(where(is.numeric), ~ scale(.)[, 1]))
# A tibble: 333 × 7
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
1 Adelie Torgersen -0.895 0.780 -1.42 -0.568
2 Adelie Torgersen -0.822 0.119 -1.07 -0.506
3 Adelie Torgersen -0.675 0.424 -0.426 -1.19
4 Adelie Torgersen -1.33 1.08 -0.568 -0.940
5 Adelie Torgersen -0.858 1.74 -0.782 -0.692
6 Adelie Torgersen -0.931 0.323 -1.42 -0.723
7 Adelie Torgersen -0.876 1.24 -0.426 0.581
8 Adelie Torgersen -0.529 0.221 -1.35 -1.25
9 Adelie Torgersen -0.986 2.05 -0.711 -0.506
10 Adelie Torgersen -1.72 2.00 -0.212 0.240
# ℹ 323 more rows
# ℹ 1 more variable: sex