How to Compute Z-Score of Multiple Columns

R Function

scale()

Published

October 27, 2024

In this post, we will learn how to compute Z-score of multiple variables (columns) at the same time using tidyverse in R using multiple approaches.

First, we will show an example of computing Z-score of multiple columns, where all the columns in the dataframe is numeric and then we will show example where we have both numeric and non-numeric columns.

library(tidyverse)
library(palmerpenguins)

penguins |>
  head()

# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
                                                
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
5 Adelie  Torgersen           36.7          19.3               193        3450
6 Adelie  Torgersen           39.3          20.6               190        3650
# ℹ 2 more variables: sex , year

Compute Z-score on multiple columns: When all columns are numeric

Let us first select all numerical columns in our dataframe using where() and is.numeric() functions.

df 
  drop_na() %>%
  select(-year) |>
  select(where(is.numeric))

Now we have a dataframe where all columns are numeric.

df

# A tibble: 333 × 4
   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
                                         
 1           39.1          18.7               181        3750
 2           39.5          17.4               186        3800
 3           40.3          18                 195        3250
 4           36.7          19.3               193        3450
 5           39.3          20.6               190        3650
 6           38.9          17.8               181        3625
 7           39.2          19.6               195        4675
 8           41.1          17.6               182        3200
 9           38.6          21.2               191        3800
10           34.6          21.1               198        4400
# ℹ 323 more rows

We can use scale() function to compute Z-score on all the columns.

df |> 
  scale() |>
  as_tibble()

# A tibble: 333 × 4
   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
                                         
 1         -0.895         0.780            -1.42       -0.568
 2         -0.822         0.119            -1.07       -0.506
 3         -0.675         0.424            -0.426      -1.19 
 4         -1.33          1.08             -0.568      -0.940
 5         -0.858         1.74             -0.782      -0.692
 6         -0.931         0.323            -1.42       -0.723
 7         -0.876         1.24             -0.426       0.581
 8         -0.529         0.221            -1.35       -1.25 
 9         -0.986         2.05             -0.711      -0.506
10         -1.72          2.00             -0.212       0.240
# ℹ 323 more rows

Compute Z-score on multiple columns: When all columns are numeric

Second approach is to compute z-score of all numerical columns of a dataframe where some columns are numeric and others are non-numeric. We will use across() function with mutate() function to select numerical columns and compute z-scores manually as shown below.

penguins |>
  drop_na() |>
  select(-year) |>
  mutate(across(where(is.numeric), ~ (.-mean(.)) / sd(.)))

# A tibble: 333 × 7
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
                                                 
 1 Adelie  Torgersen         -0.895         0.780            -1.42       -0.568
 2 Adelie  Torgersen         -0.822         0.119            -1.07       -0.506
 3 Adelie  Torgersen         -0.675         0.424            -0.426      -1.19 
 4 Adelie  Torgersen         -1.33          1.08             -0.568      -0.940
 5 Adelie  Torgersen         -0.858         1.74             -0.782      -0.692
 6 Adelie  Torgersen         -0.931         0.323            -1.42       -0.723
 7 Adelie  Torgersen         -0.876         1.24             -0.426       0.581
 8 Adelie  Torgersen         -0.529         0.221            -1.35       -1.25 
 9 Adelie  Torgersen         -0.986         2.05             -0.711      -0.506
10 Adelie  Torgersen         -1.72          2.00             -0.212       0.240
# ℹ 323 more rows
# ℹ 1 more variable: sex

Third approach is similar to the one above, but this time we will use scale() function with across() instead of computing Z-score manually

penguins |>
  drop_na() |>
  select(-year) |>
  mutate(across(where(is.numeric), ~ scale(.)[, 1]))

# A tibble: 333 × 7
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
                                                 
 1 Adelie  Torgersen         -0.895         0.780            -1.42       -0.568
 2 Adelie  Torgersen         -0.822         0.119            -1.07       -0.506
 3 Adelie  Torgersen         -0.675         0.424            -0.426      -1.19 
 4 Adelie  Torgersen         -1.33          1.08             -0.568      -0.940
 5 Adelie  Torgersen         -0.858         1.74             -0.782      -0.692
 6 Adelie  Torgersen         -0.931         0.323            -1.42       -0.723
 7 Adelie  Torgersen         -0.876         1.24             -0.426       0.581
 8 Adelie  Torgersen         -0.529         0.221            -1.35       -1.25 
 9 Adelie  Torgersen         -0.986         2.05             -0.711      -0.506
10 Adelie  Torgersen         -1.72          2.00             -0.212       0.240
# ℹ 323 more rows
# ℹ 1 more variable: sex

--- title: "How to Compute Z-Score of Multiple Columns" date: 2024-10-27 categories: ['R Function', 'scale()'] format: html: code-fold: false code-tools: true --- In this post, we will learn [how to compute Z-score](https://rstats101.com/compute-z-score/) of multiple variables (columns) at the same time using tidyverse in R using multiple approaches. First, we will show an example of computing Z-score of multiple columns, where all the columns in the dataframe is numeric and then we will show example where we have both numeric and non-numeric columns. ```r library(tidyverse) library(palmerpenguins) ``` ```r penguins |> head() # A tibble: 6 × 8 species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g 1 Adelie Torgersen 39.1 18.7 181 3750 2 Adelie Torgersen 39.5 17.4 186 3800 3 Adelie Torgersen 40.3 18 195 3250 4 Adelie Torgersen NA NA NA NA 5 Adelie Torgersen 36.7 19.3 193 3450 6 Adelie Torgersen 39.3 20.6 190 3650 # ℹ 2 more variables: sex , year ``` ### Compute Z-score on multiple columns: When all columns are numeric Let us first select all numerical columns in our dataframe using where() and is.numeric() functions. ```r df drop_na() %>% select(-year) |> select(where(is.numeric)) ``` Now we have a dataframe where all columns are numeric. ```r df # A tibble: 333 × 4 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g 1 39.1 18.7 181 3750 2 39.5 17.4 186 3800 3 40.3 18 195 3250 4 36.7 19.3 193 3450 5 39.3 20.6 190 3650 6 38.9 17.8 181 3625 7 39.2 19.6 195 4675 8 41.1 17.6 182 3200 9 38.6 21.2 191 3800 10 34.6 21.1 198 4400 # ℹ 323 more rows ``` We can use scale() function to compute Z-score on all the columns. ```r df |> scale() |> as_tibble() # A tibble: 333 × 4 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g 1 -0.895 0.780 -1.42 -0.568 2 -0.822 0.119 -1.07 -0.506 3 -0.675 0.424 -0.426 -1.19 4 -1.33 1.08 -0.568 -0.940 5 -0.858 1.74 -0.782 -0.692 6 -0.931 0.323 -1.42 -0.723 7 -0.876 1.24 -0.426 0.581 8 -0.529 0.221 -1.35 -1.25 9 -0.986 2.05 -0.711 -0.506 10 -1.72 2.00 -0.212 0.240 # ℹ 323 more rows ``` ### Compute Z-score on multiple columns: When all columns are numeric Second approach is to compute z-score of all numerical columns of a dataframe where some columns are numeric and others are non-numeric. We will use across() function with mutate() function to select numerical columns and compute z-scores manually as shown below. ```r penguins |> drop_na() |> select(-year) |> mutate(across(where(is.numeric), ~ (.-mean(.)) / sd(.))) ``` ```r # A tibble: 333 × 7 species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g 1 Adelie Torgersen -0.895 0.780 -1.42 -0.568 2 Adelie Torgersen -0.822 0.119 -1.07 -0.506 3 Adelie Torgersen -0.675 0.424 -0.426 -1.19 4 Adelie Torgersen -1.33 1.08 -0.568 -0.940 5 Adelie Torgersen -0.858 1.74 -0.782 -0.692 6 Adelie Torgersen -0.931 0.323 -1.42 -0.723 7 Adelie Torgersen -0.876 1.24 -0.426 0.581 8 Adelie Torgersen -0.529 0.221 -1.35 -1.25 9 Adelie Torgersen -0.986 2.05 -0.711 -0.506 10 Adelie Torgersen -1.72 2.00 -0.212 0.240 # ℹ 323 more rows # ℹ 1 more variable: sex ``` Third approach is similar to the one above, but this time we will use scale() function with across() instead of computing Z-score manually ```r penguins |> drop_na() |> select(-year) |> mutate(across(where(is.numeric), ~ scale(.)[, 1])) # A tibble: 333 × 7 species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g 1 Adelie Torgersen -0.895 0.780 -1.42 -0.568 2 Adelie Torgersen -0.822 0.119 -1.07 -0.506 3 Adelie Torgersen -0.675 0.424 -0.426 -1.19 4 Adelie Torgersen -1.33 1.08 -0.568 -0.940 5 Adelie Torgersen -0.858 1.74 -0.782 -0.692 6 Adelie Torgersen -0.931 0.323 -1.42 -0.723 7 Adelie Torgersen -0.876 1.24 -0.426 0.581 8 Adelie Torgersen -0.529 0.221 -1.35 -1.25 9 Adelie Torgersen -0.986 2.05 -0.711 -0.506 10 Adelie Torgersen -1.72 2.00 -0.212 0.240 # ℹ 323 more rows # ℹ 1 more variable: sex ```