How to compute Z-score
In this post, we will learn how to compute Z-score in R using two different approaches, first manually by using the z-score formula and then using scale() function available in base R.
What is Z-score
Z-score is a commonly used transformation technique to standardize/normalize a numerical variable. Transforming numerical variables into Z-scores enable us to make comparison across the variables easily.
At the core Z-score is a statistical measure that describes how far a data point is from the mean of a variable, expressed in terms of standard deviations. It is a great way used to understand how unusual or typical a particular data point is within a distribution.
How to calculate Z-score
To compute Z-score of a numerical variable, we need Mean and Standard Deviation of the variable. Then we can compute Z-score for each value of the variable by subtracting the value with mean value of the variable and then dividing by the standard deviation.
The magnitude of the Z-score shows how many standard deviations away from the mean the data point is. A Z-score of 0 means the data point is exactly at the mean. A positive Z-score indicates the data point is above the mean. A negative z-score indicates the data point is below the mean.
Let us see load the packages needed
library(tidyverse)
library(palmerpenguins)
theme_set(theme_bw(16))We will use two variables from Palmer penguin dataset to show how to compute Z-score.
df
drop_na() %>%
select(bill_depth_mm, body_mass_g)
df |> head()
# A tibble: 6 × 2
bill_depth_mm body_mass_g
1 18.7 3750
2 17.4 3800
3 18 3250
4 19.3 3450
5 20.6 3650
6 17.8 3625Let us use summary() function to see the quick summary of the two numerical variables.
df |> summary()
bill_depth_mm body_mass_g
Min. :13.10 Min. :2700
1st Qu.:15.60 1st Qu.:3550
Median :17.30 Median :4050
Mean :17.16 Mean :4207
3rd Qu.:18.70 3rd Qu.:4775
Max. :21.50 Max. :6300Let us manually compute Z-score for these two variables one-by-one.
df
mutate(bill_depth_zscore_m = (bill_depth_mm - mean(bill_depth_mm))/sd(bill_depth_mm),
body_mass_zscore_m = ((body_mass_g)-mean(body_mass_g))/sd(body_mass_g)
)
df |> head()
# A tibble: 6 × 4
bill_depth_mm body_mass_g bill_depth_zscore_m body_mass_zscore_m
1 18.7 3750 0.780 -0.568
2 17.4 3800 0.119 -0.506
3 18 3250 0.424 -1.19
4 19.3 3450 1.08 -0.940
5 20.6 3650 1.74 -0.692
6 17.8 3625 0.323 -0.723We can look at the summary to see hopw different the Z-scores are from the original values of the variables.
df |>
summary()
bill_depth_mm body_mass_g bill_depth_zscore_m body_mass_zscore_m
Min. :13.10 Min. :2700 Min. :-2.06418 Min. :-1.8716
1st Qu.:15.60 1st Qu.:3550 1st Qu.:-0.79466 1st Qu.:-0.8160
Median :17.30 Median :4050 Median : 0.06862 Median :-0.1950
Mean :17.16 Mean :4207 Mean : 0.00000 Mean : 0.0000
3rd Qu.:18.70 3rd Qu.:4775 3rd Qu.: 0.77956 3rd Qu.: 0.7053
Max. :21.50 Max. :6300 Max. : 2.20143 Max. : 2.5992We can also make scatter plot see the effect of computing Z-score to the original variable. We can see nice correlation between before and after computing Z-score. We can see that range of Z-score is very different from the original values of the variable, as we expect.
df |>
ggplot(aes(x=bill_depth_mm, y= bill_depth_zscore))+
geom_point()+
labs(title = "Z-score: before and after")
Comparing Z-score with its original values
Computing Z-score using scale() function
We can also compute Z-score using scale() function available in R.
df
mutate(bill_depth_zscore = c(scale(bill_depth_mm)),
body_mass_zscore = c(scale(body_mass_g)))
df |> head()
# A tibble: 6 × 6
bill_depth_mm body_mass_g bill_depth_zscore_m body_mass_zscore_m
1 18.7 3750 0.780 -0.568
2 17.4 3800 0.119 -0.506
3 18 3250 0.424 -1.19
4 19.3 3450 1.08 -0.940
5 20.6 3650 1.74 -0.692
6 17.8 3625 0.323 -0.723
# ℹ 2 more variables: bill_depth_zscore , body_mass_zscore We can check the summaries of the z-scores computed by two approaches
df |>
select(-bill_depth_mm, -body_mass_g )|>
summary()
bill_depth_zscore_m body_mass_zscore_m bill_depth_zscore body_mass_zscore
Min. :-2.06418 Min. :-1.8716 Min. :-2.06418 Min. :-1.8716
1st Qu.:-0.79466 1st Qu.:-0.8160 1st Qu.:-0.79466 1st Qu.:-0.8160
Median : 0.06862 Median :-0.1950 Median : 0.06862 Median :-0.1950
Mean : 0.00000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
3rd Qu.: 0.77956 3rd Qu.: 0.7053 3rd Qu.: 0.77956 3rd Qu.: 0.7053
Max. : 2.20143 Max. : 2.5992 Max. : 2.20143 Max. : 2.5992We can check if each elements of z-scores computed by the two approaches are the same.
all.equal(df$bill_depth_zscore_m,
df$bill_depth_zscore)
TRUE