How to calculate z-scores for multiple columns in R
Introduction
Data standardization (also called z-score normalization) is a crucial preprocessing step that transforms variables to have a mean of 0 and standard deviation of 1. This technique is essential when working with variables on different scales, particularly before applying machine learning algorithms or when comparing variables with different units of measurement.
Setup and Data Overview
Let’s start by loading the necessary packages and examining our dataset.
library(palmerpenguins)
library(tidyverse)
theme_set(theme_bw(16))The Palmer penguins dataset contains measurements of three penguin species. Let’s take a look at the structure of our data:
penguins |>
head()This gives us a good overview of the variables we’ll be working with, including both numeric measurements and categorical variables.
Preparing the Data
Before standardizing, we need to clean our data and select only the numeric variables we want to transform.
df <- penguins |>
drop_na() |>
select(-year) |>
select(where(is.numeric))We remove missing values with drop_na(), exclude the year variable (since it’s not a measurement), and keep only numeric columns. This creates a clean dataset with four numeric variables: bill length, bill depth, flipper length, and body mass.
Let’s examine our prepared dataset:
df |>
head()Method 1: Using Base R’s scale() Function
The scale() function is the most straightforward way to standardize variables in R.
scaled_data <- df |>
scale() |>
as_tibble()The scale() function returns a matrix, so we convert it back to a tibble for easier manipulation. Each variable now has a mean of 0 and standard deviation of 1, making them directly comparable.
Method 2: Manual Standardization with across()
For more control over the process, we can manually calculate z-scores using the standardization formula: (x - mean(x)) / sd(x).
penguins |>
drop_na() |>
select(-year) |>
mutate(across(where(is.numeric), ~ (. - mean(.)) / sd(.)))This approach keeps both categorical and numeric variables together while only transforming the numeric ones. The across() function applies our standardization formula to each numeric column.
Method 3: Using scale() within mutate()
We can also combine the convenience of scale() with the flexibility of mutate():
penguins |>
drop_na() |>
select(-year) |>
mutate(across(where(is.numeric), ~ scale(.)[, 1]))The [, 1] extracts the first (and only) column from the matrix returned by scale(). This method preserves the original data structure while standardizing only the numeric variables.
Summary
We’ve explored three methods for standardizing data in R: using base R’s scale() function, manual calculation with the z-score formula, and combining scale() with mutate(). Each approach has its advantages - scale() alone is simplest for numeric-only data, while the mutate() approaches are better when you need to preserve categorical variables alongside your standardized numeric data. Choose the method that best fits your specific analysis needs.