How to Compute Z-Score of Multiple Columns

R Function

scale()

Learn how to compute z-score of multiple columns with this comprehensive R tutorial. Includes practical examples and code snippets.

Published

October 27, 2024

Introduction

Z-scores standardize data by measuring how many standard deviations a value is from the mean. Computing z-scores for multiple columns simultaneously is essential for data preprocessing, especially when preparing data for machine learning or comparing variables with different scales.

Getting Started

library(tidyverse)
library(palmerpenguins)

Example 1: Basic Z-Score Calculation

The Problem

We need to standardize multiple numeric columns in a dataset to make them comparable. Let’s start with the penguins dataset and standardize the body measurement columns.

Step 1: Prepare the Data

First, we’ll examine our dataset and select the numeric columns we want to standardize.

# Load and examine the penguins data
data(penguins)
head(penguins)

# Select numeric columns for z-score calculation
numeric_cols <- c("bill_length_mm", "bill_depth_mm", 
                  "flipper_length_mm", "body_mass_g")

This gives us four body measurement columns that we’ll standardize.

Step 2: Calculate Z-Scores Using Base R

We’ll use the scale() function to compute z-scores for multiple columns at once.

# Calculate z-scores for selected columns
penguins_scaled <- penguins |>
  select(all_of(numeric_cols)) |>
  drop_na() |>
  scale() |>
  as_tibble()

head(penguins_scaled)

The scale() function automatically computes z-scores by subtracting the mean and dividing by the standard deviation for each column.

Step 3: Verify the Standardization

Let’s confirm our z-scores have mean ≈ 0 and standard deviation ≈ 1.

# Check means and standard deviations
penguins_scaled |>
  summarise(across(everything(), 
                   list(mean = mean, sd = sd),
                   .names = "{.col}_{.fn}"))

Perfect! All means are essentially zero and standard deviations are 1, confirming our standardization worked correctly.

Example 2: Practical Application with Custom Function

The Problem

In real-world scenarios, you often need more control over the z-score calculation process. Let’s create a custom function that handles missing values better and allows us to keep other columns in our dataset.

Step 1: Create a Custom Z-Score Function

We’ll build a flexible function that can standardize selected columns while preserving the original dataset structure.

# Custom function to calculate z-scores
calculate_z_scores <- function(data, cols) {
  data |>
    mutate(across(all_of(cols), 
                  ~ (. - mean(., na.rm = TRUE)) / sd(., na.rm = TRUE),
                  .names = "{.col}_z"))
}

This function creates new columns with “_z” suffix containing the standardized values.

Step 2: Apply the Function to Our Dataset

Now we’ll apply our custom function while keeping all original columns and handling missing values gracefully.

# Apply z-score calculation while keeping original data
penguins_with_z <- penguins |>
  calculate_z_scores(numeric_cols)

# View the results
penguins_with_z |>
  select(species, contains("_z")) |>
  head()

This approach preserves the original data while adding standardized versions, making it easier to compare different species’ measurements.

Step 3: Visualize the Standardized Data

Let’s create a visualization to see how standardization affects our data distribution.

# Compare original vs standardized distributions
penguins_with_z |>
  select(bill_length_mm, bill_length_mm_z) |>
  filter(!is.na(bill_length_mm)) |>
  pivot_longer(everything()) |>
  ggplot(aes(x = value, fill = name)) +
  geom_density(alpha = 0.7) +
  facet_wrap(~name, scales = "free") +
  labs(title = "Original vs Z-score Standardized Bill Length",
       x = "Value", y = "Density")

Density plot comparing raw penguin bill length in millimeters against its z-score standardized version computed across multiple columns in R, showing preserved shape with a new mean of zero and unit variance after applying scale() to the column.

The standardized version shows the same distribution shape but centered at zero with unit variance.

Step 4: Compare Groups Using Z-Scores

Standardization makes it easy to compare measurements across different penguin species.

# Compare species using standardized measurements
penguins_with_z |>
  group_by(species) |>
  summarise(across(contains("_z"), 
                   mean, na.rm = TRUE)) |>
  pivot_longer(-species, 
               names_to = "measurement", 
               values_to = "mean_z_score")

Now we can easily identify which species tend to be above or below average for each measurement.

Summary

Z-scores standardize multiple columns by centering at mean 0 with standard deviation 1
Use scale() for quick standardization or create custom functions for more control
The across() function with mutate() provides flexible column-wise operations
Standardization preserves distribution shape while making variables comparable
Z-scores are essential for machine learning preprocessing and comparative analysis

--- title: "How to Compute Z-Score of Multiple Columns" description: "Learn how to compute z-score of multiple columns with this comprehensive R tutorial. Includes practical examples and code snippets." date: 2024-10-27 categories: ['R Function', 'scale()'] format: html: code-fold: false code-tools: true --- ## Introduction Z-scores standardize data by measuring how many standard deviations a value is from the mean. Computing z-scores for multiple columns simultaneously is essential for data preprocessing, especially when preparing data for machine learning or comparing variables with different scales. ## Getting Started ```r library(tidyverse) library(palmerpenguins) ``` ## Example 1: Basic Z-Score Calculation ### The Problem We need to standardize multiple numeric columns in a dataset to make them comparable. Let's start with the penguins dataset and standardize the body measurement columns. ### Step 1: Prepare the Data First, we'll examine our dataset and select the numeric columns we want to standardize. ```r # Load and examine the penguins data data(penguins) head(penguins) # Select numeric columns for z-score calculation numeric_cols <- c("bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g") ``` This gives us four body measurement columns that we'll standardize. ### Step 2: Calculate Z-Scores Using Base R We'll use the scale() function to compute z-scores for multiple columns at once. ```r # Calculate z-scores for selected columns penguins_scaled <- penguins |> select(all_of(numeric_cols)) |> drop_na() |> scale() |> as_tibble() head(penguins_scaled) ``` The scale() function automatically computes z-scores by subtracting the mean and dividing by the standard deviation for each column. ### Step 3: Verify the Standardization Let's confirm our z-scores have mean ≈ 0 and standard deviation ≈ 1. ```r # Check means and standard deviations penguins_scaled |> summarise(across(everything(), list(mean = mean, sd = sd), .names = "{.col}_{.fn}")) ``` Perfect! All means are essentially zero and standard deviations are 1, confirming our standardization worked correctly. ## Example 2: Practical Application with Custom Function ### The Problem In real-world scenarios, you often need more control over the z-score calculation process. Let's create a custom function that handles missing values better and allows us to keep other columns in our dataset. ### Step 1: Create a Custom Z-Score Function We'll build a flexible function that can standardize selected columns while preserving the original dataset structure. ```r # Custom function to calculate z-scores calculate_z_scores <- function(data, cols) { data |> mutate(across(all_of(cols), ~ (. - mean(., na.rm = TRUE)) / sd(., na.rm = TRUE), .names = "{.col}_z")) } ``` This function creates new columns with "_z" suffix containing the standardized values. ### Step 2: Apply the Function to Our Dataset Now we'll apply our custom function while keeping all original columns and handling missing values gracefully. ```r # Apply z-score calculation while keeping original data penguins_with_z <- penguins |> calculate_z_scores(numeric_cols) # View the results penguins_with_z |> select(species, contains("_z")) |> head() ``` This approach preserves the original data while adding standardized versions, making it easier to compare different species' measurements. ### Step 3: Visualize the Standardized Data Let's create a visualization to see how standardization affects our data distribution. ```r # Compare original vs standardized distributions penguins_with_z |> select(bill_length_mm, bill_length_mm_z) |> filter(!is.na(bill_length_mm)) |> pivot_longer(everything()) |> ggplot(aes(x = value, fill = name)) + geom_density(alpha = 0.7) + facet_wrap(~name, scales = "free") + labs(title = "Original vs Z-score Standardized Bill Length", x = "Value", y = "Density") ``` ![Density plot comparing raw penguin bill length in millimeters against its z-score standardized version computed across multiple columns in R, showing preserved shape with a new mean of zero and unit variance after applying scale() to the column.](/images/statistics/compute-z-score-multi-column-in-r-density-comparison-ggplot.png) The standardized version shows the same distribution shape but centered at zero with unit variance. ### Step 4: Compare Groups Using Z-Scores Standardization makes it easy to compare measurements across different penguin species. ```r # Compare species using standardized measurements penguins_with_z |> group_by(species) |> summarise(across(contains("_z"), mean, na.rm = TRUE)) |> pivot_longer(-species, names_to = "measurement", values_to = "mean_z_score") ``` Now we can easily identify which species tend to be above or below average for each measurement. ## Summary - Z-scores standardize multiple columns by centering at mean 0 with standard deviation 1 - Use `scale()` for quick standardization or create custom functions for more control - The [`across()`](/dplyr/how-to-use-across-in-r.html) function with [`mutate()`](/dplyr/how-to-use-mutate-in-r.html) provides flexible column-wise operations - Standardization preserves distribution shape while making variables comparable - Z-scores are essential for machine learning preprocessing and comparative analysis --- ## Related Posts - [How to compute Z-score](/statistics/compute-z-score.html) - [How to Compute Pearson Correlation of Multiple Variables](/statistics/compute-pearson-correlation-of-multiple-variables.html) - [How to z-score normalization in R](/statistics/how-to-z-score-normalization-in-r.html) - [How to apply a function on multiple columns using across()](/dplyr/apply-a-function-on-multiple-columns-using-across.html) - [How to rename one or more columns of a dataframe](/dplyr/rename-one-or-more-columns-of-a-dataframe.html)