How to calculate median-based z-scores in R

statistics

z-score median

Learn calculate median-based z-scores in r with clear examples and explanations.

Published

March 26, 2026

Introduction

Gene Set Enrichment Analysis (GSEA) is a computational method used to determine whether predefined sets of genes show statistically significant differences between biological conditions. This tutorial demonstrates how to calculate median Z-scores for gene sets using R and tidyverse functions, which is a key step in many GSEA workflows. We’ll walk through data preparation, normalization, and aggregation techniques using simulated gene expression data.

Setting Up the Environment

First, let’s load the required packages and set up our workspace:

library(tidyverse)
set.seed(123)

Creating Simulated Gene Expression Data

We’ll start by creating a realistic gene expression dataset to work with. This simulates the type of data you’d typically get from RNA-seq or microarray experiments:

# Create expression matrix: 200 samples × 5000 genes
expression_matrix <- matrix(
  rnorm(200 * 5000, mean = 10, sd = 2), 
  nrow = 200, 
  ncol = 5000
)

Now let’s add meaningful row and column names, then convert to a tibble for easier manipulation:

colnames(expression_matrix) <- paste0("Gene", 1:5000)
rownames(expression_matrix) <- paste0("Sample", 1:200)
expression_df <- as_tibble(expression_matrix, rownames = "Sample")

The resulting tibble has samples as rows and genes as columns, which is the standard format for most genomic analyses.

Defining Gene Sets

Gene sets represent groups of genes that share common biological functions, pathways, or characteristics. Let’s define some example gene sets:

gene_sets <- list(
  set1 = paste0("Gene", 1:100),
  set2 = paste0("Gene", 101:200),
  set3 = paste0("Gene", 201:300)
)

Each gene set contains 100 genes. In real applications, these might represent genes from specific biological pathways like “cell cycle” or “immune response.”

Z-Score Normalization

Before calculating enrichment scores, we need to normalize our data. Z-score normalization standardizes each gene’s expression across samples:

calculate_z <- function(x) {
  (x - mean(x)) / sd(x)
}

Now we’ll apply this normalization to all gene expression columns:

z_score_df <- expression_df |>
  mutate(across(starts_with("Gene"), calculate_z))

This transformation ensures that genes with different expression ranges can be compared fairly, with each gene having a mean of 0 and standard deviation of 1.

Calculating Median Z-Scores for Individual Gene Sets

Let’s first see how to calculate median Z-scores for a single gene set. This shows the core logic we’ll later apply to all sets:

single_set_result <- z_score_df |>
  select(Sample, all_of(gene_sets$set1)) |>
  rowwise() |>
  mutate(
    GeneSet = "set1",
    median_z = median(c_across(starts_with("Gene")), na.rm = TRUE)
  ) |>
  select(Sample, GeneSet, median_z)

The rowwise() function ensures that the median is calculated separately for each sample across the selected genes.

Creating a Reusable Function

To avoid repeating code, let’s create a function that calculates median Z-scores for any gene set:

compute_median_z <- function(gene_set_name, genes, z_score_df) {
  z_score_df |>
    select(Sample, all_of(genes)) |>
    rowwise() |>
    mutate(
      GeneSet = gene_set_name,
      median_z = median(c_across(starts_with("Gene")), na.rm = TRUE)
    ) |>
    ungroup() |>
    select(Sample, GeneSet, median_z)
}

This function takes a gene set name, the list of genes, and our normalized data, returning a clean tibble with median Z-scores.

Processing All Gene Sets

Now we can efficiently process all gene sets at once using map2_dfr():

medians_df <- map2_dfr(
  names(gene_sets), 
  gene_sets, 
  compute_median_z, 
  z_score_df = z_score_df
)

The map2_dfr() function applies our function to each gene set and automatically combines the results into a single tibble.

Exploring the Results

Let’s examine our final results to understand what we’ve calculated:

medians_df |>
  group_by(GeneSet) |>
  summarise(
    mean_median_z = mean(median_z),
    sd_median_z = sd(median_z),
    .groups = "drop"
  )

This summary shows the distribution of median Z-scores for each gene set across all samples, which forms the foundation for downstream enrichment analysis.

Summary

We’ve successfully demonstrated how to calculate median Z-scores for gene sets using tidyverse functions. The key steps involved data normalization through Z-score transformation, creating reusable functions for median calculations, and efficiently processing multiple gene sets using functional programming approaches. These median Z-scores can now be used for statistical testing to identify significantly enriched gene sets in your biological comparisons.

--- title: "How to calculate median-based z-scores in R" description: "Learn calculate median-based z-scores in r with clear examples and explanations." date: 2026-03-26 categories: ['statistics', 'z-score median'] format: html: code-fold: false code-tools: true --- ## Introduction Gene Set Enrichment Analysis (GSEA) is a computational method used to determine whether predefined sets of genes show statistically significant differences between biological conditions. This tutorial demonstrates how to calculate median Z-scores for gene sets using R and tidyverse functions, which is a key step in many GSEA workflows. We'll walk through data preparation, normalization, and aggregation techniques using simulated gene expression data. ## Setting Up the Environment First, let's load the required packages and set up our workspace: ```r library(tidyverse) set.seed(123) ``` ## Creating Simulated Gene Expression Data We'll start by creating a realistic gene expression dataset to work with. This simulates the type of data you'd typically get from RNA-seq or microarray experiments: ```r # Create expression matrix: 200 samples × 5000 genes expression_matrix <- matrix( rnorm(200 * 5000, mean = 10, sd = 2), nrow = 200, ncol = 5000 ) ``` Now let's add meaningful row and column names, then convert to a tibble for easier manipulation: ```r colnames(expression_matrix) <- paste0("Gene", 1:5000) rownames(expression_matrix) <- paste0("Sample", 1:200) expression_df <- as_tibble(expression_matrix, rownames = "Sample") ``` The resulting tibble has samples as rows and genes as columns, which is the standard format for most genomic analyses. ## Defining Gene Sets Gene sets represent groups of genes that share common biological functions, pathways, or characteristics. Let's define some example gene sets: ```r gene_sets <- list( set1 = paste0("Gene", 1:100), set2 = paste0("Gene", 101:200), set3 = paste0("Gene", 201:300) ) ``` Each gene set contains 100 genes. In real applications, these might represent genes from specific biological pathways like "cell cycle" or "immune response." ## Z-Score Normalization Before calculating enrichment scores, we need to normalize our data. Z-score normalization standardizes each gene's expression across samples: ```r calculate_z <- function(x) { (x - mean(x)) / sd(x) } ``` Now we'll apply this normalization to all gene expression columns: ```r z_score_df <- expression_df |> mutate(across(starts_with("Gene"), calculate_z)) ``` This transformation ensures that genes with different expression ranges can be compared fairly, with each gene having a mean of 0 and standard deviation of 1. ## Calculating Median Z-Scores for Individual Gene Sets Let's first see how to calculate median Z-scores for a single gene set. This shows the core logic we'll later apply to all sets: ```r single_set_result <- z_score_df |> select(Sample, all_of(gene_sets$set1)) |> rowwise() |> mutate( GeneSet = "set1", median_z = median(c_across(starts_with("Gene")), na.rm = TRUE) ) |> select(Sample, GeneSet, median_z) ``` The `rowwise()` function ensures that the median is calculated separately for each sample across the selected genes. ## Creating a Reusable Function To avoid repeating code, let's create a function that calculates median Z-scores for any gene set: ```r compute_median_z <- function(gene_set_name, genes, z_score_df) { z_score_df |> select(Sample, all_of(genes)) |> rowwise() |> mutate( GeneSet = gene_set_name, median_z = median(c_across(starts_with("Gene")), na.rm = TRUE) ) |> ungroup() |> select(Sample, GeneSet, median_z) } ``` This function takes a gene set name, the list of genes, and our normalized data, returning a clean tibble with median Z-scores. ## Processing All Gene Sets Now we can efficiently process all gene sets at once using `map2_dfr()`: ```r medians_df <- map2_dfr( names(gene_sets), gene_sets, compute_median_z, z_score_df = z_score_df ) ``` The `map2_dfr()` function applies our function to each gene set and automatically combines the results into a single tibble. ## Exploring the Results Let's examine our final results to understand what we've calculated: ```r medians_df |> group_by(GeneSet) |> summarise( mean_median_z = mean(median_z), sd_median_z = sd(median_z), .groups = "drop" ) ``` This summary shows the distribution of median Z-scores for each gene set across all samples, which forms the foundation for downstream enrichment analysis. ## Summary We've successfully demonstrated how to calculate median Z-scores for gene sets using tidyverse functions. The key steps involved data normalization through Z-score transformation, creating reusable functions for median calculations, and efficiently processing multiple gene sets using functional programming approaches. These median Z-scores can now be used for statistical testing to identify significantly enriched gene sets in your biological comparisons.