How to randomly select rows from a dataframe in R

dplyr slice_sample()
Learn how to perform randomly select rows from a dataframe in R. Step-by-step statistical tutorial with examples.
Published

January 25, 2022

Introduction

Random sampling of rows is a fundamental technique in data analysis for creating representative subsets of larger datasets. This approach is essential for exploratory data analysis, creating training/testing splits for machine learning, or simply working with manageable portions of large datasets.

Getting Started

library(tidyverse)
library(palmerpenguins)

Example 1: Basic Random Sampling

The Problem

We need to randomly select a specific number of rows from our dataset to create a smaller, manageable sample for analysis.

Step 1: Examine the original dataset

Let’s start by looking at our penguins dataset to understand its structure.

data(penguins)
glimpse(penguins)
nrow(penguins)

The penguins dataset contains 344 rows with information about different penguin species and their characteristics.

Step 2: Select a fixed number of rows

Use slice_sample() to randomly select exactly 10 rows from the dataset.

random_10 <- penguins |>
  slice_sample(n = 10)

random_10

This creates a new dataframe with exactly 10 randomly selected penguins, maintaining all original columns.

Step 3: Select a percentage of rows

Sometimes it’s more useful to sample a proportion rather than a fixed number.

random_percent <- penguins |>
  slice_sample(prop = 0.15)

nrow(random_percent)

This selects approximately 15% of the total rows (about 52 penguins), which scales automatically with dataset size.

Step 4: Set a seed for reproducibility

Make your random sampling reproducible by setting a seed value.

set.seed(123)
reproducible_sample <- penguins |>
  slice_sample(n = 20)

head(reproducible_sample)

Using set.seed() ensures you get the same random sample every time you run the code.

Example 2: Practical Application

The Problem

Imagine you’re a researcher who needs to create balanced samples from different penguin species for a comparative study. You want to ensure each species is equally represented in your sample while maintaining randomness.

Step 1: Check species distribution

First, let’s see how many penguins we have for each species.

penguins |>
  count(species, sort = TRUE)

This shows us the available sample size for each species, helping us plan our sampling strategy.

Step 2: Stratified random sampling

Sample an equal number of penguins from each species to create a balanced dataset.

balanced_sample <- penguins |>
  drop_na() |>
  group_by(species) |>
  slice_sample(n = 15)

balanced_sample |> count(species)

This creates a perfectly balanced sample with exactly 15 penguins from each species, removing any rows with missing values first.

Step 3: Sample with replacement

Sometimes you need to sample with replacement, especially when the population is smaller than your desired sample size.

set.seed(456)
bootstrap_sample <- penguins |>
  slice_sample(n = 400, replace = TRUE)

nrow(bootstrap_sample)

This creates a bootstrap sample larger than the original dataset, with some penguins appearing multiple times.

Step 4: Weighted random sampling

Create samples where certain groups have higher probability of selection.

weighted_sample <- penguins |>
  drop_na(body_mass_g) |>
  slice_sample(n = 50, weight_by = body_mass_g)

summary(weighted_sample$body_mass_g)

This approach gives heavier penguins a higher chance of being selected, useful for studies focusing on larger specimens.

Summary

  • Use slice_sample(n = x) to select a specific number of rows randomly
  • Use slice_sample(prop = x) to select a percentage of rows for scalable sampling
  • Combine group_by() with slice_sample() for stratified sampling across categories
  • Set replace = TRUE for bootstrap sampling when you need samples larger than the original dataset
  • Always use set.seed() before sampling to ensure reproducible results in your analysis