How to randomly select rows from a dataframe in R
Introduction
Random sampling of rows is a fundamental technique in data analysis for creating representative subsets of larger datasets. This approach is essential for exploratory data analysis, creating training/testing splits for machine learning, or simply working with manageable portions of large datasets.
Getting Started
library(tidyverse)
library(palmerpenguins)Example 1: Basic Random Sampling
The Problem
We need to randomly select a specific number of rows from our dataset to create a smaller, manageable sample for analysis.
Step 1: Examine the original dataset
Let’s start by looking at our penguins dataset to understand its structure.
data(penguins)
glimpse(penguins)
nrow(penguins)The penguins dataset contains 344 rows with information about different penguin species and their characteristics.
Step 2: Select a fixed number of rows
Use slice_sample() to randomly select exactly 10 rows from the dataset.
random_10 <- penguins |>
slice_sample(n = 10)
random_10This creates a new dataframe with exactly 10 randomly selected penguins, maintaining all original columns.
Step 3: Select a percentage of rows
Sometimes it’s more useful to sample a proportion rather than a fixed number.
random_percent <- penguins |>
slice_sample(prop = 0.15)
nrow(random_percent)This selects approximately 15% of the total rows (about 52 penguins), which scales automatically with dataset size.
Step 4: Set a seed for reproducibility
Make your random sampling reproducible by setting a seed value.
set.seed(123)
reproducible_sample <- penguins |>
slice_sample(n = 20)
head(reproducible_sample)Using set.seed() ensures you get the same random sample every time you run the code.
Example 2: Practical Application
The Problem
Imagine you’re a researcher who needs to create balanced samples from different penguin species for a comparative study. You want to ensure each species is equally represented in your sample while maintaining randomness.
Step 1: Check species distribution
First, let’s see how many penguins we have for each species.
penguins |>
count(species, sort = TRUE)This shows us the available sample size for each species, helping us plan our sampling strategy.
Step 2: Stratified random sampling
Sample an equal number of penguins from each species to create a balanced dataset.
balanced_sample <- penguins |>
drop_na() |>
group_by(species) |>
slice_sample(n = 15)
balanced_sample |> count(species)This creates a perfectly balanced sample with exactly 15 penguins from each species, removing any rows with missing values first.
Step 3: Sample with replacement
Sometimes you need to sample with replacement, especially when the population is smaller than your desired sample size.
set.seed(456)
bootstrap_sample <- penguins |>
slice_sample(n = 400, replace = TRUE)
nrow(bootstrap_sample)This creates a bootstrap sample larger than the original dataset, with some penguins appearing multiple times.
Step 4: Weighted random sampling
Create samples where certain groups have higher probability of selection.
weighted_sample <- penguins |>
drop_na(body_mass_g) |>
slice_sample(n = 50, weight_by = body_mass_g)
summary(weighted_sample$body_mass_g)This approach gives heavier penguins a higher chance of being selected, useful for studies focusing on larger specimens.
Summary
- Use
slice_sample(n = x)to select a specific number of rows randomly - Use
slice_sample(prop = x)to select a percentage of rows for scalable sampling - Combine
group_by()withslice_sample()for stratified sampling across categories - Set
replace = TRUEfor bootstrap sampling when you need samples larger than the original dataset Always use
set.seed()before sampling to ensure reproducible results in your analysis