sample() function in R for random sampling
Introduction
The sample() function in R is a fundamental tool for random sampling that allows you to select elements randomly from a vector or dataset. It’s essential for data analysis tasks like creating training/testing splits, bootstrap sampling, or generating random subsets for analysis.
Getting Started
library(tidyverse)
library(palmerpenguins)Example 1: Basic Usage
The Problem
We need to understand how to use sample() to randomly select elements from a simple vector. This forms the foundation for more complex sampling operations.
Step 1: Sample without replacement
We’ll start by randomly selecting elements from a vector where each element can only be chosen once.
# Create a simple vector
numbers <- 1:10
# Sample 5 numbers without replacement
random_sample <- sample(numbers, size = 5)
print(random_sample)This returns 5 unique numbers randomly selected from 1 through 10, with no duplicates possible.
Step 2: Sample with replacement
Now we’ll allow elements to be selected multiple times by enabling replacement.
# Sample 8 numbers with replacement
with_replacement <- sample(numbers, size = 8, replace = TRUE)
print(with_replacement)
# Check for duplicates
any(duplicated(with_replacement))With replacement enabled, we can select more items than exist in the original vector, and duplicates are possible.
Step 3: Set random seed for reproducibility
Setting a seed ensures we get the same “random” results each time we run our code.
# Set seed for reproducible results
set.seed(123)
reproducible_sample <- sample(numbers, size = 5)
print(reproducible_sample)
# Run again with same seed
set.seed(123)
same_sample <- sample(numbers, size = 5)
identical(reproducible_sample, same_sample)The seed ensures our random sampling is reproducible for testing and sharing results.
Example 2: Practical Application
The Problem
We want to randomly sample penguin observations from the Palmer Penguins dataset to create a smaller subset for analysis. This mimics real-world scenarios where you need to work with a random portion of your data.
Step 1: Explore the dataset
First, let’s examine our dataset to understand what we’re working with.
# Load and examine the penguins data
data(penguins)
glimpse(penguins)
# Check total number of observations
nrow(penguins)We have 344 penguin observations with various measurements and can see the structure of our data.
Step 2: Random row sampling
We’ll randomly select specific rows from our dataset to create a smaller sample.
# Set seed for reproducibility
set.seed(456)
# Sample 50 random row indices
random_rows <- sample(nrow(penguins), size = 50)
# Create subset using these indices
penguin_sample <- penguins[random_rows, ]
nrow(penguin_sample)This creates a random subset of 50 penguins from our original dataset of 344 observations.
Step 3: Stratified sampling by species
Let’s ensure we get representatives from each penguin species in our sample.
# Sample 5 penguins from each species
stratified_sample <- penguins |>
drop_na() |>
group_by(species) |>
slice_sample(n = 5) |>
ungroup()
# Check the species distribution
stratified_sample |>
count(species)Using slice_sample() (which uses sample() internally), we get exactly 5 penguins from each species for balanced representation.
Step 4: Weighted sampling
We can also sample with different probabilities based on certain characteristics.
# Create weights based on body mass
penguins_clean <- penguins |> drop_na(body_mass_g)
# Sample with probability proportional to body mass
set.seed(789)
weighted_indices <- sample(
nrow(penguins_clean),
size = 20,
prob = penguins_clean$body_mass_g
)
weighted_sample <- penguins_clean[weighted_indices, ]This sampling approach gives heavier penguins a higher probability of being selected in our sample.
Summary
- Use
sample(x, size, replace = FALSE)for basic random sampling without replacement - Set
replace = TRUEto allow duplicate selections and sample more items than available - Always use
set.seed()before sampling to ensure reproducible results - Combine
sample()with dataset indexing to randomly select rows from data frames Use
slice_sample()from dplyr for modern, pipe-friendly random sampling within grouped data