sample() function in R for random sampling

rstats
Learn sample() function in r for random sampling with this comprehensive R tutorial. Includes practical examples and code snippets.
Published

June 30, 2021

Introduction

The sample() function in R is a fundamental tool for random sampling that allows you to select elements randomly from a vector or dataset. It’s essential for data analysis tasks like creating training/testing splits, bootstrap sampling, or generating random subsets for analysis.

Getting Started

library(tidyverse)
library(palmerpenguins)

Example 1: Basic Usage

The Problem

We need to understand how to use sample() to randomly select elements from a simple vector. This forms the foundation for more complex sampling operations.

Step 1: Sample without replacement

We’ll start by randomly selecting elements from a vector where each element can only be chosen once.

# Create a simple vector
numbers <- 1:10

# Sample 5 numbers without replacement
random_sample <- sample(numbers, size = 5)
print(random_sample)

This returns 5 unique numbers randomly selected from 1 through 10, with no duplicates possible.

Step 2: Sample with replacement

Now we’ll allow elements to be selected multiple times by enabling replacement.

# Sample 8 numbers with replacement
with_replacement <- sample(numbers, size = 8, replace = TRUE)
print(with_replacement)

# Check for duplicates
any(duplicated(with_replacement))

With replacement enabled, we can select more items than exist in the original vector, and duplicates are possible.

Step 3: Set random seed for reproducibility

Setting a seed ensures we get the same “random” results each time we run our code.

# Set seed for reproducible results
set.seed(123)
reproducible_sample <- sample(numbers, size = 5)
print(reproducible_sample)

# Run again with same seed
set.seed(123)
same_sample <- sample(numbers, size = 5)
identical(reproducible_sample, same_sample)

The seed ensures our random sampling is reproducible for testing and sharing results.

Example 2: Practical Application

The Problem

We want to randomly sample penguin observations from the Palmer Penguins dataset to create a smaller subset for analysis. This mimics real-world scenarios where you need to work with a random portion of your data.

Step 1: Explore the dataset

First, let’s examine our dataset to understand what we’re working with.

# Load and examine the penguins data
data(penguins)
glimpse(penguins)

# Check total number of observations
nrow(penguins)

We have 344 penguin observations with various measurements and can see the structure of our data.

Step 2: Random row sampling

We’ll randomly select specific rows from our dataset to create a smaller sample.

# Set seed for reproducibility
set.seed(456)

# Sample 50 random row indices
random_rows <- sample(nrow(penguins), size = 50)

# Create subset using these indices
penguin_sample <- penguins[random_rows, ]
nrow(penguin_sample)

This creates a random subset of 50 penguins from our original dataset of 344 observations.

Step 3: Stratified sampling by species

Let’s ensure we get representatives from each penguin species in our sample.

# Sample 5 penguins from each species
stratified_sample <- penguins |>
  drop_na() |>
  group_by(species) |>
  slice_sample(n = 5) |>
  ungroup()

# Check the species distribution
stratified_sample |>
  count(species)

Using slice_sample() (which uses sample() internally), we get exactly 5 penguins from each species for balanced representation.

Step 4: Weighted sampling

We can also sample with different probabilities based on certain characteristics.

# Create weights based on body mass
penguins_clean <- penguins |> drop_na(body_mass_g)

# Sample with probability proportional to body mass
set.seed(789)
weighted_indices <- sample(
  nrow(penguins_clean), 
  size = 20,
  prob = penguins_clean$body_mass_g
)

weighted_sample <- penguins_clean[weighted_indices, ]

This sampling approach gives heavier penguins a higher probability of being selected in our sample.

Summary

  • Use sample(x, size, replace = FALSE) for basic random sampling without replacement
  • Set replace = TRUE to allow duplicate selections and sample more items than available
  • Always use set.seed() before sampling to ensure reproducible results
  • Combine sample() with dataset indexing to randomly select rows from data frames
  • Use slice_sample() from dplyr for modern, pipe-friendly random sampling within grouped data