sample() function in R for random sampling

rstats

Learn sample() function in r for random sampling with this comprehensive R tutorial. Includes practical examples and code snippets.

Published

June 30, 2021

Introduction

The sample() function in R is a fundamental tool for random sampling that allows you to select elements randomly from a vector or dataset. It’s essential for data analysis tasks like creating training/testing splits, bootstrap sampling, or generating random subsets for analysis.

Getting Started

library(tidyverse)
library(palmerpenguins)

Example 1: Basic Usage

The Problem

We need to understand how to use sample() to randomly select elements from a simple vector. This forms the foundation for more complex sampling operations.

Step 1: Sample without replacement

We’ll start by randomly selecting elements from a vector where each element can only be chosen once.

# Create a simple vector
numbers <- 1:10

# Sample 5 numbers without replacement
random_sample <- sample(numbers, size = 5)
print(random_sample)

This returns 5 unique numbers randomly selected from 1 through 10, with no duplicates possible.

Step 2: Sample with replacement

Now we’ll allow elements to be selected multiple times by enabling replacement.

# Sample 8 numbers with replacement
with_replacement <- sample(numbers, size = 8, replace = TRUE)
print(with_replacement)

# Check for duplicates
any(duplicated(with_replacement))

With replacement enabled, we can select more items than exist in the original vector, and duplicates are possible.

Step 3: Set random seed for reproducibility

Setting a seed ensures we get the same “random” results each time we run our code.

# Set seed for reproducible results
set.seed(123)
reproducible_sample <- sample(numbers, size = 5)
print(reproducible_sample)

# Run again with same seed
set.seed(123)
same_sample <- sample(numbers, size = 5)
identical(reproducible_sample, same_sample)

The seed ensures our random sampling is reproducible for testing and sharing results.

Example 2: Practical Application

The Problem

We want to randomly sample penguin observations from the Palmer Penguins dataset to create a smaller subset for analysis. This mimics real-world scenarios where you need to work with a random portion of your data.

Step 1: Explore the dataset

First, let’s examine our dataset to understand what we’re working with.

# Load and examine the penguins data
data(penguins)
glimpse(penguins)

# Check total number of observations
nrow(penguins)

We have 344 penguin observations with various measurements and can see the structure of our data.

Step 2: Random row sampling

We’ll randomly select specific rows from our dataset to create a smaller sample.

# Set seed for reproducibility
set.seed(456)

# Sample 50 random row indices
random_rows <- sample(nrow(penguins), size = 50)

# Create subset using these indices
penguin_sample <- penguins[random_rows, ]
nrow(penguin_sample)

This creates a random subset of 50 penguins from our original dataset of 344 observations.

Step 3: Stratified sampling by species

Let’s ensure we get representatives from each penguin species in our sample.

# Sample 5 penguins from each species
stratified_sample <- penguins |>
  drop_na() |>
  group_by(species) |>
  slice_sample(n = 5) |>
  ungroup()

# Check the species distribution
stratified_sample |>
  count(species)

Using slice_sample() (which uses sample() internally), we get exactly 5 penguins from each species for balanced representation.

Step 4: Weighted sampling

We can also sample with different probabilities based on certain characteristics.

# Create weights based on body mass
penguins_clean <- penguins |> drop_na(body_mass_g)

# Sample with probability proportional to body mass
set.seed(789)
weighted_indices <- sample(
  nrow(penguins_clean), 
  size = 20,
  prob = penguins_clean$body_mass_g
)

weighted_sample <- penguins_clean[weighted_indices, ]

This sampling approach gives heavier penguins a higher probability of being selected in our sample.

Summary

Use sample(x, size, replace = FALSE) for basic random sampling without replacement
Set replace = TRUE to allow duplicate selections and sample more items than available
Always use set.seed() before sampling to ensure reproducible results
Combine sample() with dataset indexing to randomly select rows from data frames
Use slice_sample() from dplyr for modern, pipe-friendly random sampling within grouped data

--- title: "sample() function in R for random sampling" description: "Learn sample() function in r for random sampling with this comprehensive R tutorial. Includes practical examples and code snippets." date: 2021-06-30 categories: ['rstats'] format: html: code-fold: false code-tools: true --- ## Introduction The `sample()` function in R is a fundamental tool for random sampling that allows you to select elements randomly from a vector or dataset. It's essential for data analysis tasks like creating training/testing splits, bootstrap sampling, or generating random subsets for analysis. ## Getting Started ```r library(tidyverse) library(palmerpenguins) ``` ## Example 1: Basic Usage ### The Problem We need to understand how to use `sample()` to randomly select elements from a simple vector. This forms the foundation for more complex sampling operations. ### Step 1: Sample without replacement We'll start by randomly selecting elements from a vector where each element can only be chosen once. ```r # Create a simple vector numbers <- 1:10 # Sample 5 numbers without replacement random_sample <- sample(numbers, size = 5) print(random_sample) ``` This returns 5 unique numbers randomly selected from 1 through 10, with no duplicates possible. ### Step 2: Sample with replacement Now we'll allow elements to be selected multiple times by enabling replacement. ```r # Sample 8 numbers with replacement with_replacement <- sample(numbers, size = 8, replace = TRUE) print(with_replacement) # Check for duplicates any(duplicated(with_replacement)) ``` With replacement enabled, we can select more items than exist in the original vector, and duplicates are possible. ### Step 3: Set random seed for reproducibility Setting a seed ensures we get the same "random" results each time we run our code. ```r # Set seed for reproducible results set.seed(123) reproducible_sample <- sample(numbers, size = 5) print(reproducible_sample) # Run again with same seed set.seed(123) same_sample <- sample(numbers, size = 5) identical(reproducible_sample, same_sample) ``` The seed ensures our random sampling is reproducible for testing and sharing results. ## Example 2: Practical Application ### The Problem We want to randomly sample penguin observations from the Palmer Penguins dataset to create a smaller subset for analysis. This mimics real-world scenarios where you need to work with a random portion of your data. ### Step 1: Explore the dataset First, let's examine our dataset to understand what we're working with. ```r # Load and examine the penguins data data(penguins) glimpse(penguins) # Check total number of observations nrow(penguins) ``` We have 344 penguin observations with various measurements and can see the structure of our data. ### Step 2: Random row sampling We'll randomly select specific rows from our dataset to create a smaller sample. ```r # Set seed for reproducibility set.seed(456) # Sample 50 random row indices random_rows <- sample(nrow(penguins), size = 50) # Create subset using these indices penguin_sample <- penguins[random_rows, ] nrow(penguin_sample) ``` This creates a random subset of 50 penguins from our original dataset of 344 observations. ### Step 3: Stratified sampling by species Let's ensure we get representatives from each penguin species in our sample. ```r # Sample 5 penguins from each species stratified_sample <- penguins |> drop_na() |> group_by(species) |> slice_sample(n = 5) |> ungroup() # Check the species distribution stratified_sample |> count(species) ``` Using [`slice_sample()`](/dplyr/dplyr-slice_sample-randomly-select-rows-from-a-dataframe.html) (which uses `sample()` internally), we get exactly 5 penguins from each species for balanced representation. ### Step 4: Weighted sampling We can also sample with different probabilities based on certain characteristics. ```r # Create weights based on body mass penguins_clean <- penguins |> drop_na(body_mass_g) # Sample with probability proportional to body mass set.seed(789) weighted_indices <- sample( nrow(penguins_clean), size = 20, prob = penguins_clean$body_mass_g ) weighted_sample <- penguins_clean[weighted_indices, ] ``` This sampling approach gives heavier penguins a higher probability of being selected in our sample. ## Summary - Use `sample(x, size, replace = FALSE)` for basic random sampling without replacement - Set `replace = TRUE` to allow duplicate selections and sample more items than available - Always use `set.seed()` before sampling to ensure reproducible results - Combine `sample()` with dataset indexing to randomly select rows from data frames - Use `slice_sample()` from dplyr for modern, pipe-friendly random sampling within grouped data --- ## Related Posts - [seq() function to create sequences](/how-to/seq-function-to-create-sequences.html) - [duplicated() function in R: Find duplicated elements in a vector or dataframe](/how-to/duplicated-function-in-r-to-find-duplicated-elements.html) - [How to Generate Random Numbers from Uniform Distribution](/how-to/how-to-generate-random-numbers-from-uniform-distribution.html) - [How to apply a function on multiple columns using across()](/dplyr/apply-a-function-on-multiple-columns-using-across.html) - [How to use function in R](/base-r/how-to-use-function-in-r.html)

Introduction

Getting Started

Example 1: Basic Usage

The Problem

Step 1: Sample without replacement

Step 2: Sample with replacement

Step 3: Set random seed for reproducibility

Example 2: Practical Application

The Problem

Step 1: Explore the dataset

Step 2: Random row sampling

Step 3: Stratified sampling by species

Step 4: Weighted sampling

Summary

Use slice_sample() from dplyr for modern, pipe-friendly random sampling within grouped data

Related Posts

Use `slice_sample()` from dplyr for modern, pipe-friendly random sampling within grouped data