How to randomly select rows from a dataframe in R

dplyr slice_sample()

Learn how to perform randomly select rows from a dataframe in R. Step-by-step statistical tutorial with examples.

Published

January 25, 2022

Introduction

Random sampling of rows is a fundamental technique in data analysis for creating representative subsets of larger datasets. This approach is essential for exploratory data analysis, creating training/testing splits for machine learning, or simply working with manageable portions of large datasets.

Getting Started

library(tidyverse)
library(palmerpenguins)

Example 1: Basic Random Sampling

The Problem

We need to randomly select a specific number of rows from our dataset to create a smaller, manageable sample for analysis.

Step 1: Examine the original dataset

Let’s start by looking at our penguins dataset to understand its structure.

data(penguins)
glimpse(penguins)
nrow(penguins)

The penguins dataset contains 344 rows with information about different penguin species and their characteristics.

Step 2: Select a fixed number of rows

Use slice_sample() to randomly select exactly 10 rows from the dataset.

random_10 <- penguins |>
  slice_sample(n = 10)

random_10

This creates a new dataframe with exactly 10 randomly selected penguins, maintaining all original columns.

Step 3: Select a percentage of rows

Sometimes it’s more useful to sample a proportion rather than a fixed number.

random_percent <- penguins |>
  slice_sample(prop = 0.15)

nrow(random_percent)

This selects approximately 15% of the total rows (about 52 penguins), which scales automatically with dataset size.

Step 4: Set a seed for reproducibility

Make your random sampling reproducible by setting a seed value.

set.seed(123)
reproducible_sample <- penguins |>
  slice_sample(n = 20)

head(reproducible_sample)

Using set.seed() ensures you get the same random sample every time you run the code.

Example 2: Practical Application

The Problem

Imagine you’re a researcher who needs to create balanced samples from different penguin species for a comparative study. You want to ensure each species is equally represented in your sample while maintaining randomness.

Step 1: Check species distribution

First, let’s see how many penguins we have for each species.

penguins |>
  count(species, sort = TRUE)

This shows us the available sample size for each species, helping us plan our sampling strategy.

Step 2: Stratified random sampling

Sample an equal number of penguins from each species to create a balanced dataset.

balanced_sample <- penguins |>
  drop_na() |>
  group_by(species) |>
  slice_sample(n = 15)

balanced_sample |> count(species)

This creates a perfectly balanced sample with exactly 15 penguins from each species, removing any rows with missing values first.

Step 3: Sample with replacement

Sometimes you need to sample with replacement, especially when the population is smaller than your desired sample size.

set.seed(456)
bootstrap_sample <- penguins |>
  slice_sample(n = 400, replace = TRUE)

nrow(bootstrap_sample)

This creates a bootstrap sample larger than the original dataset, with some penguins appearing multiple times.

Step 4: Weighted random sampling

Create samples where certain groups have higher probability of selection.

weighted_sample <- penguins |>
  drop_na(body_mass_g) |>
  slice_sample(n = 50, weight_by = body_mass_g)

summary(weighted_sample$body_mass_g)

This approach gives heavier penguins a higher chance of being selected, useful for studies focusing on larger specimens.

Summary

Use slice_sample(n = x) to select a specific number of rows randomly
Use slice_sample(prop = x) to select a percentage of rows for scalable sampling
Combine group_by() with slice_sample() for stratified sampling across categories
Set replace = TRUE for bootstrap sampling when you need samples larger than the original dataset
Always use set.seed() before sampling to ensure reproducible results in your analysis

--- title: "How to randomly select rows from a dataframe in R" description: "Learn how to perform randomly select rows from a dataframe in R. Step-by-step statistical tutorial with examples." date: 2022-01-25 categories: ['dplyr slice_sample()'] format: html: code-fold: false code-tools: true --- ## Introduction Random sampling of rows is a fundamental technique in data analysis for creating representative subsets of larger datasets. This approach is essential for exploratory data analysis, creating training/testing splits for machine learning, or simply working with manageable portions of large datasets. ## Getting Started ```r library(tidyverse) library(palmerpenguins) ``` ## Example 1: Basic Random Sampling ### The Problem We need to randomly select a specific number of rows from our dataset to create a smaller, manageable sample for analysis. ### Step 1: Examine the original dataset Let's start by looking at our penguins dataset to understand its structure. ```r data(penguins) glimpse(penguins) nrow(penguins) ``` The penguins dataset contains 344 rows with information about different penguin species and their characteristics. ### Step 2: Select a fixed number of rows Use `slice_sample()` to randomly select exactly 10 rows from the dataset. ```r random_10 <- penguins |> slice_sample(n = 10) random_10 ``` This creates a new dataframe with exactly 10 randomly selected penguins, maintaining all original columns. ### Step 3: Select a percentage of rows Sometimes it's more useful to sample a proportion rather than a fixed number. ```r random_percent <- penguins |> slice_sample(prop = 0.15) nrow(random_percent) ``` This selects approximately 15% of the total rows (about 52 penguins), which scales automatically with dataset size. ### Step 4: Set a seed for reproducibility Make your random sampling reproducible by setting a seed value. ```r set.seed(123) reproducible_sample <- penguins |> slice_sample(n = 20) head(reproducible_sample) ``` Using `set.seed()` ensures you get the same random sample every time you run the code. ## Example 2: Practical Application ### The Problem Imagine you're a researcher who needs to create balanced samples from different penguin species for a comparative study. You want to ensure each species is equally represented in your sample while maintaining randomness. ### Step 1: Check species distribution First, let's see how many penguins we have for each species. ```r penguins |> count(species, sort = TRUE) ``` This shows us the available sample size for each species, helping us plan our sampling strategy. ### Step 2: Stratified random sampling Sample an equal number of penguins from each species to create a balanced dataset. ```r balanced_sample <- penguins |> drop_na() |> group_by(species) |> slice_sample(n = 15) balanced_sample |> count(species) ``` This creates a perfectly balanced sample with exactly 15 penguins from each species, removing any rows with missing values first. ### Step 3: Sample with replacement Sometimes you need to sample with replacement, especially when the population is smaller than your desired sample size. ```r set.seed(456) bootstrap_sample <- penguins |> slice_sample(n = 400, replace = TRUE) nrow(bootstrap_sample) ``` This creates a bootstrap sample larger than the original dataset, with some penguins appearing multiple times. ### Step 4: Weighted random sampling Create samples where certain groups have higher probability of selection. ```r weighted_sample <- penguins |> drop_na(body_mass_g) |> slice_sample(n = 50, weight_by = body_mass_g) summary(weighted_sample$body_mass_g) ``` This approach gives heavier penguins a higher chance of being selected, useful for studies focusing on larger specimens. ## Summary - Use `slice_sample(n = x)` to select a specific number of rows randomly - Use `slice_sample(prop = x)` to select a percentage of rows for scalable sampling - Combine [`group_by()`](/dplyr/how-to-use-groupby-in-r.html) with `slice_sample()` for stratified sampling across categories - Set `replace = TRUE` for bootstrap sampling when you need samples larger than the original dataset - Always use `set.seed()` before sampling to ensure reproducible results in your analysis --- ## Related Posts - [How to Select Rows of a dataframe by position](/dplyr/select-rows-of-a-dataframe-by-position.html) - [How to select one or more columns from a dataframe](/dplyr/select-one-or-more-columns-from-a-dataframe.html) - [How to select only numeric columns in a dataframe](/dplyr/select-all-numeric-columns-in-a-dataframe.html) - [pivot_longer on dataframe with single row](/tidyr/pivot_longer-on-dataframe-with-single-row.html) - [How to replace NAs with zero in a dataframe](/tidyr/tidyr-replace_na-function.html)

Introduction

Getting Started

Example 1: Basic Random Sampling

The Problem

Step 1: Examine the original dataset

Step 2: Select a fixed number of rows

Step 3: Select a percentage of rows

Step 4: Set a seed for reproducibility

Example 2: Practical Application

The Problem

Step 1: Check species distribution

Step 2: Stratified random sampling

Step 3: Sample with replacement

Step 4: Weighted random sampling

Summary

Always use set.seed() before sampling to ensure reproducible results in your analysis

Related Posts

Always use `set.seed()` before sampling to ensure reproducible results in your analysis