How to use select() in R

dplyr

dplyr select()

Published

February 20, 2026

dplyr::select() Function Tutorial

Introduction

The select() function from the dplyr package is a fundamental tool for column selection and manipulation in R. It allows you to choose specific columns from a data frame or tibble, making your datasets more focused and manageable. This function is particularly useful when working with large datasets containing many variables where you only need a subset for analysis.

You would use select() when you want to reduce the number of columns in your dataset, reorder columns, rename columns during selection, or apply selection criteria based on column names or properties. It’s part of the tidyverse ecosystem and works seamlessly with the pipe operator, making it an essential function for data wrangling workflows. The select() function is part of the dplyr package, which is automatically loaded when you load the tidyverse.

Syntax

select(.data, ...)

Key arguments: - .data: A data frame or tibble to select columns from - ...: One or more unquoted expressions separated by commas. You can use: - Column names directly - Selection helpers (starts_with(), ends_with(), contains(), etc.) - Ranges of columns (column1:column5) - Negative selection to exclude columns (-column_name)

Example 1: Basic Usage

Let’s start with a simple example using the palmerpenguins dataset:

library(tidyverse)
library(palmerpenguins)

# Select specific columns by name
penguins |> 
  select(species, island, bill_length_mm)

# A tibble: 344 × 3
   species island    bill_length_mm
   <fct>   <fct>              <dbl>
 1 Adelie  Torgersen           39.1
 2 Adelie  Torgersen           39.5
 3 Adelie  Torgersen           40.3
 4 Adelie  Torgersen             NA
 5 Adelie  Torgersen           36.7
 # … with 339 more rows

This example demonstrates the most basic usage of select(). We’ve chosen three specific columns from the penguins dataset: species, island, and bill_length_mm. The function returns a new tibble containing only these columns while preserving all rows. This is useful when you want to focus your analysis on specific variables without the distraction of unnecessary columns.

Example 2: Practical Application

Here’s a more practical example that combines select() with other dplyr functions to answer a research question:

# Select and analyze bill measurements by species
penguins |> 
  select(species, starts_with("bill")) |> 
  filter(!is.na(bill_length_mm), !is.na(bill_depth_mm)) |> 
  group_by(species) |> 
  summarise(
    avg_bill_length = round(mean(bill_length_mm), 2),
    avg_bill_depth = round(mean(bill_depth_mm), 2),
    n_penguins = n()
  )

# A tibble: 3 × 4
  species   avg_bill_length avg_bill_depth n_penguins
  <fct>               <dbl>          <dbl>      <int>
1 Adelie               38.8           18.3        146
2 Chinstrap            48.8           18.4         68
3 Gentoo               47.5           15.0        119

This example showcases select() in a real analytical workflow. We use the starts_with() helper function to select all columns beginning with “bill”, then pipe the result through additional operations to calculate species-specific averages. This demonstrates how select() serves as a foundation for more complex data analysis pipelines.

Example 3: Advanced Usage

Advanced selection techniques using helper functions and negative selection:

# Complex selection with multiple criteria
penguins |> 
  select(species, island, contains("length"), -ends_with("g")) |> 
  head(5)

# A tibble: 5 × 4
  species island    bill_length_mm flipper_length_mm
  <fct>   <fct>              <dbl>             <int>
1 Adelie  Torgersen           39.1               181
2 Adelie  Torgersen           39.5               186
3 Adelie  Torgersen           40.3               195
4 Adelie  Torgersen             NA                NA
5 Adelie  Torgersen           36.7               193

# Reorder and rename columns simultaneously
penguins |> 
  select(penguin_species = species, 
         location = island,
         everything()) |> 
  head(3)

# A tibble: 3 × 8
  penguin_species location  bill_length_mm bill_depth_mm flipper_length_mm
  <fct>           <fct>              <dbl>         <dbl>             <int>
1 Adelie          Torgersen           39.1          18.7               181
2 Adelie          Torgersen           39.5          17.4               186
3 Adelie          Torgersen           40.3          18                 195
# … with 3 more variables: body_mass_g <dbl>, sex <fct>, year <int>

These examples show advanced features like combining multiple selection criteria, excluding columns with negative selection, renaming during selection, and using everything() to include remaining columns after specific selections.

Common Mistakes

1. Forgetting to quote column names with spaces or special characters:

# Wrong
df |> select(my column)

# Correct
df |> select(`my column`)

2. Using quotes around regular column names:

# Unnecessary (though not wrong)
penguins |> select("species", "island")

# Preferred
penguins |> select(species, island)

3. Mixing up selection helpers syntax:

# Wrong - using wildcards instead of helper functions
penguins |> select(*bill*)

# Correct
penguins |> select(contains("bill"))

--- title: "How to use select() in R" date: 2026-02-20 categories: ["dplyr", "dplyr select()"] format: html: code-fold: false code-tools: true --- # dplyr::select() Function Tutorial ## Introduction The `select()` function from the dplyr package is a fundamental tool for column selection and manipulation in R. It allows you to choose specific columns from a data frame or tibble, making your datasets more focused and manageable. This function is particularly useful when working with large datasets containing many variables where you only need a subset for analysis. You would use `select()` when you want to reduce the number of columns in your dataset, reorder columns, rename columns during selection, or apply selection criteria based on column names or properties. It's part of the tidyverse ecosystem and works seamlessly with the pipe operator, making it an essential function for data wrangling workflows. The `select()` function is part of the dplyr package, which is automatically loaded when you load the tidyverse. ## Syntax ```r select(.data, ...) ``` **Key arguments:** - `.data`: A data frame or tibble to select columns from - `...`: One or more unquoted expressions separated by commas. You can use: - Column names directly - Selection helpers (starts_with(), ends_with(), contains(), etc.) - Ranges of columns (column1:column5) - Negative selection to exclude columns (-column_name) ## Example 1: Basic Usage Let's start with a simple example using the palmerpenguins dataset: ```r library(tidyverse) library(palmerpenguins) # Select specific columns by name penguins |> select(species, island, bill_length_mm) ``` ``` # A tibble: 344 × 3 species island bill_length_mm <fct> <fct> <dbl> 1 Adelie Torgersen 39.1 2 Adelie Torgersen 39.5 3 Adelie Torgersen 40.3 4 Adelie Torgersen NA 5 Adelie Torgersen 36.7 # … with 339 more rows ``` This example demonstrates the most basic usage of `select()`. We've chosen three specific columns from the penguins dataset: species, island, and bill_length_mm. The function returns a new tibble containing only these columns while preserving all rows. This is useful when you want to focus your analysis on specific variables without the distraction of unnecessary columns. ## Example 2: Practical Application Here's a more practical example that combines `select()` with other dplyr functions to answer a research question: ```r # Select and analyze bill measurements by species penguins |> select(species, starts_with("bill")) |> filter(!is.na(bill_length_mm), !is.na(bill_depth_mm)) |> group_by(species) |> summarise( avg_bill_length = round(mean(bill_length_mm), 2), avg_bill_depth = round(mean(bill_depth_mm), 2), n_penguins = n() ) ``` ``` # A tibble: 3 × 4 species avg_bill_length avg_bill_depth n_penguins <fct> <dbl> <dbl> <int> 1 Adelie 38.8 18.3 146 2 Chinstrap 48.8 18.4 68 3 Gentoo 47.5 15.0 119 ``` This example showcases `select()` in a real analytical workflow. We use the `starts_with()` helper function to select all columns beginning with "bill", then pipe the result through additional operations to calculate species-specific averages. This demonstrates how `select()` serves as a foundation for more complex data analysis pipelines. ## Example 3: Advanced Usage Advanced selection techniques using helper functions and negative selection: ```r # Complex selection with multiple criteria penguins |> select(species, island, contains("length"), -ends_with("g")) |> head(5) ``` ``` # A tibble: 5 × 4 species island bill_length_mm flipper_length_mm <fct> <fct> <dbl> <int> 1 Adelie Torgersen 39.1 181 2 Adelie Torgersen 39.5 186 3 Adelie Torgersen 40.3 195 4 Adelie Torgersen NA NA 5 Adelie Torgersen 36.7 193 ``` ```r # Reorder and rename columns simultaneously penguins |> select(penguin_species = species, location = island, everything()) |> head(3) ``` ``` # A tibble: 3 × 8 penguin_species location bill_length_mm bill_depth_mm flipper_length_mm <fct> <fct> <dbl> <dbl> <int> 1 Adelie Torgersen 39.1 18.7 181 2 Adelie Torgersen 39.5 17.4 186 3 Adelie Torgersen 40.3 18 195 # … with 3 more variables: body_mass_g <dbl>, sex <fct>, year <int> ``` These examples show advanced features like combining multiple selection criteria, excluding columns with negative selection, renaming during selection, and using `everything()` to include remaining columns after specific selections. ## Common Mistakes **1. Forgetting to quote column names with spaces or special characters:** ```r # Wrong df |> select(my column) # Correct df |> select(`my column`) ``` **2. Using quotes around regular column names:** ```r # Unnecessary (though not wrong) penguins |> select("species", "island") # Preferred penguins |> select(species, island) ``` **3. Mixing up selection helpers syntax:** ```r # Wrong - using wildcards instead of helper functions penguins |> select(*bill*) # Correct penguins |> select(contains("bill")) ``` ## Related Functions - `rename()`: Rename columns without changing which columns are selected - `relocate()`: Change the position of columns without dropping any - `pull()`: Extract a single column as a vector - `across()`: Apply functions across multiple selected columns - `where()`: Select columns based on their data type or other properties