How to use select() in R

dplyr
dplyr select()
Published

February 20, 2026

dplyr::select() Function Tutorial

Introduction

The select() function from the dplyr package is a fundamental tool for column selection and manipulation in R. It allows you to choose specific columns from a data frame or tibble, making your datasets more focused and manageable. This function is particularly useful when working with large datasets containing many variables where you only need a subset for analysis.

You would use select() when you want to reduce the number of columns in your dataset, reorder columns, rename columns during selection, or apply selection criteria based on column names or properties. It’s part of the tidyverse ecosystem and works seamlessly with the pipe operator, making it an essential function for data wrangling workflows. The select() function is part of the dplyr package, which is automatically loaded when you load the tidyverse.

Syntax

select(.data, ...)

Key arguments: - .data: A data frame or tibble to select columns from - ...: One or more unquoted expressions separated by commas. You can use: - Column names directly - Selection helpers (starts_with(), ends_with(), contains(), etc.) - Ranges of columns (column1:column5) - Negative selection to exclude columns (-column_name)

Example 1: Basic Usage

Let’s start with a simple example using the palmerpenguins dataset:

library(tidyverse)
library(palmerpenguins)

# Select specific columns by name
penguins |> 
  select(species, island, bill_length_mm)
# A tibble: 344 × 3
   species island    bill_length_mm
   <fct>   <fct>              <dbl>
 1 Adelie  Torgersen           39.1
 2 Adelie  Torgersen           39.5
 3 Adelie  Torgersen           40.3
 4 Adelie  Torgersen             NA
 5 Adelie  Torgersen           36.7
 # … with 339 more rows

This example demonstrates the most basic usage of select(). We’ve chosen three specific columns from the penguins dataset: species, island, and bill_length_mm. The function returns a new tibble containing only these columns while preserving all rows. This is useful when you want to focus your analysis on specific variables without the distraction of unnecessary columns.

Example 2: Practical Application

Here’s a more practical example that combines select() with other dplyr functions to answer a research question:

# Select and analyze bill measurements by species
penguins |> 
  select(species, starts_with("bill")) |> 
  filter(!is.na(bill_length_mm), !is.na(bill_depth_mm)) |> 
  group_by(species) |> 
  summarise(
    avg_bill_length = round(mean(bill_length_mm), 2),
    avg_bill_depth = round(mean(bill_depth_mm), 2),
    n_penguins = n()
  )
# A tibble: 3 × 4
  species   avg_bill_length avg_bill_depth n_penguins
  <fct>               <dbl>          <dbl>      <int>
1 Adelie               38.8           18.3        146
2 Chinstrap            48.8           18.4         68
3 Gentoo               47.5           15.0        119

This example showcases select() in a real analytical workflow. We use the starts_with() helper function to select all columns beginning with “bill”, then pipe the result through additional operations to calculate species-specific averages. This demonstrates how select() serves as a foundation for more complex data analysis pipelines.

Example 3: Advanced Usage

Advanced selection techniques using helper functions and negative selection:

# Complex selection with multiple criteria
penguins |> 
  select(species, island, contains("length"), -ends_with("g")) |> 
  head(5)
# A tibble: 5 × 4
  species island    bill_length_mm flipper_length_mm
  <fct>   <fct>              <dbl>             <int>
1 Adelie  Torgersen           39.1               181
2 Adelie  Torgersen           39.5               186
3 Adelie  Torgersen           40.3               195
4 Adelie  Torgersen             NA                NA
5 Adelie  Torgersen           36.7               193
# Reorder and rename columns simultaneously
penguins |> 
  select(penguin_species = species, 
         location = island,
         everything()) |> 
  head(3)
# A tibble: 3 × 8
  penguin_species location  bill_length_mm bill_depth_mm flipper_length_mm
  <fct>           <fct>              <dbl>         <dbl>             <int>
1 Adelie          Torgersen           39.1          18.7               181
2 Adelie          Torgersen           39.5          17.4               186
3 Adelie          Torgersen           40.3          18                 195
# … with 3 more variables: body_mass_g <dbl>, sex <fct>, year <int>

These examples show advanced features like combining multiple selection criteria, excluding columns with negative selection, renaming during selection, and using everything() to include remaining columns after specific selections.

Common Mistakes

1. Forgetting to quote column names with spaces or special characters:

# Wrong
df |> select(my column)

# Correct
df |> select(`my column`)

2. Using quotes around regular column names:

# Unnecessary (though not wrong)
penguins |> select("species", "island")

# Preferred
penguins |> select(species, island)

3. Mixing up selection helpers syntax:

# Wrong - using wildcards instead of helper functions
penguins |> select(*bill*)

# Correct
penguins |> select(contains("bill"))