How to use select() in R
dplyr::select() Function Tutorial
Introduction
The select() function from the dplyr package is a fundamental tool for column selection and manipulation in R. It allows you to choose specific columns from a data frame or tibble, making your datasets more focused and manageable. This function is particularly useful when working with large datasets containing many variables where you only need a subset for analysis.
You would use select() when you want to reduce the number of columns in your dataset, reorder columns, rename columns during selection, or apply selection criteria based on column names or properties. It’s part of the tidyverse ecosystem and works seamlessly with the pipe operator, making it an essential function for data wrangling workflows. The select() function is part of the dplyr package, which is automatically loaded when you load the tidyverse.
Syntax
select(.data, ...)Key arguments: - .data: A data frame or tibble to select columns from - ...: One or more unquoted expressions separated by commas. You can use: - Column names directly - Selection helpers (starts_with(), ends_with(), contains(), etc.) - Ranges of columns (column1:column5) - Negative selection to exclude columns (-column_name)
Example 1: Basic Usage
Let’s start with a simple example using the palmerpenguins dataset:
library(tidyverse)
library(palmerpenguins)
# Select specific columns by name
penguins |>
select(species, island, bill_length_mm)# A tibble: 344 × 3
species island bill_length_mm
<fct> <fct> <dbl>
1 Adelie Torgersen 39.1
2 Adelie Torgersen 39.5
3 Adelie Torgersen 40.3
4 Adelie Torgersen NA
5 Adelie Torgersen 36.7
# … with 339 more rows
This example demonstrates the most basic usage of select(). We’ve chosen three specific columns from the penguins dataset: species, island, and bill_length_mm. The function returns a new tibble containing only these columns while preserving all rows. This is useful when you want to focus your analysis on specific variables without the distraction of unnecessary columns.
Example 2: Practical Application
Here’s a more practical example that combines select() with other dplyr functions to answer a research question:
# Select and analyze bill measurements by species
penguins |>
select(species, starts_with("bill")) |>
filter(!is.na(bill_length_mm), !is.na(bill_depth_mm)) |>
group_by(species) |>
summarise(
avg_bill_length = round(mean(bill_length_mm), 2),
avg_bill_depth = round(mean(bill_depth_mm), 2),
n_penguins = n()
)# A tibble: 3 × 4
species avg_bill_length avg_bill_depth n_penguins
<fct> <dbl> <dbl> <int>
1 Adelie 38.8 18.3 146
2 Chinstrap 48.8 18.4 68
3 Gentoo 47.5 15.0 119
This example showcases select() in a real analytical workflow. We use the starts_with() helper function to select all columns beginning with “bill”, then pipe the result through additional operations to calculate species-specific averages. This demonstrates how select() serves as a foundation for more complex data analysis pipelines.
Example 3: Advanced Usage
Advanced selection techniques using helper functions and negative selection:
# Complex selection with multiple criteria
penguins |>
select(species, island, contains("length"), -ends_with("g")) |>
head(5)# A tibble: 5 × 4
species island bill_length_mm flipper_length_mm
<fct> <fct> <dbl> <int>
1 Adelie Torgersen 39.1 181
2 Adelie Torgersen 39.5 186
3 Adelie Torgersen 40.3 195
4 Adelie Torgersen NA NA
5 Adelie Torgersen 36.7 193
# Reorder and rename columns simultaneously
penguins |>
select(penguin_species = species,
location = island,
everything()) |>
head(3)# A tibble: 3 × 8
penguin_species location bill_length_mm bill_depth_mm flipper_length_mm
<fct> <fct> <dbl> <dbl> <int>
1 Adelie Torgersen 39.1 18.7 181
2 Adelie Torgersen 39.5 17.4 186
3 Adelie Torgersen 40.3 18 195
# … with 3 more variables: body_mass_g <dbl>, sex <fct>, year <int>
These examples show advanced features like combining multiple selection criteria, excluding columns with negative selection, renaming during selection, and using everything() to include remaining columns after specific selections.
Common Mistakes
1. Forgetting to quote column names with spaces or special characters:
# Wrong
df |> select(my column)
# Correct
df |> select(`my column`)2. Using quotes around regular column names:
# Unnecessary (though not wrong)
penguins |> select("species", "island")
# Preferred
penguins |> select(species, island)3. Mixing up selection helpers syntax:
# Wrong - using wildcards instead of helper functions
penguins |> select(*bill*)
# Correct
penguins |> select(contains("bill"))